Introduction to R Programming
This material is an update and translation of materials developed by students from the Department of Genetics at ESALQ/USP - Brazil. Access the Portuguese content taught in other events at this website.
We suggest that, before starting the practice described here, follow this tutorial for installing R and RStudio.
Getting Familiar with the RStudio Interface
When opening RStudio, you will see:
The interface is divided into four windows with main functions:
- Code editing
- Workspace and history
- Console
- Files, plots, packages, and help
Explore each of the windows. There are numerous functionalities for each of them, and we will cover some of them throughout the course.
A First Script
The code editing window (probably located in the top left corner)
will be used to write your code. Open a new script by clicking the
+
in the top left corner and selecting
R script
.
Let’s begin our work with the traditional Hello World
.
Type in your script:
## Hello world
Now, select the line and press the Run
button or use
Ctrl + enter
.
When you do this, your code will be processed in the
Console
window, where the written code will appear in blue
(if you have R’s default colors), followed by the desired result. The
line will only not be processed in the console if there is a
#
symbol in front of it. Now, try putting #
in
front of the written code. Again, select the line and press
Run
.
The #
symbol is used for comments in
the code. This is a great organizational practice and helps to remember,
later, what you were thinking when you wrote the code. It’s also
essential for others to understand it. As in the example:
## Hello world
Important: whenever you want to make any changes, edit your script and not directly in the console, because everything written in the console cannot be saved!
To save your script, you can use the Files
tab located
(by default) in the bottom right corner. You can look for a location of
your preference, create a new folder named CourseR
.
Tip:
- Avoid putting spaces and punctuation in folder and file names, this
can make access via command line in R difficult. For example, instead of
Course R
, we opt forCourseR
.
Then, just click on the floppy disk icon located in the RStudio
header or use Ctrl + s
and select the created
CourseR
directory. R scripts are saved with the
.R
extension.
Setting the Working Directory
Another good practice in R is to keep the script in the same directory where your raw data (input files for the script) and processed data (graphs, tables, etc.) are located. For this, we’ll have R identify the same directory where you saved the script as the working directory. This way, it will understand that this is where the data will be obtained from and where the results will also go.
You can do this using RStudio’s facilities, simply locate the
CourseR
directory through the Files
tab, click
on More
and then “Set as Working Directory”. Notice that
something like this will appear in the console:
In other words, you can use this same command to perform this action. The result will be our working folder. When you’re lost or to make sure the working directory has been changed, use:
Making Life Easier with Tab
Now, imagine you have a directory like
~/Documents/masters/semester1/course_such/class_such/data_28174/analysis_276182/results_161/
.
It’s not easy to remember this entire path to write in a
setwd()
command.
In addition to the convenience of the RStudio window, you can also
use the Tab
key to complete the path for you. Try it by
searching for a folder on your computer. Just start typing the path and
press Tab
, it will complete the name for you! If you have
more than one file with that name beginning, press Tab
twice, it will show you all the options.
The Tab
key works not only for indicating paths but also
for commands and object names. It’s very common to make typing errors in
code. Using Tab
will significantly reduce these errors.
Tab
can be even more powerful if you have access to the
GitHub
Copilot tool. With it, you can use Tab
to complete the
code you’re writing. It’s an artificial intelligence-based tool that
suggests code as you write. It’s a paid tool, but you can use it for
free for 60 days.
Basic Operations
Let’s get to the language!
R can function as a simple calculator, using the same syntax as other programs (like Excel):
1 + 1.3 # Decimal defined with "."
2 * 3
2^3
4 / 2
sqrt(4) # square root
log(100, base = 10) # Logarithm base 10
log(100) # Natural logarithm
Now, use the basic operations to solve the expression below. Remember
to use parentheses ()
to establish priorities in
operations.
\((\frac{13+2+1.5}{3})+ log_{4}96\)
Expected result:
## [1] 8.792481
Notice that if you position the parentheses incorrectly, the code won’t result in any error message, as this is what we call a logical error or silent error, meaning the code runs but doesn’t do what you want it to do. This is the most dangerous and difficult type of error to fix. See an example:
## [1] 18.79248
Errors that produce a message, whether a warning or an error, are called syntax errors. In these cases, R will return a message to help you correct them. Warnings don’t compromise the code’s functionality but draw attention to something; errors, however, must necessarily be corrected for the code to run.
Example of an error:
You might also forget to close a parenthesis, quotation mark,
bracket, or brace; in these cases, R will wait for the command to close
the code block, indicating with a +
:
If this happens, go to the console and press ESC, which will end the block so you can correct it.
The commands log
and sqrt
are two of many
basic functions that R has. Functions are organized sets of instructions
to perform a task. For all of them, R has a description to help in their
use. To access this help, use:
And the function description will open in RStudio’s Help
window.
If R’s own description isn’t enough for you to understand how the function works, search on Google (preferably in English). There are many websites and forums with educational information about R functions.
Vector Operations
Vectors are the simplest structures worked with in R. We build a vector with a numeric sequence using:
## [1] 1 3 2 5 2
IMPORTANT NOTE: The c is R’s function (Combine Values into a Vector or List) with which we build a vector!
We use the :
symbol to create sequences of integer
numbers, like:
## [1] 1 2 3 4 5 6 7 8 9 10
We can use other functions to generate sequences, such as:
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
## [20] 95 100
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
## [20] 95 100
- Create a sequence using the
seq
function that varies from 4 to 30, with intervals of 3.
## [1] 4 7 10 13 16 19 22 25 28
The rep
function generates sequences with repeated
numbers:
## [1] 3 4 5 3 4 5
We can perform operations using these vectors:
Notice that it’s already getting tiresome to type the same numbers repeatedly, let’s solve this by creating objects to store our vectors and much more.
Creating Objects
The storage of information in objects and their possible manipulation makes R an object-oriented language. To create an object, simply assign values to variables, as follows:
x <- c(30.1, 30.4, 40, 30.2, 30.6, 40.1)
# or
x <- c(30.1, 30.4, 40, 30.2, 30.6, 40.1)
y <- c(0.26, 0.3, 0.36, 0.24, 0.27, 0.35)
Old-school users tend to use the <-
sign, but it has
the same function as =
. Some prefer to use
<-
for object assignment and =
only for
defining arguments within functions. Organize yourself in the way you
prefer.
To access the values within the object, simply:
## [1] 30.1 30.4 40.0 30.2 30.6 40.1
The language is case-sensitive. Therefore, x
is
different from X
:
The object X
was not created.
The naming of objects is a personal choice, but it’s suggested to maintain a pattern for better organization. Here are some tips:
- Use descriptive names
- Avoid starting with numbers
- Don’t use spaces (use _ or camelCase)
- Don’t use special characters
- Maintain consistency in the chosen pattern
- Avoid very long names
- Don’t use accents or non-ASCII characters
Some names cannot be used because they establish fixed roles in R:
- TRUE - True, logical value
- FALSE - False, logical value
- if, else, for, while, break, next - Reserved words for conditional and loop structures
- for, while, repeat - Reserved words for loop structures
- function - Reserved word for function definition
- in, NA, NaN, NULL - Reserved words for special values
- NA_integer_, NA_real, NA_character_, NA_complex_ - Special values to represent missing data
We can then perform operations with the created object:
## [1] 32.1 32.4 42.0 32.2 32.6 42.1
## [1] 60.2 60.8 80.0 60.4 61.2 80.2
To perform the operation, R aligns the two vectors and performs the operation element by element. Observe:
## [1] 30.36 30.70 40.36 30.44 30.87 40.45
## [1] 7.826 9.120 14.400 7.248 8.262 14.035
If the vectors have different sizes, it will repeat the smaller one to perform the element-by-element operation with all elements of the larger one.
If the smaller vector is not a multiple of the larger one, we’ll get a warning:
## Warning in x * c(1, 2, 3, 4): longer object length is not a multiple of shorter
## object length
## [1] 30.1 60.8 120.0 120.8 30.6 80.2
Notice that the warning doesn’t compromise the code’s functionality; it just gives a hint that something might not be as you’d like.
We can also store the operation in another object:
We can also apply some functions, for example:
## [1] 101.59
## [1] 16.93167
## [1] 6.427507
Indexing
We access only the 3rd value of the created vector with
[]
:
We can also access the numbers from position 2 to 4 with:
## [1] 15.35 20.18 15.22
To get information about the created vector, use:
## num [1:6] 15.2 15.3 20.2 15.2 15.4 ...
The str
function tells us about the structure of the
vector, which is a numeric vector with 6 elements.
Vectors can also receive other categories such as characters:
Another class is factors, which can be a bit complex to handle.
Generally, factors are values categorized by levels
. For
example, if we transform our character vector clone
into a
factor, levels will be assigned to each word:
## Factor w/ 4 levels "GRA01","GRA02",..: 2 3 4 2 1 3
## [1] "GRA01" "GRA02" "URO01" "URO03"
This way, we’ll have only 4 levels for a vector with 6 elements, since the words “GRA02” and “URO01” are repeated. We can get the number of elements in the vector or its length with:
## [1] 6
There are also logical vectors, which receive true or false values:
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
With it we can, for example, identify which positions have elements greater than 40:
## [1] 6
## [1] 40.1
## [1] 40.1
We can also locate specific elements with:
## [1] TRUE FALSE TRUE TRUE FALSE FALSE
The functions any
and all
can also be
useful. Research about them.
Find more about other logical operators, like the >
used, at this link.
Warning1
Create a numeric sequence containing 10 integer values, and save it in an object called “a”.
## [1] 1 2 3 4 5 6 7 8 9 10
Create another sequence, using decimal numbers and any mathematical operation, so that its values are identical to object “a”.
## [1] 1 2 3 4 5 6 7 8 9 10
The two vectors look equal, don’t they?
Then, using a logical operator, let’s verify if object “b” is equal to object “a”.
## [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Some values are not equal. How is this possible?
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Warning2
It’s not possible to mix different classes within the same vector. When trying to do this, notice that R will try to equalize to a single class:
## [1] "TRUE" "oops" "1"
In this case, all elements were transformed into characters.
Some Tips:
- Be careful with operation priority; when in doubt, always add parentheses according to your priority interest.
- Remember that if you forget to close any
(
or[
or"
, R’s console will wait for you to close it, indicating with a+
. Nothing will be processed until you directly type a)
in the console or press ESC. - Be careful not to overwrite already created objects by creating others with the same name. Use, for example: height1, height2.
- Keep in your .R script only the commands that worked and, preferably, add comments. You can, for example, comment on difficulties encountered, so you don’t make the same mistakes later.
If you’re ahead of your colleagues, you can already do the exercises from Session 1; if not, do them at another time and send us your questions.
Matrices
Matrices are another class of objects widely used in R, with them we can perform large-scale operations in an automated way.
Since they are used in operations, we typically store numeric elements in them. To create a matrix, we determine a sequence of numbers and indicate the number of rows and columns in the matrix:
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
## [4,] 4 10
## [5,] 5 11
## [6,] 6 12
We can also use sequences already stored in vectors to generate a matrix, as long as they are numeric:
## [,1] [,2]
## [1,] 30.1 0.26
## [2,] 30.4 0.30
## [3,] 40.0 0.36
## [4,] 30.2 0.24
## [5,] 30.6 0.27
## [6,] 40.1 0.35
With them we can perform matrix operations:
## [,1] [,2]
## [1,] 2 14
## [2,] 4 16
## [3,] 6 18
## [4,] 8 20
## [5,] 10 22
## [6,] 12 24
## [,1] [,2]
## [1,] 1 49
## [2,] 4 64
## [3,] 9 81
## [4,] 16 100
## [5,] 25 121
## [6,] 36 144
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 50 58 66 74 82 90
## [2,] 58 68 78 88 98 108
## [3,] 66 78 90 102 114 126
## [4,] 74 88 102 116 130 144
## [5,] 82 98 114 130 146 162
## [6,] 90 108 126 144 162 180
Using these operations requires knowledge of matrix algebra. If you want to delve deeper into this, the book Linear Models in Statistics, Rencher (2008) has a good review about it. You can also explore R syntax for these operations at this link.
We access the numbers inside the matrix by giving the coordinates [row,column], as in the example:
## [1] 0.24
Sometimes it can be informative to give names to the columns and rows of the matrix, we do this with:
## height diameter
## GRA02 30.1 0.26
## URO01 30.4 0.30
## URO03 40.0 0.36
## GRA02 30.2 0.24
## GRA01 30.6 0.27
## URO01 40.1 0.35
These functions colnames
and rownames
also
work with data.frames.
Data.frames
Unlike matrices, we don’t perform operations with data.frames, but they allow the combination of vectors with different classes. Data frames are similar to tables generated in other programs, like Excel.
Data frames are combinations of vectors of the same length. All the ones we’ve created so far have size 6, verify this.
We can thus combine them into columns of a single data.frame:
field1 <- data.frame(
"clone" = clone, # Before the "=" sign
"height" = x, # we establish the names
"diameter" = y, # of the columns
"age" = rep(3:5, 2),
"cut" = logical
)
field1
## clone height diameter age cut
## 1 GRA02 30.1 0.26 3 FALSE
## 2 URO01 30.4 0.30 4 FALSE
## 3 URO03 40.0 0.36 5 FALSE
## 4 GRA02 30.2 0.24 3 FALSE
## 5 GRA01 30.6 0.27 4 FALSE
## 6 URO01 40.1 0.35 5 TRUE
We can access each of the columns with:
## [1] 3 4 5 3 4 5
Or also with:
## [1] 3 4 5 3 4 5
Here, the number inside the brackets refers to the column, being the second element (separated by comma). The first element refers to the row. Since we left the first element empty, we’re referring to all rows for that column.
This way, if we want to obtain specific content, we can give the coordinates with [row,column]:
## [1] 30.1
- Get the diameter of clone “URO03”.
## [1] 0.36
Even though it’s a data frame, we can perform operations with the numeric vectors it contains.
- With the diameter and height of the trees, calculate the volume
according to the following formula and store it in a
volume
object:
\(3.14*(diameter/2)^2*height\)
## [1] 1.597287 2.147760 4.069440 1.365523 1.751131 3.856116
Now, let’s add the calculated vector with the volume to our data
frame. For this, use the cbind
function.
## clone height diameter age cut volume
## 1 GRA02 30.1 0.26 3 FALSE 1.597287
## 2 URO01 30.4 0.30 4 FALSE 2.147760
## 3 URO03 40.0 0.36 5 FALSE 4.069440
## 4 GRA02 30.2 0.24 3 FALSE 1.365523
## 5 GRA01 30.6 0.27 4 FALSE 1.751131
## 6 URO01 40.1 0.35 5 TRUE 3.856116
## 'data.frame': 6 obs. of 6 variables:
## $ clone : chr "GRA02" "URO01" "URO03" "GRA02" ...
## $ height : num 30.1 30.4 40 30.2 30.6 40.1
## $ diameter: num 0.26 0.3 0.36 0.24 0.27 0.35
## $ age : int 3 4 5 3 4 5
## $ cut : logi FALSE FALSE FALSE FALSE FALSE TRUE
## $ volume : num 1.6 2.15 4.07 1.37 1.75 ...
Some tips:
Remember that, to build matrices and data frames, the columns must have the same number of elements.
If you don’t know which operator or function should be used, search on Google or ask chatgpt or any other tool. For example, if you’re unsure how to calculate the standard deviation, search for “standard deviation R”. The R community is very active and most of your questions about it have already been answered somewhere on the web.
Don’t forget that everything you do in R needs to be explicitly indicated, like a multiplication 4ac with
4*a*c
. To generate a vector 1,3,2,6 you need:c(1,3,2,6)
.
Lists
Lists consist of a collection of objects, not necessarily of the same
class. In them, we can store all the other objects we’ve already seen
and retrieve them through indexing with [[
. As an example,
let’s use some objects that have already been generated.
my_list <- list(field1 = field1, height_mean = tapply(field1$height, field1$age, mean), matrix_ex = W)
str(my_list)
## List of 3
## $ field1 :'data.frame': 6 obs. of 6 variables:
## ..$ clone : chr [1:6] "GRA02" "URO01" "URO03" "GRA02" ...
## ..$ height : num [1:6] 30.1 30.4 40 30.2 30.6 40.1
## ..$ diameter: num [1:6] 0.26 0.3 0.36 0.24 0.27 0.35
## ..$ age : int [1:6] 3 4 5 3 4 5
## ..$ cut : logi [1:6] FALSE FALSE FALSE FALSE FALSE TRUE
## ..$ volume : num [1:6] 1.6 2.15 4.07 1.37 1.75 ...
## $ height_mean: num [1:3(1d)] 30.1 30.5 40
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:3] "3" "4" "5"
## $ matrix_ex : num [1:6, 1:2] 30.1 30.4 40 30.2 30.6 40.1 0.26 0.3 0.36 0.24 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "GRA02" "URO01" "URO03" "GRA02" ...
## .. ..$ : chr [1:2] "height" "diameter"
I want to access the data.frame field1
## clone height diameter age cut volume
## 1 GRA02 30.1 0.26 3 FALSE 1.597287
## 2 URO01 30.4 0.30 4 FALSE 2.147760
## 3 URO03 40.0 0.36 5 FALSE 4.069440
## 4 GRA02 30.2 0.24 3 FALSE 1.365523
## 5 GRA01 30.6 0.27 4 FALSE 1.751131
## 6 URO01 40.1 0.35 5 TRUE 3.856116
## clone height diameter age cut volume
## 1 GRA02 30.1 0.26 3 FALSE 1.597287
## 2 URO01 30.4 0.30 4 FALSE 2.147760
## 3 URO03 40.0 0.36 5 FALSE 4.069440
## 4 GRA02 30.2 0.24 3 FALSE 1.365523
## 5 GRA01 30.6 0.27 4 FALSE 1.751131
## 6 URO01 40.1 0.35 5 TRUE 3.856116
To access a specific column in the data.frame field1
,
which is inside my_list:
## [1] 0.26 0.30 0.36 0.24 0.27 0.35
## [1] 0.26 0.30 0.36 0.24 0.27 0.35
## [1] 0.26 0.30 0.36 0.24 0.27 0.35
Lists are very useful, for example, when we are going to use/generate various objects within a loop.
Arrays
This is a type of object that you probably won’t use at the beginning, but it’s good to know about its existence. Arrays are used to store data with more than two dimensions. For example, if we create an array:
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
We will have four matrices with two rows and three columns, and the numbers from 1 to 24 will be distributed in them by columns.
If you’re ahead of your colleagues, you can already do the exercises from Session 2; if not, do them at another time and send us your questions through the forum.
Importing and Exporting Data
Download the objects created so far by clicking here
RData files are exclusive to R. You can open them by double-clicking the file or by using:
This will recover all objects created so far in the tutorial.
If you have internet access, you can also use the web link directly:
To generate this RData file, I ran all the code presented so far, and used:
This command saves everything in your Global Environment (all objects displayed in the upper-right corner). You can also save only a specific object by using:
If we remove it from our Global Environment with:
We can easily load it again with:
The RData format is exclusive to R. It is useful for tasks like ours
where we pause the analysis one day and resume another, without having
to rerun everything. However, there are situations where you need to
export your data to other programs that require different formats, such
as .txt
or .csv
. For that, you can use:
write.table(campo1, file = "campo1.txt", sep = ";", dec = ".", row.names = FALSE)
write.csv(campo1, file = "campo1.csv", row.names = TRUE)
Note: You can use packages to export and import data in other
formats. For example: - The openxlsx
package allows you to
export and import Excel files. - The vroom
package can be
used to handle large compressed tables.
To install packages, use the function
install.packages("package_name")
and to load the desired
package, use library(package_name)
. More details about
packages will be covered in the “Package Usage” section of this
course.
When exporting data, there are multiple options for file formatting, which are important to consider if the file will be used in other software later.
Open the generated files in your text editor to view their
formatting. Notice that the .txt
file was saved with the
separator ;
and decimal separator .
.
Meanwhile, the .csv
file was saved with the comma separator
,
and decimal separator .
.
These files can also be read again in R by using the appropriate functions and specifications:
campo1_txt <- read.table(file = "campo1.txt", sep = ";", dec = ".", header = TRUE)
campo1_csv <- read.csv(file = "campo1.csv")
campo1_xlsx <- read.xlsx("campo1.xlsx", sheet = 1)
head(campo1_txt)
head(campo1_csv)
head(campo1_xlsx)
Now that we have learned how to import data, we will work with the dataset generated from the questionnaire sent to you and to students of other editions of this course.
The spreadsheet with the data is available at the link below. Add it to your working directory or specify the folder path when importing it into R, as shown below.
We can also import the data directly from GitHub by pointing to the web address:
Here, we will also use the argument na.strings
to
indicate how missing data was labeled.
dados <- read.csv(file = "data/dados_2025.csv", na.strings = "-", header = T, dec = ",")
head(dados)
Let’s explore the structure of the collected data:
## 'data.frame': 124 obs. of 8 variables:
## $ Timestamp : chr "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" ...
## $ Affiliation : chr "Rcourse2021" "Rcourse2021" "Rcourse2021" "Rcourse2021" ...
## $ Using.Google.Maps..please.provide.the.longitude.of.the.city.country.of.your.birth. : chr "-54.5724" "-47.6476" "-48.0547" "-106.6563" ...
## $ Using.Google.Maps..please.provide.the.latitude.of.the.city.country.of.your.birth. : chr "-25.5263" "-22.725" "-15.911" "52.1418" ...
## $ Background.Field..e.g..Molecular.Biology..Animal.Breeding..Genetics..etc.. : chr "Agronomia" "Agronomia" "Biotecnologia" "Licenciatura em Ciências Biológicas" ...
## $ Current.professional.affiliation : chr "PhD" "PhD" "Masters degree" "Other" ...
## $ If.you.chose.other.in.the..current.professional.affiliation..question.above..please.specify.: chr NA NA NA NA ...
## $ Level.of.R.knowledge : chr "Intermediate" "Intermediate" "Beginner (some knowledge)" "Intermediate" ...
## [1] 124 8
Observe that the column names still correspond to the complete questions from the questionnaire. Let’s simplify them to make them easier to work with:
## [1] "Timestamp"
## [2] "Affiliation"
## [3] "Using.Google.Maps..please.provide.the.longitude.of.the.city.country.of.your.birth."
## [4] "Using.Google.Maps..please.provide.the.latitude.of.the.city.country.of.your.birth."
## [5] "Background.Field..e.g..Molecular.Biology..Animal.Breeding..Genetics..etc.."
## [6] "Current.professional.affiliation"
## [7] "If.you.chose.other.in.the..current.professional.affiliation..question.above..please.specify."
## [8] "Level.of.R.knowledge"
colnames(dados) <- c("Date", "Affiliation", "Longitude", "Latitude", "Background", "Present_Occupation", "Explain", "KnowledgeR")
colnames(dados)
## [1] "Date" "Affiliation" "Longitude"
## [4] "Latitude" "Background" "Present_Occupation"
## [7] "Explain" "KnowledgeR"
## 'data.frame': 124 obs. of 8 variables:
## $ Date : chr "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" ...
## $ Affiliation : chr "Rcourse2021" "Rcourse2021" "Rcourse2021" "Rcourse2021" ...
## $ Longitude : chr "-54.5724" "-47.6476" "-48.0547" "-106.6563" ...
## $ Latitude : chr "-25.5263" "-22.725" "-15.911" "52.1418" ...
## $ Background : chr "Agronomia" "Agronomia" "Biotecnologia" "Licenciatura em Ciências Biológicas" ...
## $ Present_Occupation: chr "PhD" "PhD" "Masters degree" "Other" ...
## $ Explain : chr NA NA NA NA ...
## $ KnowledgeR : chr "Intermediate" "Intermediate" "Beginner (some knowledge)" "Intermediate" ...
Now we will use the dataset to learn different commands and functions in the R environment.
First, let’s check how many students responded to the course
questionnaire by counting the number of rows. For this, use the
nrow
function:
## [1] 124
Next, let’s check if our group includes people who share the same educational background and occupation.
This can be easily done using the table
function, which
indicates the frequency of each observation:
Conditional Structures
if and else
For our next activity with the dataset, let’s first understand how
the if
and else
structures work.
In conditional functions if
and else
, we
set a condition for if
. If the condition is true, the
activity will be performed; otherwise (else
), another task
will be performed. For example:
## two is not greater than three
- Activity: Determine the R Knowledge Level of the
second person who responded (row 2). Send the message “Intermediate
Level” if it is intermediate or “Basic or Advanced Level” otherwise.
(tip: the
==
operator refers to “exactly equal to”)
## Date Affiliation Longitude Latitude
## 1 07/04/2025 14:00:00 Rcourse2021 -54.5724 -25.5263
## 2 07/04/2025 14:00:00 Rcourse2021 -47.6476 -22.725
## 3 07/04/2025 14:00:00 Rcourse2021 -48.0547 -15.911
## 4 07/04/2025 14:00:00 Rcourse2021 -106.6563 52.1418
## 5 07/04/2025 14:00:00 Rcourse2021 -47.6604 -22.7641
## 6 07/04/2025 14:00:00 Rcourse2021 -47.6434 -22.7118
## Background Present_Occupation Explain
## 1 Agronomia PhD <NA>
## 2 Agronomia PhD <NA>
## 3 Biotecnologia Masters degree <NA>
## 4 Licenciatura em Ciências Biológicas Other <NA>
## 5 Agronomia Profissional <NA>
## 6 Agronomia PhD <NA>
## KnowledgeR
## 1 Intermediate
## 2 Intermediate
## 3 Beginner (some knowledge)
## 4 Intermediate
## 5 Beginner (some knowledge)
## 6 Intermediate
if (dados$KnowledgeR[2] == "Intermediate") {
cat("Intermediate Level")
} else {
cat("Basic or Advanced Level")
}
## Intermediate Level
We can specify more than one condition by repeating the
if
and else
structure.
Now, let’s examine the occupation of the respondents to the
questionnaire:
if (dados$Present_Occupation[2] == "Other") {
print(dados$Explain[2])
} else if (dados$Present_Occupation[2] == "PhD") {
print(paste("Is your PhD in:", dados$Background[2], "?"))
} else {
print(paste("Are you still working with:", dados$Background[2], "?"))
}
## [1] "Is your PhD in: Agronomia ?"
However, note that these structures can only be applied to an individual element of a vector or to the whole vector. If we want to iterate through the individual elements, we need to use another resource.
Repetition Structures
For
This feature can be implemented using the for
function,
a powerful and widely-used tool. It constitutes a loop structure,
meaning it will apply the same activity repeatedly until a specified
condition is met. See the examples below:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 5 6 7 8 9 10 11 12 13 14
In the examples above, i
works as an index that varies
from 1 to 10 during the operation specified within the braces.
Using this structure, we can repeat the operation performed with
if
and else
structures for the entire
vector:
for (i in 1:nrow(dados)) {
if (dados$Present_Occupation[i] == "Other") {
print(dados$Explain[i])
} else if (dados$Present_Occupation[i] == "PhD") {
print(paste("Is your PhD in:", dados$Background[i], "?"))
} else {
print(paste("Are you still working with:", dados$Background[i], "?"))
}
}
Note that some participants did not respond to the question “Other.”
To ensure R does not return an error, we can use the is.na
function to check whether a response is NA (not available).
for (i in 1:nrow(dados)) {
if (dados$Present_Occupation[i] == "Other") {
if (is.na(dados$Explain[i])) {
print("This person did not explain what they meant by 'Other'")
} else {
print(dados$Explain[i])
}
} else if (dados$Present_Occupation[i] == "PhD") {
print(paste("Is your PhD in:", dados$Background[i], "?"))
} else {
print(paste("Are you still working with:", dados$Background[i], "?"))
}
}
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biotecnologia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Zootecnia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engenharia Florestal ?"
## [1] "Is your PhD in: Bacharel em Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Engenharia Agrícola ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Engenheiro Florestal ?"
## [1] "Are you still working with: Genética e Melhoramento de Plantas ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Engenharia Agronômica ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biotecnologia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Melhoramento genético ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Zootecnia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Biotecnologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Genética e Melhoramento ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Is your PhD in: Genética e Melhoramento de Plantas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engeheria em Biotecnologia Vegetal (Chile) ?"
## [1] "Are you still working with: AGRONOMIA ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biología ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Engenharia Biotecnológica ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Engenharia Florestal ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engenheiro Agrônomo ?"
## [1] "Is your PhD in: Genética e Melhoramento ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Engenharia Agronômica / Licenciatura em Ciências Agrárias ?"
## [1] "Are you still working with: Ciencias Biologicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Eng. Florestal ?"
## [1] "Postdoc"
## [1] "Are you still working with: Psychology ?"
## [1] "Are you still working with: Plant Genetics ?"
## [1] "Are you still working with: Plant Breeding ?"
## [1] "Finished my Masters "
## [1] "Are you still working with: Soil Sciencies ?"
## [1] "Masters degree"
## [1] "PhD"
Tip: Indentation
Notice the difference:
# Without indentation
for (i in 1:nrow(dados)) {
if (dados$Present_Occupation[i] == "Other") {
if (is.na(dados$Explain[i])) {
print("This person did not explain what they meant by 'Other'")
} else {
print(dados$Explain[i])
}
} else if (dados$Present_Occupation[i] == "PhD") {
print(paste("Is your PhD in:", dados$Background[i], "?"))
} else {
print(paste("Are you still working with:", dados$Background[i], "?"))
}
}
RStudio’s code editor makes it easy to indent R code. Select the area
you want to indent and press Ctrl+i
.
Now let’s work with column 5, which contains the participants’
backgrounds. Notice that the table
function returns
different categories that could be grouped into just one, such as
“Agronomic Engineering” and “Agronomy.” We will use a loop to identify
which participants responded with areas related to “Agro.” Then, we will
determine which responses weren’t “Agronomy” and ask the participants to
modify their answers to “Agronomy.”
Tip: To identify the pattern “Agro,” we can use the
grepl
function.
## [1] "Agronomia"
## [2] "Agronomia"
## [3] "Biotecnologia"
## [4] "Licenciatura em Ciências Biológicas"
## [5] "Agronomia"
## [6] "Agronomia"
## [7] "Zootecnia"
## [8] "Agronomia"
## [9] "Agronomia"
## [10] "Engenharia Florestal"
## [11] "Bacharel em Agronomia"
## [12] "Engenharia Agronômica"
## [13] "Engenharia Agrícola"
## [14] "Agronomia"
## [15] "Engenharia Agronômica"
## [16] "Ciências Biológicas"
## [17] "Agronomia"
## [18] "Engenheiro Florestal"
## [19] "Genética e Melhoramento de Plantas"
## [20] "Agronomia"
## [21] "Agronomia"
## [22] "Engenharia Agronômica"
## [23] "Agronomia"
## [24] "Agronomia"
## [25] "Ciências Biológicas"
## [26] "Agronomia"
## [27] "Ciências Biológicas"
## [28] "Engenharia Agronômica"
## [29] "Ciências Biológicas"
## [30] "Agronomia"
## [31] "Biotecnologia"
## [32] "Biologia"
## [33] "Agronomia"
## [34] "Biologia"
## [35] "Agronomia"
## [36] "Biologia"
## [37] "Agronomia"
## [38] "Melhoramento genético"
## [39] "FAEM - UFPEL"
## [40] "Agronomia"
## [41] "Ciências Biológicas"
## [42] "Ciências Biológicas"
## [43] "Agronomia"
## [44] "Biologia"
## [45] "Agronomia"
## [46] "Agronomia"
## [47] "Eng. Agronômica"
## [48] "Zootecnia"
## [49] "Agronomia"
## [50] "Biotecnologia"
## [51] "Agronomia"
## [52] "Agronomia"
## [53] "Agronomia"
## [54] "Agronomia"
## [55] "Agronomia"
## [56] "Agronomia"
## [57] "Agronomia"
## [58] "Genética e Melhoramento"
## [59] "Agronomia"
## [60] "Agronomia"
## [61] "Agronomia"
## [62] "Ciências bilógicas"
## [63] "Agronomia"
## [64] "Biologia"
## [65] "Genética e Melhoramento de Plantas"
## [66] "Agronomia"
## [67] "Agronomia"
## [68] "Agronomia"
## [69] "Agronomia"
## [70] "Agronomia"
## [71] "Agronomia"
## [72] "Agronomia"
## [73] "Agronomia"
## [74] "Engeheria em Biotecnologia Vegetal (Chile)"
## [75] "AGRONOMIA"
## [76] "Engenharia Agronômica"
## [77] "Agronomia"
## [78] "Agronomia"
## [79] "Biología"
## [80] "Agronomia"
## [81] "Ciências Biológicas"
## [82] "Agronomia"
## [83] "Agronomia"
## [84] "Agronomia"
## [85] "Agronomia"
## [86] "Agronomia"
## [87] "Ciências Biológicas"
## [88] "Agronomia"
## [89] "Agronomia"
## [90] "Agronomia"
## [91] "Agronomia"
## [92] "Agronomia"
## [93] "Engenharia Biotecnológica"
## [94] "Agronomia"
## [95] "Agronomia"
## [96] "Engenharia Florestal"
## [97] "Agronomia"
## [98] "Agronomia"
## [99] "Biologia"
## [100] "Biologia"
## [101] "Ciências Biológicas"
## [102] "Agronomia"
## [103] "Agronomia"
## [104] "Agronomia"
## [105] "Engenheiro Agrônomo"
## [106] "Genética e Melhoramento"
## [107] "Agronomia"
## [108] "Agronomia"
## [109] "Biotecnologia"
## [110] "Agronomia"
## [111] "Agronomia"
## [112] "Biotecnologia"
## [113] "Engenharia Agronômica / Licenciatura em Ciências Agrárias"
## [114] "Ciencias Biologicas"
## [115] "Agronomia"
## [116] "Eng. Florestal"
## [117] "Genetics"
## [118] "Psychology"
## [119] "Plant Genetics"
## [120] "Plant Breeding"
## [121] "Climate Change"
## [122] "Soil Sciencies "
## [123] "Animal Science"
## [124] "Animal Breeding and Genetics"
## [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [13] FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [37] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
## [49] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [61] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## [85] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [97] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [109] FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE
## [1] "Agronomia"
## [2] "Agronomia"
## [3] "Agronomia"
## [4] "Agronomia"
## [5] "Agronomia"
## [6] "Agronomia"
## [7] "Bacharel em Agronomia"
## [8] "Engenharia Agronômica"
## [9] "Agronomia"
## [10] "Engenharia Agronômica"
## [11] "Agronomia"
## [12] "Agronomia"
## [13] "Agronomia"
## [14] "Engenharia Agronômica"
## [15] "Agronomia"
## [16] "Agronomia"
## [17] "Agronomia"
## [18] "Engenharia Agronômica"
## [19] "Agronomia"
## [20] "Agronomia"
## [21] "Agronomia"
## [22] "Agronomia"
## [23] "Agronomia"
## [24] "Agronomia"
## [25] "Agronomia"
## [26] "Agronomia"
## [27] "Eng. Agronômica"
## [28] "Agronomia"
## [29] "Agronomia"
## [30] "Agronomia"
## [31] "Agronomia"
## [32] "Agronomia"
## [33] "Agronomia"
## [34] "Agronomia"
## [35] "Agronomia"
## [36] "Agronomia"
## [37] "Agronomia"
## [38] "Agronomia"
## [39] "Agronomia"
## [40] "Agronomia"
## [41] "Agronomia"
## [42] "Agronomia"
## [43] "Agronomia"
## [44] "Agronomia"
## [45] "Agronomia"
## [46] "Agronomia"
## [47] "Agronomia"
## [48] "Engenharia Agronômica"
## [49] "Agronomia"
## [50] "Agronomia"
## [51] "Agronomia"
## [52] "Agronomia"
## [53] "Agronomia"
## [54] "Agronomia"
## [55] "Agronomia"
## [56] "Agronomia"
## [57] "Agronomia"
## [58] "Agronomia"
## [59] "Agronomia"
## [60] "Agronomia"
## [61] "Agronomia"
## [62] "Agronomia"
## [63] "Agronomia"
## [64] "Agronomia"
## [65] "Agronomia"
## [66] "Agronomia"
## [67] "Agronomia"
## [68] "Agronomia"
## [69] "Agronomia"
## [70] "Agronomia"
## [71] "Agronomia"
## [72] "Agronomia"
## [73] "Engenharia Agronômica / Licenciatura em Ciências Agrárias"
## [74] "Agronomia"
for (i in 1:nrow(dados)) {
if (grepl("Agro", dados[i, 5])) {
if (dados[i, 5] != "Agronomy") {
print("Please replace your response with Agronomy.")
}
}
}
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
Notice that the code above does not return the rows with incorrect data; it only prints a message. To fix this, we need to store these rows in a vector and then access them.
homog <- vector()
for (i in 1:nrow(dados)) {
if (grepl("Agro", dados[i, 5])) {
if (dados[i, 5] != "Agronomy") {
print("Please replace your response with Agronomy.")
homog <- c(homog, i)
}
}
}
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] 1 2 5 6 8 9 11 12 14 15 17 20 21 22 23 24 26 28 30
## [20] 33 35 37 40 43 45 46 47 49 51 52 53 54 55 56 57 59 60 61
## [39] 63 66 67 68 69 70 71 72 73 76 77 78 80 82 83 84 85 86 88
## [58] 89 90 91 92 94 95 97 98 102 103 104 107 108 110 111 113 115
How would you correct these incorrect elements? Try it!
‘dados[homog, 5] <- “Agronomy”’
While
In this type of repetition structure, the task will be performed until a certain condition is met.
## 2345
It’s very important that in this structure the condition is met, otherwise the loop will run infinitely and you’ll have to interrupt it by external means. An example of these “external means” in RStudio is clicking the red symbol in the top right corner of the console window. You can also press Ctrl+C in the console.
It’s not very difficult for this to happen, just a small error like:
Here we can use the break
and next
commands
to meet other conditions, like:
## 23
## 235
The break
command stops the loop completely when the
condition is met, while next
skips the rest of the current
iteration and continues with the next one.
Repeat
This structure also requires a stop condition, but this condition is
necessarily placed inside the code block using break
. It
then repeats the code block until the condition interrupts it.
## 234
The repeat
structure is similar to while
,
but with the key difference that the stop condition must be explicitly
defined within the code block using break
. It will continue
to execute the code block indefinitely until it encounters the
break
statement.
Loops within Loops
It’s also possible to use repetition structures within repetition structures. For example, if we want to work on both columns and rows of a matrix.
# Creating an empty matrix
ex_mat <- matrix(nrow = 10, ncol = 10)
# each number inside the matrix will be the product of the column index by the row index
for (i in 1:dim(ex_mat)[1]) {
for (j in 1:dim(ex_mat)[2]) {
ex_mat[i, j] <- i * j
}
}
Another example of use:
var1 <- c("fertilizer1", "fertilizer2")
var2 <- c("ESS", "URO", "GRA")
w <- 1
for (i in var1) {
for (j in var2) {
file_name <- paste0(i, "_plant_", j, ".txt")
file <- data.frame("block" = "fake_data", "treatment" = "fake_data")
write.table(file, file = file_name)
w <- w + 1
}
}
# Check your working directory, files should have been generated
If you’re ahead of your colleagues, you can already do the exercises from Session 3, if not, do them at another time and send us your questions through the forum.
Some tips:
- Be careful when running the same command multiple times, some variables might not be the same as they were before. For the command to work the same way, the input objects need to be in the form you expect.
- Remember that
=
is for defining objects and==
is the equality sign. - In conditional and repetition structures, remember that it’s necessary to maintain the expected syntax: If(){} and for(i in 1:10){}. In for, we can change the letter that will be the index, but it’s always necessary to provide a sequence of integers or characters.
- Using indentation helps to visualize the beginning and end of each code structure and makes it easier to open and close braces. Indentation refers to those spaces we use before the line, like:
# Creating an empty matrix
ex_mat <- matrix(nrow = 10, ncol = 10)
# each number inside the matrix will be the product of the column index by the row index
for (i in 1:dim(ex_mat)[1]) { # First level, no space
for (j in 1:dim(ex_mat)[2]) { # Second level has one space (tab)
ex_mat[i, j] <- i * j # Third level has two spaces
} # Closed the second level
} # Closed the first level
The consistent use of indentation makes your code more readable and helps prevent errors by making the structure clearer. Most modern IDEs, including RStudio, provide automatic indentation features to help maintain this consistency.
Vectorization
Although loops are intuitive and easier to understand, they are slower and less efficient than vectorization. Vectorization is a technique that allows operations to be applied to all elements of a vector or matrix at once, without the need to iterate over each element individually.
Here is a simple example of non-vectorized code (using a loop) and its vectorized version:
# Not vectorized (using loop)
numbers <- 1:5
loop_result <- numeric(length(numbers))
for (i in 1:length(numbers)) {
loop_result[i] <- numbers[i] * 2
}
# Vectorized approach
numbers <- 1:5
vectorized_result <- numbers * 2
loop_result == vectorized_result
## [1] TRUE TRUE TRUE TRUE TRUE
This code transformation can become more complex depending on the scenario. For example, think about how a vectorized version of the previous loop would look like:
# Not vectorized (using loop)
ex_mat <- matrix(nrow = 10, ncol = 10)
for (i in 1:dim(ex_mat)[1]) {
for (j in 1:dim(ex_mat)[2]) {
ex_mat[i, j] <- i * j
}
}
# Vectorized?
This is a good moment for you to practice using an AI tool to help you transform the code into a vectorized version. You can use chatgpt or copilot, for example. Compare the result of the provided code with what you generated with the loop to verify if the tool is really doing what you want. This transformation is worth it if the loop code is taking too long or if you have to run the same code many times.
’ex_mat <- outer(1:10, 1:10, “*“)’
Here’s the English translation of your R programming instructional text:
Creating Functions
If you’re already comfortable using loops, you might be wondering:
“What if I want to do this several times?” or
“What if I want to apply this logic to different datasets?” That’s where
functions come in.
We can create custom functions to perform specific tasks. The basic syntax to create a function in R is:
The function my_function
takes two arguments
(arg1
and arg2
) and returns their sum. You can
call the function by passing the desired values:
## [1] 8
Example of a custom function using vectorization:
vectorized_sum <- function(vector) {
# Check if the vector is numeric
if (!is.numeric(vector)) {
stop("The vector must be numeric.")
}
# Sum the vector elements
total <- sum(vector)
# Z-score standardization
z_score <- (vector - mean(vector)) / sd(vector)
result <- list(sum = total, z_score = z_score)
return(result)
}
# Calling the function
vectorized_result <- vectorized_sum(c(1, 2, 3, 4, 5))
vectorized_result
## $sum
## [1] 15
##
## $z_score
## [1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111
Example of a function using a data.frame as input. Note that repeated use of the function will require that the input data has the same format or at least the required columns. A good practice is to check if the required columns are present before performing calculations.
calc_volume <- function(data_frame) {
if (!all(c("diameter", "height") %in% colnames(data_frame))) {
stop("The columns 'diameter' and 'height' must be present in the data frame.")
}
volume <- 3.14 * ((data_frame$diameter / 2)^2) * data_frame$height
return(volume)
}
It’s recommended to always document your functions. There is a
package called roxygen2
that automatically generates manual
pages for documenting functions in a package. It requires proper syntax,
as shown below:
#' Calculate the volume of cylinders based on diameter and height
#'
#' @param data_frame A data frame containing the columns "diameter" and "height".
#'
#' @return A numeric vector with the calculated volumes for each cylinder.
#'
#' @details
#' Calculates the volume using the formula:
#' \deqn{Volume = \pi \times \left(\frac{diameter}{2}\right)^2 \times height}
#' If the required columns are missing, the function will stop with an error.
#'
#' @examples
#' df <- data.frame(diameter = c(4, 6), height = c(10, 15))
#' calc_volume(df)
#' # [1] 125.6 424.2
#'
#' @note
#' The function uses 3.14 as an approximation for pi. For higher precision, consider replacing 3.14 with the built-in `pi` constant.
calc_volume <- function(data_frame) {
if (!all(c("diameter", "height") %in% colnames(data_frame))) {
stop("The columns 'diameter' and 'height' must be present in the data frame.")
}
volume <- 3.14 * ((data_frame$diameter / 2)^2) * data_frame$height
return(volume)
}
If this function is part of a package, you just need to run the
command roxygen2::roxygenise()
to create the help page. See
an example of an R package structure at https://github.com/Breeding-Insight/BIGr
apply
Function Family
The apply
family of functions can also be used as
repetition structures. Their syntax is more concise compared to
for
or while
and can simplify code
writing.
apply
The apply
function is the base of its family, so
understanding it is essential. Its syntax is:
apply(X, MARGIN, FUN, ...)
, where X
is an
array (including matrices), MARGIN
is 1 to apply to rows, 2
to columns, and c(1,2)
to both; FUN
is the
function to apply.
Simple matrix example:
Sum of columns:
## [1] 3 15 27 39
Sum of rows:
## [1] 36 48
Equivalent using for
loops:
## [1] 3
## [1] 15
## [1] 27
## [1] 39
## [1] 36
## [1] 48
Example using a custom function:
lapply
Unlike apply
, lapply
can take vectors and
lists (mainly used with lists) and returns a list.
ex_list <- list(
A = matrix(seq(0, 21, 3), nrow = 2),
B = matrix(seq(0, 14, 2), nrow = 2),
C = matrix(seq(0, 39, 5), nrow = 2)
)
str(ex_list)
## List of 3
## $ A: num [1:2, 1:4] 0 3 6 9 12 15 18 21
## $ B: num [1:2, 1:4] 0 2 4 6 8 10 12 14
## $ C: num [1:2, 1:4] 0 5 10 15 20 25 30 35
Select the second column of all matrices:
## $A
## [1] 3
##
## $B
## [1] 2
##
## $C
## [1] 5
Using a custom function:
## $A
## [,1] [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950
##
## $B
## [,1] [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950
##
## $C
## [,1] [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950
sapply
sapply
is a variant of lapply
that tries to
simplify the output into a vector, matrix, or array.
## A B C
## 3 2 5
## A B C
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
## [5,] -1.1618950 -1.1618950 -1.1618950
## [6,] -0.3872983 -0.3872983 -0.3872983
## [7,] 0.3872983 0.3872983 0.3872983
## [8,] 1.1618950 1.1618950 1.1618950
tapply
tapply
applies functions based on the levels of a
categorical variable (factor), commonly used with data frames.
dados$Affiliation <- as.factor(dados$Affiliation)
dados$KnowledgeR_num <- NA
dados$KnowledgeR_num[dados$KnowledgeR == "Advanced"] <- 3
dados$KnowledgeR_num[dados$KnowledgeR == "Intermediate"] <- 2
dados$KnowledgeR_num[dados$KnowledgeR == "Beginner (some knowledge)"] <- 1
dados$KnowledgeR_num[dados$KnowledgeR == "No R knowledge"] <- 0
dados$KnowledgeR_num[dados$KnowledgeR == ""] <- NA
tapply(dados$KnowledgeR_num, dados$Affiliation, mean)
## Agronomic Engineer
## 1.00000
## Breeding Insight
## 1.75000
## CIA Central Pecuario
## 0.00000
## INTA
## 1.00000
## National Institute of Innovation and Transfer in Agricultural Technology
## 2.00000
## Rcourse2021
## 1.87069
## Agronomic Engineer
## 1.00000
## Breeding Insight
## 1.75000
## CIA Central Pecuario
## 0.00000
## INTA
## 1.00000
## National Institute of Innovation and Transfer in Agricultural Technology
## 2.00000
## Rcourse2021
## 1.87069
mapply
mapply
is a multivariate version of sapply
,
allowing functions to be applied to multiple vectors.
sum_fun <- function(x, y) {
return(x + y)
}
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
result_mapply <- mapply(sum_fun, vector1, vector2)
print(result_mapply)
## [1] 5 7 9
multiply <- function(x, y, z) {
return(x * y * z)
}
vector3 <- c(7, 8, 9)
result_mapply <- mapply(multiply, vector1, vector2, vector3)
print(result_mapply)
## [1] 28 80 162
result_mapply <- mapply(function(x, y) {
return(c(sum = x + y, product = x * y))
}, vector1, vector2)
print(result_mapply)
## [,1] [,2] [,3]
## sum 5 7 9
## product 4 10 18
sum_product <- function(x, y) {
return(c(sum = x + y, product = x * y))
}
result_mapply <- mapply(sum_product, vector1, vector2)
print(result_mapply)
## [,1] [,2] [,3]
## sum 5 7 9
## product 4 10 18
sum_product <- function(x, y, z) {
return(c(sum = x + y + z, product = x * y * z))
}
result_mapply <- mapply(sum_product, vector1, vector2, vector3)
print(result_mapply)
## [,1] [,2] [,3]
## sum 12 15 18
## product 28 80 162
If you’re ahead of your classmates, you can go ahead and do the Extra session exercises. If not, do them at home and send us your questions via the forum.
Long and Wide Format
In data analysis and visualization with R, especially using the
tidyverse
, the structure of your data matters a lot.
tidyverse
is a collection of R packages designed for data
science, and it emphasizes the use of tidy data principles. Tidy data is
a standardized way of organizing data that makes it easier to work with.
Here is a list off all packages in the tidyverse
:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## [1] "broom" "conflicted" "cli" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "ggplot2"
## [9] "googledrive" "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "ragg"
## [21] "readr" "readxl" "reprex" "rlang"
## [25] "rstudioapi" "rvest" "stringr" "tibble"
## [29] "tidyr" "xml2" "tidyverse"
There are two main formats for organizing data:
- Wide format: one row per subject with multiple columns for repeated measures.
- Long format: one row per observation, making it tidy and compatible with tools like ggplot2.
Let’s consider a dataset showing populations (in millions) of 10 countries across three years:
# Wide format data (realistic population estimates in millions)
data_wide <- tibble(
country = c(
"USA", "Canada", "Mexico", "Brazil", "Costa Rica",
"Uruguay", "China", "Japan", "India", "Greenland"
),
`2000` = c(282, 31, 98, 174, 4, 3.3, 1267, 127, 1050, 0.056),
`2010` = c(309, 34, 112, 196, 4.5, 3.4, 1340, 128, 1230, 0.057),
`2020` = c(331, 38, 126, 213, 5, 3.5, 1402, 126, 1380, 0.056)
)
data_wide
## # A tibble: 10 × 4
## country `2000` `2010` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 USA 282 309 331
## 2 Canada 31 34 38
## 3 Mexico 98 112 126
## 4 Brazil 174 196 213
## 5 Costa Rica 4 4.5 5
## 6 Uruguay 3.3 3.4 3.5
## 7 China 1267 1340 1402
## 8 Japan 127 128 126
## 9 India 1050 1230 1380
## 10 Greenland 0.056 0.057 0.056
Use pivot_longer() to convert from wide to long format:
data_long <- pivot_longer(data_wide,
cols = -country,
names_to = "year",
values_to = "population"
)
data_long
## # A tibble: 30 × 3
## country year population
## <chr> <chr> <dbl>
## 1 USA 2000 282
## 2 USA 2010 309
## 3 USA 2020 331
## 4 Canada 2000 31
## 5 Canada 2010 34
## 6 Canada 2020 38
## 7 Mexico 2000 98
## 8 Mexico 2010 112
## 9 Mexico 2020 126
## 10 Brazil 2000 174
## # ℹ 20 more rows
Note that our data.frame was converted to tibble format. Tibbles are a modern version of data.frames, designed to be easier to use and more efficient. They are part of the tidyverse package and are often used in data analysis. Here are some practical differences between them:
Feature | data.frame |
tibble (from tibble package) |
---|---|---|
Base or Tidyverse | Base R | Part of the tidyverse |
Printing | Prints entire dataset (can be large) | Prints a preview (10 rows, fitted columns) |
Column types | May convert types automatically (e.g. strings to factors) | No automatic type conversion |
Subsetting | df[, 1] may return a vector |
tibble[, 1] always returns a tibble |
Row names | Always has row names | Doesn’t use row names |
Most tidyverse functions, especially ggplot2, work best with long (tidy) data.
Wide format tables are usually easier to visualize if you intend to
export it to a CSV or excel file. You convert long data to wide format
with pivot_wider()
:
data_wide_back <- pivot_wider(data_long,
names_from = year,
values_from = population
)
data_wide_back
## # A tibble: 10 × 4
## country `2000` `2010` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 USA 282 309 331
## 2 Canada 31 34 38
## 3 Mexico 98 112 126
## 4 Brazil 174 196 213
## 5 Costa Rica 4 4.5 5
## 6 Uruguay 3.3 3.4 3.5
## 7 China 1267 1340 1402
## 8 Japan 127 128 126
## 9 India 1050 1230 1380
## 10 Greenland 0.056 0.057 0.056
Introduction to pipe use
The pipe operator (%>%
) is a powerful tool in R,
especially when using the tidyverse
package. It allows you
to chain together multiple functions in a clear and readable way.
Instead of nesting functions within each other, you can use the pipe to
pass the output of one function directly into the next. Here is an
example:
data_wide_back <- data_long %>%
pivot_wider(
names_from = year,
values_from = population
)
data_wide_back
## # A tibble: 10 × 4
## country `2000` `2010` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 USA 282 309 331
## 2 Canada 31 34 38
## 3 Mexico 98 112 126
## 4 Brazil 174 196 213
## 5 Costa Rica 4 4.5 5
## 6 Uruguay 3.3 3.4 3.5
## 7 China 1267 1340 1402
## 8 Japan 127 128 126
## 9 India 1050 1230 1380
## 10 Greenland 0.056 0.057 0.056
The usage of the pipe operator makes sense when you have a sequence
of operations to perform on a dataset. Let’s explore some other
tidyverse
functions that can be used with the pipe
operator.
data_wide_back <- data_long %>%
pivot_wider(
names_from = year,
values_from = population
) %>%
mutate(total_population = `2000` + `2010` + `2020`) %>% # mutate function will add a new column
arrange(desc(total_population)) # arrange function will sort the data by the new column, descending
data_wide_back
## # A tibble: 10 × 5
## country `2000` `2010` `2020` total_population
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 China 1267 1340 1402 4009
## 2 India 1050 1230 1380 3660
## 3 USA 282 309 331 922
## 4 Brazil 174 196 213 583
## 5 Japan 127 128 126 381
## 6 Mexico 98 112 126 336
## 7 Canada 31 34 38 103
## 8 Costa Rica 4 4.5 5 13.5
## 9 Uruguay 3.3 3.4 3.5 10.2
## 10 Greenland 0.056 0.057 0.056 0.169
The mutate
function is used to create new variables or
modify existing ones, while the arrange
function is used to
sort the data frame by one or more variables. Other useful functions
include filter
(to filter rows based on conditions), and
select
(to select specific columns).
data_wide_back <- data_long %>%
pivot_wider(
names_from = year,
values_from = population
) %>%
mutate(total_population = `2000` + `2010` + `2020`) %>%
arrange(desc(total_population)) %>%
filter(total_population > 1000) %>% # Filter countries with total population greater than 1000
select(country, total_population) # Select only the country and total population columns
data_wide_back
## # A tibble: 2 × 2
## country total_population
## <chr> <dbl>
## 1 China 4009
## 2 India 3660
Using the long format, we can also summarize the data using the
summarise
function. This is useful for calculating summary
statistics like mean, median, or total population by year.
data_summary <- data_long %>%
group_by(year) %>%
summarise(
total_population = sum(population, na.rm = TRUE),
avg_population = mean(population, na.rm = TRUE),
max_population = max(population, na.rm = TRUE)
)
data_summary
## # A tibble: 3 × 4
## year total_population avg_population max_population
## <chr> <dbl> <dbl> <dbl>
## 1 2000 3036. 304. 1267
## 2 2010 3357. 336. 1340
## 3 2020 3625. 362. 1402
The pipe became popular with the dplyr
package and in
recent R versions, it is also available in base R. In base R, the pipe
operator is |>
, but it works similarly to the
%>%
operator from dplyr
.
The main difference is that the base R pipe operator
(|>
) does not require the magrittr
package,
which is necessary for the %>%
operator. Another
difference is that the %>%
operator allows you to use
the placeholder .
to specify where the input should go in
the next function. This is particularly useful when the input is not the
first argument of the function:
##
## Call:
## lm(formula = `2020` ~ `2000`, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.426 -4.807 -1.463 3.357 130.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5810 23.1470 0.068 0.947
## `2000` 1.1885 0.0434 27.384 3.41e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.18 on 8 degrees of freedom
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9881
## F-statistic: 749.9 on 1 and 8 DF, p-value: 3.409e-09
The lm
function is used to fit a linear model, and the
summary
function provides a summary of the fitted model.
Note that the .
placeholder indicates that the input data
from the previous step should be used as the data argument in the
lm
function.
The base R pipe operator does not support the placeholder
.
. Instead, you can use anonymous functions parentheses to
specify where the input should go:
##
## Call:
## lm(formula = `2020` ~ `2000`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.426 -4.807 -1.463 3.357 130.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5810 23.1470 0.068 0.947
## `2000` 1.1885 0.0434 27.384 3.41e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.18 on 8 degrees of freedom
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9881
## F-statistic: 749.9 on 1 and 8 DF, p-value: 3.409e-09
Introduction to ggplot2
ggplot2
is a powerful and flexible package for creating
visualizations in R. It is part of the tidyverse
collection
of packages and is widely used for data visualization. The main idea
behind ggplot2
is to create graphics based on the Grammar
of Graphics, s a theoretical framework for data visualization that
breaks down a graphic into a set of independent, structured components.
For ggplot2
you will fing the following main
components:
- Data: The dataset you’re using.
- Aesthetics (aes): The visual properties (like position, color, size) that map to the data.
- Geometries (geom): The types of visual elements in a plot, like points, lines, bars, etc.
- Statistics (stat): Statistical transformations, like smoothing or binning, applied to data.
- Scales (scale): Adjustments for mapping data to aesthetics (e.g., color scales).
- Coordinates (coord): The coordinate system, such as Cartesian or polar.
- Facets (facet): Splitting the data into subsets to create multiple panels (like creating small multiples).
Here’s a breakdown of the key elements in the Grammar of Graphics:
Data
The data is the foundation of any plot. It contains the variables you want to visualize. In ggplot2, you specify the data using the data argument:
Aesthetics (aes)
Aesthetics define how the data maps to visual properties of the plot, like:
- x and y position (x, y)
- color (color)
- size (size)
- shape (shape)
- fill (fill)
For example:
Here:
- x = year (horizontal axis)
- y = population (vertical axis)
Geometries (geom)
Geometries are the visual elements of a plot. Different types of geometries allow you to create different kinds of plots:
- geom_point() for scatter plots
- geom_line() for line plots
- geom_bar() for bar charts
- geom_histogram() for histograms
- geom_boxplot() for boxplots
For example:
ggplot(data = data_long, aes(x = year, y = population, group = country)) +
geom_line() + geom_point()
# add color by country
ggplot(data = data_long, aes(x = year, y = population, group = country, color = country)) +
geom_line()
This creates a scatter plot where the points represent data.
Statistics (stat)
Some plots require statistical transformations, such as smoothing or binning. You can apply these transformations using the stat_* functions.
We can apply statistical summaries. For example, add a smoothed trend:
ggplot(data_long, aes(x = year, y = population)) +
stat_summary(fun = mean, geom = "line", group = 1) +
labs(title = "Average Population Trend")
Scales (scale)
Modify how data maps to visual aesthetics (e.g., color):
ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
geom_line(size = 1.2) +
scale_color_manual(values = c("China" = "red", "India" = "orange", "USA" = "blue")) +
labs(title = "Population by Country (Selected Colors)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Here, scale_color_manual() customizes the color mapping for gender.
You can get also ready to use palettes from the viridis
package, which is color-blind friendly:
## Loading required package: viridisLite
ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
geom_line(size = 1.2) +
scale_color_viridis_d() +
labs(title = "Population by Country (Viridis Colors)")
This uses the viridis color palette, which is designed to be perceptually uniform and color-blind friendly.
Coordinates (coord)
The coordinate system defines the layout of the plot. The most common system is Cartesian (x and y axes), but you can also use polar coordinates, or transform the axes.
For example:
ggplot(data_long %>% filter(year == "2020"), aes(x = country, y = population)) +
geom_col() +
coord_flip() +
labs(title = "Population by Country in 2020 (Flipped)")
This flips the axes, so that the x-axis becomes the y-axis and vice versa.
Facets (facet)
Faceting allows you to split a plot into multiple panels, which is useful for comparing subsets of the data. Facets can be done by rows or columns.
ggplot(data_long, aes(x = year, y = population)) +
geom_line(group = 1) +
facet_wrap(~ country) +
labs(title = "Population Trend by Country (Faceted)")
Labels and Themes
You can customize the plot with labels and themes. Labels include titles, axis labels, and legends. Themes control the overall appearance of the plot.
ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
geom_line(size = 1.2) +
labs(
title = "Population Over Time by Country",
subtitle = "Based on simulated data (millions)",
x = "Year",
y = "Population (Millions)",
caption = "Data source: Simulated"
) +
theme_minimal()
You can customize fonts and style:
ggplot(data_long, aes(x = year, y = population, color = country, group = country)) +
geom_line(size = 1.2) +
labs(
title = "Population Over Time by Country",
subtitle = "Based on simulated data (millions)",
x = "Year",
y = "Population (Millions)",
caption = "Data source: Simulated"
) +
theme_minimal() + theme(
plot.title = element_text(size = 18, face = "bold"),
axis.title = element_text(size = 14),
legend.title = element_text(size = 12)
)
Creating map plots with ggplot2
# Load necessary datasets
dados <- read.csv("https://breeding-insight.github.io/learn-hub/r-intro/data/dados_2025.csv")
colnames(dados) <- c("Date", "Affiliation", "Longitude", "Latitude", "Background", "Present_Occupation", "Explain", "KnowledgeR")
# Create quantitative measure
dados$KnowledgeR_num <- NA
dados$KnowledgeR_num[dados$KnowledgeR == "Advanced"] <- 3
dados$KnowledgeR_num[dados$KnowledgeR == "Intermediate"] <- 2
dados$KnowledgeR_num[dados$KnowledgeR == "Beginner (some knowledge)"] <- 1
dados$KnowledgeR_num[dados$KnowledgeR == "No R knowledge"] <- 0
dados$KnowledgeR_num[dados$KnowledgeR == ""] <- NA
# Cargar los datos del mapa de EE.UU.
world_map <- map_data("world")
dados$Latitude <- as.numeric(dados$Latitude)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
# Crear el gráfico
ggplot() +
geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
# Graficar los puntos de peso promedio
geom_point(data = dados, aes(x = Longitude, y = Latitude), alpha = 0.7) +
labs(title = "R course students", x = "Longitud", y = "Latitud") +
theme_minimal() +
theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Color by Present Occupation
ggplot() +
geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
# Graficar los puntos de peso promedio
geom_point(data = dados, aes(x = Longitude, y = Latitude, color = Present_Occupation), alpha = 0.7) +
labs(title = "R course students", x = "Longitud", y = "Latitud") +
theme_minimal() +
theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Use color blind friendly
ggplot() +
geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
# Graficar los puntos de peso promedio
geom_point(data = dados, aes(x = Longitude, y = Latitude, color = Present_Occupation), alpha = 0.7) +
scale_colour_viridis_d() +
labs(title = "R course students", x = "Longitud", y = "Latitud") +
theme_minimal() +
theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).