Introduction to R Programming




This material is an update and translation of materials developed by students from the Department of Genetics at ESALQ/USP - Brazil. Access the Portuguese content taught in other events at this website.

We suggest that, before starting the practice described here, follow this tutorial for installing R and RStudio.

Getting Familiar with the RStudio Interface

When opening RStudio, you will see:

The interface is divided into four windows with main functions:

  • Code editing
  • Workspace and history
  • Console
  • Files, plots, packages, and help

Explore each of the windows. There are numerous functionalities for each of them, and we will cover some of them throughout the course.

A First Script

The code editing window (probably located in the top left corner) will be used to write your code. Open a new script by clicking the + in the top left corner and selecting R script.

Let’s begin our work with the traditional Hello World. Type in your script:

cat("Hello world")
## Hello world

Now, select the line and press the Run button or use Ctrl + enter.

When you do this, your code will be processed in the Console window, where the written code will appear in blue (if you have R’s default colors), followed by the desired result. The line will only not be processed in the console if there is a # symbol in front of it. Now, try putting # in front of the written code. Again, select the line and press Run.

# cat("Hello world")

The # symbol is used for comments in the code. This is a great organizational practice and helps to remember, later, what you were thinking when you wrote the code. It’s also essential for others to understand it. As in the example:

# Starting work in R
cat("Hello world")
## Hello world

Important: whenever you want to make any changes, edit your script and not directly in the console, because everything written in the console cannot be saved!

To save your script, you can use the Files tab located (by default) in the bottom right corner. You can look for a location of your preference, create a new folder named CourseR.

Tip:

  • Avoid putting spaces and punctuation in folder and file names, this can make access via command line in R difficult. For example, instead of Course R, we opt for CourseR.

Then, just click on the floppy disk icon located in the RStudio header or use Ctrl + s and select the created CourseR directory. R scripts are saved with the .R extension.

Setting the Working Directory

Another good practice in R is to keep the script in the same directory where your raw data (input files for the script) and processed data (graphs, tables, etc.) are located. For this, we’ll have R identify the same directory where you saved the script as the working directory. This way, it will understand that this is where the data will be obtained from and where the results will also go.

You can do this using RStudio’s facilities, simply locate the CourseR directory through the Files tab, click on More and then “Set as Working Directory”. Notice that something like this will appear in the console:

setwd("~/Documents/CourseR")

In other words, you can use this same command to perform this action. The result will be our working folder. When you’re lost or to make sure the working directory has been changed, use:

getwd()

Making Life Easier with Tab

Now, imagine you have a directory like ~/Documents/masters/semester1/course_such/class_such/data_28174/analysis_276182/results_161/. It’s not easy to remember this entire path to write in a setwd() command.

In addition to the convenience of the RStudio window, you can also use the Tab key to complete the path for you. Try it by searching for a folder on your computer. Just start typing the path and press Tab, it will complete the name for you! If you have more than one file with that name beginning, press Tab twice, it will show you all the options.

The Tab key works not only for indicating paths but also for commands and object names. It’s very common to make typing errors in code. Using Tab will significantly reduce these errors.

Tab can be even more powerful if you have access to the GitHub Copilot tool. With it, you can use Tab to complete the code you’re writing. It’s an artificial intelligence-based tool that suggests code as you write. It’s a paid tool, but you can use it for free for 60 days.

Basic Operations

Let’s get to the language!

R can function as a simple calculator, using the same syntax as other programs (like Excel):

1 + 1.3 # Decimal defined with "."
2 * 3
2^3
4 / 2

sqrt(4) # square root
log(100, base = 10) # Logarithm base 10
log(100) # Natural logarithm

Now, use the basic operations to solve the expression below. Remember to use parentheses () to establish priorities in operations.

\((\frac{13+2+1.5}{3})+ log_{4}96\)

Expected result:

## [1] 8.792481

Notice that if you position the parentheses incorrectly, the code won’t result in any error message, as this is what we call a logical error or silent error, meaning the code runs but doesn’t do what you want it to do. This is the most dangerous and difficult type of error to fix. See an example:

13 + 2 + 1.5 / 3 + log(96, base = 4)
## [1] 18.79248

Errors that produce a message, whether a warning or an error, are called syntax errors. In these cases, R will return a message to help you correct them. Warnings don’t compromise the code’s functionality but draw attention to something; errors, however, must necessarily be corrected for the code to run.

Example of an error:

((13+2+1,5)/3) + log(96, base = 4)

You might also forget to close a parenthesis, quotation mark, bracket, or brace; in these cases, R will wait for the command to close the code block, indicating with a +:

((13+2+1.5)/3 + log(96, base = 4)

If this happens, go to the console and press ESC, which will end the block so you can correct it.

The commands log and sqrt are two of many basic functions that R has. Functions are organized sets of instructions to perform a task. For all of them, R has a description to help in their use. To access this help, use:

?log

And the function description will open in RStudio’s Help window.

If R’s own description isn’t enough for you to understand how the function works, search on Google (preferably in English). There are many websites and forums with educational information about R functions.

Vector Operations

Vectors are the simplest structures worked with in R. We build a vector with a numeric sequence using:

c(1, 3, 2, 5, 2)
## [1] 1 3 2 5 2

IMPORTANT NOTE: The c is R’s function (Combine Values into a Vector or List) with which we build a vector!

We use the : symbol to create sequences of integer numbers, like:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

We can use other functions to generate sequences, such as:

seq(from = 0, to = 100, by = 5)
##  [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90
## [20]  95 100
# or
seq(0, 100, 5) # If you already know the order of the function arguments
##  [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90
## [20]  95 100
  • Create a sequence using the seq function that varies from 4 to 30, with intervals of 3.
## [1]  4  7 10 13 16 19 22 25 28

The rep function generates sequences with repeated numbers:

rep(3:5, 2)
## [1] 3 4 5 3 4 5

We can perform operations using these vectors:

c(1, 4, 3, 2) * 2
c(4, 2, 1, 5) + c(5, 2, 6, 1)
c(4, 2, 1, 5) * c(5, 2, 6, 1)

Notice that it’s already getting tiresome to type the same numbers repeatedly, let’s solve this by creating objects to store our vectors and much more.

Creating Objects

The storage of information in objects and their possible manipulation makes R an object-oriented language. To create an object, simply assign values to variables, as follows:

x <- c(30.1, 30.4, 40, 30.2, 30.6, 40.1)
# or
x <- c(30.1, 30.4, 40, 30.2, 30.6, 40.1)

y <- c(0.26, 0.3, 0.36, 0.24, 0.27, 0.35)

Old-school users tend to use the <- sign, but it has the same function as =. Some prefer to use <- for object assignment and = only for defining arguments within functions. Organize yourself in the way you prefer.

To access the values within the object, simply:

x
## [1] 30.1 30.4 40.0 30.2 30.6 40.1

The language is case-sensitive. Therefore, x is different from X:

X

The object X was not created.

The naming of objects is a personal choice, but it’s suggested to maintain a pattern for better organization. Here are some tips:

  • Use descriptive names
  • Avoid starting with numbers
  • Don’t use spaces (use _ or camelCase)
  • Don’t use special characters
  • Maintain consistency in the chosen pattern
  • Avoid very long names
  • Don’t use accents or non-ASCII characters

Some names cannot be used because they establish fixed roles in R:

  • TRUE - True, logical value
  • FALSE - False, logical value
  • if, else, for, while, break, next - Reserved words for conditional and loop structures
  • for, while, repeat - Reserved words for loop structures
  • function - Reserved word for function definition
  • in, NA, NaN, NULL - Reserved words for special values
  • NA_integer_, NA_real, NA_character_, NA_complex_ - Special values to represent missing data

We can then perform operations with the created object:

x + 2
## [1] 32.1 32.4 42.0 32.2 32.6 42.1
x * 2
## [1] 60.2 60.8 80.0 60.4 61.2 80.2

To perform the operation, R aligns the two vectors and performs the operation element by element. Observe:

x + y
## [1] 30.36 30.70 40.36 30.44 30.87 40.45
x * y
## [1]  7.826  9.120 14.400  7.248  8.262 14.035

If the vectors have different sizes, it will repeat the smaller one to perform the element-by-element operation with all elements of the larger one.

x * 2
x * c(1, 2)

If the smaller vector is not a multiple of the larger one, we’ll get a warning:

x * c(1, 2, 3, 4)
## Warning in x * c(1, 2, 3, 4): longer object length is not a multiple of shorter
## object length
## [1]  30.1  60.8 120.0 120.8  30.6  80.2

Notice that the warning doesn’t compromise the code’s functionality; it just gives a hint that something might not be as you’d like.

We can also store the operation in another object:

z <- (x + y) / 2
z

We can also apply some functions, for example:

sum(z) # sum of z values
## [1] 101.59
mean(z) # mean
## [1] 16.93167
var(z) # variance
## [1] 6.427507

Indexing

We access only the 3rd value of the created vector with []:

z[3]

We can also access the numbers from position 2 to 4 with:

z[2:4]
## [1] 15.35 20.18 15.22

To get information about the created vector, use:

str(z)
##  num [1:6] 15.2 15.3 20.2 15.2 15.4 ...

The str function tells us about the structure of the vector, which is a numeric vector with 6 elements.

Vectors can also receive other categories such as characters:

clone <- c("GRA02", "URO01", "URO03", "GRA02", "GRA01", "URO01")

Another class is factors, which can be a bit complex to handle.

Generally, factors are values categorized by levels. For example, if we transform our character vector clone into a factor, levels will be assigned to each word:

clone_factor <- as.factor(clone)
str(clone_factor)
##  Factor w/ 4 levels "GRA01","GRA02",..: 2 3 4 2 1 3
levels(clone_factor)
## [1] "GRA01" "GRA02" "URO01" "URO03"

This way, we’ll have only 4 levels for a vector with 6 elements, since the words “GRA02” and “URO01” are repeated. We can get the number of elements in the vector or its length with:

length(clone_factor)
## [1] 6

There are also logical vectors, which receive true or false values:

logical <- x > 40
logical # Are the elements greater than 40?
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

With it we can, for example, identify which positions have elements greater than 40:

which(logical) # Getting the positions of TRUE elements
## [1] 6
x[which(logical)] # Getting numbers greater than 40 from vector x by position
## [1] 40.1
# or
x[which(x > 40)]
## [1] 40.1

We can also locate specific elements with:

clone %in% c("URO03", "GRA02")
## [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE

The functions any and all can also be useful. Research about them.

Find more about other logical operators, like the > used, at this link.

Warning1

Create a numeric sequence containing 10 integer values, and save it in an object called “a”.

(a <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10

Create another sequence, using decimal numbers and any mathematical operation, so that its values are identical to object “a”.

b <- seq(from = 0.1, to = 1, 0.1)
(b <- b * 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

The two vectors look equal, don’t they?

Then, using a logical operator, let’s verify if object “b” is equal to object “a”.

a == b
##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Some values are not equal. How is this possible?

a == round(b)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Warning2

It’s not possible to mix different classes within the same vector. When trying to do this, notice that R will try to equalize to a single class:

wrong <- c(TRUE, "oops", 1)
wrong
## [1] "TRUE" "oops" "1"

In this case, all elements were transformed into characters.

Some Tips:

  • Be careful with operation priority; when in doubt, always add parentheses according to your priority interest.
  • Remember that if you forget to close any ( or [ or ", R’s console will wait for you to close it, indicating with a +. Nothing will be processed until you directly type a ) in the console or press ESC.
  • Be careful not to overwrite already created objects by creating others with the same name. Use, for example: height1, height2.
  • Keep in your .R script only the commands that worked and, preferably, add comments. You can, for example, comment on difficulties encountered, so you don’t make the same mistakes later.

If you’re ahead of your colleagues, you can already do the exercises from Session 1; if not, do them at another time and send us your questions.

Matrices

Matrices are another class of objects widely used in R, with them we can perform large-scale operations in an automated way.

Since they are used in operations, we typically store numeric elements in them. To create a matrix, we determine a sequence of numbers and indicate the number of rows and columns in the matrix:

X <- matrix(1:12, nrow = 6, ncol = 2)
X
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
## [3,]    3    9
## [4,]    4   10
## [5,]    5   11
## [6,]    6   12

We can also use sequences already stored in vectors to generate a matrix, as long as they are numeric:

W <- matrix(c(x, y), nrow = 6, ncol = 2)
W
##      [,1] [,2]
## [1,] 30.1 0.26
## [2,] 30.4 0.30
## [3,] 40.0 0.36
## [4,] 30.2 0.24
## [5,] 30.6 0.27
## [6,] 40.1 0.35

With them we can perform matrix operations:

X * 2
##      [,1] [,2]
## [1,]    2   14
## [2,]    4   16
## [3,]    6   18
## [4,]    8   20
## [5,]   10   22
## [6,]   12   24
X * X
##      [,1] [,2]
## [1,]    1   49
## [2,]    4   64
## [3,]    9   81
## [4,]   16  100
## [5,]   25  121
## [6,]   36  144
X %*% t(X) # Matrix multiplication
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   50   58   66   74   82   90
## [2,]   58   68   78   88   98  108
## [3,]   66   78   90  102  114  126
## [4,]   74   88  102  116  130  144
## [5,]   82   98  114  130  146  162
## [6,]   90  108  126  144  162  180

Using these operations requires knowledge of matrix algebra. If you want to delve deeper into this, the book Linear Models in Statistics, Rencher (2008) has a good review about it. You can also explore R syntax for these operations at this link.

We access the numbers inside the matrix by giving the coordinates [row,column], as in the example:

W[4, 2] # Number positioned in row 4 and column 2
## [1] 0.24

Sometimes it can be informative to give names to the columns and rows of the matrix, we do this with:

colnames(W) <- c("height", "diameter")
rownames(W) <- clone
W
##       height diameter
## GRA02   30.1     0.26
## URO01   30.4     0.30
## URO03   40.0     0.36
## GRA02   30.2     0.24
## GRA01   30.6     0.27
## URO01   40.1     0.35

These functions colnames and rownames also work with data.frames.

Data.frames

Unlike matrices, we don’t perform operations with data.frames, but they allow the combination of vectors with different classes. Data frames are similar to tables generated in other programs, like Excel.

Data frames are combinations of vectors of the same length. All the ones we’ve created so far have size 6, verify this.

We can thus combine them into columns of a single data.frame:

field1 <- data.frame(
  "clone" = clone, # Before the "=" sign
  "height" = x, # we establish the names
  "diameter" = y, # of the columns
  "age" = rep(3:5, 2),
  "cut" = logical
)
field1
##   clone height diameter age   cut
## 1 GRA02   30.1     0.26   3 FALSE
## 2 URO01   30.4     0.30   4 FALSE
## 3 URO03   40.0     0.36   5 FALSE
## 4 GRA02   30.2     0.24   3 FALSE
## 5 GRA01   30.6     0.27   4 FALSE
## 6 URO01   40.1     0.35   5  TRUE

We can access each of the columns with:

field1$age
## [1] 3 4 5 3 4 5

Or also with:

field1[, 4]
## [1] 3 4 5 3 4 5

Here, the number inside the brackets refers to the column, being the second element (separated by comma). The first element refers to the row. Since we left the first element empty, we’re referring to all rows for that column.

This way, if we want to obtain specific content, we can give the coordinates with [row,column]:

field1[1, 2]
## [1] 30.1
  • Get the diameter of clone “URO03”.
## [1] 0.36

Even though it’s a data frame, we can perform operations with the numeric vectors it contains.

  • With the diameter and height of the trees, calculate the volume according to the following formula and store it in a volume object:

\(3.14*(diameter/2)^2*height\)

## [1] 1.597287 2.147760 4.069440 1.365523 1.751131 3.856116

Now, let’s add the calculated vector with the volume to our data frame. For this, use the cbind function.

field1 <- cbind(field1, volume)
field1
##   clone height diameter age   cut   volume
## 1 GRA02   30.1     0.26   3 FALSE 1.597287
## 2 URO01   30.4     0.30   4 FALSE 2.147760
## 3 URO03   40.0     0.36   5 FALSE 4.069440
## 4 GRA02   30.2     0.24   3 FALSE 1.365523
## 5 GRA01   30.6     0.27   4 FALSE 1.751131
## 6 URO01   40.1     0.35   5  TRUE 3.856116
str(field1)
## 'data.frame':    6 obs. of  6 variables:
##  $ clone   : chr  "GRA02" "URO01" "URO03" "GRA02" ...
##  $ height  : num  30.1 30.4 40 30.2 30.6 40.1
##  $ diameter: num  0.26 0.3 0.36 0.24 0.27 0.35
##  $ age     : int  3 4 5 3 4 5
##  $ cut     : logi  FALSE FALSE FALSE FALSE FALSE TRUE
##  $ volume  : num  1.6 2.15 4.07 1.37 1.75 ...

Some tips:

  • Remember that, to build matrices and data frames, the columns must have the same number of elements.

  • If you don’t know which operator or function should be used, search on Google or ask chatgpt or any other tool. For example, if you’re unsure how to calculate the standard deviation, search for “standard deviation R”. The R community is very active and most of your questions about it have already been answered somewhere on the web.

  • Don’t forget that everything you do in R needs to be explicitly indicated, like a multiplication 4ac with 4*a*c. To generate a vector 1,3,2,6 you need: c(1,3,2,6).

Lists

Lists consist of a collection of objects, not necessarily of the same class. In them, we can store all the other objects we’ve already seen and retrieve them through indexing with [[. As an example, let’s use some objects that have already been generated.

my_list <- list(field1 = field1, height_mean = tapply(field1$height, field1$age, mean), matrix_ex = W)
str(my_list)
## List of 3
##  $ field1     :'data.frame': 6 obs. of  6 variables:
##   ..$ clone   : chr [1:6] "GRA02" "URO01" "URO03" "GRA02" ...
##   ..$ height  : num [1:6] 30.1 30.4 40 30.2 30.6 40.1
##   ..$ diameter: num [1:6] 0.26 0.3 0.36 0.24 0.27 0.35
##   ..$ age     : int [1:6] 3 4 5 3 4 5
##   ..$ cut     : logi [1:6] FALSE FALSE FALSE FALSE FALSE TRUE
##   ..$ volume  : num [1:6] 1.6 2.15 4.07 1.37 1.75 ...
##  $ height_mean: num [1:3(1d)] 30.1 30.5 40
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ : chr [1:3] "3" "4" "5"
##  $ matrix_ex  : num [1:6, 1:2] 30.1 30.4 40 30.2 30.6 40.1 0.26 0.3 0.36 0.24 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "GRA02" "URO01" "URO03" "GRA02" ...
##   .. ..$ : chr [1:2] "height" "diameter"

I want to access the data.frame field1

my_list[[1]]
##   clone height diameter age   cut   volume
## 1 GRA02   30.1     0.26   3 FALSE 1.597287
## 2 URO01   30.4     0.30   4 FALSE 2.147760
## 3 URO03   40.0     0.36   5 FALSE 4.069440
## 4 GRA02   30.2     0.24   3 FALSE 1.365523
## 5 GRA01   30.6     0.27   4 FALSE 1.751131
## 6 URO01   40.1     0.35   5  TRUE 3.856116
# or
my_list$field1
##   clone height diameter age   cut   volume
## 1 GRA02   30.1     0.26   3 FALSE 1.597287
## 2 URO01   30.4     0.30   4 FALSE 2.147760
## 3 URO03   40.0     0.36   5 FALSE 4.069440
## 4 GRA02   30.2     0.24   3 FALSE 1.365523
## 5 GRA01   30.6     0.27   4 FALSE 1.751131
## 6 URO01   40.1     0.35   5  TRUE 3.856116

To access a specific column in the data.frame field1, which is inside my_list:

my_list[[1]][[3]]
## [1] 0.26 0.30 0.36 0.24 0.27 0.35
# or
my_list[[1]]$diameter
## [1] 0.26 0.30 0.36 0.24 0.27 0.35
# or
my_list$field1$diameter
## [1] 0.26 0.30 0.36 0.24 0.27 0.35

Lists are very useful, for example, when we are going to use/generate various objects within a loop.

Arrays

This is a type of object that you probably won’t use at the beginning, but it’s good to know about its existence. Arrays are used to store data with more than two dimensions. For example, if we create an array:

(my_array <- array(1:24, dim = c(2, 3, 4)))
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24

We will have four matrices with two rows and three columns, and the numbers from 1 to 24 will be distributed in them by columns.

If you’re ahead of your colleagues, you can already do the exercises from Session 2; if not, do them at another time and send us your questions through the forum.

Importing and Exporting Data

Download the objects created so far by clicking here

RData files are exclusive to R. You can open them by double-clicking the file or by using:

load("data/dia_1.RData")

This will recover all objects created so far in the tutorial.

If you have internet access, you can also use the web link directly:

load(url("https://breeding-insight.github.io/learn-hub/r-intro/data/dia_1.RData"))

To generate this RData file, I ran all the code presented so far, and used:

save.image(file = "data/dia_1.RData")

This command saves everything in your Global Environment (all objects displayed in the upper-right corner). You can also save only a specific object by using:

save(campo1, file = "data/campo1.RData")

If we remove it from our Global Environment with:

rm(campo1) # Make sure to save the RData file before removing

We can easily load it again with:

load("data/campo1.RData")

The RData format is exclusive to R. It is useful for tasks like ours where we pause the analysis one day and resume another, without having to rerun everything. However, there are situations where you need to export your data to other programs that require different formats, such as .txt or .csv. For that, you can use:

write.table(campo1, file = "campo1.txt", sep = ";", dec = ".", row.names = FALSE)
write.csv(campo1, file = "campo1.csv", row.names = TRUE)

Note: You can use packages to export and import data in other formats. For example: - The openxlsx package allows you to export and import Excel files. - The vroom package can be used to handle large compressed tables.
To install packages, use the function install.packages("package_name") and to load the desired package, use library(package_name). More details about packages will be covered in the “Package Usage” section of this course.

# install.packages("openxlsx")
library(openxlsx)

write.xlsx(campo1, file = "campo1.xlsx")

When exporting data, there are multiple options for file formatting, which are important to consider if the file will be used in other software later.

Open the generated files in your text editor to view their formatting. Notice that the .txt file was saved with the separator ; and decimal separator .. Meanwhile, the .csv file was saved with the comma separator , and decimal separator ..

These files can also be read again in R by using the appropriate functions and specifications:

campo1_txt <- read.table(file = "campo1.txt", sep = ";", dec = ".", header = TRUE)
campo1_csv <- read.csv(file = "campo1.csv")
campo1_xlsx <- read.xlsx("campo1.xlsx", sheet = 1)
head(campo1_txt)
head(campo1_csv)
head(campo1_xlsx)

Now that we have learned how to import data, we will work with the dataset generated from the questionnaire sent to you and to students of other editions of this course.

The spreadsheet with the data is available at the link below. Add it to your working directory or specify the folder path when importing it into R, as shown below.

We can also import the data directly from GitHub by pointing to the web address:

dados <- read.csv("https://breeding-insight.github.io/learn-hub/r-intro/data/dados_2025.csv")

Here, we will also use the argument na.strings to indicate how missing data was labeled.

dados <- read.csv(file = "data/dados_2025.csv", na.strings = "-", header = T, dec = ",")
head(dados)
load("data/dados_2025.RData")

Let’s explore the structure of the collected data:

str(dados)
## 'data.frame':    124 obs. of  8 variables:
##  $ Timestamp                                                                                   : chr  "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" ...
##  $ Affiliation                                                                                 : chr  "Rcourse2021" "Rcourse2021" "Rcourse2021" "Rcourse2021" ...
##  $ Using.Google.Maps..please.provide.the.longitude.of.the.city.country.of.your.birth.          : chr  "-54.5724" "-47.6476" "-48.0547" "-106.6563" ...
##  $ Using.Google.Maps..please.provide.the.latitude.of.the.city.country.of.your.birth.           : chr  "-25.5263" "-22.725" "-15.911" "52.1418" ...
##  $ Background.Field..e.g..Molecular.Biology..Animal.Breeding..Genetics..etc..                  : chr  "Agronomia" "Agronomia" "Biotecnologia" "Licenciatura em Ciências Biológicas" ...
##  $ Current.professional.affiliation                                                            : chr  "PhD" "PhD" "Masters degree" "Other" ...
##  $ If.you.chose.other.in.the..current.professional.affiliation..question.above..please.specify.: chr  NA NA NA NA ...
##  $ Level.of.R.knowledge                                                                        : chr  "Intermediate" "Intermediate" "Beginner (some knowledge)" "Intermediate" ...
# Also
dim(dados)
## [1] 124   8

Observe that the column names still correspond to the complete questions from the questionnaire. Let’s simplify them to make them easier to work with:

colnames(dados)
## [1] "Timestamp"                                                                                   
## [2] "Affiliation"                                                                                 
## [3] "Using.Google.Maps..please.provide.the.longitude.of.the.city.country.of.your.birth."          
## [4] "Using.Google.Maps..please.provide.the.latitude.of.the.city.country.of.your.birth."           
## [5] "Background.Field..e.g..Molecular.Biology..Animal.Breeding..Genetics..etc.."                  
## [6] "Current.professional.affiliation"                                                            
## [7] "If.you.chose.other.in.the..current.professional.affiliation..question.above..please.specify."
## [8] "Level.of.R.knowledge"
colnames(dados) <- c("Date", "Affiliation", "Longitude", "Latitude", "Background", "Present_Occupation", "Explain", "KnowledgeR")
colnames(dados)
## [1] "Date"               "Affiliation"        "Longitude"         
## [4] "Latitude"           "Background"         "Present_Occupation"
## [7] "Explain"            "KnowledgeR"
str(dados)
## 'data.frame':    124 obs. of  8 variables:
##  $ Date              : chr  "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" "07/04/2025 14:00:00" ...
##  $ Affiliation       : chr  "Rcourse2021" "Rcourse2021" "Rcourse2021" "Rcourse2021" ...
##  $ Longitude         : chr  "-54.5724" "-47.6476" "-48.0547" "-106.6563" ...
##  $ Latitude          : chr  "-25.5263" "-22.725" "-15.911" "52.1418" ...
##  $ Background        : chr  "Agronomia" "Agronomia" "Biotecnologia" "Licenciatura em Ciências Biológicas" ...
##  $ Present_Occupation: chr  "PhD" "PhD" "Masters degree" "Other" ...
##  $ Explain           : chr  NA NA NA NA ...
##  $ KnowledgeR        : chr  "Intermediate" "Intermediate" "Beginner (some knowledge)" "Intermediate" ...

Now we will use the dataset to learn different commands and functions in the R environment.

First, let’s check how many students responded to the course questionnaire by counting the number of rows. For this, use the nrow function:

nrow(dados)
## [1] 124

Next, let’s check if our group includes people who share the same educational background and occupation.

This can be easily done using the table function, which indicates the frequency of each observation:

table(dados$Affiliation)
table(dados$Present_Occupation)

Conditional Structures

if and else

For our next activity with the dataset, let’s first understand how the if and else structures work.

In conditional functions if and else, we set a condition for if. If the condition is true, the activity will be performed; otherwise (else), another task will be performed. For example:

if (2 > 3) {
  cat("two is greater than three")
} else {
  cat("two is not greater than three")
}
## two is not greater than three
  • Activity: Determine the R Knowledge Level of the second person who responded (row 2). Send the message “Intermediate Level” if it is intermediate or “Basic or Advanced Level” otherwise. (tip: the == operator refers to “exactly equal to”)
head(dados)
##                  Date Affiliation Longitude Latitude
## 1 07/04/2025 14:00:00 Rcourse2021  -54.5724 -25.5263
## 2 07/04/2025 14:00:00 Rcourse2021  -47.6476  -22.725
## 3 07/04/2025 14:00:00 Rcourse2021  -48.0547  -15.911
## 4 07/04/2025 14:00:00 Rcourse2021 -106.6563  52.1418
## 5 07/04/2025 14:00:00 Rcourse2021  -47.6604 -22.7641
## 6 07/04/2025 14:00:00 Rcourse2021  -47.6434 -22.7118
##                            Background Present_Occupation Explain
## 1                           Agronomia                PhD    <NA>
## 2                           Agronomia                PhD    <NA>
## 3                       Biotecnologia     Masters degree    <NA>
## 4 Licenciatura em Ciências Biológicas              Other    <NA>
## 5                           Agronomia       Profissional    <NA>
## 6                           Agronomia                PhD    <NA>
##                  KnowledgeR
## 1              Intermediate
## 2              Intermediate
## 3 Beginner (some knowledge)
## 4              Intermediate
## 5 Beginner (some knowledge)
## 6              Intermediate
if (dados$KnowledgeR[2] == "Intermediate") {
  cat("Intermediate Level")
} else {
  cat("Basic or Advanced Level")
}
## Intermediate Level

We can specify more than one condition by repeating the if and else structure.
Now, let’s examine the occupation of the respondents to the questionnaire:

if (dados$Present_Occupation[2] == "Other") {
  print(dados$Explain[2])
} else if (dados$Present_Occupation[2] == "PhD") {
  print(paste("Is your PhD in:", dados$Background[2], "?"))
} else {
  print(paste("Are you still working with:", dados$Background[2], "?"))
}
## [1] "Is your PhD in: Agronomia ?"

However, note that these structures can only be applied to an individual element of a vector or to the whole vector. If we want to iterate through the individual elements, we need to use another resource.

Repetition Structures

For

This feature can be implemented using the for function, a powerful and widely-used tool. It constitutes a loop structure, meaning it will apply the same activity repeatedly until a specified condition is met. See the examples below:

for (i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
test <- vector()
for (i in 1:10) {
  test[i] <- i + 4
}
test
##  [1]  5  6  7  8  9 10 11 12 13 14

In the examples above, i works as an index that varies from 1 to 10 during the operation specified within the braces.

Using this structure, we can repeat the operation performed with if and else structures for the entire vector:

for (i in 1:nrow(dados)) {
  if (dados$Present_Occupation[i] == "Other") {
    print(dados$Explain[i])
  } else if (dados$Present_Occupation[i] == "PhD") {
    print(paste("Is your PhD in:", dados$Background[i], "?"))
  } else {
    print(paste("Are you still working with:", dados$Background[i], "?"))
  }
}

Note that some participants did not respond to the question “Other.” To ensure R does not return an error, we can use the is.na function to check whether a response is NA (not available).

for (i in 1:nrow(dados)) {
  if (dados$Present_Occupation[i] == "Other") {
    if (is.na(dados$Explain[i])) {
      print("This person did not explain what they meant by 'Other'")
    } else {
      print(dados$Explain[i])
    }
  } else if (dados$Present_Occupation[i] == "PhD") {
    print(paste("Is your PhD in:", dados$Background[i], "?"))
  } else {
    print(paste("Are you still working with:", dados$Background[i], "?"))
  }
}
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biotecnologia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Zootecnia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engenharia Florestal ?"
## [1] "Is your PhD in: Bacharel em Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Engenharia Agrícola ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Engenheiro Florestal ?"
## [1] "Are you still working with: Genética e Melhoramento de Plantas ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Engenharia Agronômica ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biotecnologia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Melhoramento genético ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Zootecnia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Biotecnologia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Genética e Melhoramento ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Is your PhD in: Genética e Melhoramento de Plantas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engeheria em Biotecnologia Vegetal (Chile) ?"
## [1] "Are you still working with: AGRONOMIA ?"
## [1] "Are you still working with: Engenharia Agronômica ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biología ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Engenharia Biotecnológica ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Engenharia Florestal ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Biologia ?"
## [1] "Is your PhD in: Biologia ?"
## [1] "Is your PhD in: Ciências Biológicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Engenheiro Agrônomo ?"
## [1] "Is your PhD in: Genética e Melhoramento ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Is your PhD in: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Is your PhD in: Agronomia ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "This person did not explain what they meant by 'Other'"
## [1] "Are you still working with: Engenharia Agronômica / Licenciatura em Ciências Agrárias ?"
## [1] "Are you still working with: Ciencias Biologicas ?"
## [1] "Are you still working with: Agronomia ?"
## [1] "Is your PhD in: Eng. Florestal ?"
## [1] "Postdoc"
## [1] "Are you still working with: Psychology ?"
## [1] "Are you still working with: Plant Genetics ?"
## [1] "Are you still working with: Plant Breeding ?"
## [1] "Finished my Masters "
## [1] "Are you still working with: Soil Sciencies  ?"
## [1] "Masters degree"
## [1] "PhD"

Tip: Indentation

Notice the difference:

# Without indentation
for (i in 1:nrow(dados)) {
  if (dados$Present_Occupation[i] == "Other") {
    if (is.na(dados$Explain[i])) {
      print("This person did not explain what they meant by 'Other'")
    } else {
      print(dados$Explain[i])
    }
  } else if (dados$Present_Occupation[i] == "PhD") {
    print(paste("Is your PhD in:", dados$Background[i], "?"))
  } else {
    print(paste("Are you still working with:", dados$Background[i], "?"))
  }
}

RStudio’s code editor makes it easy to indent R code. Select the area you want to indent and press Ctrl+i.

Now let’s work with column 5, which contains the participants’ backgrounds. Notice that the table function returns different categories that could be grouped into just one, such as “Agronomic Engineering” and “Agronomy.” We will use a loop to identify which participants responded with areas related to “Agro.” Then, we will determine which responses weren’t “Agronomy” and ask the participants to modify their answers to “Agronomy.”

Tip: To identify the pattern “Agro,” we can use the grepl function.

# Example of using grepl
dados[, 5]
##   [1] "Agronomia"                                                
##   [2] "Agronomia"                                                
##   [3] "Biotecnologia"                                            
##   [4] "Licenciatura em Ciências Biológicas"                      
##   [5] "Agronomia"                                                
##   [6] "Agronomia"                                                
##   [7] "Zootecnia"                                                
##   [8] "Agronomia"                                                
##   [9] "Agronomia"                                                
##  [10] "Engenharia Florestal"                                     
##  [11] "Bacharel em Agronomia"                                    
##  [12] "Engenharia Agronômica"                                    
##  [13] "Engenharia Agrícola"                                      
##  [14] "Agronomia"                                                
##  [15] "Engenharia Agronômica"                                    
##  [16] "Ciências Biológicas"                                      
##  [17] "Agronomia"                                                
##  [18] "Engenheiro Florestal"                                     
##  [19] "Genética e Melhoramento de Plantas"                       
##  [20] "Agronomia"                                                
##  [21] "Agronomia"                                                
##  [22] "Engenharia Agronômica"                                    
##  [23] "Agronomia"                                                
##  [24] "Agronomia"                                                
##  [25] "Ciências Biológicas"                                      
##  [26] "Agronomia"                                                
##  [27] "Ciências Biológicas"                                      
##  [28] "Engenharia Agronômica"                                    
##  [29] "Ciências Biológicas"                                      
##  [30] "Agronomia"                                                
##  [31] "Biotecnologia"                                            
##  [32] "Biologia"                                                 
##  [33] "Agronomia"                                                
##  [34] "Biologia"                                                 
##  [35] "Agronomia"                                                
##  [36] "Biologia"                                                 
##  [37] "Agronomia"                                                
##  [38] "Melhoramento genético"                                    
##  [39] "FAEM - UFPEL"                                             
##  [40] "Agronomia"                                                
##  [41] "Ciências Biológicas"                                      
##  [42] "Ciências Biológicas"                                      
##  [43] "Agronomia"                                                
##  [44] "Biologia"                                                 
##  [45] "Agronomia"                                                
##  [46] "Agronomia"                                                
##  [47] "Eng. Agronômica"                                          
##  [48] "Zootecnia"                                                
##  [49] "Agronomia"                                                
##  [50] "Biotecnologia"                                            
##  [51] "Agronomia"                                                
##  [52] "Agronomia"                                                
##  [53] "Agronomia"                                                
##  [54] "Agronomia"                                                
##  [55] "Agronomia"                                                
##  [56] "Agronomia"                                                
##  [57] "Agronomia"                                                
##  [58] "Genética e Melhoramento"                                  
##  [59] "Agronomia"                                                
##  [60] "Agronomia"                                                
##  [61] "Agronomia"                                                
##  [62] "Ciências bilógicas"                                       
##  [63] "Agronomia"                                                
##  [64] "Biologia"                                                 
##  [65] "Genética e Melhoramento de Plantas"                       
##  [66] "Agronomia"                                                
##  [67] "Agronomia"                                                
##  [68] "Agronomia"                                                
##  [69] "Agronomia"                                                
##  [70] "Agronomia"                                                
##  [71] "Agronomia"                                                
##  [72] "Agronomia"                                                
##  [73] "Agronomia"                                                
##  [74] "Engeheria em Biotecnologia Vegetal (Chile)"               
##  [75] "AGRONOMIA"                                                
##  [76] "Engenharia Agronômica"                                    
##  [77] "Agronomia"                                                
##  [78] "Agronomia"                                                
##  [79] "Biología"                                                 
##  [80] "Agronomia"                                                
##  [81] "Ciências Biológicas"                                      
##  [82] "Agronomia"                                                
##  [83] "Agronomia"                                                
##  [84] "Agronomia"                                                
##  [85] "Agronomia"                                                
##  [86] "Agronomia"                                                
##  [87] "Ciências Biológicas"                                      
##  [88] "Agronomia"                                                
##  [89] "Agronomia"                                                
##  [90] "Agronomia"                                                
##  [91] "Agronomia"                                                
##  [92] "Agronomia"                                                
##  [93] "Engenharia Biotecnológica"                                
##  [94] "Agronomia"                                                
##  [95] "Agronomia"                                                
##  [96] "Engenharia Florestal"                                     
##  [97] "Agronomia"                                                
##  [98] "Agronomia"                                                
##  [99] "Biologia"                                                 
## [100] "Biologia"                                                 
## [101] "Ciências Biológicas"                                      
## [102] "Agronomia"                                                
## [103] "Agronomia"                                                
## [104] "Agronomia"                                                
## [105] "Engenheiro Agrônomo"                                      
## [106] "Genética e Melhoramento"                                  
## [107] "Agronomia"                                                
## [108] "Agronomia"                                                
## [109] "Biotecnologia"                                            
## [110] "Agronomia"                                                
## [111] "Agronomia"                                                
## [112] "Biotecnologia"                                            
## [113] "Engenharia Agronômica / Licenciatura em Ciências Agrárias"
## [114] "Ciencias Biologicas"                                      
## [115] "Agronomia"                                                
## [116] "Eng. Florestal"                                           
## [117] "Genetics"                                                 
## [118] "Psychology"                                               
## [119] "Plant Genetics"                                           
## [120] "Plant Breeding"                                           
## [121] "Climate Change"                                           
## [122] "Soil Sciencies "                                          
## [123] "Animal Science"                                           
## [124] "Animal Breeding and Genetics"
grepl("Agro", dados[, 5]) # Which rows contain the characters "Agro"
##   [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
##  [13] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
##  [37]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
##  [49]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [61]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
##  [97]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [109] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE
dados[grepl("Agro", dados[, 5]), 5]
##  [1] "Agronomia"                                                
##  [2] "Agronomia"                                                
##  [3] "Agronomia"                                                
##  [4] "Agronomia"                                                
##  [5] "Agronomia"                                                
##  [6] "Agronomia"                                                
##  [7] "Bacharel em Agronomia"                                    
##  [8] "Engenharia Agronômica"                                    
##  [9] "Agronomia"                                                
## [10] "Engenharia Agronômica"                                    
## [11] "Agronomia"                                                
## [12] "Agronomia"                                                
## [13] "Agronomia"                                                
## [14] "Engenharia Agronômica"                                    
## [15] "Agronomia"                                                
## [16] "Agronomia"                                                
## [17] "Agronomia"                                                
## [18] "Engenharia Agronômica"                                    
## [19] "Agronomia"                                                
## [20] "Agronomia"                                                
## [21] "Agronomia"                                                
## [22] "Agronomia"                                                
## [23] "Agronomia"                                                
## [24] "Agronomia"                                                
## [25] "Agronomia"                                                
## [26] "Agronomia"                                                
## [27] "Eng. Agronômica"                                          
## [28] "Agronomia"                                                
## [29] "Agronomia"                                                
## [30] "Agronomia"                                                
## [31] "Agronomia"                                                
## [32] "Agronomia"                                                
## [33] "Agronomia"                                                
## [34] "Agronomia"                                                
## [35] "Agronomia"                                                
## [36] "Agronomia"                                                
## [37] "Agronomia"                                                
## [38] "Agronomia"                                                
## [39] "Agronomia"                                                
## [40] "Agronomia"                                                
## [41] "Agronomia"                                                
## [42] "Agronomia"                                                
## [43] "Agronomia"                                                
## [44] "Agronomia"                                                
## [45] "Agronomia"                                                
## [46] "Agronomia"                                                
## [47] "Agronomia"                                                
## [48] "Engenharia Agronômica"                                    
## [49] "Agronomia"                                                
## [50] "Agronomia"                                                
## [51] "Agronomia"                                                
## [52] "Agronomia"                                                
## [53] "Agronomia"                                                
## [54] "Agronomia"                                                
## [55] "Agronomia"                                                
## [56] "Agronomia"                                                
## [57] "Agronomia"                                                
## [58] "Agronomia"                                                
## [59] "Agronomia"                                                
## [60] "Agronomia"                                                
## [61] "Agronomia"                                                
## [62] "Agronomia"                                                
## [63] "Agronomia"                                                
## [64] "Agronomia"                                                
## [65] "Agronomia"                                                
## [66] "Agronomia"                                                
## [67] "Agronomia"                                                
## [68] "Agronomia"                                                
## [69] "Agronomia"                                                
## [70] "Agronomia"                                                
## [71] "Agronomia"                                                
## [72] "Agronomia"                                                
## [73] "Engenharia Agronômica / Licenciatura em Ciências Agrárias"
## [74] "Agronomia"
for (i in 1:nrow(dados)) {
  if (grepl("Agro", dados[i, 5])) {
    if (dados[i, 5] != "Agronomy") {
      print("Please replace your response with Agronomy.")
    }
  }
}
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."

Notice that the code above does not return the rows with incorrect data; it only prints a message. To fix this, we need to store these rows in a vector and then access them.

homog <- vector()
for (i in 1:nrow(dados)) {
  if (grepl("Agro", dados[i, 5])) {
    if (dados[i, 5] != "Agronomy") {
      print("Please replace your response with Agronomy.")
      homog <- c(homog, i)
    }
  }
}
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
## [1] "Please replace your response with Agronomy."
homog
##  [1]   1   2   5   6   8   9  11  12  14  15  17  20  21  22  23  24  26  28  30
## [20]  33  35  37  40  43  45  46  47  49  51  52  53  54  55  56  57  59  60  61
## [39]  63  66  67  68  69  70  71  72  73  76  77  78  80  82  83  84  85  86  88
## [58]  89  90  91  92  94  95  97  98 102 103 104 107 108 110 111 113 115

How would you correct these incorrect elements? Try it!

‘dados[homog, 5] <- “Agronomy”’

While

In this type of repetition structure, the task will be performed until a certain condition is met.

x <- 1

while (x < 5) {
  x <- x + 1
  cat(x)
}
## 2345

It’s very important that in this structure the condition is met, otherwise the loop will run infinitely and you’ll have to interrupt it by external means. An example of these “external means” in RStudio is clicking the red symbol in the top right corner of the console window. You can also press Ctrl+C in the console.

It’s not very difficult for this to happen, just a small error like:

x <- 1

while (x < 5) {
  x + 1
  cat(x)
}

Here we can use the break and next commands to meet other conditions, like:

x <- 1

while (x < 5) {
  x <- x + 1
  if (x == 4) break
  cat(x)
}
## 23
x <- 1

while (x < 5) {
  x <- x + 1
  if (x == 4) next
  cat(x)
}
## 235

The break command stops the loop completely when the condition is met, while next skips the rest of the current iteration and continues with the next one.

Repeat

This structure also requires a stop condition, but this condition is necessarily placed inside the code block using break. It then repeats the code block until the condition interrupts it.

x <- 1
repeat{
  x <- x + 1
  cat(x)
  if (x == 4) break
}
## 234

The repeat structure is similar to while, but with the key difference that the stop condition must be explicitly defined within the code block using break. It will continue to execute the code block indefinitely until it encounters the break statement.

Loops within Loops

It’s also possible to use repetition structures within repetition structures. For example, if we want to work on both columns and rows of a matrix.

# Creating an empty matrix
ex_mat <- matrix(nrow = 10, ncol = 10)

# each number inside the matrix will be the product of the column index by the row index
for (i in 1:dim(ex_mat)[1]) {
  for (j in 1:dim(ex_mat)[2]) {
    ex_mat[i, j] <- i * j
  }
}

Another example of use:

var1 <- c("fertilizer1", "fertilizer2")
var2 <- c("ESS", "URO", "GRA")

w <- 1
for (i in var1) {
  for (j in var2) {
    file_name <- paste0(i, "_plant_", j, ".txt")
    file <- data.frame("block" = "fake_data", "treatment" = "fake_data")
    write.table(file, file = file_name)
    w <- w + 1
  }
}

# Check your working directory, files should have been generated

If you’re ahead of your colleagues, you can already do the exercises from Session 3, if not, do them at another time and send us your questions through the forum.

Some tips:

  • Be careful when running the same command multiple times, some variables might not be the same as they were before. For the command to work the same way, the input objects need to be in the form you expect.
  • Remember that = is for defining objects and == is the equality sign.
  • In conditional and repetition structures, remember that it’s necessary to maintain the expected syntax: If(){} and for(i in 1:10){}. In for, we can change the letter that will be the index, but it’s always necessary to provide a sequence of integers or characters.
  • Using indentation helps to visualize the beginning and end of each code structure and makes it easier to open and close braces. Indentation refers to those spaces we use before the line, like:
# Creating an empty matrix
ex_mat <- matrix(nrow = 10, ncol = 10)

# each number inside the matrix will be the product of the column index by the row index
for (i in 1:dim(ex_mat)[1]) { # First level, no space
  for (j in 1:dim(ex_mat)[2]) { # Second level has one space (tab)
    ex_mat[i, j] <- i * j # Third level has two spaces
  } # Closed the second level
} # Closed the first level

The consistent use of indentation makes your code more readable and helps prevent errors by making the structure clearer. Most modern IDEs, including RStudio, provide automatic indentation features to help maintain this consistency.

Vectorization

Although loops are intuitive and easier to understand, they are slower and less efficient than vectorization. Vectorization is a technique that allows operations to be applied to all elements of a vector or matrix at once, without the need to iterate over each element individually.

Here is a simple example of non-vectorized code (using a loop) and its vectorized version:

# Not vectorized (using loop)
numbers <- 1:5
loop_result <- numeric(length(numbers))
for (i in 1:length(numbers)) {
  loop_result[i] <- numbers[i] * 2
}

# Vectorized approach
numbers <- 1:5
vectorized_result <- numbers * 2

loop_result == vectorized_result
## [1] TRUE TRUE TRUE TRUE TRUE

This code transformation can become more complex depending on the scenario. For example, think about how a vectorized version of the previous loop would look like:

# Not vectorized (using loop)
ex_mat <- matrix(nrow = 10, ncol = 10)

for (i in 1:dim(ex_mat)[1]) {
  for (j in 1:dim(ex_mat)[2]) {
    ex_mat[i, j] <- i * j
  }
}

# Vectorized?

This is a good moment for you to practice using an AI tool to help you transform the code into a vectorized version. You can use chatgpt or copilot, for example. Compare the result of the provided code with what you generated with the loop to verify if the tool is really doing what you want. This transformation is worth it if the loop code is taking too long or if you have to run the same code many times.

’ex_mat <- outer(1:10, 1:10, “*“)’

Here’s the English translation of your R programming instructional text:


Creating Functions

If you’re already comfortable using loops, you might be wondering: “What if I want to do this several times?” or
“What if I want to apply this logic to different datasets?” That’s where functions come in.

We can create custom functions to perform specific tasks. The basic syntax to create a function in R is:

my_function <- function(arg1, arg2) {
  # Function code
  result <- arg1 + arg2
  return(result)
}

The function my_function takes two arguments (arg1 and arg2) and returns their sum. You can call the function by passing the desired values:

result <- my_function(3, 5)
result # Output: 8
## [1] 8

Example of a custom function using vectorization:

vectorized_sum <- function(vector) {
  # Check if the vector is numeric
  if (!is.numeric(vector)) {
    stop("The vector must be numeric.")
  }
  # Sum the vector elements
  total <- sum(vector)

  # Z-score standardization
  z_score <- (vector - mean(vector)) / sd(vector)

  result <- list(sum = total, z_score = z_score)
  return(result)
}

# Calling the function
vectorized_result <- vectorized_sum(c(1, 2, 3, 4, 5))
vectorized_result
## $sum
## [1] 15
## 
## $z_score
## [1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

Example of a function using a data.frame as input. Note that repeated use of the function will require that the input data has the same format or at least the required columns. A good practice is to check if the required columns are present before performing calculations.

calc_volume <- function(data_frame) {
  if (!all(c("diameter", "height") %in% colnames(data_frame))) {
    stop("The columns 'diameter' and 'height' must be present in the data frame.")
  }

  volume <- 3.14 * ((data_frame$diameter / 2)^2) * data_frame$height
  return(volume)
}

It’s recommended to always document your functions. There is a package called roxygen2 that automatically generates manual pages for documenting functions in a package. It requires proper syntax, as shown below:

#' Calculate the volume of cylinders based on diameter and height
#'
#' @param data_frame A data frame containing the columns "diameter" and "height".
#'
#' @return A numeric vector with the calculated volumes for each cylinder.
#'
#' @details
#' Calculates the volume using the formula:
#' \deqn{Volume = \pi \times \left(\frac{diameter}{2}\right)^2 \times height}
#' If the required columns are missing, the function will stop with an error.
#'
#' @examples
#' df <- data.frame(diameter = c(4, 6), height = c(10, 15))
#' calc_volume(df)
#' # [1] 125.6 424.2
#'
#' @note
#' The function uses 3.14 as an approximation for pi. For higher precision, consider replacing 3.14 with the built-in `pi` constant.
calc_volume <- function(data_frame) {
  if (!all(c("diameter", "height") %in% colnames(data_frame))) {
    stop("The columns 'diameter' and 'height' must be present in the data frame.")
  }

  volume <- 3.14 * ((data_frame$diameter / 2)^2) * data_frame$height
  return(volume)
}

If this function is part of a package, you just need to run the command roxygen2::roxygenise() to create the help page. See an example of an R package structure at https://github.com/Breeding-Insight/BIGr

apply Function Family

The apply family of functions can also be used as repetition structures. Their syntax is more concise compared to for or while and can simplify code writing.

apply

The apply function is the base of its family, so understanding it is essential. Its syntax is: apply(X, MARGIN, FUN, ...), where X is an array (including matrices), MARGIN is 1 to apply to rows, 2 to columns, and c(1,2) to both; FUN is the function to apply.

Simple matrix example:

ex_mat <- matrix(seq(0, 21, 3), nrow = 2)

Sum of columns:

apply(ex_mat, 2, sum)
## [1]  3 15 27 39

Sum of rows:

apply(ex_mat, 1, sum)
## [1] 36 48

Equivalent using for loops:

for (i in 1:dim(ex_mat)[2]) {
  print(sum(ex_mat[, i]))
}
## [1] 3
## [1] 15
## [1] 27
## [1] 39
for (i in 1:dim(ex_mat)[1]) {
  print(sum(ex_mat[i, ]))
}
## [1] 36
## [1] 48

Example using a custom function:

zscore <- function(vector) {
  result <- (vector - mean(vector)) / sd(vector)
  return(result)
}

row_results <- apply(ex_mat, 1, function(x) zscore(x))
column_results <- apply(ex_mat, 2, function(x) zscore(x))

lapply

Unlike apply, lapply can take vectors and lists (mainly used with lists) and returns a list.

ex_list <- list(
  A = matrix(seq(0, 21, 3), nrow = 2),
  B = matrix(seq(0, 14, 2), nrow = 2),
  C = matrix(seq(0, 39, 5), nrow = 2)
)
str(ex_list)
## List of 3
##  $ A: num [1:2, 1:4] 0 3 6 9 12 15 18 21
##  $ B: num [1:2, 1:4] 0 2 4 6 8 10 12 14
##  $ C: num [1:2, 1:4] 0 5 10 15 20 25 30 35

Select the second column of all matrices:

lapply(ex_list, "[", 2)
## $A
## [1] 3
## 
## $B
## [1] 2
## 
## $C
## [1] 5

Using a custom function:

results <- lapply(ex_list, function(x) apply(x, 1, zscore))
results
## $A
##            [,1]       [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950
## 
## $B
##            [,1]       [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950
## 
## $C
##            [,1]       [,2]
## [1,] -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950

sapply

sapply is a variant of lapply that tries to simplify the output into a vector, matrix, or array.

sapply(ex_list, "[", 2)
## A B C 
## 3 2 5
results <- sapply(ex_list, function(x) apply(x, 1, zscore))
results
##               A          B          C
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950
## [5,] -1.1618950 -1.1618950 -1.1618950
## [6,] -0.3872983 -0.3872983 -0.3872983
## [7,]  0.3872983  0.3872983  0.3872983
## [8,]  1.1618950  1.1618950  1.1618950

tapply

tapply applies functions based on the levels of a categorical variable (factor), commonly used with data frames.

dados$Affiliation <- as.factor(dados$Affiliation)

dados$KnowledgeR_num <- NA
dados$KnowledgeR_num[dados$KnowledgeR == "Advanced"] <- 3
dados$KnowledgeR_num[dados$KnowledgeR == "Intermediate"] <- 2
dados$KnowledgeR_num[dados$KnowledgeR == "Beginner (some knowledge)"] <- 1
dados$KnowledgeR_num[dados$KnowledgeR == "No R knowledge"] <- 0
dados$KnowledgeR_num[dados$KnowledgeR == ""] <- NA

tapply(dados$KnowledgeR_num, dados$Affiliation, mean)
##                                                       Agronomic Engineer 
##                                                                  1.00000 
##                                                         Breeding Insight 
##                                                                  1.75000 
##                                                     CIA Central Pecuario 
##                                                                  0.00000 
##                                                                     INTA 
##                                                                  1.00000 
## National Institute of Innovation and Transfer in Agricultural Technology 
##                                                                  2.00000 
##                                                              Rcourse2021 
##                                                                  1.87069
tapply(dados$KnowledgeR_num, dados$Affiliation, function(x) mean(x, na.rm = TRUE))
##                                                       Agronomic Engineer 
##                                                                  1.00000 
##                                                         Breeding Insight 
##                                                                  1.75000 
##                                                     CIA Central Pecuario 
##                                                                  0.00000 
##                                                                     INTA 
##                                                                  1.00000 
## National Institute of Innovation and Transfer in Agricultural Technology 
##                                                                  2.00000 
##                                                              Rcourse2021 
##                                                                  1.87069

mapply

mapply is a multivariate version of sapply, allowing functions to be applied to multiple vectors.

sum_fun <- function(x, y) {
  return(x + y)
}
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
result_mapply <- mapply(sum_fun, vector1, vector2)
print(result_mapply)
## [1] 5 7 9
multiply <- function(x, y, z) {
  return(x * y * z)
}
vector3 <- c(7, 8, 9)
result_mapply <- mapply(multiply, vector1, vector2, vector3)
print(result_mapply)
## [1]  28  80 162
result_mapply <- mapply(function(x, y) {
  return(c(sum = x + y, product = x * y))
}, vector1, vector2)
print(result_mapply)
##         [,1] [,2] [,3]
## sum        5    7    9
## product    4   10   18
sum_product <- function(x, y) {
  return(c(sum = x + y, product = x * y))
}
result_mapply <- mapply(sum_product, vector1, vector2)
print(result_mapply)
##         [,1] [,2] [,3]
## sum        5    7    9
## product    4   10   18
sum_product <- function(x, y, z) {
  return(c(sum = x + y + z, product = x * y * z))
}
result_mapply <- mapply(sum_product, vector1, vector2, vector3)
print(result_mapply)
##         [,1] [,2] [,3]
## sum       12   15   18
## product   28   80  162

If you’re ahead of your classmates, you can go ahead and do the Extra session exercises. If not, do them at home and send us your questions via the forum.

Long and Wide Format

In data analysis and visualization with R, especially using the tidyverse, the structure of your data matters a lot. tidyverseis a collection of R packages designed for data science, and it emphasizes the use of tidy data principles. Tidy data is a standardized way of organizing data that makes it easier to work with. Here is a list off all packages in the tidyverse:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyverse_packages()
##  [1] "broom"         "conflicted"    "cli"           "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
##  [9] "googledrive"   "googlesheets4" "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "ragg"         
## [21] "readr"         "readxl"        "reprex"        "rlang"        
## [25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
## [29] "tidyr"         "xml2"          "tidyverse"

There are two main formats for organizing data:

  • Wide format: one row per subject with multiple columns for repeated measures.
  • Long format: one row per observation, making it tidy and compatible with tools like ggplot2.

Let’s consider a dataset showing populations (in millions) of 10 countries across three years:

# Wide format data (realistic population estimates in millions)
data_wide <- tibble(
  country = c(
    "USA", "Canada", "Mexico", "Brazil", "Costa Rica",
    "Uruguay", "China", "Japan", "India", "Greenland"
  ),
  `2000` = c(282, 31, 98, 174, 4, 3.3, 1267, 127, 1050, 0.056),
  `2010` = c(309, 34, 112, 196, 4.5, 3.4, 1340, 128, 1230, 0.057),
  `2020` = c(331, 38, 126, 213, 5, 3.5, 1402, 126, 1380, 0.056)
)

data_wide
## # A tibble: 10 × 4
##    country      `2000`   `2010`   `2020`
##    <chr>         <dbl>    <dbl>    <dbl>
##  1 USA         282      309      331    
##  2 Canada       31       34       38    
##  3 Mexico       98      112      126    
##  4 Brazil      174      196      213    
##  5 Costa Rica    4        4.5      5    
##  6 Uruguay       3.3      3.4      3.5  
##  7 China      1267     1340     1402    
##  8 Japan       127      128      126    
##  9 India      1050     1230     1380    
## 10 Greenland     0.056    0.057    0.056

Use pivot_longer() to convert from wide to long format:

data_long <- pivot_longer(data_wide,
  cols = -country,
  names_to = "year",
  values_to = "population"
)

data_long
## # A tibble: 30 × 3
##    country year  population
##    <chr>   <chr>      <dbl>
##  1 USA     2000         282
##  2 USA     2010         309
##  3 USA     2020         331
##  4 Canada  2000          31
##  5 Canada  2010          34
##  6 Canada  2020          38
##  7 Mexico  2000          98
##  8 Mexico  2010         112
##  9 Mexico  2020         126
## 10 Brazil  2000         174
## # ℹ 20 more rows

Note that our data.frame was converted to tibble format. Tibbles are a modern version of data.frames, designed to be easier to use and more efficient. They are part of the tidyverse package and are often used in data analysis. Here are some practical differences between them:

Feature data.frame tibble (from tibble package)
Base or Tidyverse Base R Part of the tidyverse
Printing Prints entire dataset (can be large) Prints a preview (10 rows, fitted columns)
Column types May convert types automatically (e.g. strings to factors) No automatic type conversion
Subsetting df[, 1] may return a vector tibble[, 1] always returns a tibble
Row names Always has row names Doesn’t use row names

Most tidyverse functions, especially ggplot2, work best with long (tidy) data.

Wide format tables are usually easier to visualize if you intend to export it to a CSV or excel file. You convert long data to wide format with pivot_wider():

data_wide_back <- pivot_wider(data_long,
  names_from = year,
  values_from = population
)

data_wide_back
## # A tibble: 10 × 4
##    country      `2000`   `2010`   `2020`
##    <chr>         <dbl>    <dbl>    <dbl>
##  1 USA         282      309      331    
##  2 Canada       31       34       38    
##  3 Mexico       98      112      126    
##  4 Brazil      174      196      213    
##  5 Costa Rica    4        4.5      5    
##  6 Uruguay       3.3      3.4      3.5  
##  7 China      1267     1340     1402    
##  8 Japan       127      128      126    
##  9 India      1050     1230     1380    
## 10 Greenland     0.056    0.057    0.056

Introduction to pipe use

The pipe operator (%>%) is a powerful tool in R, especially when using the tidyverse package. It allows you to chain together multiple functions in a clear and readable way. Instead of nesting functions within each other, you can use the pipe to pass the output of one function directly into the next. Here is an example:

data_wide_back <- data_long %>%
  pivot_wider(
    names_from = year,
    values_from = population
  )

data_wide_back
## # A tibble: 10 × 4
##    country      `2000`   `2010`   `2020`
##    <chr>         <dbl>    <dbl>    <dbl>
##  1 USA         282      309      331    
##  2 Canada       31       34       38    
##  3 Mexico       98      112      126    
##  4 Brazil      174      196      213    
##  5 Costa Rica    4        4.5      5    
##  6 Uruguay       3.3      3.4      3.5  
##  7 China      1267     1340     1402    
##  8 Japan       127      128      126    
##  9 India      1050     1230     1380    
## 10 Greenland     0.056    0.057    0.056

The usage of the pipe operator makes sense when you have a sequence of operations to perform on a dataset. Let’s explore some other tidyverse functions that can be used with the pipe operator.

data_wide_back <- data_long %>%
  pivot_wider(
    names_from = year,
    values_from = population
  ) %>%
  mutate(total_population = `2000` + `2010` + `2020`) %>%  # mutate function will add a new column
  arrange(desc(total_population))                          # arrange function will sort the data by the new column, descending
data_wide_back
## # A tibble: 10 × 5
##    country      `2000`   `2010`   `2020` total_population
##    <chr>         <dbl>    <dbl>    <dbl>            <dbl>
##  1 China      1267     1340     1402             4009    
##  2 India      1050     1230     1380             3660    
##  3 USA         282      309      331              922    
##  4 Brazil      174      196      213              583    
##  5 Japan       127      128      126              381    
##  6 Mexico       98      112      126              336    
##  7 Canada       31       34       38              103    
##  8 Costa Rica    4        4.5      5               13.5  
##  9 Uruguay       3.3      3.4      3.5             10.2  
## 10 Greenland     0.056    0.057    0.056            0.169

The mutate function is used to create new variables or modify existing ones, while the arrange function is used to sort the data frame by one or more variables. Other useful functions include filter (to filter rows based on conditions), and select (to select specific columns).

data_wide_back <- data_long %>%
  pivot_wider(
    names_from = year,
    values_from = population
  ) %>%
  mutate(total_population = `2000` + `2010` + `2020`) %>%
  arrange(desc(total_population)) %>%
  filter(total_population > 1000) %>% # Filter countries with total population greater than 1000
  select(country, total_population)   # Select only the country and total population columns
data_wide_back
## # A tibble: 2 × 2
##   country total_population
##   <chr>              <dbl>
## 1 China               4009
## 2 India               3660

Using the long format, we can also summarize the data using the summarise function. This is useful for calculating summary statistics like mean, median, or total population by year.

data_summary <- data_long %>%
  group_by(year) %>%
  summarise(
    total_population = sum(population, na.rm = TRUE),
    avg_population = mean(population, na.rm = TRUE),
    max_population = max(population, na.rm = TRUE)
  )

data_summary
## # A tibble: 3 × 4
##   year  total_population avg_population max_population
##   <chr>            <dbl>          <dbl>          <dbl>
## 1 2000             3036.           304.           1267
## 2 2010             3357.           336.           1340
## 3 2020             3625.           362.           1402

The pipe became popular with the dplyr package and in recent R versions, it is also available in base R. In base R, the pipe operator is |>, but it works similarly to the %>% operator from dplyr.

The main difference is that the base R pipe operator (|>) does not require the magrittr package, which is necessary for the %>% operator. Another difference is that the %>% operator allows you to use the placeholder . to specify where the input should go in the next function. This is particularly useful when the input is not the first argument of the function:

data_wide %>%
  lm(`2020` ~ `2000`, data = .) %>%
  summary()
## 
## Call:
## lm(formula = `2020` ~ `2000`, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.426   -4.807   -1.463    3.357  130.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5810    23.1470   0.068    0.947    
## `2000`        1.1885     0.0434  27.384 3.41e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.18 on 8 degrees of freedom
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9881 
## F-statistic: 749.9 on 1 and 8 DF,  p-value: 3.409e-09

The lm function is used to fit a linear model, and the summary function provides a summary of the fitted model. Note that the . placeholder indicates that the input data from the previous step should be used as the data argument in the lm function.

The base R pipe operator does not support the placeholder .. Instead, you can use anonymous functions parentheses to specify where the input should go:

data_wide |>
  (\(df) lm(`2020` ~ `2000`, data = df))() |>
  summary()
## 
## Call:
## lm(formula = `2020` ~ `2000`, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.426   -4.807   -1.463    3.357  130.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5810    23.1470   0.068    0.947    
## `2000`        1.1885     0.0434  27.384 3.41e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.18 on 8 degrees of freedom
## Multiple R-squared:  0.9894, Adjusted R-squared:  0.9881 
## F-statistic: 749.9 on 1 and 8 DF,  p-value: 3.409e-09

Introduction to ggplot2

ggplot2 is a powerful and flexible package for creating visualizations in R. It is part of the tidyverse collection of packages and is widely used for data visualization. The main idea behind ggplot2 is to create graphics based on the Grammar of Graphics, s a theoretical framework for data visualization that breaks down a graphic into a set of independent, structured components. For ggplot2 you will fing the following main components:

  • Data: The dataset you’re using.
  • Aesthetics (aes): The visual properties (like position, color, size) that map to the data.
  • Geometries (geom): The types of visual elements in a plot, like points, lines, bars, etc.
  • Statistics (stat): Statistical transformations, like smoothing or binning, applied to data.
  • Scales (scale): Adjustments for mapping data to aesthetics (e.g., color scales).
  • Coordinates (coord): The coordinate system, such as Cartesian or polar.
  • Facets (facet): Splitting the data into subsets to create multiple panels (like creating small multiples).

Here’s a breakdown of the key elements in the Grammar of Graphics:

Data

The data is the foundation of any plot. It contains the variables you want to visualize. In ggplot2, you specify the data using the data argument:

ggplot(data = data_long)

Aesthetics (aes)

Aesthetics define how the data maps to visual properties of the plot, like:

  • x and y position (x, y)
  • color (color)
  • size (size)
  • shape (shape)
  • fill (fill)

For example:

ggplot(data = data_long, aes(x = year, y = population))

Here:

  • x = year (horizontal axis)
  • y = population (vertical axis)

Geometries (geom)

Geometries are the visual elements of a plot. Different types of geometries allow you to create different kinds of plots:

  • geom_point() for scatter plots
  • geom_line() for line plots
  • geom_bar() for bar charts
  • geom_histogram() for histograms
  • geom_boxplot() for boxplots

For example:

ggplot(data = data_long, aes(x = year, y = population)) +
  geom_point() 

ggplot(data = data_long, aes(x = year, y = population, group = country)) +
  geom_line() + geom_point()

# add color by country
ggplot(data = data_long, aes(x = year, y = population, group = country, color = country)) +
  geom_line() 

This creates a scatter plot where the points represent data.

Statistics (stat)

Some plots require statistical transformations, such as smoothing or binning. You can apply these transformations using the stat_* functions.

We can apply statistical summaries. For example, add a smoothed trend:

ggplot(data_long, aes(x = year, y = population)) +
  stat_summary(fun = mean, geom = "line", group = 1) +
  labs(title = "Average Population Trend")

Scales (scale)

Modify how data maps to visual aesthetics (e.g., color):

ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
  geom_line(size = 1.2) +
  scale_color_manual(values = c("China" = "red", "India" = "orange", "USA" = "blue")) +
  labs(title = "Population by Country (Selected Colors)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here, scale_color_manual() customizes the color mapping for gender. You can get also ready to use palettes from the viridis package, which is color-blind friendly:

library(viridis)
## Loading required package: viridisLite
ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
  geom_line(size = 1.2) +
  scale_color_viridis_d() +
  labs(title = "Population by Country (Viridis Colors)")

This uses the viridis color palette, which is designed to be perceptually uniform and color-blind friendly.

Coordinates (coord)

The coordinate system defines the layout of the plot. The most common system is Cartesian (x and y axes), but you can also use polar coordinates, or transform the axes.

For example:

ggplot(data_long %>% filter(year == "2020"), aes(x = country, y = population)) +
  geom_col() +
  coord_flip() +
  labs(title = "Population by Country in 2020 (Flipped)")

This flips the axes, so that the x-axis becomes the y-axis and vice versa.

Facets (facet)

Faceting allows you to split a plot into multiple panels, which is useful for comparing subsets of the data. Facets can be done by rows or columns.

ggplot(data_long, aes(x = year, y = population)) +
  geom_line(group = 1) +
  facet_wrap(~ country) +
  labs(title = "Population Trend by Country (Faceted)")

Labels and Themes

You can customize the plot with labels and themes. Labels include titles, axis labels, and legends. Themes control the overall appearance of the plot.

ggplot(data_long, aes(x = year, y = population, group = country, color = country)) +
  geom_line(size = 1.2) +
  labs(
    title = "Population Over Time by Country",
    subtitle = "Based on simulated data (millions)",
    x = "Year",
    y = "Population (Millions)",
    caption = "Data source: Simulated"
  ) +
  theme_minimal()

You can customize fonts and style:

ggplot(data_long, aes(x = year, y = population, color = country, group = country)) +
  geom_line(size = 1.2) +
  labs(
    title = "Population Over Time by Country",
    subtitle = "Based on simulated data (millions)",
    x = "Year",
    y = "Population (Millions)",
    caption = "Data source: Simulated"
  ) +
  theme_minimal() + theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title = element_text(size = 14),
    legend.title = element_text(size = 12)
  )

Creating map plots with ggplot2

# Load necessary datasets
dados <- read.csv("https://breeding-insight.github.io/learn-hub/r-intro/data/dados_2025.csv")

colnames(dados) <- c("Date", "Affiliation", "Longitude", "Latitude", "Background", "Present_Occupation", "Explain", "KnowledgeR")

# Create quantitative measure
dados$KnowledgeR_num <- NA
dados$KnowledgeR_num[dados$KnowledgeR == "Advanced"] <- 3
dados$KnowledgeR_num[dados$KnowledgeR == "Intermediate"] <- 2
dados$KnowledgeR_num[dados$KnowledgeR == "Beginner (some knowledge)"] <- 1
dados$KnowledgeR_num[dados$KnowledgeR == "No R knowledge"] <- 0
dados$KnowledgeR_num[dados$KnowledgeR == ""] <- NA


# Cargar los datos del mapa de EE.UU.
world_map <- map_data("world")

dados$Latitude <- as.numeric(dados$Latitude)
## Warning: NAs introduced by coercion
dados$Longitude <- as.numeric(dados$Longitude)
## Warning: NAs introduced by coercion
# Crear el gráfico
ggplot() +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
  # Graficar los puntos de peso promedio
  geom_point(data = dados, aes(x = Longitude, y = Latitude), alpha = 0.7) +
  labs(title = "R course students", x = "Longitud", y = "Latitud") +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Color by Present Occupation
ggplot() +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
  # Graficar los puntos de peso promedio
  geom_point(data = dados, aes(x = Longitude, y = Latitude, color = Present_Occupation), alpha = 0.7) +
  labs(title = "R course students", x = "Longitud", y = "Latitud") +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Use color blind friendly
ggplot() +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group), fill = "lightgray", color = "white") +
  # Graficar los puntos de peso promedio
  geom_point(data = dados, aes(x = Longitude, y = Latitude, color = Present_Occupation), alpha = 0.7) +
  scale_colour_viridis_d() +
  labs(title = "R course students", x = "Longitud", y = "Latitud") +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).