Introduction to R Workshop

Author

Jeffrey Girard

Published

April 3, 2025

Overview

This one-hour workshop is organized by the Kansas Data Science Consortium (KDSC) and strives to provide a brief and gentle introduction to the R programming language.

Instructor

The instructor of this workshop, Dr. Jeffrey Girard, is a professor in the Department of Psychology at the University of Kansas and one of the three co-directors of the KDSC. He runs the AffCOM Research Lab, which applies computational approaches to study emotion, personality, and mental health. He also teaches (paid) statistics and data science workshops via his company, SMaRT Workshops.

Workshop Outline

Installing R and RStudio
Learning the R Language
Working with Data Frames
Preview of Future Workshops

Installing R and RStudio

You need to install R and, although doing so is optional, I recommend also installing RStudio. Both are freely available! Once they are installed, using the installers linked below, you will always open RStudio to access R. That is, R needs to be on your computer, but you won’t be accessing it directly.

Installing R

On Windows, click here to download the appropriate R installer.
On a newer Mac with Apple chips (from 2020 to now), click here to download the appropriate R installer.
On an older Mac with Intel chips (from 2006 to 2020), click here to download the appropriate R installer.

Installing RStudio

On Windows, click here to download the appropriate installer.
On any Mac, click here to download the appropriate installer.

Learning the R Language

R is controlled by commands that you type into RStudio’s Console. It takes some time to learn how to properly phrase commands, but doing so opens up many exciting possibilities for you! You typically type in one command at a time and hit to get an immediate response from R. It’s like a chat room with your computer.

R as a Calculator

At its most basic level, you can use the R console as a calculator. In the following grey box, you can see the command I entered and R’s response (following #>).

(9.5 + 3.75 - 1.25) * (3 / 2)^2
#> [1] 27

Note that R uses periods (.) to mark decimals and doesn’t like commas or spaces inside numbers. So type numbers out like 1234567.89 and not like 1,234,567.89 or 1 234 567,89.

Creating Objects

Just like creating variables in algebra, we can store information in temporary objects in R. For instance, we can tell R that x refers to the number 57.23 We do this using the <- operator, which we read as “x gets 57.23.” Any time we want to see what information is stored in an object, we can send a command to R including only its name; this is called printing it.

x <- 57.23
x
#> [1] 57.23

The main purpose of creating objects is to use them in future commands. This can be very handy.

(x - 32) * (5 / 9)
#> [1] 14.01667

We can also give objects more descriptive names, which makes them easier to remember and understand. There are several rules about what names are allowed to be. For now, let’s stick with simple names that only include letters (without numbers, spaces, or symbols) as doing so will never break those rules. Below, we can see how descriptive names clarify what the previous (ambiguous) operation was doing: converting a temperature from Fahrenheit to Celsius.

tempF <- 57.23
tempC <- (tempF - 32) * (5 / 9)
tempC
#> [1] 14.01667

Using Functions

R has the ability to transform objects in many useful ways. This is usually accomplished using functions, which I like to think of as recipes (e.g., in baking). A recipe calls for ingredients and includes a series of steps for transforming those ingredients into a tasty treat. Similarly, functions in R call for inputs and include a series of steps for transforming those inputs into an output. The telltale sign that a function is being used in R is text followed by parentheses. Continuing the baking analogy, I think of these parentheses as the edges of a mixing bowl that you put the ingredients/inputs into. For example, we can use the sqrt() function to calculate the (principal) square root of an input number (i.e., x).

sqrt(x = 25)
#> [1] 5

There are many useful functions built into R! Another one that will help us learn the basics is the round() function, which will round the input number to a specified number of digits (i.e., decimal places). In this case, we need two inputs: the number to round (i.e., x) and the number of digits to round to (i.e., digits). We can accomplish this by separating each ingredient with a comma.

round(x = 2/3, digits = 4)
#> [1] 0.6667

The digits input allows us to configure the function’s behavior. This means that R can have a single round() function and doesn’t need to have separate functions for rounding to, e.g., 1 or 2 or 5 digits.

round(x = 2/3, digits = 2)
#> [1] 0.67

Vector Objects

One function that we use very frequently is the c() function. Note that the name is lowercase c, which I think of as being short for “collect.” It groups multiple similar objects (e.g., numbers) into a single collection object, which we refer to as a vector. Note that this function can take any number of inputs, each of which does not have an individual name.

Collecting objects into a vector will not only keep those objects together, but there are functions that let us do two useful things with vectors. To showcase this, let’s create a simple vector called grades to store the grades of five students on a recent exam.

grades <- c(78, 84, 91, 92, 94)
grades
#> [1] 78 84 91 92 94

First, we can use a vector to transform all sub-objects at once. For example, imagine that there was a scoring error and all grades are 2 points lower than they should be. Instead of needing to add 2 to each of the five grades individually, I can simply add 2 to the vector once and it will apply to all the grades at once. That would be a huge time savings in a course with many students!

corrected <- grades + 2
corrected
#> [1] 80 86 93 94 96

Second, we can use a vector to summarize across the sub-objects. For example, I might want to calculate the number of students in the class, the minimum and maximum grades, and the average grade. By giving the corrected vector to special functions in R, we can do this easily.

length(x = corrected)
#> [1] 5

range(x = corrected)
#> [1] 80 96

mean(x = corrected)
#> [1] 89.8

Character Data

Sometimes we need to provide R with text instead of numbers. This type of data is called character data and each instance is called a string. Just like numbers, we can collect strings into vectors and there are special functions that can transform and summarize character data. In R, we mark each string by surrounding it with quotation marks ("). Here are a few examples:

customers <- c("John Smith", "Jane Doe", "Joe Bloggs")
customers
#> [1] "John Smith" "Jane Doe"   "Joe Bloggs"

toupper(x = customers)
#> [1] "JOHN SMITH" "JANE DOE"   "JOE BLOGGS"

nchar(x = customers)
#> [1] 10  8 10

Working with Data Frames

R comes with some example datasets built into it, which can be accessed by looking up their names and printing them. For example, the mtcars dataset includes information about different car models.

mtcars
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

This object is more complex than anything we’ve seen yet. It is an object type called a data frame, which is essentially multiple vectors combined together such that each row represents one observation (in this case, one car model) and each column represents one variable (in this case, an attribute like horsepower hp or fuel efficiency mpg). Most of the time, you will be using data frames for data analysis and visualization tasks.

Basic Summaries

Often the first thing you should do after accessing a dataset (i.e., before visualizing or analyzing it) is to summarize it. This will provide a set of summary statistics for each variable/column, including its minimum (Min.), first quartile (1st Qu.), median (Median), mean (Mean), third quartile (3rd Qu.), and maximum (Max.). These five numbers provide a good summary of extreme values (min and max), common values (first and third quartiles), and central tendencies (mean and median).

summary(object = mtcars)
#>       mpg             cyl             disp             hp       
#>  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
#>  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
#>  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
#>  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
#>  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
#>  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
#>       drat             wt             qsec             vs        
#>  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
#>  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
#>  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
#>  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
#>  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
#>  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
#>        am              gear            carb      
#>  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
#>  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
#>  Median :0.0000   Median :4.000   Median :2.000  
#>  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
#>  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
#>  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Here we can see that all the cars have between 52 and 335 horsepower (hp), half of the cars have between 96.5 and 180, and the average is 146.7 This is also a good opportunity to check for outliers, errors, and missing values. For example, if I saw that the minimum hp was negative, I would double-check the data as that would result in a very strange car.

Basic Visualizations

It is often useful to create quick visualizations of your main variables of interest. For instance, we can create histograms to examine which values of horsepower (hp) and fuel efficiency (mpg) are more and less common. This can be done using the hist() function and the $ operator. This operator allows us to extract a specific column from a data frame. For instance, we can use mtcars$hp to extract (and then visualize) the horsepower numbers.

hist(x = mtcars$hp, xlab = "Horsepower")

We can also optionally use the xlab input to control the label of the x-axis. Let’s repeat with the mpg variable.

hist(x = mtcars$mpg, xlab = "Miles per Gallon")

Finally, let’s explore if there is a relationship between cars’ horsepower and fuel efficiency. To do so, we can use the plot() function and use the x and y axes to represent each variable.

plot(x = mtcars$hp, y = mtcars$mpg, xlab = "Horsepower", ylab = "Miles per Gallon")

It looks like cars with higher horsepower generally have lower fuel efficiency, although this relationship seems to “slow down” or “level off” after around 200 horsepower.

Preview of Future Workshops

This workshop only covered the basics. Future workshops over the next few weeks will discuss how to use R to: create much prettier and more complex data visualizations (April 15) and conduct elementary statistical analyses (April 22). Learn more about these free trainings at this link!