9.5 + 3.75 - 1.25) * (3 / 2)^2
(#> [1] 27
Introduction to R Workshop
Overview
This one-hour workshop is organized by the Kansas Data Science Consortium (KDSC) and strives to provide a brief and gentle introduction to the R programming language.
Instructor
The instructor of this workshop, Dr. Jeffrey Girard, is a professor in the Department of Psychology at the University of Kansas and one of the three co-directors of the KDSC. He runs the AffCOM Research Lab, which applies computational approaches to study emotion, personality, and mental health. He also teaches (paid) statistics and data science workshops via his company, SMaRT Workshops.
Workshop Outline
- Installing R and RStudio
- Learning the R Language
- Working with Data Frames
- Preview of Future Workshops
Installing R and RStudio
You need to install R and, although doing so is optional, I recommend also installing RStudio. Both are freely available! Once they are installed, using the installers linked below, you will always open RStudio to access R. That is, R needs to be on your computer, but you won’t be accessing it directly.
Installing R
On Windows, click here to download the appropriate R installer.
On a newer Mac with Apple chips (from 2020 to now), click here to download the appropriate R installer.
On an older Mac with Intel chips (from 2006 to 2020), click here to download the appropriate R installer.
Installing RStudio
On Windows, click here to download the appropriate installer.
On any Mac, click here to download the appropriate installer.
Learning the R Language
R is controlled by commands that you type into RStudio’s Console. It takes some time to learn how to properly phrase commands, but doing so opens up many exciting possibilities for you! You typically type in one command at a time and hit to get an immediate response from R. It’s like a chat room with your computer.
R as a Calculator
At its most basic level, you can use the R console as a calculator. In the following grey box, you can see the command I entered and R’s response (following #>
).
Note that R uses periods (.
) to mark decimals and doesn’t like commas or spaces inside numbers. So type numbers out like 1234567.89
and not like 1,234,567.89
or 1 234 567,89
.
Creating Objects
Just like creating variables in algebra, we can store information in temporary objects in R. For instance, we can tell R that x
refers to the number 57.23 We do this using the <-
operator, which we read as “x gets 57.23.” Any time we want to see what information is stored in an object, we can send a command to R including only its name; this is called printing it.
<- 57.23
x
x#> [1] 57.23
The main purpose of creating objects is to use them in future commands. This can be very handy.
- 32) * (5 / 9)
(x #> [1] 14.01667
We can also give objects more descriptive names, which makes them easier to remember and understand. There are several rules about what names are allowed to be. For now, let’s stick with simple names that only include letters (without numbers, spaces, or symbols) as doing so will never break those rules. Below, we can see how descriptive names clarify what the previous (ambiguous) operation was doing: converting a temperature from Fahrenheit to Celsius.
<- 57.23
tempF <- (tempF - 32) * (5 / 9)
tempC
tempC#> [1] 14.01667
Using Functions
R has the ability to transform objects in many useful ways. This is usually accomplished using functions, which I like to think of as recipes (e.g., in baking). A recipe calls for ingredients and includes a series of steps for transforming those ingredients into a tasty treat. Similarly, functions in R call for inputs and include a series of steps for transforming those inputs into an output. The telltale sign that a function is being used in R is text followed by parentheses. Continuing the baking analogy, I think of these parentheses as the edges of a mixing bowl that you put the ingredients/inputs into. For example, we can use the sqrt()
function to calculate the (principal) square root of an input number (i.e., x
).
sqrt(x = 25)
#> [1] 5
There are many useful functions built into R! Another one that will help us learn the basics is the round()
function, which will round the input number to a specified number of digits (i.e., decimal places). In this case, we need two inputs: the number to round (i.e., x
) and the number of digits to round to (i.e., digits
). We can accomplish this by separating each ingredient with a comma.
round(x = 2/3, digits = 4)
#> [1] 0.6667
The digits
input allows us to configure the function’s behavior. This means that R can have a single round()
function and doesn’t need to have separate functions for rounding to, e.g., 1 or 2 or 5 digits.
round(x = 2/3, digits = 2)
#> [1] 0.67
Vector Objects
One function that we use very frequently is the c()
function. Note that the name is lowercase c, which I think of as being short for “collect.” It groups multiple similar objects (e.g., numbers) into a single collection object, which we refer to as a vector. Note that this function can take any number of inputs, each of which does not have an individual name.
Collecting objects into a vector will not only keep those objects together, but there are functions that let us do two useful things with vectors. To showcase this, let’s create a simple vector called grades
to store the grades of five students on a recent exam.
<- c(78, 84, 91, 92, 94)
grades
grades#> [1] 78 84 91 92 94
First, we can use a vector to transform all sub-objects at once. For example, imagine that there was a scoring error and all grades are 2 points lower than they should be. Instead of needing to add 2 to each of the five grades individually, I can simply add 2 to the vector once and it will apply to all the grades at once. That would be a huge time savings in a course with many students!
<- grades + 2
corrected
corrected#> [1] 80 86 93 94 96
Second, we can use a vector to summarize across the sub-objects. For example, I might want to calculate the number of students in the class, the minimum and maximum grades, and the average grade. By giving the corrected
vector to special functions in R, we can do this easily.
length(x = corrected)
#> [1] 5
range(x = corrected)
#> [1] 80 96
mean(x = corrected)
#> [1] 89.8
Character Data
Sometimes we need to provide R with text instead of numbers. This type of data is called character data and each instance is called a string. Just like numbers, we can collect strings into vectors and there are special functions that can transform and summarize character data. In R, we mark each string by surrounding it with quotation marks ("
). Here are a few examples:
<- c("John Smith", "Jane Doe", "Joe Bloggs")
customers
customers#> [1] "John Smith" "Jane Doe" "Joe Bloggs"
toupper(x = customers)
#> [1] "JOHN SMITH" "JANE DOE" "JOE BLOGGS"
nchar(x = customers)
#> [1] 10 8 10
Working with Data Frames
R comes with some example datasets built into it, which can be accessed by looking up their names and printing them. For example, the mtcars
dataset includes information about different car models.
mtcars#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
This object is more complex than anything we’ve seen yet. It is an object type called a data frame, which is essentially multiple vectors combined together such that each row represents one observation (in this case, one car model) and each column represents one variable (in this case, an attribute like horsepower hp
or fuel efficiency mpg
). Most of the time, you will be using data frames for data analysis and visualization tasks.
Basic Summaries
Often the first thing you should do after accessing a dataset (i.e., before visualizing or analyzing it) is to summarize it. This will provide a set of summary statistics for each variable/column, including its minimum (Min.), first quartile (1st Qu.), median (Median), mean (Mean), third quartile (3rd Qu.), and maximum (Max.). These five numbers provide a good summary of extreme values (min and max), common values (first and third quartiles), and central tendencies (mean and median).
summary(object = mtcars)
#> mpg cyl disp hp
#> Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
#> 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
#> Median :19.20 Median :6.000 Median :196.3 Median :123.0
#> Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
#> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
#> Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
#> drat wt qsec vs
#> Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
#> 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
#> Median :3.695 Median :3.325 Median :17.71 Median :0.0000
#> Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
#> 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
#> Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
#> am gear carb
#> Min. :0.0000 Min. :3.000 Min. :1.000
#> 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
#> Median :0.0000 Median :4.000 Median :2.000
#> Mean :0.4062 Mean :3.688 Mean :2.812
#> 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :1.0000 Max. :5.000 Max. :8.000
Here we can see that all the cars have between 52 and 335 horsepower (hp
), half of the cars have between 96.5 and 180, and the average is 146.7 This is also a good opportunity to check for outliers, errors, and missing values. For example, if I saw that the minimum hp
was negative, I would double-check the data as that would result in a very strange car.
Basic Visualizations
It is often useful to create quick visualizations of your main variables of interest. For instance, we can create histograms to examine which values of horsepower (hp
) and fuel efficiency (mpg
) are more and less common. This can be done using the hist()
function and the $
operator. This operator allows us to extract a specific column from a data frame. For instance, we can use mtcars$hp
to extract (and then visualize) the horsepower numbers.
hist(x = mtcars$hp, xlab = "Horsepower")
We can also optionally use the xlab
input to control the label of the x-axis. Let’s repeat with the mpg
variable.
hist(x = mtcars$mpg, xlab = "Miles per Gallon")
Finally, let’s explore if there is a relationship between cars’ horsepower and fuel efficiency. To do so, we can use the plot()
function and use the x and y axes to represent each variable.
plot(x = mtcars$hp, y = mtcars$mpg, xlab = "Horsepower", ylab = "Miles per Gallon")
It looks like cars with higher horsepower generally have lower fuel efficiency, although this relationship seems to “slow down” or “level off” after around 200 horsepower.
Preview of Future Workshops
This workshop only covered the basics. Future workshops over the next few weeks will discuss how to use R to: create much prettier and more complex data visualizations (April 15) and conduct elementary statistical analyses (April 22). Learn more about these free trainings at this link!