KDSC S4 Bootcamp

Summer Social Science Statistics Workshop

Author

Jeffrey Girard

Published

August 15, 2025

1 Workshop Overview

This workshop is designed to prepare incoming graduate students in psychology and related social and behavioral sciences for success in their first graduate-level statistics courses. Graduate statistics can move quickly and assume a level of fluency with core concepts, terminology, and skills that students may not have used recently or may have learned unevenly in undergraduate courses. My goal is to bridge those gaps—whether they stem from differences in prior training, time since last coursework, or variation in how research methods and statistics were taught.

We will begin by reinforcing key ideas from undergraduate statistics and research design, ensuring a solid understanding of concepts such as variables, measurement scales, reliability and validity, and the logic of statistical inference. This foundation will make it easier to follow the more technical material covered in graduate courses, and to connect statistical procedures to meaningful research questions.

In addition, the workshop provides a gentle, hands-on introduction to statistical programming in R. Many graduate courses and research projects now expect students to conduct analyses and produce visualizations in R rather than in point-and-click software. Through guided examples, participants will learn how to work with data in R, compute and interpret descriptive statistics, and create clear, informative plots for exploratory data analysis. By the end, students will be better equipped to approach their graduate coursework with confidence, apply statistical tools to their own research, and continue developing their analytic skills throughout their academic and professional careers.

2 Why Learn Statistics?

Statistics is a cornerstone of psychological science—not because psychologists love numbers for their own sake, but because human reasoning is prone to bias. Left to “common sense,” we often see what we expect to see. Decades of research on belief bias shows that, even when faced with logically valid arguments, people judge them incorrectly if the conclusion conflicts with their prior beliefs. Statistics helps guard against these errors by providing objective tools to evaluate evidence, ensuring that conclusions are grounded in data rather than intuition.

2.1 Statistics Protects Against Hidden Pitfalls

Psychological data can be complex and counterintuitive. Simpson’s paradox, where aggregated data tell a different story than disaggregated data, is a prime example. Without statistical thinking, researchers might draw false conclusions from patterns that vanish or reverse under closer examination. In other words, statistics isn’t just about crunching numbers; it’s about detecting subtle patterns, avoiding traps, and making sense of messy reality.

Simpson’s Paradox in Action

In 1973, UC Berkeley’s graduate admissions data appeared to show a clear gender bias: 46% of male applicants were admitted compared to 35% of female applicants. But when admissions were examined department by department, most departments actually had slightly higher acceptance rates for women. The apparent bias arose because women tended to apply to more competitive departments with lower overall admission rates, while men applied more often to departments with higher rates. This reversal between the aggregated and disaggregated data illustrates how misleading patterns can emerge without careful statistical analysis.

2.2 Why Psychology Relies Heavily on Statistics

Unlike physics, which studies relatively simple entities like electrons, psychology studies people which can be complex, variable, and unpredictable. This complexity makes statistical tools indispensable. Psychologists need to understand statistics to:

Design better research: Good research design and statistics are deeply linked.
Interpret the literature: Most psychology papers rely on statistical results.
Work independently: Trained statisticians are scarce and expensive; researchers must be self-sufficient.

Even outside research, a statistical lens is essential for critically evaluating claims in clinical work, policy, and everyday life.

2.3 A Skill for Life, Not Just the Lab

We live in a world overflowing with data but short on clear understanding. From misleading headlines to misinterpreted survey results, statistical literacy is a survival skill for informed citizenship. For psychology students, statistics is not just an academic requirement—it’s a tool for thinking clearly, making sound decisions, and contributing credible, trustworthy knowledge.

2.4 Discussion Questions

Why Learn Statistics?

Why isn’t “common sense” enough when evaluating evidence, and how can statistics help us avoid the pitfalls of belief bias?
In the Berkeley admissions example, what does Simpson’s paradox teach us about the importance of looking beyond aggregated data?
How does the complexity and unpredictability of human behavior make statistics more essential in psychology than in some other sciences?
Can you think of a recent news story, social media post, or public debate where statistical literacy would have helped you interpret the information more accurately?

3 Research Design

Research design is the framework for planning, conducting, and evaluating a study. In psychology, it involves deciding what to measure, how to measure it, and how to interpret results while minimizing bias and error. Although this is only a brief overview, it connects core concepts of measurement, variable types, reliability, validity, and study design—each of which directly affects the quality of statistical analysis.

3.1 Psychological Measurement

Measurement in psychology means assigning numbers or categories to aspects of human behavior or mental processes. Because many psychological concepts are abstract (e.g., “intelligence” or “attitude”), researchers must operationalize them by turning vague ideas into specific, measurable variables. This involves defining the construct, selecting a method of measurement (self-report, observation, records), and deciding allowable values. Operationalization links the theoretical construct to the actual data collected.

3.2 Scales of Measurement

Variables differ in the kind of information they convey:

Nominal: Categories without order (e.g., eye colour).
Ordinal: Categories with order but unequal intervals (e.g., tourament rankings, Likert scales).
Interval: Equal intervals but arbitrary zero (e.g., temperature in °C).
Ratio: Equal intervals with a true zero (e.g., reaction time).

Variables may also be continuous (any value within a range) or discrete (distinct, separate values). These distinctions affect which statistical methods are appropriate.

Example: Operationalising a Psychological Construct

Construct: Test anxiety

Possible operationalization steps:

Definition: Test anxiety = feelings of tension and worry specifically related to taking exams.
Measurement method: Administer the Test Anxiety Inventory (TAI), a self-report questionnaire with strong evidence of validity.
Allowable values: Responses scored on a 1–4 Likert scale for each item, summed to produce a total score from 20–80.
Variable type: Interval-like (treated as continuous in analysis).

By clearly defining what “test anxiety” means, selecting a suitable measurement tool, and deciding how responses will be represented, the vague idea of “being nervous during exams” becomes a measurable variable that can be analysed statistically.

3.3 Measurement Quality

High-quality measurement requires attention to both reliability and validity.

Reliability is the consistency of a measurement. Types include test–retest (over time), inter-rater (between observers), parallel forms (across equivalent measures), and internal consistency (across items within a measure). A measure can be reliable but invalid; however, very low reliability usually undermines validity.
Validity asks whether the measurement actually reflects the intended construct (construct validity), appears appropriate to experts or stakeholders (face validity), and captures the construct in relevant real-world contexts (ecological validity).

Example: When Construct Validity Fails

Research goal: Measure aggression in children.

Chosen measure: Number of times a child shouts during a playground observation.

Problem: Shouting may sometimes reflect aggression, but can also occur during play, excitement, or calling to friends. The measure may capture general vocal activity rather than aggression specifically.

Why this matters: The measurement does not align cleanly with the intended construct. Conclusions about “aggression” based on this operationalization could be misleading.

3.4 The Role of Variables

In analysis, variables play different roles. Predictors (independent variables) are used to explain or forecast outcomes (dependent variables). Clear distinction between these roles helps structure analyses and interpret causal claims.

3.5 Types of Research

Experimental research manipulates predictors under controlled conditions, often using random assignment to reduce confounds and support causal inference.
Non-experimental research observes naturally occurring variables. This includes quasi-experiments (similar to experiments but without control over predictors) and case studies (in-depth analysis of one or few instances). Non-experimental designs can be more naturalistic but are more vulnerable to confounds.

3.6 Research Validity

Research validity concerns whether a study’s design and execution support accurate, generalizable conclusions.

Internal validity: Can observed differences be attributed to the manipulated or focal variables, rather than confounds?
External validity: Will results generalize to other people, settings, and situations?
Ecological validity: Does the study setting and procedure resemble the real-world context of interest?

Threats to research validity can arise from multiple sources:

Changes over time: History effects (external events) and maturation (natural changes in participants) can influence results independently of the study variables.
Measurement process: Testing effects (practice or familiarity) and regression to the mean (extreme scores moving toward average) can distort outcomes.
Participant selection: Selection bias (systematic group differences) and attrition (dropouts) can reduce comparability.
Social and psychological influences: Experimenter bias (subtle cues from the researcher), demand effects (participants guessing the study’s purpose), and placebo effects can alter behavior.
Ethical or procedural issues: Fraud, poor control procedures, or unrepresentative samples can undermine both internal and external validity.

3.7 Discussion Questions

Research Design

Choose a psychological construct (e.g., self-esteem, motivation, empathy). How could you operationalize it in at least two different ways, and what trade-offs might each approach have in terms of reliability and validity?
Why might a measure be highly reliable but still have poor validity? Can you think of a real-world example where this might occur?
In what ways do experimental designs strengthen internal validity compared to non-experimental designs? Can you think of a situation where a non-experimental design might still be preferable?
Pick one threat to research validity (e.g., selection bias, demand effects). How might it appear in a psychological study, and what steps could a researcher take to minimize its impact?

4 Getting Started in R

R is a free, open-source statistical computing environment that is more powerful and flexible than spreadsheets and avoids the high licensing costs of proprietary software. It’s widely used in research, highly extensible through thousands of free add-on packages, and supports cutting-edge methods found in advanced textbooks. Because R is also a full programming language, learning it not only builds data analysis skills but also introduces valuable programming abilities relevant to modern psychological research. Although R has a learning curve and some quirks, its strengths in cost, capability, and long-term utility make it one of the best tools for serious statistical work.

4.1 Installing R and RStudio

You need to install R and, although doing so is optional, I recommend also installing RStudio. Both are freely available! Once they are installed, using the installers linked below, you will always open RStudio to access R. That is, R needs to be on your computer, but you won’t be accessing it directly.

4.1.1 Installing R

On Windows, click here to download the appropriate R installer.
On a newer Mac with Apple chips (from 2020 to now), click here to download the appropriate R installer.
On an older Mac with Intel chips (from 2006 to 2020), click here to download the appropriate R installer.

4.1.2 Installing RStudio

On Windows, click here to download the appropriate installer.
On any Mac, click here to download the appropriate installer.

4.2 Learning the R Language

R is controlled by commands that you type into RStudio’s Console. It takes some time to learn how to properly phrase commands, but doing so opens up many exciting possibilities for you! You typically type in one command at a time and hit to get an immediate response from R. It’s like a chat room with your computer.

4.2.1 R as a Calculator

At its most basic level, you can use the R console as a calculator. In the following grey box, you can see the command I entered and R’s response (following #>).

(9.5 + 3.75 - 1.25) * (3 / 2)^2
#> [1] 27

Note that R uses periods (.) to mark decimals and doesn’t like commas or spaces inside numbers. So type numbers out like 1234567.89 and not like 1,234,567.89 or 1 234 567,89.

4.2.2 Creating Objects

Just like creating variables in algebra, we can store information in temporary objects in R. For instance, we can tell R that x refers to the number 57.23 We do this using the <- operator, which we read as “x gets 57.23.” Any time we want to see what information is stored in an object, we can send a command to R including only its name; this is called printing it.

x <- 57.23
x
#> [1] 57.23

The main purpose of creating objects is to use them in future commands. This can be very handy.

(x - 32) * (5 / 9)
#> [1] 14.01667

We can also give objects more descriptive names, which makes them easier to remember and understand. There are several rules about what names are allowed to be. For now, let’s stick with simple names that only include letters (without numbers, spaces, or symbols) as doing so will never break those rules. Below, we can see how descriptive names clarify what the previous (ambiguous) operation was doing: converting a temperature from Fahrenheit to Celsius.

tempF <- 57.23
tempC <- (tempF - 32) * (5 / 9)
tempC
#> [1] 14.01667

There is also a keyboard shortcut to input the arrow operator: +-

4.2.3 Using Functions

R has the ability to transform objects in many useful ways. This is usually accomplished using functions, which I like to think of as recipes (e.g., in baking). A recipe calls for ingredients and includes a series of steps for transforming those ingredients into a tasty treat. Similarly, functions in R call for inputs and include a series of steps for transforming those inputs into an output. The telltale sign that a function is being used in R is text followed by parentheses. Continuing the baking analogy, I think of these parentheses as the edges of a mixing bowl that you put the ingredients/inputs into. For example, we can use the sqrt() function to calculate the (principal) square root of an input number (i.e., x).

sqrt(x = 25)
#> [1] 5

There are many useful functions built into R! Another one that will help us learn the basics is the round() function, which will round the input number to a specified number of digits (i.e., decimal places). In this case, we need two inputs: the number to round (i.e., x) and the number of digits to round to (i.e., digits). We can accomplish this by separating each ingredient with a comma.

round(x = 2/3, digits = 4)
#> [1] 0.6667

The digits input allows us to configure the function’s behavior. This means that R can have a single round() function and doesn’t need to have separate functions for rounding to, e.g., 1 or 2 or 5 digits.

round(x = 2/3, digits = 2)
#> [1] 0.67

We can also leave off the argument names and provide them by position (i.e., in the expected order). This is often recommended for data arguments.

round(2/3, 2)
#> [1] 0.67
round(2/3, digits = 2)
#> [1] 0.67

4.2.4 Vector Objects

One function that we use very frequently is the c() function. Note that the name is lowercase c, which I think of as being short for “collect.” It groups multiple similar objects (e.g., numbers) into a single collection object, which we refer to as a vector. Note that this function can take any number of inputs, each of which does not have an individual name.

Collecting objects into a vector will not only keep those objects together, but there are functions that let us do two useful things with vectors. To showcase this, let’s create a simple vector called grades to store the grades of five students on a recent exam.

grades <- c(78, 84, 91, 92, 94)
grades
#> [1] 78 84 91 92 94

First, we can use a vector to transform all sub-objects at once. For example, imagine that there was a scoring error and all grades are 2 points lower than they should be. Instead of needing to add 2 to each of the five grades individually, I can simply add 2 to the vector once and it will apply to all the grades at once. That would be a huge time savings in a course with many students!

corrected <- grades + 2
corrected
#> [1] 80 86 93 94 96

Second, we can use a vector to summarize across the sub-objects. For example, I might want to calculate the number of students in the class or add up all their points. By giving the corrected vector to special functions in R, we can do this easily.

length(corrected)
#> [1] 5

sum(corrected)
#> [1] 449

4.2.5 Character Data

Sometimes we need to provide R with text instead of numbers. This type of data is called character data and each instance is called a string. Just like numbers, we can collect strings into vectors and there are special functions that can transform and summarize character data. In R, we mark each string by surrounding it with quotation marks ("). Here are a few examples:

customers <- c("John Smith", "Jane Doe", "Joe Bloggs")
customers
#> [1] "John Smith" "Jane Doe"   "Joe Bloggs"

toupper(customers)
#> [1] "JOHN SMITH" "JANE DOE"   "JOE BLOGGS"

nchar(customers)
#> [1] 10  8 10

4.3 Installing and Loading Packages

Base R can do a lot, but many analyses and visualizations are easier with add-on packages.

Install once (per machine):

install.packages("moments")
install.packages("ggplot2")

(You don’t need to run this every time—only when the package isn’t installed.)

Load each session:

library(moments)
library(ggplot2)

4.4 Working with Datasets

4.4.1 Tidy Data

Before we start working with more complex data structures in R, it’s important to understand the idea of tidy data. This concept is a standard way of organizing data that makes analysis and visualization easier.

In tidy data:

Each row is a single observation: a case, unit, or instance you’ve measured.
Each column is a single variable: a property or attribute that you’ve measured for each observation.
Each cell contains a single value: no multiple entries or lists inside one cell.

For example, if you are recording exam scores for students in two different subjects, a tidy dataset would have:

One row for each participant
One column for each variable (e.g., ParticipantID, Condition, ReactionTime)
One value per cell (e.g., 610 in the ReactionTime column)

A non-tidy dataset might store all scores in one column separated by commas, or have one column for each exam date instead of each variable. While such formats are common in raw data, they require more work to analyze. R’s modern data tools work best when your data follows the tidy format.

ParticipantID	Condition	ReactionTime	Accuracy
P001	Control	525	0.92
P002	Treatment	498	0.88
P003	Control	610	0.95
P004	Treatment	472	0.91
P005	Control	587	0.89

4.4.2 Data Frames

R, and many packages, come with some example datasets built in, which can be accessed by looking up their names and printing them. For example, the mpg dataset is included in the ggplot2 package and includes information about different car models.

mpg
#> # A tibble: 234 × 11
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
#>  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
#>  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
#>  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
#>  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
#>  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
#>  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
#>  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
#>  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
#> 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
#> # ℹ 224 more rows

This object is more complex than anything we’ve seen yet. It is an object type called a data frame, which is essentially multiple vectors combined together such that each row represents one observation (in this case, one car model) and each column represents one variable (in this case, an attribute like engine size displ or highway fuel efficiency hwy or type of vehicle class). Most of the time, you will be using data frames for data analysis and visualization tasks.

We can use the $ operator to extract any column/variable from a data frame as a vector.

mpg$hwy
#>   [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
#>  [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
#>  [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
#>  [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
#> [101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
#> [126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
#> [151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
#> [176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
#> [201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
#> [226] 28 29 29 29 28 29 26 26 26

This will become helpful for calculating summaries of our variables, next.

4.5 Try It Yourself

Getting Started in R

In the R console, calculate the value of:
\[(12.5 - 7.5) \times \left( \frac{4}{3} \right)^2\]
Create an object called length_in storing the value 15 in inches. Convert it to centimeters using the formula below. Store the result in an object called length_cm and print it. \[\text{cm} = \text{in} \times 2.54\]
Create a vector called scores with the values 85, 90, 92, 88, 95.
- Add 3 to all scores to correct for a scoring error.
- Find the number of scores, the minimum and maximum, and the average of the corrected scores.
Load the ggplot2 package and print the diamonds dataset. Extract the price column and calculate the total (i.e., sum) value of all diamonds in the dataset.

5 Descriptive Statistics

5.1 Measures of Central Tendency

When we want to describe a set of numbers, one useful step is to find a typical or central value. These are called measures of central tendency, and they give us a quick sense of where most of the data is “centered.” The three most common are:

Mean: Often called the average, the mean is found by adding up all the values and dividing by how many there are. It is sensitive to extreme values (outliers).
Median: The middle value when all numbers are ordered from smallest to largest. If there’s an even number of values, the median is the average of the two middle values. It’s less affected by extreme values than the mean.
Mode: The most frequently occurring value in the data. A set can have no mode, one mode, or more than one mode.

mean(mpg$hwy)
#> [1] 23.44017

median(mpg$hwy)
#> [1] 24

table(mpg$class)
#> 
#>    2seater    compact    midsize    minivan     pickup subcompact        suv 
#>          5         47         41         11         33         35         62

which.max(table(mpg$class))
#> suv 
#>   7

5.2 Measures of Variability

While measures of central tendency tell us where the “center” of our data is, measures of variability describe how spread out the values are. They help us understand whether the data points are clustered closely together or scattered far apart. Common measures include:

Range: The difference between the largest and smallest values. It’s quick to calculate but can be overly influenced by extreme values.
Interquartile Range (IQR): The range of the middle 50% of the data (between the 25th and 75th percentiles). This reduces the influence of extreme values.
Variance: The average of the squared differences between each value and the mean. It’s a key concept in statistics, but the units are “squared,” making it less intuitive.
Standard Deviation (SD): The square root of the variance. It’s in the same units as the original data, so it’s easier to interpret.

range(mpg$hwy)
#> [1] 12 44

diff(range(mpg$hwy)) 
#> [1] 32

IQR(mpg$hwy)
#> [1] 9

var(mpg$hwy)
#> [1] 35.45778

sd(mpg$hwy)
#> [1] 5.954643

5.3 Measures of Shape

Beyond knowing where the data is centered and how spread out it is, we can also describe the shape of its distribution. Two common measures of shape are skewness and kurtosis.

Skewness: Describes the asymmetry of the data’s distribution.
- Negative skew (<0=left-skewed): the left tail is longer, with more extreme low values. The mean is often less than the median.
- Zero skew (0=unskewed): the distribution is symmetric.
- Positive skew (>0=right-skewed): the right tail is longer, with more extreme high values. The mean is often greater than the median.
Kurtosis: Describes the tailedness or “peakedness” of the distribution compared to a normal distribution.
- Low kurtosis (<3=platykurtic): flatter peak and lighter tails.
- Normal kurtosis (3=mesokurtic): similar to a normal distribution.
- High kurtosis (>3=leptokurtic): sharper peak and heavier tails, meaning more extreme values.

library(moments)

skewness(mpg$hwy)
#> [1] 0.366865

kurtosis(mpg$hwy)
#> [1] 3.163929

5.4 Visualizing Variability

When you want to understand how a single variable is distributed, different plot types can highlight different aspects of the data. The ggplot2 package offers a variety of geoms for this purpose.

5.4.1 Bar plots

Bar plots are typically used for categorical variables, showing counts or proportions for each category. The height of each bar corresponds to the count (or proportion, if specified) for that category.

library(ggplot2)

ggplot(data = mpg, mapping = aes(x = class)) + geom_bar()

5.4.2 Histograms

Histograms display the distribution of a numeric variable by grouping values into bins. The height of each bar reflects how many observations fall within that range of values.

ggplot(data = mpg, mapping = aes(x = hwy)) + geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.4.3 Density plots

Density plots provide a smoothed version of the histogram, showing the distribution as a continuous curve. This is especially useful for comparing the shapes of multiple distributions without the visual noise of bin edges.

ggplot(data = mpg, mapping = aes(x = hwy)) + geom_density()

5.4.4 Boxplots

Boxplots give a compact summary of a distribution’s center, spread, and outliers.

The box spans from the first quartile (Q1) to the third quartile (Q3), containing the middle 50% of the data.
The line inside the box marks the median (Q2).
The whiskers extend from the box to the most extreme values within 1.5 times the interquartile range (IQR) from Q1 or Q3.
Points beyond the whiskers are considered outliers and are plotted individually as points.

ggplot(data = mpg, mapping = aes(x = hwy)) + geom_boxplot()

5.5 Visualizing Joint Variability

To examine how two variables vary together, choose a plot that matches their types (numeric–numeric vs. numeric–categorical). The ggplot2 geoms below highlight different relationships and patterns. Note that we now need to specify two aesthetic mappings instead of just one.

5.5.1 Colored density plots (numeric by categorical)

Colored density plots compare the distribution of a numeric variable across categories by overlaying group-wise densities.

ggplot(data = mpg, mapping = aes(x = hwy, color = drv)) + geom_density(linewidth = 1)

5.5.2 Boxplots with a discrete y-axis (numeric vs. categorical)

Horizontal boxplots summarize a numeric variable for each category, making group comparisons of medians, spread, and outliers straightforward. Mapping the categorical variable to the y-axis keeps labels readable and emphasizes between-group differences.

ggplot(data = mpg, mapping = aes(x = hwy, y = class)) + geom_boxplot()

5.5.3 Scatterplots (numeric vs. numeric)

Scatterplots reveal association, clusters, and potential nonlinearity between two numeric variables.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()

5.6 Try It Yourself

Descriptive Statistics

Using the diamonds dataset (included in the ggplot2 pacakge), calculate the mean and median of the carat variable. How do the mean and median compare, and what might that suggest about skewness?
In the diamonds dataset, compute the range, IQR, variance, and standard deviation of the carat variable. Which measure of spread is least influenced by extreme values?
Create a boxplot showing the distribution of price for each cut. Which cut tends to have the highest median price? Which has the widest spread?
Make a scatterplot of carat (x-axis) vs. price (y-axis). What kind of relationship do you observe between these variables?

6 Graduate Courses

See the two-year schedule posted at: https://college.ku.edu/grad-quant

6.1 PSYC 790

Statistical Methods in Psychology I

Linear modeling
Ordinary least squares estimation
Probability theory
Statistical inference
t-test group comparisons
ANOVA group comparisons
Correlation associations
Multiple regression

6.2 PSYC 791

Statistical Methods in Psychology II

Generalized linear modeling
Maximum likelihood estimation
Binary regression
Count regression
Ordinal regression

6.3 PSYC 792

Data Science for the Social and Behavioral Sciences

R programming
Task automation
Data processing
Data visualization
Data communication

6.4 PSYC 894

Multilevel Modeling

Linear mixed-effects modeling
Clustered data
Longitudinal data
Primer on Bayesian estimation