Kelvin Smith Library
KSL has a series of Digital Scholarship Workshops every semester. FInd information and sign up in campus groups.
This is not a standalone resource. This document contains code samples for pasting into .rmd files during the workshop.
Code samples for Workshop
More complete tutorial resources will be coming soon!
For help with R, or to schedule class visits or workshops, please contact David Beales, Digital Scholarship Partner at the Kelvin Smith Library.
R is a free software environment for statistical computing and graphics. It utilizes a library of thousands of packages for a wide variety of statistical and data visualization purposes. There is a strong interdisciplinary user community.
RStudio is a free IDE for R that provides tools for code, data and workspace management. We recommend that you install RStudio alongside R from the beginning as it greatly reduces the learning curve.
The R for Data Science online textbook 2nd edition (2023) - https://r4ds.hadley.nz/ is the recommended textbook for learning about R and the tidyverse. This is a complete and free online resource.
Posit hosts a complete set of cheat sheets for R and the tidyverse
. https://posit.co/resources/cheatsheets/, including a comprehensive guide to working in RStudio. https://rstudio.github.io/cheatsheets/html/rstudio-ide.html
The tidyverse, https://www.tidyverse.org/, is a collection of R packages designed for data science. These packages share a common philosophy, grammar, and data structures, making it easier to learn and use them together. The tidyverse is especially known for making data manipulation, visualization, and analysis more intuitive and consistent.
The tidyverse was created to make data science in R easier, more consistent, and more intuitive, especially for people working with real-world data. Base R has many powerful functions, but they often have inconsistent syntax and naming conventions, which can be confusing for beginners. The tidyverse addresses this by emphasizing clear, readable code that follows a logical, step-by-step workflow, often using the pipe operator (%>%
or |>
). It also promotes the concept of “tidy data,” where each variable is a column and each observation is a row—making data analysis more predictable and efficient. The tidyverse consists of modular packages that each do one thing well but work seamlessly together, enabling a smooth transition from data import to visualization and reporting. Created by Hadley Wickham and others at RStudio in 2016, the tidyverse was also designed to support teaching and learning, offering a consistent and beginner-friendly approach to modern data science in R.
Tidy data principles: Each variable is a column, each observation is a row, and each type of observational unit is a table.
Piping (%>%
): Used to write code in a clear, readable, step-by-step style.
Consistent syntax and function naming across packages.
These packages are all loaded when you run library(tidyverse)
.
ggplot2
– for data visualization
dplyr
– for data manipulation (e.g., filter, select, mutate, summarize)
tidyr
– for tidying and reshaping data (e.g., pivoting, separating columns)
readr
– for reading rectangular data (like CSV files)
tibble
– a modern version of data frames
purrr
– for functional programming and iteration
stringr
– for string (text) processing
forcats
– for working with categorical variables (factors)
Exploratory Data Analysis (EDA) is the process of examining and understanding your dataset before doing formal modeling or hypothesis testing. The goal is to summarize the main characteristics of the data, often using visual methods and descriptive statistics. EDA helps identify patterns, detect outliers, test assumptions, and spot errors or missing values.
EDA in R is often done using tidyverse packages, especially dplyr
for data manipulation and ggplot2
for visualization. You begin by inspecting the structure of the dataset using functions like glimpse()
or summary()
to get an overview of variable types and basic statistics. To identify missing data, is.na()
combined with colSums()
is commonly used. For understanding individual variables, dplyr
functions like summarize()
and count()
help describe numeric and categorical variables, respectively. When exploring relationships or distributions, ggplot2
is the go-to tool: histograms and density plots are used for single numeric variables, bar plots for categorical counts, and boxplots or scatterplots for comparing variables. Throughout EDA, you can use facet_wrap()
in ggplot2
to create multi-panel plots that show how patterns differ across groups. Together, these tools provide a flexible and coherent framework for getting to know your data before moving on to modeling or inference.
Chapter 7 of the free ebook, R for Data Science, gives an introduction to EDA in R with many code examples https://r4ds.had.co.nz/exploratory-data-analysis.html. And don't forget about the cheat sheets! https://rstudio.github.io/cheatsheets/html/data-transformation.html
ggplot2
is THE tool for data visualization in R. It is part of the tidyverse
and is built around the principles of the Grammar of Graphics, which allows you to create complex plots by layering components such as data, aesthetics, and geometric objects. This approach encourages users to think about the structure of a plot rather than just the end result, making it easier to build, customize, and understand visualizations.
To begin visualizing data with ggplot2
, you first define the dataset and the aesthetic mappings—that is, which variables go on the x- and y-axes, and optionally, which variables define color, size, shape, or other properties. Then, you add geoms (geometric objects) to represent the data, such as geom_point()
for scatterplots, geom_bar()
for bar charts, or geom_boxplot()
for comparing distributions. You can further enhance your plots with faceting (using facet_wrap()
or facet_grid()
), which creates multiple subplots based on the values of a categorical variable. This is especially helpful when comparing groups. ggplot2
also makes it easy to add layers like trend lines (for example, with geom_smooth()
), annotations, and custom themes to improve the clarity and aesthetics of your plots.
Because it integrates well with dplyr
, ggplot2
fits naturally into a tidyverse
workflow, allowing for seamless transitions from data wrangling to visualization.
The free ebook, ggplot2: Elegant Graphics for Data Analysis (3e) https://ggplot2-book.org/ is an excellent resource for learning more about how to use ggplot2
. And don't forget about the cheat sheets! https://rstudio.github.io/cheatsheets/html/data-visualization.html