Research Guides: Digital Scholarship: R and RStudio: Overview

R is a free software environment for statistical computing and graphics. It utilizes a library of thousands of packages for a wide variety of statistical and data visualization purposes. There is a strong interdisciplinary user community.

https://www.r-project.org/

RStudio is a free IDE for R that provides tools for code, data and workspace management. We recommend that you install RStudio alongside R from the beginning as it greatly reduces the learning curve.

https://posit.co/products/open-source/rstudio/

The R for Data Science online textbook 2nd edition (2023) - https://r4ds.hadley.nz/ is the recommended textbook for learning about R and the tidyverse. This is a complete and free online resource.

Posit hosts a complete set of cheat sheets for R and the tidyverse. https://posit.co/resources/cheatsheets/, including a comprehensive guide to working in RStudio. https://rstudio.github.io/cheatsheets/html/rstudio-ide.html

The tidyverse, https://www.tidyverse.org/, is a collection of R packages designed for data science. These packages share a common philosophy, grammar, and data structures, making it easier to learn and use them together. The tidyverse is especially known for making data manipulation, visualization, and analysis more intuitive and consistent.

The tidyverse was created to make data science in R easier, more consistent, and more intuitive, especially for people working with real-world data. Base R has many powerful functions, but they often have inconsistent syntax and naming conventions, which can be confusing for beginners. The tidyverse addresses this by emphasizing clear, readable code that follows a logical, step-by-step workflow, often using the pipe operator (%>% or |>). It also promotes the concept of “tidy data,” where each variable is a column and each observation is a row—making data analysis more predictable and efficient. The tidyverse consists of modular packages that each do one thing well but work seamlessly together, enabling a smooth transition from data import to visualization and reporting. Created by Hadley Wickham and others at RStudio in 2016, the tidyverse was also designed to support teaching and learning, offering a consistent and beginner-friendly approach to modern data science in R.

Core Features of the Tidyverse:

Tidy data principles: Each variable is a column, each observation is a row, and each type of observational unit is a table.
Piping (%>%): Used to write code in a clear, readable, step-by-step style.
Consistent syntax and function naming across packages.

Key Packages in the Tidyverse:

These packages are all loaded when you run library(tidyverse).

ggplot2 – for data visualization
dplyr – for data manipulation (e.g., filter, select, mutate, summarize)
tidyr – for tidying and reshaping data (e.g., pivoting, separating columns)
readr – for reading rectangular data (like CSV files)
tibble – a modern version of data frames
purrr – for functional programming and iteration
stringr – for string (text) processing
forcats – for working with categorical variables (factors)

Exploratory Data Analysis (EDA) is the process of examining and understanding your dataset before doing formal modeling or hypothesis testing. The goal is to summarize the main characteristics of the data, often using visual methods and descriptive statistics. EDA helps identify patterns, detect outliers, test assumptions, and spot errors or missing values.

EDA in R is often done using tidyverse packages, especially dplyr for data manipulation and ggplot2 for visualization. You begin by inspecting the structure of the dataset using functions like glimpse() or summary() to get an overview of variable types and basic statistics. To identify missing data, is.na() combined with colSums() is commonly used. For understanding individual variables, dplyr functions like summarize() and count() help describe numeric and categorical variables, respectively. When exploring relationships or distributions, ggplot2 is the go-to tool: histograms and density plots are used for single numeric variables, bar plots for categorical counts, and boxplots or scatterplots for comparing variables. Throughout EDA, you can use facet_wrap() in ggplot2 to create multi-panel plots that show how patterns differ across groups. Together, these tools provide a flexible and coherent framework for getting to know your data before moving on to modeling or inference.

Chapter 7 of the free ebook, R for Data Science, gives an introduction to EDA in R with many code examples https://r4ds.had.co.nz/exploratory-data-analysis.html. And don't forget about the cheat sheets! https://rstudio.github.io/cheatsheets/html/data-transformation.html

ggplot2 is THE tool for data visualization in R. It is part of the tidyverse and is built around the principles of the Grammar of Graphics, which allows you to create complex plots by layering components such as data, aesthetics, and geometric objects. This approach encourages users to think about the structure of a plot rather than just the end result, making it easier to build, customize, and understand visualizations.

To begin visualizing data with ggplot2, you first define the dataset and the aesthetic mappings—that is, which variables go on the x- and y-axes, and optionally, which variables define color, size, shape, or other properties. Then, you add geoms (geometric objects) to represent the data, such as geom_point() for scatterplots, geom_bar() for bar charts, or geom_boxplot() for comparing distributions. You can further enhance your plots with faceting (using facet_wrap() or facet_grid()), which creates multiple subplots based on the values of a categorical variable. This is especially helpful when comparing groups. ggplot2also makes it easy to add layers like trend lines (for example, with geom_smooth()), annotations, and custom themes to improve the clarity and aesthetics of your plots.

Because it integrates well with dplyr, ggplot2 fits naturally into a tidyverse workflow, allowing for seamless transitions from data wrangling to visualization.

The free ebook, ggplot2: Elegant Graphics for Data Analysis (3e) https://ggplot2-book.org/ is an excellent resource for learning more about how to use ggplot2. And don't forget about the cheat sheets! https://rstudio.github.io/cheatsheets/html/data-visualization.html

CC license image. It indicates that attribution is necessary, no commercial use, and derivates should also be shared. This work is openly licensed via CC BY-NC-SA 4.0. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) this material under the following terms:

You must give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes.
If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.