Research Guides: R and RStudio in Digital Scholarship: Exploratory Data Analysis (EDA)

Exploratory Data Analysis with dplyr

Exploratory Data Analysis (EDA) is the process of examining and understanding your dataset before doing formal modeling or hypothesis testing. The goal is to summarize the main characteristics of the data, often using visual methods and descriptive statistics. EDA helps identify patterns, detect outliers, test assumptions, and spot errors or missing values.

EDA in R is often done using tidyverse packages, especially dplyr for data manipulation and ggplot2 for visualization. You begin by inspecting the structure of the dataset using functions like glimpse() or summary() to get an overview of variable types and basic statistics. To identify missing data, is.na() combined with colSums() is commonly used. For understanding individual variables, dplyr functions like summarize() and count() help describe numeric and categorical variables, respectively. When exploring relationships or distributions, ggplot2 is the go-to tool: histograms and density plots are used for single numeric variables, bar plots for categorical counts, and boxplots or scatterplots for comparing variables. Throughout EDA, you can use facet_wrap() in ggplot2 to create multi-panel plots that show how patterns differ across groups. Together, these tools provide a flexible and coherent framework for getting to know your data before moving on to modeling or inference.

Chapter 7 of the free ebook, R for Data Science, gives an introduction to EDA in R with many code examples. And don't forget about the cheat sheets!