Introduction
Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all. The R programming language, specifically through the RStudio IDE, is a widely used open source platform for data analysis and data visualization. This is because of the extensive variety of packages available and attentive community devoted to data analysis. Consequently, there are several exploratory data analysis packages, each of which have their own pros and cons.
Here, we utilize the dlookr package to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. dlookr offers a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.
What is Exploratory Data Analysis?
Exploratory data analysis is a statistical, approach towards analyzing data sets to investigate and summarize their main characteristics, often through statistical graphics and other data visualization methods.
What are Some Important Data Set Characteristics?
There are several characteristics that are arguably important, but we will only consider those covered in this workshop series. Let’s start with the fundamentals that will help guide us.
Diagnostics
When importing data sets, it is important to consider characteristics about the data columns, rows, and individual cells.
Variables
Name of each variable
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
---|---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | Middle |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 | Middle |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 | Middle |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 | Young |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 | Middle |
5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 | Young |
Types
Data type of each variable
variables | types |
---|---|
Pregnancies | integer |
Glucose | integer |
BloodPressure | integer |
SkinThickness | integer |
Insulin | integer |
BMI | numeric |
DiabetesPedigreeFunction | numeric |
Age | integer |
Outcome | integer |
Age_group | factor |
Numerical: Continuous
Measurable numbers that are fractional or decimal and cannot be counted (e.g., time, height, weight)
Numerical: Discrete
Countable whole numbers or integers (e.g., number of successes or failures)
Categorical: Nominal
Labeling variables without any order or quantitative value (e.g., hair color, nationality)
Categorical: Ordinal
Where there is a hierarchical order along a scale (e.g., ranks, letter grades, age groups)
Missing Values (NAs)
Cells, rows, or columns without data
Missing percent: percentage of missing values * Unique count: number of unique values.
Unique rate: rate of unique value - unique count / total number of observations.
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
---|---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | NA | 0 | NA | 0.627 | NA | 1 | Middle |
1 | 85 | 66 | 29 | 0 | NA | 0.351 | 31 | 0 | Middle |
8 | NA | 64 | 0 | 0 | 23.3 | 0.672 | NA | 1 | Middle |
1 | 89 | 66 | 23 | NA | 28.1 | 0.167 | NA | 0 | NA |
NA | 137 | NA | 35 | 168 | 43.1 | NA | 33 | 1 | NA |
5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | NA | Young |
Summary Statistics
Above we described some properties of data. However, you will need to know some descriptive characteristics of your data before you can move forward. Enter, summary statistics.
Summary statistics allow you to summarize large amounts of information about your data as quickly as possible.
Central Tendency
Measuring a central property of your data. Some examples you’ve probably heard of are:
Mean: Average value
Median: Middle value
Mode: Most common value
Notice however, that all values of central tendency can be pretty similar, such as in the top panel. This will become important when we discuss data transformations in Chapter 3.
Statistical Dispersion
Measure of data variability, scatter, or spread. Some examples you may have heard of:
Standard deviation (SD): The amount of variation that occurs in a set of values.
Interquartile range (IQR): The difference between the 75th and 25th percentiles
Outliers: A value outside of \(1.5 * IQR\)
Distribution Shape
Measures of describing the shape of a distribution, usually compared to a normal distribution (bell-curve)
Skewness: The symmetry of the distribution
Kurtosis: The tailedness of the distribution
Statistical Dependence (Correlation)
Measure of causality between two random variables (statistically). Notably, we approximate causality with correlations (see correlation \(\neq\) causation)
- Numerical values, but you can compare numericals across categories (see the first plot above).