Introduction

Image Credit: Choonghyun Ryu

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all. The R programming language, specifically through the RStudio IDE, is a widely used open source platform for data analysis and data visualization. This is because of the extensive variety of packages available and attentive community devoted to data analysis. Consequently, there are several exploratory data analysis packages, each of which have their own pros and cons.

Here, we utilize the dlookr package to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. dlookr offers a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.

What is Exploratory Data Analysis?

Exploratory data analysis is a statistical, approach towards analyzing data sets to investigate and summarize their main characteristics, often through statistical graphics and other data visualization methods.


What are Some Important Data Set Characteristics?

There are several characteristics that are arguably important, but we will only consider those covered in this workshop series. Let’s start with the fundamentals that will help guide us.

Diagnostics

When importing data sets, it is important to consider characteristics about the data columns, rows, and individual cells.


Variables

Name of each variable

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome Age_group
6 148 72 35 0 33.6 0.627 50 1 Middle
1 85 66 29 0 26.6 0.351 31 0 Middle
8 183 64 0 0 23.3 0.672 32 1 Middle
1 89 66 23 94 28.1 0.167 21 0 Young
0 137 40 35 168 43.1 2.288 33 1 Middle
5 116 74 0 0 25.6 0.201 30 0 Young

Types

Data type of each variable

variables types
Pregnancies integer
Glucose integer
BloodPressure integer
SkinThickness integer
Insulin integer
BMI numeric
DiabetesPedigreeFunction numeric
Age integer
Outcome integer
Age_group factor

Numerical: Continuous

Measurable numbers that are fractional or decimal and cannot be counted (e.g., time, height, weight)

Numerical: Discrete

Countable whole numbers or integers (e.g., number of successes or failures)


Categorical: Nominal

Labeling variables without any order or quantitative value (e.g., hair color, nationality)

Categorical: Ordinal

Where there is a hierarchical order along a scale (e.g., ranks, letter grades, age groups)

Missing Values (NAs)

Cells, rows, or columns without data

  • Missing percent: percentage of missing values * Unique count: number of unique values.

  • Unique rate: rate of unique value - unique count / total number of observations.

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome Age_group
6 148 72 NA 0 NA 0.627 NA 1 Middle
1 85 66 29 0 NA 0.351 31 0 Middle
8 NA 64 0 0 23.3 0.672 NA 1 Middle
1 89 66 23 NA 28.1 0.167 NA 0 NA
NA 137 NA 35 168 43.1 NA 33 1 NA
5 116 74 0 0 25.6 0.201 30 NA Young

Summary Statistics

Above we described some properties of data. However, you will need to know some descriptive characteristics of your data before you can move forward. Enter, summary statistics.

Summary statistics allow you to summarize large amounts of information about your data as quickly as possible.

Central Tendency

Measuring a central property of your data. Some examples you’ve probably heard of are:

  • Mean: Average value

  • Median: Middle value

  • Mode: Most common value

Notice however, that all values of central tendency can be pretty similar, such as in the top panel. This will become important when we discuss data transformations in Chapter 3.

Statistical Dispersion

Measure of data variability, scatter, or spread. Some examples you may have heard of:

  • Standard deviation (SD): The amount of variation that occurs in a set of values.

  • Interquartile range (IQR): The difference between the 75th and 25th percentiles

  • Outliers: A value outside of \(1.5 * IQR\)

Distribution Shape

Measures of describing the shape of a distribution, usually compared to a normal distribution (bell-curve)

  • Skewness: The symmetry of the distribution

  • Kurtosis: The tailedness of the distribution

Statistical Dependence (Correlation)

Measure of causality between two random variables (statistically). Notably, we approximate causality with correlations (see correlation \(\neq\) causation)

  • Numerical values, but you can compare numericals across categories (see the first plot above).