Introduction

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all. The R programming language, specifically through the RStudio IDE, is a widely used open source platform for data analysis and data visualization. This is because of the extensive variety of packages available and attentive community devoted to data analysis. Consequently, there are several exploratory data analysis packages, each of which have their own pros and cons.

Here, we utilize the dlookr package to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. dlookr offers a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.

What is Exploratory Data Analysis?

Exploratory data analysis is a statistical, approach towards analyzing data sets to investigate and summarize their main characteristics, often through statistical graphics and other data visualization methods.

What are Some Important Data Set Characteristics?

There are several characteristics that are arguably important, but we will only consider those covered in this workshop series. Let’s start with the fundamentals that will help guide us.

Diagnostics

When importing data sets, it is important to consider characteristics about the data columns, rows, and individual cells.

Variables

Name of each variable

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Age_group
6	148	72	35	0	33.6	0.627	50	1	Middle
1	85	66	29	0	26.6	0.351	31	0	Middle
8	183	64	0	0	23.3	0.672	32	1	Middle
1	89	66	23	94	28.1	0.167	21	0	Young
0	137	40	35	168	43.1	2.288	33	1	Middle
5	116	74	0	0	25.6	0.201	30	0	Young

Types

Data type of each variable

variables	types
Pregnancies	integer
Glucose	integer
BloodPressure	integer
SkinThickness	integer
Insulin	integer
BMI	numeric
DiabetesPedigreeFunction	numeric
Age	integer
Outcome	integer
Age_group	factor

Numerical: Continuous

Measurable numbers that are fractional or decimal and cannot be counted (e.g., time, height, weight)

Numerical: Discrete

Countable whole numbers or integers (e.g., number of successes or failures)

Categorical: Nominal

Labeling variables without any order or quantitative value (e.g., hair color, nationality)

Categorical: Ordinal

Where there is a hierarchical order along a scale (e.g., ranks, letter grades, age groups)

Missing Values (NAs)

Cells, rows, or columns without data

Missing percent: percentage of missing values * Unique count: number of unique values.
Unique rate: rate of unique value - unique count / total number of observations.

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Age_group
6	148	72	NA	0	NA	0.627	NA	1	Middle
1	85	66	29	0	NA	0.351	31	0	Middle
8	NA	64	0	0	23.3	0.672	NA	1	Middle
1	89	66	23	NA	28.1	0.167	NA	0	NA
NA	137	NA	35	168	43.1	NA	33	1	NA
5	116	74	0	0	25.6	0.201	30	NA	Young

Summary Statistics

Above we described some properties of data. However, you will need to know some descriptive characteristics of your data before you can move forward. Enter, summary statistics.

Summary statistics allow you to summarize large amounts of information about your data as quickly as possible.

Central Tendency

Measuring a central property of your data. Some examples you’ve probably heard of are:

Mean: Average value
Median: Middle value
Mode: Most common value

Notice however, that all values of central tendency can be pretty similar, such as in the top panel. This will become important when we discuss data transformations in Chapter 3.

Statistical Dispersion

Measure of data variability, scatter, or spread. Some examples you may have heard of:

Standard deviation (SD): The amount of variation that occurs in a set of values.
Interquartile range (IQR): The difference between the 75th and 25th percentiles
Outliers: A value outside of \(1.5 * IQR\)

Distribution Shape

Measures of describing the shape of a distribution, usually compared to a normal distribution (bell-curve)

Skewness: The symmetry of the distribution
Kurtosis: The tailedness of the distribution

Statistical Dependence (Correlation)

Measure of causality between two random variables (statistically). Notably, we approximate causality with correlations (see correlation \(\neq\) causation)

Numerical values, but you can compare numericals across categories (see the first plot above).