Data7 Exploratory Data Analysis in MySQL

Purpose of Materials

Exploratory data analysis of a novel data set with MySQL

Objectives

Load and explore a data set with summary statistics
Diagnose and remove outliers
Diagnose missing values in a data set

Overview

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all.

SQL (Structured Query Language) is a programming language for database management, which lets you store, retrieve, manage, and manipulate data tables within databases. Though SQL has limited mathematical capabilities it can be used to perform EDA. A major disadvantage however is that SQL cannot be used to perform statistical graphics and other data visualization methods. For this, I recommend either the R programming language, specifically through the RStudio IDE and ggplot2 from the tidyverse package suite, or Python, specifically the seaborn library.

Here, we utilize MySQL to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. We introduce a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.

Required Setup

We first need to prepare our environment with the necessary libraries and start our MySQL connection. Note that you can use SQLite for some of these functions, but not all.

install.packages("RMySQL")

library(DBI)
library(RMySQL)
library(here)

SQL_Con <- dbConnect(RMySQL::MySQL(), dbname = 'data_diabetes_RStudio', username = 'root', host='localhost', password = 'Chism1154')

Examine and Describe a Data Set

Examine columns and data types
Summary statistics
Define box plots
Describe meta data

Examine a Data Set

We should always examine a dataset in our databases. Here we are looking at the first 5 rows of the dataset.

-- First 5 rows
SELECT * 
FROM diabetes
LIMIT 5;

5 records
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1

Describe your Data

We need to start by seeing what data types our columns actually are.

-- What are the properties of the data
DESCRIBE diabetes;

9 records
Field	Type	Null	Default
Pregnancies	int	YES	NA
Glucose	int	YES	NA
BloodPressure	int	YES	NA
SkinThickness	int	YES	NA
Insulin	int	YES	NA
BMI	double	YES	NA
DiabetesPedigreeFunction	double	YES	NA
Age	int	YES	NA
Outcome	int	YES	NA

Field: name of each variable
type: data type of each variable
Null: allows NULL values
Key: primary key that was used
Default: default value for the column
Extra: any additional information

Unique values

Next we can look at unique values within columns - e.g., Pregnancies and Age.

SELECT 
DISTINCT Pregnancies, Age
FROM diabetes;

Displaying records 1 - 10
Pregnancies	Age
6	50
1	31
8	32
1	21
0	33
5	30
3	26
10	29
2	53
8	54

Filtering

We can also filter data by qualifications.

SELECT *
FROM diabetes
WHERE Age >= 50
LIMIT 5;

5 records
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
2	197	70	45	543	30.5	0.158	53	1
8	125	96	0	0	0.0	0.232	54	1
10	139	80	0	0	27.1	1.441	57	0
1	189	60	23	846	30.1	0.398	59	1

Sorting

We might want to sort columns in ascending (ASC) or descending (DESC) order.

SELECT Glucose, Age
FROM diabetes
ORDER BY Glucose ASC
LIMIT 10;

Displaying records 1 - 10
Glucose	Age
0	41
0	21
0	22
0	37
0	22
44	36
56	22
57	67
57	41
61	46

Summary Statistics of your Data

Numerical Variables

Our entire database is numerical data, but we will look at two of the numerical columns, Glucose and Insulin.

Note that the stat column will be called ‘Count’ but this isn’t a problem.

-- Summary statistics of our numerical columns
SELECT 'Count',
    count(Glucose) as Glucose,
    count(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Total',
    sum(Glucose) as Glucose,
    sum(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Mean',
    avg(Glucose),
    avg(Insulin)
FROM diabetes
UNION
SELECT 'Min',
    min(Glucose),
    min(Insulin)
FROM diabetes
UNION
SELECT 'Max',
    max(Glucose),
    max(Insulin)
FROM diabetes
UNION
SELECT 'Std. Dev.',
    STDDEV_SAMP(Glucose),
    STDDEV_SAMP(Insulin)
FROM diabetes
UNION
SELECT 'Variance',
    VAR_SAMP(Glucose),
    VAR_SAMP(Insulin)
FROM diabetes;

7 records
Count	Glucose	Insulin
Count	768.00000	768.00000
Total	92847.00000	61286.00000
Mean	120.89453	79.79948
Min	0.00000	0.00000
Max	199.00000	846.00000
Std. Dev.	31.97262	115.24400
Variance	1022.24831	13281.18008

Count: number of observations
Total: sum of all values in a columns
Mean: arithmetic mean (average value)
Min: minimum value
Max: maximum value
Std. Dev.: standard deviation of the data
Variance: variance of the data

Median and Percentiles

SQL does not have straightforward functions for percentiles (including the Median), so let’s write them!

Make 100 bins
Assign percentiles to bins
Select desired percentiles (25, 50, 75) = (Q1, Median, Q3)

WITH perc AS (SELECT Glucose, NTILE(100) OVER (ORDER BY Glucose) AS 'Percentile'
FROM diabetes)
SELECT Percentile, MAX(Glucose) as Glucose
FROM perc
GROUP BY Percentile
HAVING Percentile = 25 OR Percentile = 50 or Percentile = 75;

3 records
Percentile	Glucose
25	100
50	119
75	144

Categorical Variables

Our original data does not have a categorical column, so we will use diabetes_Age. In the Age_group column categories were created by a qualifier:

Young: Age <= 21

Middle: Age between 21 and 30

Elderly: Age > 30

SELECT Age_group,
       COUNT(*) AS 'Count',
       COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS 'Ratio'
FROM diabetes_Age
GROUP BY Age_group

3 records
Age_group	Count	Ratio
Middle	270	0.35156
Young	417	0.54297
Elderly	81	0.10547

Count: number of values in the column
Ratio: the number of observations over the total observations

Outliers

Values outside of \(1.5 * IQR\)

There are several numerical variables that have outliers above, let’s see what the data look like with and without them.

Outlier Detection

The most common method to detect outliers is by using the interquartile range:

\(1.5 * IQR\). Above we found the first and third quartiles for Glucose (Q1 = 100, Q3 = 144), therefore the IQR is 44.

Qualifier: \(1.5∗44=66\)

Lower limit: \(100-66 = 34\)

Upper limit: \(144 + 66 = 210\)

SELECT Glucose
FROM diabetes
WHERE Glucose < 210 AND Glucose > 34 
ORDER BY Glucose ASC;

Displaying records 1 - 10
Glucose
44
56
57
57
61
62
65
67
68
68

Missing Values (NAs)

Table showing the extent of NAs in columns containing them.

SELECT * 
FROM
diabetesNA
LIMIT 5;

5 records
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	NA	33.6	0.627	50	NA
1	NA	NA	NA	0	NA	0.351	31	0
8	183	64	NA	0	23.3	0.672	32	1
NA	89	NA	23	NA	28.1	0.167	NA	0
NA	137	40	35	NA	43.1	2.288	33	1

However, “NA” is not NULL in SQL, so we will have to change this.

Create new table from diabetesNA.

CREATE TABLE diabetesNull
AS
SELECT * 
FROM diabetesNA;

Change 'NA' into NULL - Its possible to do this all in one argument, but I find that temperamental.

Pregnancies

UPDATE diabetesNull
SET
Pregnancies = NULL WHERE Pregnancies = 'NA';

Glucose

UPDATE diabetesNull
SET
Glucose = NULL WHERE Glucose = 'NA';

BloodPressure

UPDATE diabetesNull
SET
BloodPressure = NULL WHERE BloodPressure = 'NA';

SkinThickness

UPDATE diabetesNull
SET
SkinThickness = NULL WHERE SkinThickness = 'NA';

Insulin

UPDATE diabetesNull
SET
Insulin = NULL WHERE Insulin = 'NA';

BMI

UPDATE diabetesNull
SET
BMI = NULL WHERE BMI = 'NA';

DiabetesPedigreeFunction

UPDATE diabetesNull
SET
DiabetesPedigreeFunction = NULL WHERE DiabetesPedigreeFunction = 'NA';

Age

UPDATE diabetesNull
SET
Age = NULL WHERE Age = 'NA';

Outcome

UPDATE diabetesNull
SET
Outcome = NULL WHERE Outcome = 'NA';

Diagnose NAs

Now we can see true NAs! (They look the same in the Quarto render, but SQL treats them differently)

SELECT * 
FROM diabetesNULL

Displaying records 1 - 10
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	NA	33.6	0.627	50	NA
1	NA	NA	NA	0	NA	0.351	31	0
8	183	64	NA	0	23.3	0.672	32	1
NA	89	NA	23	NA	28.1	0.167	NA	0
NA	137	40	35	NA	43.1	2.288	33	1
3	78	50	32	NA	31	0.248	26	1
NA	115	0	0	0	NA	0.134	NA	0
2	197	70	NA	543	30.5	0.158	53	NA
4	NA	NA	NA	0	NA	0.191	30	0
10	168	74	0	0	38	0.537	34	1

We can see the number of NULL values in each column.

SELECT 'NAs',
    SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) AS Glucose, 
    SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) AS Insulin
FROM diabetesNULL
UNION
SELECT 'NA Freq',
    SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Glucose IS NULL THEN 0 ELSE 1 END) AS Glucose, 
    SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Insulin IS NULL THEN 0 ELSE 1 END) AS Insulin
FROM diabetesNULL

2 records
NAs	Glucose	Insulin
NAs	168.0000	164.0000
NA Freq	0.4541	0.4385

Created: 09/21/2022 (G. Chism); Last update: 09/21/2022

CC BY-NC-SA