Data7 Exploratory Data Analysis in MySQL

Author

Greg Chism

Published

September 21, 2022

Purpose of Materials

Exploratory data analysis of a novel data set with MySQL


Objectives

  1. Load and explore a data set with summary statistics
  2. Diagnose and remove outliers
  3. Diagnose missing values in a data set

Overview

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all.

SQL (Structured Query Language) is a programming language for database management, which lets you store, retrieve, manage, and manipulate data tables within databases. Though SQL has limited mathematical capabilities it can be used to perform EDA. A major disadvantage however is that SQL cannot be used to perform statistical graphics and other data visualization methods. For this, I recommend either the R programming language, specifically through the RStudio IDE and ggplot2 from the tidyverse package suite, or Python, specifically the seaborn library.

Here, we utilize MySQL to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. We introduce a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.


Required Setup

We first need to prepare our environment with the necessary libraries and start our MySQL connection. Note that you can use SQLite for some of these functions, but not all.

install.packages("RMySQL")

library(DBI)
library(RMySQL)
library(here)

SQL_Con <- dbConnect(RMySQL::MySQL(), dbname = 'data_diabetes_RStudio', username = 'root', host='localhost', password = 'Chism1154')

Examine and Describe a Data Set

  • Examine columns and data types
  • Summary statistics
  • Define box plots
  • Describe meta data

Examine a Data Set

We should always examine a dataset in our databases. Here we are looking at the first 5 rows of the dataset.

-- First 5 rows
SELECT * 
FROM diabetes
LIMIT 5;
5 records
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
0 137 40 35 168 43.1 2.288 33 1

Describe your Data

We need to start by seeing what data types our columns actually are.

-- What are the properties of the data
DESCRIBE diabetes;
9 records
Field Type Null Key Default Extra
Pregnancies int YES NA
Glucose int YES NA
BloodPressure int YES NA
SkinThickness int YES NA
Insulin int YES NA
BMI double YES NA
DiabetesPedigreeFunction double YES NA
Age int YES NA
Outcome int YES NA
  • Field: name of each variable
  • type: data type of each variable
  • Null: allows NULL values
  • Key: primary key that was used
  • Default: default value for the column
  • Extra: any additional information

Unique values

Next we can look at unique values within columns - e.g., Pregnancies and Age.

SELECT 
DISTINCT Pregnancies, Age
FROM diabetes;
Displaying records 1 - 10
Pregnancies Age
6 50
1 31
8 32
1 21
0 33
5 30
3 26
10 29
2 53
8 54

Filtering

We can also filter data by qualifications.

SELECT *
FROM diabetes
WHERE Age >= 50
LIMIT 5;
5 records
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 0 33.6 0.627 50 1
2 197 70 45 543 30.5 0.158 53 1
8 125 96 0 0 0.0 0.232 54 1
10 139 80 0 0 27.1 1.441 57 0
1 189 60 23 846 30.1 0.398 59 1

Sorting

We might want to sort columns in ascending (ASC) or descending (DESC) order.

SELECT Glucose, Age
FROM diabetes
ORDER BY Glucose ASC
LIMIT 10;
Displaying records 1 - 10
Glucose Age
0 41
0 21
0 22
0 37
0 22
44 36
56 22
57 67
57 41
61 46

Summary Statistics of your Data

Numerical Variables

Our entire database is numerical data, but we will look at two of the numerical columns, Glucose and Insulin.

Note that the stat column will be called ‘Count’ but this isn’t a problem.

-- Summary statistics of our numerical columns
SELECT 'Count',
    count(Glucose) as Glucose,
    count(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Total',
    sum(Glucose) as Glucose,
    sum(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Mean',
    avg(Glucose),
    avg(Insulin)
FROM diabetes
UNION
SELECT 'Min',
    min(Glucose),
    min(Insulin)
FROM diabetes
UNION
SELECT 'Max',
    max(Glucose),
    max(Insulin)
FROM diabetes
UNION
SELECT 'Std. Dev.',
    STDDEV_SAMP(Glucose),
    STDDEV_SAMP(Insulin)
FROM diabetes
UNION
SELECT 'Variance',
    VAR_SAMP(Glucose),
    VAR_SAMP(Insulin)
FROM diabetes; 
7 records
Count Glucose Insulin
Count 768.00000 768.00000
Total 92847.00000 61286.00000
Mean 120.89453 79.79948
Min 0.00000 0.00000
Max 199.00000 846.00000
Std. Dev. 31.97262 115.24400
Variance 1022.24831 13281.18008
  • Count: number of observations

  • Total: sum of all values in a columns

  • Mean: arithmetic mean (average value)

  • Min: minimum value

  • Max: maximum value

  • Std. Dev.: standard deviation of the data

  • Variance: variance of the data


Median and Percentiles

SQL does not have straightforward functions for percentiles (including the Median), so let’s write them!

  • Make 100 bins

  • Assign percentiles to bins

  • Select desired percentiles (25, 50, 75) = (Q1, Median, Q3)

WITH perc AS (SELECT Glucose, NTILE(100) OVER (ORDER BY Glucose) AS 'Percentile'
FROM diabetes)
SELECT Percentile, MAX(Glucose) as Glucose
FROM perc
GROUP BY Percentile
HAVING Percentile = 25 OR Percentile = 50 or Percentile = 75;
3 records
Percentile Glucose
25 100
50 119
75 144

Categorical Variables

Our original data does not have a categorical column, so we will use diabetes_Age. In the Age_group column categories were created by a qualifier:

Young: Age <= 21

Middle: Age between 21 and 30

Elderly: Age > 30

SELECT Age_group,
       COUNT(*) AS 'Count',
       COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS 'Ratio'
FROM diabetes_Age
GROUP BY Age_group
3 records
Age_group Count Ratio
Middle 270 0.35156
Young 417 0.54297
Elderly 81 0.10547
  • Count: number of values in the column

  • Ratio: the number of observations over the total observations


Outliers

Values outside of \(1.5 * IQR\)

Boxplot showing the IQR, lower and upper quartiles, median, and outliers

Image Credit: CÉDRIC SCHERER


There are several numerical variables that have outliers above, let’s see what the data look like with and without them.


Outlier Detection

The most common method to detect outliers is by using the interquartile range:

\(1.5 * IQR\). Above we found the first and third quartiles for Glucose (Q1 = 100, Q3 = 144), therefore the IQR is 44.

Qualifier: \(1.5∗44=66\)

Lower limit: \(100-66 = 34\)

Upper limit: \(144 + 66 = 210\)

SELECT Glucose
FROM diabetes
WHERE Glucose < 210 AND Glucose > 34 
ORDER BY Glucose ASC;
Displaying records 1 - 10
Glucose
44
56
57
57
61
62
65
67
68
68

Missing Values (NAs)

Table showing the extent of NAs in columns containing them.

SELECT * 
FROM
diabetesNA
LIMIT 5;
5 records
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 NA 33.6 0.627 50 NA
1 NA NA NA 0 NA 0.351 31 0
8 183 64 NA 0 23.3 0.672 32 1
NA 89 NA 23 NA 28.1 0.167 NA 0
NA 137 40 35 NA 43.1 2.288 33 1

However, “NA” is not NULL in SQL, so we will have to change this.

  • Create new table from diabetesNA.
CREATE TABLE diabetesNull
AS
SELECT * 
FROM diabetesNA;
  • Change 'NA' into NULL - Its possible to do this all in one argument, but I find that temperamental.

Pregnancies

UPDATE diabetesNull
SET
Pregnancies = NULL WHERE Pregnancies = 'NA';

Glucose

UPDATE diabetesNull
SET
Glucose = NULL WHERE Glucose = 'NA';

BloodPressure

UPDATE diabetesNull
SET
BloodPressure = NULL WHERE BloodPressure = 'NA';

SkinThickness

UPDATE diabetesNull
SET
SkinThickness = NULL WHERE SkinThickness = 'NA';

Insulin

UPDATE diabetesNull
SET
Insulin = NULL WHERE Insulin = 'NA';

BMI

UPDATE diabetesNull
SET
BMI = NULL WHERE BMI = 'NA';

DiabetesPedigreeFunction

UPDATE diabetesNull
SET
DiabetesPedigreeFunction = NULL WHERE DiabetesPedigreeFunction = 'NA';

Age

UPDATE diabetesNull
SET
Age = NULL WHERE Age = 'NA';

Outcome

UPDATE diabetesNull
SET
Outcome = NULL WHERE Outcome = 'NA';

Diagnose NAs

Now we can see true NAs! (They look the same in the Quarto render, but SQL treats them differently)

SELECT * 
FROM diabetesNULL 
Displaying records 1 - 10
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 NA 33.6 0.627 50 NA
1 NA NA NA 0 NA 0.351 31 0
8 183 64 NA 0 23.3 0.672 32 1
NA 89 NA 23 NA 28.1 0.167 NA 0
NA 137 40 35 NA 43.1 2.288 33 1
3 78 50 32 NA 31 0.248 26 1
NA 115 0 0 0 NA 0.134 NA 0
2 197 70 NA 543 30.5 0.158 53 NA
4 NA NA NA 0 NA 0.191 30 0
10 168 74 0 0 38 0.537 34 1

We can see the number of NULL values in each column.

SELECT 'NAs',
    SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) AS Glucose, 
    SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) AS Insulin
FROM diabetesNULL
UNION
SELECT 'NA Freq',
    SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Glucose IS NULL THEN 0 ELSE 1 END) AS Glucose, 
    SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Insulin IS NULL THEN 0 ELSE 1 END) AS Insulin
FROM diabetesNULL
2 records
NAs Glucose Insulin
NAs 168.0000 164.0000
NA Freq 0.4541 0.4385

Created: 09/21/2022 (G. Chism); Last update: 09/21/2022

CC BY-NC-SA