```
install.packages("RMySQL")
library(DBI)
library(RMySQL)
library(here)
<- dbConnect(RMySQL::MySQL(), dbname = 'data_diabetes_RStudio', username = 'root', host='localhost', password = 'Chism1154') SQL_Con
```

# Data7 Exploratory Data Analysis in MySQL

## Purpose of Materials

**Exploratory data analysis of a novel data set with MySQL**

## Objectives

- Load and explore a data set with summary statistics
- Diagnose and remove outliers
- Diagnose missing values in a data set

## Overview

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all.

SQL (Structured Query Language) is a programming language for database management, which lets you store, retrieve, manage, and manipulate data tables within databases. Though SQL has limited mathematical capabilities it can be used to perform EDA. A major disadvantage however is that SQL cannot be used to perform statistical graphics and other data visualization methods. For this, I recommend either the R programming language, specifically through the RStudio IDE and `ggplot2`

from the `tidyverse`

package suite, or Python, specifically the `seaborn`

library.

Here, we utilize MySQL to conduct preliminary exploratory data analysis aimed at diagnosing any major issues with an imported data set. We introduce a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.

## Required Setup

We first need to prepare our environment with the necessary libraries and start our MySQL connection. Note that you can use SQLite for some of these functions, but not all.

## Examine and Describe a Data Set

- Examine columns and data types
- Summary statistics
- Define box plots
- Describe meta data

### Examine a Data Set

We should always examine a dataset in our databases. Here we are looking at the first 5 rows of the dataset.

```
-- First 5 rows
SELECT *
FROM diabetes
LIMIT 5;
```

Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|

6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |

1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |

8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |

1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |

0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |

### Describe your Data

We need to start by seeing what data types our columns actually are.

```
-- What are the properties of the data
DESCRIBE diabetes;
```

Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|

Pregnancies | int | YES | NA | ||

Glucose | int | YES | NA | ||

BloodPressure | int | YES | NA | ||

SkinThickness | int | YES | NA | ||

Insulin | int | YES | NA | ||

BMI | double | YES | NA | ||

DiabetesPedigreeFunction | double | YES | NA | ||

Age | int | YES | NA | ||

Outcome | int | YES | NA |

`Field`

: name of each variable`type`

: data type of each variable`Null`

: allows NULL values`Key`

: primary key that was used`Default`

: default value for the column`Extra`

: any additional information

### Unique values

Next we can look at unique values within columns - e.g., Pregnancies and Age.

```
SELECT
DISTINCT Pregnancies, Age
FROM diabetes;
```

Pregnancies | Age |
---|---|

6 | 50 |

1 | 31 |

8 | 32 |

1 | 21 |

0 | 33 |

5 | 30 |

3 | 26 |

10 | 29 |

2 | 53 |

8 | 54 |

### Filtering

We can also filter data by qualifications.

```
SELECT *
FROM diabetes
WHERE Age >= 50
LIMIT 5;
```

Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|

6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |

2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |

8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |

10 | 139 | 80 | 0 | 0 | 27.1 | 1.441 | 57 | 0 |

1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | 1 |

### Sorting

We might want to sort columns in ascending (`ASC`

) or descending (`DESC`

) order.

```
SELECT Glucose, Age
FROM diabetes
ORDER BY Glucose ASC
LIMIT 10;
```

Glucose | Age |
---|---|

0 | 41 |

0 | 21 |

0 | 22 |

0 | 37 |

0 | 22 |

44 | 36 |

56 | 22 |

57 | 67 |

57 | 41 |

61 | 46 |

## Summary Statistics of your Data

### Numerical Variables

Our entire database is numerical data, but we will look at two of the numerical columns, `Glucose`

and `Insulin`

.

Note that the stat column will be called ‘Count’ but this isn’t a problem.

```
-- Summary statistics of our numerical columns
SELECT 'Count',
count(Glucose) as Glucose,
count(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Total',
sum(Glucose) as Glucose,
sum(Insulin) as Insulin
FROM diabetes
UNION
SELECT 'Mean',
avg(Glucose),
avg(Insulin)
FROM diabetes
UNION
SELECT 'Min',
min(Glucose),
min(Insulin)
FROM diabetes
UNION
SELECT 'Max',
max(Glucose),
max(Insulin)
FROM diabetes
UNION
SELECT 'Std. Dev.',
STDDEV_SAMP(Glucose),
STDDEV_SAMP(Insulin)
FROM diabetes
UNION
SELECT 'Variance',
VAR_SAMP(Glucose),
VAR_SAMP(Insulin)
FROM diabetes;
```

Count | Glucose | Insulin |
---|---|---|

Count | 768.00000 | 768.00000 |

Total | 92847.00000 | 61286.00000 |

Mean | 120.89453 | 79.79948 |

Min | 0.00000 | 0.00000 |

Max | 199.00000 | 846.00000 |

Std. Dev. | 31.97262 | 115.24400 |

Variance | 1022.24831 | 13281.18008 |

`Count`

: number of observations`Total`

: sum of all values in a columns`Mean`

: arithmetic mean (average value)`Min`

: minimum value`Max`

: maximum value`Std. Dev.`

: standard deviation of the data`Variance`

: variance of the data

**Median** and **Percentiles**

SQL does not have straightforward functions for percentiles (including the **Median**), so let’s write them!

Make 100 bins

Assign percentiles to bins

Select desired percentiles (25, 50, 75) = (Q1, Median, Q3)

```
WITH perc AS (SELECT Glucose, NTILE(100) OVER (ORDER BY Glucose) AS 'Percentile'
FROM diabetes)
SELECT Percentile, MAX(Glucose) as Glucose
FROM perc
GROUP BY Percentile
HAVING Percentile = 25 OR Percentile = 50 or Percentile = 75;
```

Percentile | Glucose |
---|---|

25 | 100 |

50 | 119 |

75 | 144 |

### Categorical Variables

Our original data does not have a categorical column, so we will use `diabetes_Age`

. In the `Age_group`

column categories were created by a qualifier:

`Young`

: Age <= 21

`Middle`

: Age between 21 and 30

`Elderly`

: Age > 30

```
SELECT Age_group,
COUNT(*) AS 'Count',
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () AS 'Ratio'
FROM diabetes_Age
GROUP BY Age_group
```

Age_group | Count | Ratio |
---|---|---|

Middle | 270 | 0.35156 |

Young | 417 | 0.54297 |

Elderly | 81 | 0.10547 |

`Count`

: number of values in the column`Ratio`

: the number of observations over the total observations

## Outliers

Values outside of \(1.5 * IQR\)

There are several numerical variables that have outliers above, let’s see what the data look like with and without them.

### Outlier Detection

The most common method to detect outliers is by using the interquartile range:

\(1.5 * IQR\). Above we found the first and third quartiles for Glucose (Q1 = 100, Q3 = 144), therefore the IQR is 44.

Qualifier: \(1.5∗44=66\)

Lower limit: \(100-66 = 34\)

Upper limit: \(144 + 66 = 210\)

```
SELECT Glucose
FROM diabetes
WHERE Glucose < 210 AND Glucose > 34
ORDER BY Glucose ASC;
```

Glucose |
---|

44 |

56 |

57 |

57 |

61 |

62 |

65 |

67 |

68 |

68 |

## Missing Values (NAs)

Table showing the extent of NAs in columns containing them.

```
SELECT *
FROM
diabetesNALIMIT 5;
```

Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|

6 | 148 | 72 | 35 | NA | 33.6 | 0.627 | 50 | NA |

1 | NA | NA | NA | 0 | NA | 0.351 | 31 | 0 |

8 | 183 | 64 | NA | 0 | 23.3 | 0.672 | 32 | 1 |

NA | 89 | NA | 23 | NA | 28.1 | 0.167 | NA | 0 |

NA | 137 | 40 | 35 | NA | 43.1 | 2.288 | 33 | 1 |

However, “NA” is not NULL in SQL, so we will have to change this.

- Create new table from diabetesNA.

```
CREATE TABLE diabetesNull
AS
SELECT *
FROM diabetesNA;
```

- Change
`'NA'`

into`NULL`

- Its possible to do this all in one argument, but I find that temperamental.

**Pregnancies**

```
UPDATE diabetesNull
SET
= NULL WHERE Pregnancies = 'NA'; Pregnancies
```

**Glucose**

```
UPDATE diabetesNull
SET
= NULL WHERE Glucose = 'NA'; Glucose
```

**BloodPressure**

```
UPDATE diabetesNull
SET
= NULL WHERE BloodPressure = 'NA'; BloodPressure
```

**SkinThickness**

```
UPDATE diabetesNull
SET
= NULL WHERE SkinThickness = 'NA'; SkinThickness
```

**Insulin**

```
UPDATE diabetesNull
SET
= NULL WHERE Insulin = 'NA'; Insulin
```

**BMI**

```
UPDATE diabetesNull
SET
= NULL WHERE BMI = 'NA'; BMI
```

**DiabetesPedigreeFunction**

```
UPDATE diabetesNull
SET
= NULL WHERE DiabetesPedigreeFunction = 'NA'; DiabetesPedigreeFunction
```

**Age**

```
UPDATE diabetesNull
SET
= NULL WHERE Age = 'NA'; Age
```

**Outcome**

```
UPDATE diabetesNull
SET
= NULL WHERE Outcome = 'NA'; Outcome
```

### Diagnose NAs

Now we can see true NAs! (They look the same in the Quarto render, but SQL treats them differently)

```
SELECT *
FROM diabetesNULL
```

Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|

6 | 148 | 72 | 35 | NA | 33.6 | 0.627 | 50 | NA |

1 | NA | NA | NA | 0 | NA | 0.351 | 31 | 0 |

8 | 183 | 64 | NA | 0 | 23.3 | 0.672 | 32 | 1 |

NA | 89 | NA | 23 | NA | 28.1 | 0.167 | NA | 0 |

NA | 137 | 40 | 35 | NA | 43.1 | 2.288 | 33 | 1 |

3 | 78 | 50 | 32 | NA | 31 | 0.248 | 26 | 1 |

NA | 115 | 0 | 0 | 0 | NA | 0.134 | NA | 0 |

2 | 197 | 70 | NA | 543 | 30.5 | 0.158 | 53 | NA |

4 | NA | NA | NA | 0 | NA | 0.191 | 30 | 0 |

10 | 168 | 74 | 0 | 0 | 38 | 0.537 | 34 | 1 |

We can see the number of NULL values in each column.

```
SELECT 'NAs',
SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) AS Glucose,
SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) AS Insulin
FROM diabetesNULL
UNION
SELECT 'NA Freq',
SUM(CASE WHEN Glucose IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Glucose IS NULL THEN 0 ELSE 1 END) AS Glucose,
SUM(CASE WHEN Insulin IS NULL THEN 1 ELSE 0 END) / SUM(CASE WHEN Insulin IS NULL THEN 0 ELSE 1 END) AS Insulin
FROM diabetesNULL
```

NAs | Glucose | Insulin |
---|---|---|

NAs | 168.0000 | 164.0000 |

NA Freq | 0.4541 | 0.4385 |

Created: 09/21/2022 (G. Chism); Last update: 09/21/2022