# Install the csvkit
brew install csvkit
Data7 Exploratory Data Analysis in Unix Shell
Purpose of Materials
Exploratory data analysis of a novel data set with Bash Unix shell
Objectives
- Load and explore a data set with summary statistics
- Diagnose missing values in a data set
Overview
Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all.
Unix Shell is a command-line shell user interface for Unix-like operating systems. The shell is used by the operating system to control the execution of the system through shell scripts. Though Unix shell has limited mathematical capabilities it can be used to perform EDA. A major disadvantage however is that Unix shell cannot be used to perform statistical graphics and other data visualization methods. For this, I recommend either the R programming language, specifically through the RStudio IDE and ggplot2
from the tidyverse
package suite, or Python, specifically the seaborn
library. You can however use Unix shell to visualize data through R. To learn this I recommend reading Chapter 7 of the Data Science at the Command Line book.
Here, we utilize the Bash Unix Shell to conduct preliminary exploratory data analysis aimed at diagnosing any major issues within a data set stored in a directory. We introduce a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.
Required Setup
You will need to have homebrew
installed and then the csvkit
which is a powerful data exploration toolkit.
We first need to prepare our environment to explore a dataset. In our case, we will move to the directory housing diabetes.csv
.
Note that the Bash
code chunks will contain all code from previous chunks, where relevant, so that you can see the progression.
Listing files
ls
List files
ls
Data7_EDA_In_Shell.Rproj
Data7_EDA_In_Shell.html
Data7_EDA_In_Shell.qmd
Data7_EDA_In_Shell.rmarkdown
Data7_EDA_In_Shell_files
LICENSE
README.md
cover.png
data
Move to the data directory
cd data
change directory to data
cd data
ls
diabetes.csv
diabetesNA.csv
diabetes_Age.csv
diabetes.csv
Original Pima County Indians Diabetes dataset
diabetesNA.csv
Dataset with randomly added NA
values to 30% of rows in each column
diabetes_Age.csv
Dataset with an added Age_group
column (see below for qualifier)
Examine and Describe a Data Set
- Examine columns and data types
- Summary statistics
- Define box plots
- Describe meta data
Examine a Data Set
We should always examine a dataset in our databases. Here we are looking at the first 5 rows of the dataset.
Head
head
Head first 8 rows (default)
-n
Number, add a number to get the number of rows - e.g., 5 rows
# Move to the folder that contains our data
cd data
# First 5 rows
head -n 5 diabetes_Age.csv
"Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome","Age_group"
6,148,72,35,0,33.6,0.627,50,1,"Middle"
1,85,66,29,0,26.6,0.351,31,0,"Middle"
8,183,64,0,0,23.3,0.672,32,1,"Middle"
1,89,66,23,94,28.1,0.167,21,0,"Young"
Refined head
: line numbers
head
Head first 8 rows (default)
-n
Number, add a number to get the number of rows - e.g., 5 rows
|
Pipeline
nl
Line numbers
# Move to the folder that contains our data
cd data
# Line numbers for rows that wrap
head -n 5 diabetes_Age.csv | nl
1 "Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome","Age_group"
2 6,148,72,35,0,33.6,0.627,50,1,"Middle"
3 1,85,66,29,0,26.6,0.351,31,0,"Middle"
4 8,183,64,0,0,23.3,0.672,32,1,"Middle"
5 1,89,66,23,94,28.1,0.167,21,0,"Young"
Column names
csvcut
Filters and truncates CSV files, but assumes that the file is delimited by commas
-n
Number, add a number to get the number of rows - e.g., 5 rows
# Move to the folder that contains our data
cd data
# Examine the headers
csvcut -n diabetes_Age.csv
1: Pregnancies
2: Glucose
3: BloodPressure
4: SkinThickness
5: Insulin
6: BMI
7: DiabetesPedigreeFunction
8: Age
9: Outcome
10: Age_group
Data structure
csvcut
Filters and truncates CSV files, but assumes that the file is delimited by commas
csvlook
Renders a CSV to the command line in a Markdown-compatible, fixed-width format
|
Pipeline
csvlook
Renders a CSV to the command line in a Markdown-compatible, fixed-width format
head
Head first 8 rows (default)
# Move to the folder that contains our data
cd data
# Examine the dataset, print to fixed-width format
csvcut diabetes_Age.csv | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | True | Middle |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | False | Middle |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | True | Middle |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | False | Young |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | True | Middle |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | False | Young |
| 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | True | Young |
| 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | False | Young |
Describe your Data
We need to start by seeing what data types our columns actually are.
csvsql
SQL function to show SQL data types and whether there are NULL
values
# Move to the folder that contains our data
cd data
# SQL function to show data types and if there are NULL values
csvsql diabetes_Age.csv
CREATE TABLE "diabetes_Age" (
"Pregnancies" DECIMAL NOT NULL,
"Glucose" DECIMAL NOT NULL,
"BloodPressure" DECIMAL NOT NULL,
"SkinThickness" DECIMAL NOT NULL,
"Insulin" DECIMAL NOT NULL,
"BMI" DECIMAL NOT NULL,
"DiabetesPedigreeFunction" DECIMAL NOT NULL,
"Age" DECIMAL NOT NULL,
"Outcome" BOOLEAN NOT NULL,
"Age_group" VARCHAR NOT NULL
);
Unique values
Next we can look at the number of unique values in each column.
csvstat
Summary statistics toolkit (also see below)
--unique
Count of unique values
# Move to the folder that contains our data
cd data
# Number of unique values in each column
csvstat diabetes_Age.csv --unique
1. Pregnancies: 17
2. Glucose: 136
3. BloodPressure: 47
4. SkinThickness: 51
5. Insulin: 186
6. BMI: 248
7. DiabetesPedigreeFunction: 517
8. Age: 52
9. Outcome: 2
10. Age_group: 3
Filtering
csvgrep
Filter data to rows where certain columns contain a given value or match a regular expression.
-c
Columns
-i
Inverse match (does not equal/match)
-r
REGEX, a string to search for
"[1]"
Search for 1 values
|
Pipeline
csvlook
Renders a CSV to the command line in a Markdown-compatible, fixed-width format
head
Head first 8 rows (default)
# Move to the folder that contains our data
cd data
# Filter rows with values opposite of 1 in the Outcome column
csvgrep -c Outcome -i -r "[1]" diabetes_Age.csv | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | False | Middle |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | False | Young |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | False | Young |
| 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | False | Young |
| 4 | 110 | 92 | 0 | 0 | 37.6 | 0.191 | 30 | False | Young |
| 10 | 139 | 80 | 0 | 0 | 27.1 | 1.441 | 57 | False | Elderly |
| 1 | 103 | 30 | 38 | 83 | 43.3 | 0.183 | 33 | False | Middle |
| 3 | 126 | 88 | 41 | 235 | 39.3 | 0.704 | 27 | False | Young |
awk
Filtering by numerical comparisons
-F
Filter
NR
Number of rows that the filter will go through at one time
|
Pipeline
csvlook
Renders a CSV to the command line in a Markdown-compatible, fixed-width format
head
Head first 8 rows (default)
# Move to the folder that contains our data
cd data
# Filter data based on the criteria: column 2 values greater than 50
< diabetes_Age.csv awk -F, 'NR==1 || ($2 > 50)' | csvcut | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
| 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | True | Middle |
| 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | False | Middle |
| 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | True | Middle |
| 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | False | Young |
| 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | True | Middle |
| 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | False | Young |
| 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | True | Young |
| 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | False | Young |
Sorting
We might want to sort columns in ascending or descending order.
csvsort
Sort data by a qualifier, default is ascending
-c
Column
-r
Reverse (descending order)
|
Pipeline
csvcut
Filters and truncates CSV files, but assumes that the file is delimited by commas
csvlook
Renders a CSV to the command line in a Markdown-compatible, fixed-width format
head
Head first 8 rows (default)
# Move to the folder that contains our data
cd data
# Sort data by the Glucose column (ascending order)
csvsort -c 'Glucose' diabetes_Age.csv | csvcut | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
| 1 | 0 | 48 | 20 | 0 | 24.7 | 0.140 | 22 | False | Young |
| 1 | 0 | 74 | 20 | 23 | 27.7 | 0.299 | 21 | False | Young |
| 1 | 0 | 68 | 35 | 0 | 32.0 | 0.389 | 22 | False | Young |
| 5 | 0 | 80 | 32 | 0 | 41.0 | 0.346 | 37 | True | Middle |
| 6 | 0 | 68 | 41 | 0 | 39.0 | 0.727 | 41 | True | Middle |
| 5 | 44 | 62 | 0 | 0 | 25.0 | 0.587 | 36 | False | Middle |
| 2 | 56 | 56 | 28 | 45 | 24.2 | 0.332 | 22 | False | Young |
| 9 | 57 | 80 | 37 | 0 | 32.8 | 0.096 | 41 | False | Middle |
Summary Statistics of your Data
Numerical Variables
Our entire database is numerical data, but we will look at two of the numerical columns, Glucose
and Insulin
.
csvstat
Summary statistics for all columns
|
Pipeline
head
Head first 8 rows (default)
-n
Number, add a number to get the number of rows - e.g., 34 rows
# Move to the folder that contains our data
cd data
# Summary statistics, showing 34 rows of the output
csvstat diabetes_Age.csv | head -n 34
1. "Pregnancies"
Type of data: Number
Contains null values: False
Unique values: 17
Smallest value: 0
Largest value: 17
Sum: 2,953
Mean: 3.845
Median: 3
StDev: 3.37
Most common values: 1 (135x)
0 (111x)
2 (103x)
3 (75x)
4 (68x)
2. "Glucose"
Type of data: Number
Contains null values: False
Unique values: 136
Smallest value: 0
Largest value: 199
Sum: 92,847
Mean: 120.895
Median: 117
StDev: 31.973
Most common values: 100 (17x)
99 (17x)
125 (14x)
111 (14x)
106 (14x)
Type of data
: Data typeContains null values
: Presence ofNull
valuesUnique values
: Number of unique valuesSmallest value
: minimum valueLargest value
: maximum valueSum
: sum of the entire column (numerical only)Mean
: arithmetic meanMedian
: middle valueStDev
: standard deviationMost common values
: frequency of 5 most common values
Specific statistics
You can also look at individual statistics (such as unique above)
Note: Outcome
and Age_group
have no mean values since they are boolean
(true/false) and varchar
(character) data types.
--mean
Only show arithmetic mean
# Move to the folder that contains our data
cd data
# Mean of all numerical data columns
csvstat diabetes_Age.csv --mean
1. Pregnancies: 3.845
2. Glucose: 120.895
3. BloodPressure: 69.105
4. SkinThickness: 20.536
5. Insulin: 79.799
6. BMI: 31.993
7. DiabetesPedigreeFunction: 0.472
8. Age: 33.241
9. Outcome: None
10. Age_group: None
Categorical Variables
diabetes_Age
has a categorical Age_group
column categories were created by a qualifier:
Young
: Age <= 21
Middle
: Age between 21 and 30
Elderly
: Age > 30
-c
Column
# Move to the folder that contains our data
cd data
# Summary statistics of the Age_group column
csvstat diabetes_age.csv -c Age_group
10. "Age_group"
Type of data: Text
Contains null values: False
Unique values: 3
Longest value: 7 characters
Most common values: Young (417x)
Middle (270x)
Elderly (81x)
Row count: 768
All of the summary statistics are the same as above except:
Longest value
: number of characters in the longest character value
Missing Values (NAs)
We will investigate NA values in the diabetesNA.csv
dataset.
# Move to the folder that contains our data
cd data
# Summary statistics, showing 34 rows of the output
csvstat diabetesNA.csv | head -n 34
1. "Pregnancies"
Type of data: Number
Contains null values: True (excluded from calculations)
Unique values: 17
Smallest value: 0
Largest value: 17
Sum: 2,087
Mean: 3.879
Median: 3
StDev: 3.384
Most common values: None (230x)
1 (89x)
0 (78x)
2 (73x)
3 (57x)
2. "Glucose"
Type of data: Number
Contains null values: True (excluded from calculations)
Unique values: 128
Smallest value: 0
Largest value: 199
Sum: 64,622
Mean: 120.115
Median: 117
StDev: 31.834
Most common values: None (230x)
125 (12x)
100 (12x)
99 (12x)
102 (11x)
Contains null values
This will be true, but note that these are excluded from calculations
Filtering out NAs
csvgrep
Filter data to rows where certain columns contain a given value or match a regular expression.
-c
Columns
-i
Inverse match (does not equal/match)
-r
REGEX, a string to search for
"[NA]"
Search for NA character values
|
Pipeline
head
Head first 8 rows (default)
nl
Line numbers
# Move to the folder that contains our data
cd data
# Filter data from the Glucose that are not NAs
csvgrep -c Glucose -i -r "[NA]" diabetesNA.csv | csvcut | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- |
| 6 | 148 | 72 | 35 | | 33.6 | 0.627 | 50 | |
| 8 | 183 | 64 | | 0 | 23.3 | 0.672 | 32 | True |
| | 89 | | 23 | | 28.1 | 0.167 | | False |
| | 137 | 40 | 35 | | 43.1 | 2.288 | 33 | True |
| | 116 | | | 0 | 25.6 | | 30 | False |
| 3 | 78 | 50 | 32 | | 31.0 | 0.248 | 26 | True |
| | 115 | 0 | 0 | 0 | | 0.134 | | False |
| 2 | 197 | 70 | | 543 | 30.5 | 0.158 | 53 | |
Notice how NAs are blank in the output
Created: 09/23/2022 (G. Chism); Last update: 09/23/2022