Data7 Exploratory Data Analysis in Unix Shell

Author

Greg Chism

Published

September 22, 2022

Purpose of Materials

Exploratory data analysis of a novel data set with Bash Unix shell


Objectives

  1. Load and explore a data set with summary statistics
  2. Diagnose missing values in a data set

Overview

Exploratory data analysis is an essential first step towards determining the validity of your data and should be performed throughout the data pipeline. However, EDA is often performed too late or not at all.

Unix Shell is a command-line shell user interface for Unix-like operating systems. The shell is used by the operating system to control the execution of the system through shell scripts. Though Unix shell has limited mathematical capabilities it can be used to perform EDA. A major disadvantage however is that Unix shell cannot be used to perform statistical graphics and other data visualization methods. For this, I recommend either the R programming language, specifically through the RStudio IDE and ggplot2 from the tidyverse package suite, or Python, specifically the seaborn library. You can however use Unix shell to visualize data through R. To learn this I recommend reading Chapter 7 of the Data Science at the Command Line book.

Here, we utilize the Bash Unix Shell to conduct preliminary exploratory data analysis aimed at diagnosing any major issues within a data set stored in a directory. We introduce a clean and straightforward methodology to uncover issues such as data outliers, missing data, as well as summary statistical reports.


Required Setup

You will need to have homebrew installed and then the csvkit which is a powerful data exploration toolkit.

# Install the csvkit
brew install csvkit 

We first need to prepare our environment to explore a dataset. In our case, we will move to the directory housing diabetes.csv.

Note that the Bash code chunks will contain all code from previous chunks, where relevant, so that you can see the progression.

Listing files

ls List files

ls
Data7_EDA_In_Shell.Rproj
Data7_EDA_In_Shell.html
Data7_EDA_In_Shell.qmd
Data7_EDA_In_Shell.rmarkdown
Data7_EDA_In_Shell_files
LICENSE
README.md
cover.png
data

Move to the data directory

cd data change directory to data

cd data 
ls
diabetes.csv
diabetesNA.csv
diabetes_Age.csv

diabetes.csv Original Pima County Indians Diabetes dataset

diabetesNA.csv Dataset with randomly added NA values to 30% of rows in each column

diabetes_Age.csv Dataset with an added Age_group column (see below for qualifier)


Examine and Describe a Data Set

  • Examine columns and data types
  • Summary statistics
  • Define box plots
  • Describe meta data

Examine a Data Set

We should always examine a dataset in our databases. Here we are looking at the first 5 rows of the dataset.

Refined head: line numbers

head Head first 8 rows (default)

-n Number, add a number to get the number of rows - e.g., 5 rows

| Pipeline

nl Line numbers

# Move to the folder that contains our data
cd data

# Line numbers for rows that wrap
head -n 5 diabetes_Age.csv | nl
     1  "Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome","Age_group"
     2  6,148,72,35,0,33.6,0.627,50,1,"Middle"
     3  1,85,66,29,0,26.6,0.351,31,0,"Middle"
     4  8,183,64,0,0,23.3,0.672,32,1,"Middle"
     5  1,89,66,23,94,28.1,0.167,21,0,"Young"

Column names

csvcut Filters and truncates CSV files, but assumes that the file is delimited by commas

-n Number, add a number to get the number of rows - e.g., 5 rows

# Move to the folder that contains our data
cd data

# Examine the headers
csvcut -n diabetes_Age.csv
  1: Pregnancies
  2: Glucose
  3: BloodPressure
  4: SkinThickness
  5: Insulin
  6: BMI
  7: DiabetesPedigreeFunction
  8: Age
  9: Outcome
 10: Age_group

Data structure

csvcut Filters and truncates CSV files, but assumes that the file is delimited by commas

csvlook Renders a CSV to the command line in a Markdown-compatible, fixed-width format

| Pipeline

csvlook Renders a CSV to the command line in a Markdown-compatible, fixed-width format

head Head first 8 rows (default)

# Move to the folder that contains our data
cd data

# Examine the dataset, print to fixed-width format
csvcut diabetes_Age.csv | csvlook | head 
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin |  BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
|           6 |     148 |            72 |            35 |       0 | 33.6 |                    0.627 |  50 |    True | Middle    |
|           1 |      85 |            66 |            29 |       0 | 26.6 |                    0.351 |  31 |   False | Middle    |
|           8 |     183 |            64 |             0 |       0 | 23.3 |                    0.672 |  32 |    True | Middle    |
|           1 |      89 |            66 |            23 |      94 | 28.1 |                    0.167 |  21 |   False | Young     |
|           0 |     137 |            40 |            35 |     168 | 43.1 |                    2.288 |  33 |    True | Middle    |
|           5 |     116 |            74 |             0 |       0 | 25.6 |                    0.201 |  30 |   False | Young     |
|           3 |      78 |            50 |            32 |      88 | 31.0 |                    0.248 |  26 |    True | Young     |
|          10 |     115 |             0 |             0 |       0 | 35.3 |                    0.134 |  29 |   False | Young     |

Describe your Data

We need to start by seeing what data types our columns actually are.

csvsql SQL function to show SQL data types and whether there are NULL values

# Move to the folder that contains our data
cd data

# SQL function to show data types and if there are NULL values
csvsql diabetes_Age.csv
CREATE TABLE "diabetes_Age" (
    "Pregnancies" DECIMAL NOT NULL, 
    "Glucose" DECIMAL NOT NULL, 
    "BloodPressure" DECIMAL NOT NULL, 
    "SkinThickness" DECIMAL NOT NULL, 
    "Insulin" DECIMAL NOT NULL, 
    "BMI" DECIMAL NOT NULL, 
    "DiabetesPedigreeFunction" DECIMAL NOT NULL, 
    "Age" DECIMAL NOT NULL, 
    "Outcome" BOOLEAN NOT NULL, 
    "Age_group" VARCHAR NOT NULL
);

Unique values

Next we can look at the number of unique values in each column.

csvstat Summary statistics toolkit (also see below)

--unique Count of unique values

# Move to the folder that contains our data
cd data

# Number of unique values in each column
csvstat diabetes_Age.csv --unique
  1. Pregnancies: 17
  2. Glucose: 136
  3. BloodPressure: 47
  4. SkinThickness: 51
  5. Insulin: 186
  6. BMI: 248
  7. DiabetesPedigreeFunction: 517
  8. Age: 52
  9. Outcome: 2
 10. Age_group: 3

Filtering

csvgrep Filter data to rows where certain columns contain a given value or match a regular expression.

-c Columns

-i Inverse match (does not equal/match)

-r REGEX, a string to search for

"[1]" Search for 1 values

| Pipeline

csvlook Renders a CSV to the command line in a Markdown-compatible, fixed-width format

head Head first 8 rows (default)

# Move to the folder that contains our data
cd data

# Filter rows with values opposite of 1 in the Outcome column
csvgrep -c Outcome -i -r "[1]" diabetes_Age.csv | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin |  BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
|           1 |      85 |            66 |            29 |       0 | 26.6 |                    0.351 |  31 |   False | Middle    |
|           1 |      89 |            66 |            23 |      94 | 28.1 |                    0.167 |  21 |   False | Young     |
|           5 |     116 |            74 |             0 |       0 | 25.6 |                    0.201 |  30 |   False | Young     |
|          10 |     115 |             0 |             0 |       0 | 35.3 |                    0.134 |  29 |   False | Young     |
|           4 |     110 |            92 |             0 |       0 | 37.6 |                    0.191 |  30 |   False | Young     |
|          10 |     139 |            80 |             0 |       0 | 27.1 |                    1.441 |  57 |   False | Elderly   |
|           1 |     103 |            30 |            38 |      83 | 43.3 |                    0.183 |  33 |   False | Middle    |
|           3 |     126 |            88 |            41 |     235 | 39.3 |                    0.704 |  27 |   False | Young     |

awk Filtering by numerical comparisons

-F Filter

NR Number of rows that the filter will go through at one time

| Pipeline

csvlook Renders a CSV to the command line in a Markdown-compatible, fixed-width format

head Head first 8 rows (default)

# Move to the folder that contains our data
cd data

# Filter data based on the criteria: column 2 values greater than 50
< diabetes_Age.csv awk -F, 'NR==1 || ($2 > 50)' | csvcut | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin |  BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
|           6 |     148 |            72 |            35 |       0 | 33.6 |                    0.627 |  50 |    True | Middle    |
|           1 |      85 |            66 |            29 |       0 | 26.6 |                    0.351 |  31 |   False | Middle    |
|           8 |     183 |            64 |             0 |       0 | 23.3 |                    0.672 |  32 |    True | Middle    |
|           1 |      89 |            66 |            23 |      94 | 28.1 |                    0.167 |  21 |   False | Young     |
|           0 |     137 |            40 |            35 |     168 | 43.1 |                    2.288 |  33 |    True | Middle    |
|           5 |     116 |            74 |             0 |       0 | 25.6 |                    0.201 |  30 |   False | Young     |
|           3 |      78 |            50 |            32 |      88 | 31.0 |                    0.248 |  26 |    True | Young     |
|          10 |     115 |             0 |             0 |       0 | 35.3 |                    0.134 |  29 |   False | Young     |

Sorting

We might want to sort columns in ascending or descending order.

csvsort Sort data by a qualifier, default is ascending

-c Column

-r Reverse (descending order)

| Pipeline

csvcut Filters and truncates CSV files, but assumes that the file is delimited by commas

csvlook Renders a CSV to the command line in a Markdown-compatible, fixed-width format

head Head first 8 rows (default)

# Move to the folder that contains our data
cd data

# Sort data by the Glucose column (ascending order)
csvsort -c 'Glucose' diabetes_Age.csv | csvcut | csvlook | head
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin |  BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- | --------- |
|           1 |       0 |            48 |            20 |       0 | 24.7 |                    0.140 |  22 |   False | Young     |
|           1 |       0 |            74 |            20 |      23 | 27.7 |                    0.299 |  21 |   False | Young     |
|           1 |       0 |            68 |            35 |       0 | 32.0 |                    0.389 |  22 |   False | Young     |
|           5 |       0 |            80 |            32 |       0 | 41.0 |                    0.346 |  37 |    True | Middle    |
|           6 |       0 |            68 |            41 |       0 | 39.0 |                    0.727 |  41 |    True | Middle    |
|           5 |      44 |            62 |             0 |       0 | 25.0 |                    0.587 |  36 |   False | Middle    |
|           2 |      56 |            56 |            28 |      45 | 24.2 |                    0.332 |  22 |   False | Young     |
|           9 |      57 |            80 |            37 |       0 | 32.8 |                    0.096 |  41 |   False | Middle    |

Summary Statistics of your Data

Numerical Variables

Our entire database is numerical data, but we will look at two of the numerical columns, Glucose and Insulin.

csvstat Summary statistics for all columns

| Pipeline

head Head first 8 rows (default)

-n Number, add a number to get the number of rows - e.g., 34 rows

# Move to the folder that contains our data
cd data

# Summary statistics, showing 34 rows of the output
csvstat diabetes_Age.csv | head -n 34
  1. "Pregnancies"

    Type of data:          Number
    Contains null values:  False
    Unique values:         17
    Smallest value:        0
    Largest value:         17
    Sum:                   2,953
    Mean:                  3.845
    Median:                3
    StDev:                 3.37
    Most common values:    1 (135x)
                           0 (111x)
                           2 (103x)
                           3 (75x)
                           4 (68x)

  2. "Glucose"

    Type of data:          Number
    Contains null values:  False
    Unique values:         136
    Smallest value:        0
    Largest value:         199
    Sum:                   92,847
    Mean:                  120.895
    Median:                117
    StDev:                 31.973
    Most common values:    100 (17x)
                           99 (17x)
                           125 (14x)
                           111 (14x)
                           106 (14x)
  • Type of data: Data type

  • Contains null values: Presence of Null values

  • Unique values: Number of unique values

  • Smallest value: minimum value

  • Largest value: maximum value

  • Sum: sum of the entire column (numerical only)

  • Mean: arithmetic mean

  • Median: middle value

  • StDev: standard deviation

  • Most common values: frequency of 5 most common values

Specific statistics

You can also look at individual statistics (such as unique above)

Note: Outcome and Age_group have no mean values since they are boolean (true/false) and varchar (character) data types.

--mean Only show arithmetic mean

# Move to the folder that contains our data
cd data

# Mean of all numerical data columns
csvstat diabetes_Age.csv --mean 
  1. Pregnancies: 3.845
  2. Glucose: 120.895
  3. BloodPressure: 69.105
  4. SkinThickness: 20.536
  5. Insulin: 79.799
  6. BMI: 31.993
  7. DiabetesPedigreeFunction: 0.472
  8. Age: 33.241
  9. Outcome: None
 10. Age_group: None

Categorical Variables

diabetes_Age has a categorical Age_group column categories were created by a qualifier:

Young: Age <= 21

Middle: Age between 21 and 30

Elderly: Age > 30

-c Column

# Move to the folder that contains our data
cd data

# Summary statistics of the Age_group column
csvstat diabetes_age.csv -c Age_group
 10. "Age_group"

    Type of data:          Text
    Contains null values:  False
    Unique values:         3
    Longest value:         7 characters
    Most common values:    Young (417x)
                           Middle (270x)
                           Elderly (81x)

Row count: 768

All of the summary statistics are the same as above except:

  • Longest value: number of characters in the longest character value

Missing Values (NAs)

We will investigate NA values in the diabetesNA.csv dataset.

# Move to the folder that contains our data
cd data

# Summary statistics, showing 34 rows of the output
csvstat diabetesNA.csv | head -n 34
  1. "Pregnancies"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         17
    Smallest value:        0
    Largest value:         17
    Sum:                   2,087
    Mean:                  3.879
    Median:                3
    StDev:                 3.384
    Most common values:    None (230x)
                           1 (89x)
                           0 (78x)
                           2 (73x)
                           3 (57x)

  2. "Glucose"

    Type of data:          Number
    Contains null values:  True (excluded from calculations)
    Unique values:         128
    Smallest value:        0
    Largest value:         199
    Sum:                   64,622
    Mean:                  120.115
    Median:                117
    StDev:                 31.834
    Most common values:    None (230x)
                           125 (12x)
                           100 (12x)
                           99 (12x)
                           102 (11x)

Contains null values This will be true, but note that these are excluded from calculations

Filtering out NAs

csvgrep Filter data to rows where certain columns contain a given value or match a regular expression.

-c Columns

-i Inverse match (does not equal/match)

-r REGEX, a string to search for

"[NA]" Search for NA character values

| Pipeline

head Head first 8 rows (default)

nl Line numbers

# Move to the folder that contains our data
cd data

# Filter data from the Glucose that are not NAs
csvgrep -c Glucose -i -r "[NA]" diabetesNA.csv | csvcut | csvlook | head  
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin |  BMI | DiabetesPedigreeFunction | Age | Outcome |
| ----------- | ------- | ------------- | ------------- | ------- | ---- | ------------------------ | --- | ------- |
|           6 |     148 |            72 |            35 |         | 33.6 |                    0.627 |  50 |         |
|           8 |     183 |            64 |               |       0 | 23.3 |                    0.672 |  32 |    True |
|             |      89 |               |            23 |         | 28.1 |                    0.167 |     |   False |
|             |     137 |            40 |            35 |         | 43.1 |                    2.288 |  33 |    True |
|             |     116 |               |               |       0 | 25.6 |                          |  30 |   False |
|           3 |      78 |            50 |            32 |         | 31.0 |                    0.248 |  26 |    True |
|             |     115 |             0 |             0 |       0 |      |                    0.134 |     |   False |
|           2 |     197 |            70 |               |     543 | 30.5 |                    0.158 |  53 |         |

Notice how NAs are blank in the output


Created: 09/23/2022 (G. Chism); Last update: 09/23/2022

CC BY-NC-SA