# Sets the number of significant figures to two - e.g., 0.01
options(digits = 2)
# Required package for quick package downloading and loading
if (!require(pacman))
install.packages("pacman")
# Downloads and load required packages
::p_load(dlookr, # Exploratory data analysis
pacman# Needed for Box-Cox transformations
forecast, # HTML tables from R outputs
formattable, # Standardizes paths to data
here, # Alternative to formattable
kableExtra, # Needed to write HTML reports
knitr, # To generate NAs
missRanger, # Powerful data wrangling package suite tidyverse)
Transforming like a Data… Transformer
Purpose of this chapter
Using data transformation to correct non-normality in numerical data
Take-aways
- Load and explore a data set with publication quality tables
- Quickly diagnose non-normality in data
- Data transformation
- Prepare an HTML summary report showcasing data transformations
Required Setup
We first need to prepare our environment with the necessary packages
Load and Examine a Data Set
- Load data and view
- Examine columns and data types
- Examine data normality
- Describe properties of data
# Let's load a data set from the diabetes data set
<- read.csv(here("EDA_In_R_Book", "data", "diabetes.csv")) |>
dataset # Add a categorical group
mutate(Age_group = ifelse(Age >= 21 & Age <= 30, "Young",
ifelse(Age > 30 & Age <= 50, "Middle",
"Elderly")),
Age_group = fct_rev(Age_group))
# What does the data look like?
|>
dataset head() |>
formattable()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | Age_group |
---|---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 34 | 0.63 | 50 | 1 | Middle |
1 | 85 | 66 | 29 | 0 | 27 | 0.35 | 31 | 0 | Middle |
8 | 183 | 64 | 0 | 0 | 23 | 0.67 | 32 | 1 | Middle |
1 | 89 | 66 | 23 | 94 | 28 | 0.17 | 21 | 0 | Young |
0 | 137 | 40 | 35 | 168 | 43 | 2.29 | 33 | 1 | Middle |
5 | 116 | 74 | 0 | 0 | 26 | 0.20 | 30 | 0 | Young |
Data Normality
Normal distributions (bell curves) are a common data assumptions for many hypothesis testing statistics, in particular parametric statistics. Deviations from normality can either strongly skew the results or reduce the power to detect a significant statistical difference.
Here are the distribution properties to know and consider:
The mean, median, and mode are the same value.
Distribution symmetry at the mean.
Normal distributions can be described by the mean and standard deviation.
Here’s an example using the Glucose
column in our dataset
Describing Properties of our Data (Refined)
Skewness
The symmetry of the distribution
See Introduction 4.3 for more information about these values
|>
dataset select(Glucose, Insulin, BMI, SkinThickness) |>
describe() |>
select(described_variables, skewness) |>
formattable()
described_variables | skewness |
---|---|
Glucose | 0.17 |
Insulin | 2.27 |
BMI | -0.43 |
SkinThickness | 0.11 |
Note that we will remove the other percentiles to produce a cleaner output
describes_variables
: name of the column being describedskewness
: skewness
Testing Normality (Accelerated)
Q-Q plots
Testing overall normality of two columns
Testing normality of groups
Note that you can also use normality()
to run Shapiro-Wilk tests, but since this test is not viable at N < 20
, I recommend Q-Q plots.
Q-Q Plots
Plots of the quartiles of a target data set against the predicted quartiles from a normal distribution.
Notably, plot_normality()
will show you the logaritmic and skewed transformations (more below)
|>
dataset plot_normality(Glucose, Insulin, Age)
Normality within Groups
Looking within Age_group at the subgroup normality
Q-Q Plots
%>%
dataset group_by(Age_group) %>%
select(Glucose, Insulin) %>%
plot_normality()
Transforming Data
Your data could be more easily interpreted with a transformation, since not all relationships in nature follow a linear relationship - i.e., many biological phenomena follow a power law (or logarithmic curve), where they do not scale linearly.
We will try to transform the Insulin
column with through several approaches and discuss the pros and cons of each. First however, we will remove 0
values, because Insulin
values are impossible…
<- dataset |>
InsMod filter(Insulin > 0)
Square-root, Cube-root, and Logarithmic Transformations
Resolving Skewness using transform()
.
“sqrt”: square-root transformation. \(\sqrt x\) (moderate skew)
“log”: log transformation. \(log(x)\) (greater skew)
“log+1”: log transformation. \(log(x + 1)\). Used for values that contain 0.
“1/x”: inverse transformation. \(1/x\) (severe skew)
“x^2”: squared transformation. \(x^2\)
“x^3”: cubed transformation. \(x^3\)
We will compare the sqrt
, log+1
, 1/x
(inverse), x^2
, and x^3
transformations. Note that you would have to add a constant to use the log
transformation, so it is easier to use the log+1
instead. You however need to add a constant to both the sqrt
and 1/x
transformations because they don’t include zeros and will otherwise skew the results. Here we removed zeros a priori.
Square-root Transformation
<- transform(InsMod$Insulin, method = "sqrt")
sqrtIns
summary(sqrtIns)
* Resolving Skewness with sqrt
* Information of Transformation (before vs after)
Original Transformation
n 394.0 394.00
na 0.0 0.00
mean 155.5 11.75
sd 118.8 4.17
se_mean 6.0 0.21
IQR 113.8 5.05
skewness 2.2 1.01
kurtosis 6.4 1.46
p00 14.0 3.74
p01 18.0 4.24
p05 41.7 6.45
p10 50.3 7.09
p20 69.2 8.32
p25 76.2 8.73
p30 87.9 9.38
p40 105.0 10.25
p50 125.0 11.18
p60 145.8 12.07
p70 176.0 13.27
p75 190.0 13.78
p80 210.0 14.49
p90 292.4 17.10
p95 395.5 19.89
p99 580.5 24.09
p100 846.0 29.09
|>
sqrtIns plot()
Logarithmic (+1) Transformation
<- transform(InsMod$Insulin, method = "log+1")
Log1Ins
summary(Log1Ins)
* Resolving Skewness with log+1
* Information of Transformation (before vs after)
Original Transformation
n 394.0 394.000
na 0.0 0.000
mean 155.5 4.818
sd 118.8 0.691
se_mean 6.0 0.035
IQR 113.8 0.905
skewness 2.2 -0.088
kurtosis 6.4 0.237
p00 14.0 2.708
p01 18.0 2.944
p05 41.7 3.753
p10 50.3 3.938
p20 69.2 4.251
p25 76.2 4.347
p30 87.9 4.488
p40 105.0 4.663
p50 125.0 4.836
p60 145.8 4.989
p70 176.0 5.176
p75 190.0 5.252
p80 210.0 5.352
p90 292.4 5.682
p95 395.5 5.983
p99 580.5 6.366
p100 846.0 6.742
|>
Log1Ins plot()
Inverse Transformation
<- transform(InsMod$Insulin, method = "1/x")
InvIns
summary(InvIns)
* Resolving Skewness with 1/x
* Information of Transformation (before vs after)
Original Transformation
n 394.0 3.9e+02
na 0.0 0.0e+00
mean 155.5 1.1e-02
sd 118.8 9.0e-03
se_mean 6.0 4.6e-04
IQR 113.8 7.9e-03
skewness 2.2 3.2e+00
kurtosis 6.4 1.5e+01
p00 14.0 1.2e-03
p01 18.0 1.7e-03
p05 41.7 2.5e-03
p10 50.3 3.4e-03
p20 69.2 4.8e-03
p25 76.2 5.3e-03
p30 87.9 5.7e-03
p40 105.0 6.9e-03
p50 125.0 8.0e-03
p60 145.8 9.5e-03
p70 176.0 1.1e-02
p75 190.0 1.3e-02
p80 210.0 1.4e-02
p90 292.4 2.0e-02
p95 395.5 2.4e-02
p99 580.5 5.6e-02
p100 846.0 7.1e-02
|>
InvIns plot()
Squared Transformation
<- transform(InsMod$Insulin, method = "x^2")
SqrdIns
summary(SqrdIns)
* Resolving Skewness with x^2
* Information of Transformation (before vs after)
Original Transformation
n 394.0 3.9e+02
na 0.0 0.0e+00
mean 155.5 3.8e+04
sd 118.8 7.2e+04
se_mean 6.0 3.7e+03
IQR 113.8 3.0e+04
skewness 2.2 4.8e+00
kurtosis 6.4 3.1e+01
p00 14.0 2.0e+02
p01 18.0 3.2e+02
p05 41.7 1.7e+03
p10 50.3 2.5e+03
p20 69.2 4.8e+03
p25 76.2 5.8e+03
p30 87.9 7.7e+03
p40 105.0 1.1e+04
p50 125.0 1.6e+04
p60 145.8 2.1e+04
p70 176.0 3.1e+04
p75 190.0 3.6e+04
p80 210.0 4.4e+04
p90 292.4 8.5e+04
p95 395.5 1.6e+05
p99 580.5 3.4e+05
p100 846.0 7.2e+05
|>
SqrdIns plot()
Cubed Transformation
<- transform(InsMod$Insulin, method = "x^3")
CubeIns
summary(CubeIns)
* Resolving Skewness with x^3
* Information of Transformation (before vs after)
Original Transformation
n 394.0 3.9e+02
na 0.0 0.0e+00
mean 155.5 1.4e+07
sd 118.8 4.8e+07
se_mean 6.0 2.4e+06
IQR 113.8 6.4e+06
skewness 2.2 7.7e+00
kurtosis 6.4 7.7e+01
p00 14.0 2.7e+03
p01 18.0 5.8e+03
p05 41.7 7.2e+04
p10 50.3 1.3e+05
p20 69.2 3.3e+05
p25 76.2 4.4e+05
p30 87.9 6.8e+05
p40 105.0 1.2e+06
p50 125.0 2.0e+06
p60 145.8 3.1e+06
p70 176.0 5.5e+06
p75 190.0 6.9e+06
p80 210.0 9.3e+06
p90 292.4 2.5e+07
p95 395.5 6.2e+07
p99 580.5 2.0e+08
p100 846.0 6.1e+08
|>
CubeIns plot()
Box-cox Transformation
There are several transformations, each with it’s own “criteria”, and they don’t always fix extremely skewed data. Instead, you can just choose the Box-Cox transformation which searches for the the best lambda value that maximizes the log-likelihood (basically, what power transformation is best). The benefit is that you should have normally distributed data after, but the power relationship might be pretty abstract (i.e., what would a transformation of x^0.12 be interpreted as in your system?..)
<- transform(InsMod$Insulin, method = "Box-Cox")
BoxCoxIns
summary(BoxCoxIns)
* Resolving Skewness with Box-Cox
* Information of Transformation (before vs after)
Original Transformation
n 394.0 394.000
na 0.0 0.000
mean 155.5 3.011
sd 118.8 0.262
se_mean 6.0 0.013
IQR 113.8 0.335
skewness 2.2 -0.630
kurtosis 6.4 1.003
p00 14.0 2.027
p01 18.0 2.168
p05 41.7 2.588
p10 50.3 2.673
p20 69.2 2.808
p25 76.2 2.848
p30 87.9 2.904
p40 105.0 2.973
p50 125.0 3.037
p60 145.8 3.092
p70 176.0 3.157
p75 190.0 3.183
p80 210.0 3.216
p90 292.4 3.320
p95 395.5 3.409
p99 580.5 3.515
p100 846.0 3.610
|>
BoxCoxIns plot()
Produce an HTML Transformation Summary
transformation_web_report(dataset)