import pandas as pd
import matplotlib.pyplot as plt
from skimpy import clean_columns
# Load the dataset
= pd.read_csv('path_to_your_file.csv')
df
# Clean column names
= clean_columns(df)
df
# Assign and remove NAs
'.': np.nan, '': np.nan}, inplace = True)
df.replace({=True)
df.dropna(inplace
# Convert 'date' to datetime and set as index
'date'] = pd.to_datetime(df['date'])
df['date', inplace = True)
df.set_index(
# Plot the AQI values
'aqi_value'].plot(title = 'AQI Time Series')
df['AQI Value')
plt.ylabel( plt.show()
Time series analysis
Objective:
Analyze AQI time series data to identify underlying patterns, trends, and seasonality.
Apply ARIMA models to forecast future AQI values.
Explore and apply detrending methods to examine the time series data without its trend component.
Investigate the seasonality in the data, understanding how AQI values change over different times of the year.
Prerequisites:
Software and Libraries: Ensure Python, Jupyter Notebook, and necessary libraries (
pandas
,matplotlib
,statsmodels
,pmdarima
) are installed.Datasets: Access to the provided dataset with
date
andaqi_value
columns, among others, to perform the analysis.
Key Concepts:
Time Series Analysis
Time Series Data: Data points collected or recorded at specific time intervals.
Trend: The long-term movement in time series data, showing an increase or decrease in the data over time.
Seasonality: Regular patterns or cycles of fluctuations in time series data that occur due to seasonal factors.
ARIMA Modeling
ARIMA (AutoRegressive Integrated Moving Average): A popular statistical method for time series forecasting that captures different aspects of the data, including trend and seasonality.
Parameters (p, d, q):
p
: The number of lag observations included in the model (AR part).d
: The degree of differencing required to make the time series stationary.q
: The size of the moving average window (MA part).
Stationarity and Differencing
Stationarity: A characteristic of a time series whose statistical properties (mean, variance) do not change over time.
Differencing: A method of transforming a time series to make it stationary by subtracting the previous observation from the current observation.
Detrending
- Detrending: The process of removing the trend component from a time series to analyze the cyclical and irregular components.
Seasonality Analysis
- Seasonal Decompose: A method to separate out the seasonal component from the time series data, allowing for analysis of specific patterns that repeat over fixed periods.
Dataset:
The data this week comes from the EPA’s measurements on air quality for Tucson, AZ core-based statistical area (CBSA) for 2022.
We’ll use the dataset: ad_aqi_tracker_data-2022.csv
, which includes daily observations on air quality, along with multi-year averages.
Metadata for ad_aqi_tracker_data-2022.csv
:
Variable | Class | Description |
---|---|---|
Date |
DateTime | Date of observation |
AQI Values |
int | Air quality index reading |
Main Pollutant |
character | Primary pollutant at time of reading |
Site Name |
character | Name of collection site |
Site ID |
character | ID of collection site |
Source |
double | Data source |
Note: You will have to change the data type for some columns to match the above.
(Source: https://www.airnow.gov/aqi-basics)
Question:
How can we apply ARIMA modeling to forecast future Air Quality Index (AQI) values based on historical data, and what insights can be gained from detrending and analyzing the seasonality in AQI time series data?
Step 1: Setup and Data Preprocessing
Load the dataset into a pandas DataFrame.
Convert the
date
column to datetime format and set it as the index of the DataFrame.Convert the
aqi_value
column as needed.Plot the
aqi_value
time series to visually inspect the data.
Step 2: Time Series Decomposition
Use
seasonal_decompose
from thestatsmodels
package to decompose the time series into trend, seasonal, and residual components.Plot the decomposed components to understand the underlying patterns.
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series
= seasonal_decompose(df['aqi_value'], model = 'additive')
decomposition
# Plot the decomposed components
decomposition.plot() plt.show()
Part 3: Testing for Stationarity
Perform an Augmented Dickey-Fuller (ADF) test to check the stationarity of the time series.
If the series is not stationary, apply differencing to make it stationary.
from statsmodels.tsa.stattools import adfuller
# Perform Augmented Dickey-Fuller test
= adfuller(df['aqi_value'])
result print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
# Interpretation
if result[1] > 0.05:
print("Series is not stationary")
else:
print("Series is stationary")
Step 4: ARIMA Model
Use the
auto_arima
function from thepmdarima
package to identify the optimal parameters (p,d,q) for the ARIMA model.Fit an ARIMA model with the identified parameters.
Plot the original vs. fitted values to assess the model’s performance.
from pmdarima import auto_arima
# Identify the optimal ARIMA model
= auto_arima(df['aqi_value'], start_p = 1, start_q = 1,
auto_model = 'adf', # use adftest to find optimal 'd'
test = 3, max_q = 3, # maximum p and q
max_p = 1, # frequency of series
m = None, # let model determine 'd'
d = False, # No Seasonality
seasonal = 0,
start_P = 0,
D = True,
trace = 'ignore',
error_action = True,
suppress_warnings = True)
stepwise
print(auto_model.summary())
# Fit ARIMA model
= auto_model.fit(df['aqi_value'])
model
# Plot original vs fitted values
'fitted'] = model.predict_in_sample()
df['aqi_value', 'fitted']].plot(title='Original vs. Fitted Values')
df[[ plt.show()
Part 5: Forecasting
Forecast AQI values for the next 30 days using the fitted ARIMA model.
Plot the forecasted values alongside the historical data to visualize the forecast.
# Forecast the next 30 days
= model.predict(n_periods = 30, return_conf_int = True)
forecast, conf_int
# Plot the forecast
= (8, 6))
plt.figure(figsize 'aqi_value'], label = 'Historical')
plt.plot(df.index, df[-1], periods = 31, closed = 'right'), forecast, label='Forecast')
plt.plot(pd.date_range(df.index[-1], periods = 31, closed = 'right'), conf_int[:, 0], conf_int[:, 1], color = 'red', alpha = 0.3)
plt.fill_between(pd.date_range(df.index['AQI Forecast')
plt.title(
plt.legend() plt.show()
Part 6: Detrending and Seasonality Analysis
Explore different detrending methods (e.g., subtracting a moving average, polynomial detrending) using the
detrend_aqi
andpoly_trend
columns.Analyze seasonality patterns in the detrended data.
# Detrending using moving average
'moving_avg'] = df['aqi_value'].rolling(window = 12).mean()
df['detrended'] = df['aqi_value'] - df['moving_avg']
df[
# Plot detrended data
'detrended']].plot(title='Detrended AQI Time Series')
df[[
plt.show()
# Assuming seasonality was identified, you can further analyze it,
# for example, by averaging detrended values by month or another relevant period.
Submission:
- Submit your Jupyter Notebook via the course’s learning management system, including your code, visualizations, and a brief discussion of your findings regarding the impact of cage-free practices on egg production.