Time Series Analysis - Energy consumption in Barcelona¶
Nowadays studying Data Science and develop personal projects is easier than ever. There are many websites with free and interesting datasets out there. This is just one of many examples of that. At the Barcelona government website, I found this dataset that gives us a very good starting point for a Time Series analysis.
Data cleaning and preparation¶
The pandas
library has many functionalities to work with time series. But to really make
the most of these capabilities, it's essential to have a date time index. That's what we're doing in the
code cell below. I won't go into much detail here. We're just renaming columns and preparing the date to
be a usefull DateTimeIndex.
import pandas as pd
import numpy as np
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
df = pd.read_csv('data_raw.csv', thousands='.')
# Drop empty column
df.drop(columns=['Unnamed: 1'], inplace=True)
# Format year and months
df.Periodo = df.Periodo.apply(lambda x: x.split()[-1] if 'Año' in x else x)
df = df.iloc[4:,].reset_index(drop=True)
months = ['Enero', 'Febrero', 'Marzo', 'Abril', 'Mayo', 'Junio', 'Julio', 'Agosto', 'Septiembre', 'Octubre', 'Noviembre', 'Diciembre']
d = dict(zip(months, np.arange(1,13)))
df['Periodo'] = df.Periodo.str.strip().replace(d)
def format_date(row):
if isinstance(row['Periodo'], str):
return row['Periodo']
year_index = row.name - row['Periodo']
year = df.loc[year_index, 'Periodo']
return '1-'+str(row['Periodo']) + '-' + year
df['Periodo'] = df.apply(format_date, axis=1)
df = df[df['Periodo'].str.contains('-')]
# Change column names
renamed = dict(zip(df.columns, ['period', 'total', 'domestic', 'low_voltage', 'high_voltage']))
df.rename(columns=renamed, inplace=True)
# Set datetime index
df['period'] = pd.to_datetime(df['period'], dayfirst=True)
df.set_index('period', inplace=True)
# Keep only the years where we have months
df = df['2002-01-01':]
df.head()
total | domestic | low_voltage | high_voltage | |
---|---|---|---|---|
period | ||||
2002-01-01 | 586030 | 276261 | 167946 | 141823 |
2002-02-01 | 551446 | 219226 | 150389 | 181831 |
2002-03-01 | 608499 | 269371 | 164810 | 174318 |
2002-04-01 | 440794 | 185130 | 137355 | 118309 |
2002-05-01 | 517328 | 262391 | 157361 | 97576 |
Start Exploratory Data Analysis¶
Let's start with plotting all the data to see how it looks.
import plotly.express as px
COLOR_MAP = dict(zip(df.columns, ['#DDBEA8', '#488B49','#F3DFC1', '#C25B66',]))
fig = px.line(df, width=800,
color_discrete_map=COLOR_MAP) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
fig.show()
The peak in February 2004 is explained in the original source of the data:
Low voltage gives error due to problems with the S.C.E. On march the excess gets compensated.
To balance it, let's set the average of both months to this entries. The good thing of using a date as an index, is that we can select the entries based on the date. Here we're selecting the values for february and march.
feb = df.loc['2004-02-01',]
mar = df.loc['2004-03-01',]
df.loc['2004-02-01':'2004-03-01','low_voltage'] = (feb['low_voltage'] + mar['low_voltage']) / 2
df.loc['2004-02-01':'2004-03-01','total'] = (feb['total'] + mar['total']) / 2
df.loc['2004-02-01':'2004-03-01']
total | domestic | low_voltage | high_voltage | |
---|---|---|---|---|
period | ||||
2004-02-01 | 594331.5 | 271946 | 161593 | 162441 |
2004-03-01 | 594331.5 | 261995 | 161593 | 169095 |
Long-term trends - rolling average¶
Another avantage is the ability to resample the dataset very quickly. Here, for example, we can easily get the sum of the consumption along each year. This reveals that there's been a reduction in domestic energy consumption in recent years, along with a slight increase in non-domestic consumption.
data = df.loc[:,['domestic', 'low_voltage', 'high_voltage']].resample('A').sum()
fig = px.line(data,
title="Anual energy consumption in Barcelona",
width=800,
color_discrete_map=COLOR_MAP) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
fig.show()
Adding all the values in the year gives a good estimate of trend. But a better estimate is the rolling averages. Rolling statistics are often used to smooth out short-term fluctuations and highlight long-term trends or patterns in the data.
import plotly.graph_objects as go
data = df.loc[:,['domestic', 'low_voltage', 'high_voltage']].rolling(12).mean()
fig = px.line(data,
title="Rolling average - 12 months window",
width=800,
color_discrete_map=COLOR_MAP) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
fig.add_trace(go.Scatter(
x=['2008', '2020'],
y=[250000, 250000],
text=["2008-09 Debt Crisis",
"COVID-19"],
mode="text",
name=""
))
fig.add_shape(type="rect",
xref='paper', yref='paper',
x0=0.35, y0=0,
x1=0.4, y1=1,
line=dict(color='rgba(250,250,250,0)'),
fillcolor='rgba(250,250,250,0.1)'),
fig.add_shape(type="rect",
xref='paper', yref='paper',
x0=0.95, y0=0,
x1=0.98, y1=1,
line=dict(color='rgba(250,250,250,0)'),
fillcolor='rgba(250,250,250,0.1)'),
fig.show()
By making the rolling average of the past 12 months, we can see more clearly the monthly trends. The effects of big macroeconomic events such as the 2008 debt crisis and the covid pandemics are also very obvious.
Intra-year variability - detrending¶
Now we want to see if there are months were the energy consumption is higher than other months. We could
plot the mean of each month, but here we are plotting all the years together to see how all the trend
flows. By substracting the rolling mean from the original time series, we remove the long-term patterns,
hence - we are removing inter-year variability. Then we have to transform the years into columns and keep
each month as a separate row. Pandas pivot_table
method is particularly useful to do that.
In general, we see that in the summer months there's a decrease in domestic energy consumption and an increase in the rest of low voltage (corresponding to small businesses) and high voltage.
On green we've marked the year 2009, where the energy in Barcelona moved to another company. On red we've highlighted the year 2020, where we can clearly see the effect of the lockdown.
from plotly.subplots import make_subplots
def plot_year_trend(df, column):
sub_df = df.loc['2003':'2021', [column]]
rolling = sub_df.rolling(12).mean()
df_detrend = sub_df - rolling
month = df_detrend.index.month
year = df_detrend.index.year
df_detrend['Month'] = month
df_detrend['Year'] = year
df_detrend = pd.pivot_table(df_detrend, index='Month', columns='Year', values=column)
def get_color(year):
if year == 2020:
return COLOR_MAP['high_voltage']
if year == 2009:
return COLOR_MAP['domestic']
return COLOR_MAP['low_voltage']
color_map = {str(y):get_color(y) for y in df_detrend.columns}
fig = px.line(df_detrend, title=f"Tendencia anual para {column}",
color_discrete_map=color_map
)
return fig
pacf_figures = [plot_year_trend(df, 'domestic'),
plot_year_trend(df, 'low_voltage'),
plot_year_trend(df, 'high_voltage'),]
fig = make_subplots(rows=len(pacf_figures),
shared_xaxes=True,
vertical_spacing=0.01,
row_titles=['Domestic', 'Low voltage', 'High voltage'],
cols=1)
for i, figure in enumerate(pacf_figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=i+1, col=1)
fig.update_layout(height=750, width=800,
showlegend=False,
title=f"""Intra-year variability in energy consumption
<br><sup>
<span style='color: {COLOR_MAP["domestic"]}'>2009</span>
<span style='color: {COLOR_MAP["high_voltage"]}'>2020</span>
<span style='color: {COLOR_MAP["low_voltage"]}'>rest of the years</span>
</sup>""",
template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
)\
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False) \
.update_traces(line={'width':1.5})
fig.show()
The valley seen in 2009 is explained by this:
(5) As of July 2009, a new company Endesa Energía XXI S.L. was created, which it inherited from the old company Endesa Distribución S.L. all clients up to a power of 10 KW, the rest have gone to Mercado Libre.
(6) As of September 2009, the energy billed by the new company Endesa Energía XXI S.L. is included.
Periodicity - First-order differencing¶
Time series normally have two basic components: trends and seasonality. Until now, we've taken a look into the trends, both long and short-term. To look at seasonality, we can apply differencing. This computes the difference between consecutive observations in the time series, and helps in focusing on the changes within each season. First-order differencing has many applications in different fields, like Heart Rate Variability analysis
In our case, it helps to remove the long term trends and to see the periodicity of the data, as we see in the plot. We can clearly appreciate "waves", or a frequency pattern, in the energy consumption.
diff = df[['domestic', 'low_voltage', 'high_voltage']].diff()
px.line(diff[1:],
width=800, height=600,
color_discrete_map=COLOR_MAP,) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False) \
.update_traces(line={'width': 1})
Correlation¶
When we first explored the data, we saw that, in the long term, the domestic consumption was decreasing while the high voltage consumption was increasing. We should expect, then som correlation between the two variables. But when we compute the correlations, we don't see very high values (left heatmap). This is due to the fact that correlation calculated with the unprocessed data is a measure that doesn't distinguish between trend and seasonal components explicitly. In other words, while there may be a negative correlation between two variables on the long term, the positive seasonal correlation (right heatmap) would cancel the first.
But these values aren't still correct. A quick look at the first order differencing plot (above) reveals us that the data has a lot of noise. To really see the correlation between the components, we have to decompose the data.
general_correlation = df.iloc[:,1:].corr()
seasonal_corr = df.diff().iloc[1:,1:].corr()
fig = make_subplots(rows=1, cols=2,
shared_yaxes=True,
column_titles=["General correlation", "Seasonal correlation"])
trend_traces = px.imshow(general_correlation,
text_auto=".2f",
color_continuous_scale=[COLOR_MAP['high_voltage'],
COLOR_MAP['low_voltage'],
COLOR_MAP['domestic']],
x=['Domes.', 'Low V', 'High V'],
y=['Domes.', 'Low V', 'High V']) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)['data']
seasonal_traces = px.imshow(seasonal_corr,
text_auto=".2f",
x=['Domes.', 'Low V', 'High V'],
y=['Domes.', 'Low V', 'High V'],
color_continuous_scale=[COLOR_MAP['high_voltage'],
COLOR_MAP['low_voltage'],
COLOR_MAP['domestic']],) \
['data']
for trace in trend_traces:
fig.add_trace(trace, row=1, col=1)
for trace in seasonal_traces:
fig.add_trace(trace, row=1, col=2)
fig.update_layout(template='plotly_dark',
width=800,
plot_bgcolor='#34495E',
paper_bgcolor='#34495E')
fig.show()
Decomposition¶
Apart from trend and seasonal, there's also a lot of noise and other short-term fluctuations. While trends correspond to the long-term direction of the data and seasoanlity captures repeating patterns, residual represent the remaining variability. Any decent forecast task that we want to do has to take these two components (and the residuals) in consideration. By understanding the trends and seasonality inherent in a time series, we can build more accurate and reliable forecasting models.
The python library statsmodels
has a very useful method to do this. Now, the seasonal plot
makes much more sense than before
from statsmodels.tsa.seasonal import seasonal_decompose
decompositions = {column: seasonal_decompose(df.loc[:,column]) for column in df.columns[1:]}
fig = make_subplots(rows=3, cols=1,
row_titles=['Trend component', 'Seasonal component', 'Residual'])
for column in decompositions.keys():
decompositions[column].trend.name = column
decompositions[column].seasonal.name = column
fig.add_trace(go.Scatter(x=decompositions[column].trend.index,
y=decompositions[column].trend.dropna(),
line=dict(color=COLOR_MAP[column])), row=1, col=1)
fig.add_trace(go.Scatter(x=decompositions[column].seasonal.index,
y=decompositions[column].seasonal.dropna(),
line=dict(color=COLOR_MAP[column])), row=2, col=1)
fig.add_trace(go.Scatter(x=decompositions[column].resid.index,
y=decompositions[column].resid.dropna(),
line=dict(color=COLOR_MAP[column])), row=3, col=1)
fig.update_layout(height=750, width=800,
showlegend=False,
title=f"""Energy consumption decomposition
<br><sup>
<span style='color: {COLOR_MAP["domestic"]}'>domestic</span>
<span style='color: {COLOR_MAP["low_voltage"]}'>low voltage</span>
<span style='color: {COLOR_MAP["high_voltage"]}'>high voltage</span>
</sup>""",
template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
)\
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False) \
.update_traces(line={'width':1.5})
fig.show()
We can recalculate the correlations, this time with more accurate values. Now we see something that makes sense. There's a negative correlation between the domestic energy consumption and the rest (corresponding to small business and industry). On the trend component, this measure is quite low, but its much higher in the seasonal component, as we expected when we plotted the intra-year variability.
trend_df = pd.concat([dec.trend for dec in decompositions.values()], axis=1).dropna()
seasonal_df = pd.concat([dec.seasonal for dec in decompositions.values()], axis=1).dropna()
trend_corr = trend_df.corr()
seasonal_corr = seasonal_df.corr()
fig = make_subplots(rows=1, cols=2,
shared_yaxes=True,
column_titles=["Trend correlation", "Seasonal correlation"])
trend_traces = px.imshow(trend_corr,
text_auto=".2f",
color_continuous_scale=[COLOR_MAP['high_voltage'],
COLOR_MAP['low_voltage'],
COLOR_MAP['domestic']],
x=['Domes.', 'Low V', 'High V'],
y=['Domes.', 'Low V', 'High V']) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)['data']
seasonal_traces = px.imshow(seasonal_corr,
text_auto=".2f",
x=['Domes.', 'Low V', 'High V'],
y=['Domes.', 'Low V', 'High V'],
color_continuous_scale=[COLOR_MAP['high_voltage'],
COLOR_MAP['low_voltage'],
COLOR_MAP['domestic']],) \
['data']
for trace in trend_traces:
fig.add_trace(trace, row=1, col=1)
for trace in seasonal_traces:
fig.add_trace(trace, row=1, col=2)
fig.update_layout(template='plotly_dark',
width=800,
plot_bgcolor='#34495E',
paper_bgcolor='#34495E')
fig.show()
Autocorrelation¶
Another way to detect seasonality is the Autocorrelation function (ACF). It quantifies the degree to which each observation is related to previous observations along different lags. We can set lags of different periods, like 6 months, 12, etc... It's crucial for identifying the order of autoregressive (AR) and moving average (MA) processes, and helps a lot when selecting the appropriate parameters for ARIMA models (we'll see them soon).
When we plot the ACF, we obviously see peaks every 12 months. This means that the series are correlated with themselfs shifted one year.
from statsmodels.tsa.stattools import acf, pacf
acf_df = pd.DataFrame({column:acf(df.loc[:,column], nlags=36)
for column in df.columns[1:]},)
px.line(acf_df,width=800, height=600,
title="""Autocorrelation function
<br> <sup> We observe a clear correlation peak at lag 12<sup>""",
color_discrete_map=COLOR_MAP,) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
The autocorrelation function has a sister: the Partial Autocorrelation Function (PACF). This function also shows the relationsips between observations at different lags, but it excludes the effect that the intermediate observations have. For example, if we're measuring the correlation at lag 5, the function represents the relation between the current observation point and the one that's exactly 5 time points behind, excluding the ones that are 4, 3, 2 or 1 time point behind.
PACF is particularly useful to tune the order $p$ in AutoRegressive models ($AR(p)$). This order $p$ indicates how many past values are considered in the model. ARIMA models asume stationarity, so differencing is applied before computing the PACF to achieve stationarity, remove trends, and focus on the direct autocorrelations necessary for identifying the order of autoregressive terms in the time series model.
In the following plots, any spikes outside the greyed areas are statistically significant, meaning that there is a strong autocorrelation between values at those lags.
def create_corr_plot(column, plot_pacf=False):
series = df.loc[:,column]
corr_array = pacf(series.diff().dropna(), alpha=0.05) if plot_pacf else acf(series.diff().dropna(), alpha=0.05)
lower_y = corr_array[1][:,0] - corr_array[0]
upper_y = corr_array[1][:,1] - corr_array[0]
fig = go.Figure()
[fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color=COLOR_MAP['total'])
for x in range(len(corr_array[0]))]
fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0], mode='markers', marker_color=COLOR_MAP[column],
marker_size=8)
fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y, mode='lines', line_color='rgba(255,255,255,0)')
fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y, mode='lines',fillcolor='rgba(250,250,250,0.1)',
fill='tonexty', line_color='rgba(255,255,255,0)')
fig.update_traces(showlegend=False)
fig.update_yaxes(zerolinecolor='#000000')
return fig
fig = make_subplots(rows=3, cols=2,
column_titles=['Autocorrelation', 'Partial autocorrelation'],
row_titles=['Domestic', 'Low voltage', 'High voltage'])
pacf_figures = [create_corr_plot(column, True) for column in df.columns[1:]]
acf_figures = [create_corr_plot(column) for column in df.columns[1:]]
for i, figure in enumerate(pacf_figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=i+1, col=2)
for i, figure in enumerate(acf_figures):
for trace in range(len(figure["data"])):
fig.append_trace(figure["data"][trace], row=i+1, col=1)
fig.update_layout(
width=800,height=900,
template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
fig.show()
Time series forecasting¶
ARIMA models¶
ARIMA stands for AutoRegressive Integrated Moving Average. It's a combination of three different parts. The Autoregressive process models the current value of a variable as a linear combination of its past values. How many past values? That's determined by the parameter $p$.
$${\displaystyle X_{t}=c+\sum _{i=1}^{p}\varphi _{i}x_{t-i}+\varepsilon _{t}\,}$$The Moving Average part, on the other hand, models the current value based on the average of its past values. Here the parameter to set it's determined by the letter $q$:
$${\displaystyle X_{t}=\frac{1}{k}\sum _{i=n-k+1}^{q}x_{i}\,}$$Finally, the Integrated part corrects the model in case it is non-stationary. In our case, the rolling average plot clearly shows long-term trends, which makes our data non-stationary. But just looking at a plot to decide whether or data is stationary or not is not very scientific, isn't it? Luckily, the Augmented Dicker Fuller test was designed just for that. Its null hypothesis asumes that the data is non-stationary. For the three series (domestic, low and high voltage) the p value is high enough to accept that the data is non-stationary.
from statsmodels.tsa.stattools import adfuller
raw = [print(f"{column}: {adfuller(df.loc[:,column])[1]}") for column in df.columns[1:]]
domestic: 0.842010473579625 low_voltage: 0.3599026643228239 high_voltage: 0.09802548827010471
To "stationarisize" a time series, we have to compute the differences, as we did above with the first-order differencing. The number of times we have to difference the data determins the last parameter $d$. We can use the Dickey Fuller test again to check how many times we need differencing. We see that with just one time, we get really low p-values in the test.
first_order = [print(f"First order differencing {column}: {adfuller(df.loc[:,column].diff().dropna())[1]}") for column in df.columns[1:]]
First order differencing domestic: 1.346953258378036e-27 First order differencing low_voltage: 7.674326893256915e-20 First order differencing high_voltage: 4.517936348563829e-18
The model, then, can be written as ARIMA(p, d, q). We've already decided the value of $d$, now we need to decide $p$ and $q$, so -> ARIMA(p, 1 , q). To determine the values $p$ and $q$, we can take a look at the lags of the PACF (for $p$), and ACF (for $q$) plots. We can consider any peaks falling out of the significance area. For example, let's take a closer look at the PACF plot corresponding to domestic consumption. Lags 1, 7, 8, 10 and 11 have a greater significance. Lags 6 and 9 too, but to a lesser extent. The ACF plot has a high peak at lag 12, revealing seasonality. This suggests that ARIMA may not be the best model to forecast the energy consumption, but let's keep going with it, and we'll take a look at Seasonal ARIMA (SARIMA) later.
pacf_plot = create_corr_plot('domestic', plot_pacf=True)
acf_plot = create_corr_plot('domestic')
fig = make_subplots(rows=1, cols=2,
column_titles=['Autocorrelation', 'Partial autocorrelation'],
)
for trace in range(len(pacf_plot["data"])):
fig.append_trace(pacf_plot["data"][trace], row=1, col=2)
for trace in range(len(acf_plot["data"])):
fig.append_trace(acf_plot["data"][trace], row=1, col=1)
fig.update_layout(title="Domestic consumption", width=800,
template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
As we see, there are a lot of possible combinations for $p$ and $q$. We could arbitrarily choose one, or we can run a for loop between a range of possible combinations to select the one that throws the minimum MSE with the prediction. As with any machine learning model, we also have to split between train and test (we're using an 80/20 ratio in this case). The code below summarizes the process:
from itertools import product
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error as mse
from statsmodels.tools.sm_exceptions import ConvergenceWarning, ValueWarning
import warnings
warnings.simplefilter('ignore', ValueWarning)
warnings.simplefilter('ignore', ConvergenceWarning)
# warnings.simplefilter('ignore', UserWarning)
split_idx = df.index[int(len(df) * .8)]
domestic_train = df.loc[:split_idx, 'domestic']
domestic_test = df.loc[split_idx:, 'domestic']
p_range = range(1,12)
d_range = range(0,2)
q_range = range(1,13)
pdq_combinations = list(product(p_range, d_range, q_range))
combinations =[]
rmses = []
for pdq in pdq_combinations:
try:
model = ARIMA(domestic_train, order=pdq).fit()
prediction = model.predict(start=len(domestic_train), end=len(df))
rmse = np.sqrt(mse(domestic_test, prediction))
combinations.append(pdq)
rmses.append(rmse)
except Exception as e:
continue
results = pd.DataFrame({'combination': combinations, 'rmse': rmses})
combination | rmse | |
---|---|---|
77 | (4, 0, 6) | 22030.620452 |
As we see, even the best combinations gives a quite high error, and looking at the plot we see that, although it captures some of the fluctuations, the prediction is not very accurate.
comb = results.loc[results.rmse.argmin(), 'combination']
error = results.rmse.min()
model = ARIMA(domestic_train, order=(4,0,6)).fit()
forecast = model.predict(len(domestic_train), len(df))
px.line(pd.concat([df.loc[split_idx:,'domestic'], forecast], axis=1),
title=f"""ARIMA model prediction
<br><sup>p={comb[0]}, d={comb[1]}, q={comb[2]} | RMSE = {error:.2f}</sup>""",
width=800,
color_discrete_map=COLOR_MAP,) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
SARIMA models¶
In our case, it's very clear that we have seasonality. We saw it when we firts plottet the data, when we decomposed it, and finally with the ACF and PACF plots, with those clear peaks at lag 12. Seasonal AutoRegressive Integrated Moving Average (SARIMA) is an extension of the ARIMA model designed to capture both non-seasonal and seasonal patterns in time series data. The SARIMA model is denoted as SARIMA(p, d, q)(P, D, Q)m, where (p, d, q) represents the non-seasonal order (same as the ARIMA model) (P, D, Q) denotes the seasonal order, and $m$ is the length of the seasonal cycle. The non-seasonal parameters (p, d, q) are similar to those in the ARIMA model, representing the autoregressive order, degree of differencing, and moving average order, respectively. The seasonal parameters (P, D, Q) mirror their non-seasonal counterparts but apply to the seasonal component.
The data we are looking at has a yearly seasonal component. Therefore, the $m$ for us will be 12, as we have monthly data. Having a P of 1, for example, means that we the model will evaluate each point based in an autoregression taking one seasonal cycle (12 months in our case) into account.
We're not showing here the for loop to select the best parameters, to save some space,
model = SARIMAX(domestic_train, order=(8,0,6), seasonal_order=(1,1,1,12)).fit(disp=False)
forecast = model.get_forecast(steps=len(df) - len(domestic_train),)
fig = px.line(pd.concat([df.loc[split_idx:,'domestic'], forecast.predicted_mean], axis=1),
title=f"""SARIMA model prediction
<br><sup>p={8}, d={0}, q={6} | P=1, D=1, Q=1, m=12 | RMSE = {error:.2f} | Confidence interval in grey</sup>""",
width=800,
color_discrete_map=COLOR_MAP,) \
.update_layout(template='plotly_dark',
plot_bgcolor='#34495E',
paper_bgcolor='#34495E',
) \
.update_xaxes(showgrid=False) \
.update_yaxes(showgrid=False)
confidence_intervals = forecast.conf_int()
fig.add_trace(go.Scatter(x=confidence_intervals.index,
y=confidence_intervals['lower domestic'],
line=dict(color='rgba(0,0,0,0)'), showlegend=False))
fig.add_trace(go.Scatter(x=confidence_intervals.index,
y=confidence_intervals['upper domestic'],
line=dict(color='rgba(0,0,0,0)'),
fillcolor='rgba(250,250,250,0.1)',
fill='tonexty', showlegend=False))
fig.show()
Conclusion¶
Time series forecasting is a very big field with applications in various domains, including finance, economics, and environmental science. This brief overview just shows key concepts such as trends, seasonality, and the ARIMA model. However, time series analysis have a multitude of advanced techniques and models, each designed for specific data characteristics and forecasting requirements.
This introduction is just a stepping stone, so feel free to explore the diverse landscape of time series forecasting and discover many more tools available for predicting future trends and patterns.
Further steps¶
Beyond the introductory concepts, we could explore other machine learning algorithms like Long Short-Term Memory (LSTM) networks or ensemble methods can capture intricate patterns that traditional models may overlook.
Another thing that I really want to do is integrate economic metrics to study the influence that the economy has in energy consumption. By correlating time series data with economic indicators such as GDP, inflation rates, or interest rates, we could see other relationships and maybe we could use them to refine predictive models. But that's material for many other articles!