Forecasting Solar-Energy#

Objective#

The primary objective is to build and evaluate a predictive model, with a focus on understanding the relative importance of different features in making predictions. With more solar power integrated into power systems, accurately forecast the power outputs from solar power becomes crucial for reliable and economic operation of the system. The conventional generators need to follow up and ramp down based on the power increase and decline of solar power during sunrise and sunset. During cloudy days, it is also important to forecast the power fluctuations of solar power to prepare adequate reserve capacities. The model deals with structured data needed for solar power forecasting and aims to optimize prediction accuracy while providing explainability with a simple example.
We use historical time-series data from a specified region in Mississippi from 2006 to analyze and forecast solar-energy output. We want to reiterate that the purpose of this module is not to provide the most accurate forecast, but to demonstrate the process of developing a machine learning pipeline for solar power forecasting and analyzing feature importance for interpretability. The dataset used in this module is a sample dataset for demonstration purposes only. The techniques and methods used in this module can be applied to other datasets for solar power forecasting. The procedure of other forecasting methods could be quite different than this method. But the main data processing steps should be similar.

Purpose#

  • To develop a machine learning pipeline for solar power forecasting purpuse.

  • To analyze feature importance for forecasting results interpretability.

  • To improve predictive accuracy using feature engineering and optimization techniques.

Who is this useful for?#

  • Data Scientists: Interested in understanding feature importance in forecasting models.

  • Decision-Makers: Seeking insights from the predictions for actionable strategies.

  • Students & Researchers: Exploring predictive modeling and feature analysis.

Applications#

  • Predicting outcomes in structured data for solar power forecasting (e.g., sales, risk assessment, customer behavior).

  • Identifying key drivers influencing outcomes for resource allocation.

  • Benchmarking forecasting performance (accuracy) against baseline algorithms.

Notebook Components#

  1. Data Preparation: Importing, cleaning, and preprocessing data for modeling. Prepare the historical time-series data for solar power forecasting.

  2. Model Development: Training machine learning models.

    • ARIMA model-ARIMA stands for AutoRegressive Integrated Moving Average. It is a popular statistical method used for time series forecasting. The ARIMA model is characterized by three main components: 1. AutoRegressive (AR) part: This part involves regressing the variable on its own lagged (past) values. The number of lagged values to include is denoted by p. 2. Integrated (I) part: This part involves differencing the data to make it stationary, which means that the mean and variance are constant over time. The number of differences needed to achieve stationarity is denoted by d. 3. Moving Average (MA) part: This part involves modeling the error term as a linear combination of error terms occurring at various times in the past. The number of lagged error terms to include is denoted by q.The ARIMA model is generally denoted as ARIMA(p, d, q), where: p is the number of lag observations included in the model (lag order). d is the number of times that the raw observations are differenced. q is the size of the moving average window. Steps to Build an ARIMA Model: Identification: Determine the values of p, d, and q using techniques like the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. Estimation: Fit the ARIMA model to the time series data using the identified parameters.Diagnostic Checking: Check the residuals of the model to ensure that they resemble white noise (i.e., they are normally distributed with a mean of zero and constant variance). Forecasting: Use the fitted ARIMA model to make future predictions.

    • Prophet model-Prophet is an open-source forecasting tool developed by Facebook. It is designed to handle time series data that may have daily, weekly, and yearly seasonality, along with holiday effects. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It works best with time series that have strong seasonal effects and several seasons of historical data. The model is intuitive and allows for easy incorporation of additional regressors to improve forecast accuracy. Steps to Build a Prophet Model: Data Preparation: Ensure the data is in the correct format with columns β€˜ds’ (date) and β€˜y’ (value to forecast). Model Initialization: Create a Prophet object and specify any seasonalities or holidays. Model Fitting: Fit the model to the historical data. Forecasting: Use the fitted model to make future predictions. Visualization: Plot the forecasted values along with the historical data to visualize the model’s performance.

    • LightGBM model- LightGBM (Light Gradient Boosting Machine) is a highly efficient and scalable gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following key features: Gradient Boosting: LightGBM is based on the gradient boosting framework, which builds models sequentially. Each new model attempts to correct the errors made by the previous models. This is done by minimizing a loss function using gradient descent. Tree-Based Learning: LightGBM uses decision trees as its base learners. Specifically, it uses a technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up the training process and reduce memory usage. Efficiency: LightGBM is designed to be highly efficient in terms of both speed and memory usage. It achieves this through several optimizations: Histogram-based decision tree learning: This reduces the complexity of finding the best split. Leaf-wise tree growth: Unlike level-wise tree growth used in other frameworks, LightGBM grows trees leaf-wise, which can lead to deeper trees and better accuracy. GOSS and EFB: These techniques further improve efficiency by reducing the number of data instances and features considered during training. Scalability: LightGBM can handle large datasets and high-dimensional data efficiently. It supports parallel and distributed learning, making it suitable for big data applications. Accuracy: Due to its efficient implementation and advanced techniques, LightGBM often achieves higher accuracy compared to other gradient boosting frameworks.

  3. Feature Importance Analysis: Evaluating which features contribute the most to predictions. Analayze the key factors in solar power forecating.

  4. Visualization: Graphically representing feature importance for interpretability.

# Import all the required libraries, pandas for data analytics, numpy for numerical calculation, matplotlib for plotting and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install scikit-learn
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Update submodules to fetch data
Requirement already satisfied: statsmodels in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: numpy<3,>=1.22.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.15.2)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (2.2.3)
Requirement already satisfied: patsy>=0.5.6 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.0.1)
Requirement already satisfied: packaging>=21.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (24.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0)
Requirement already satisfied: scikit-learn in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.15.2)
Requirement already satisfied: joblib>=1.2.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (3.5.0)
pd.set_option('display.max_rows', 300) #set the limit for the maximum number of rows in display
# # define data path for the input data of historical 5 minutes solar power output
# # file path
# df = pd.read_csv("ms-pv-2006/Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
import os
import pandas as pd

current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")

# # read the historical data with pandas
df = pd.read_csv(csv_path)
data = pd.read_csv(csv_path)

Add a new column β€œDateTime”, with the following format specified and make it our index for later use

df['Datetime'] = pd.to_datetime(df['LocalTime'], format='%m/%d/%y %H:%M')
df.set_index('Datetime', inplace=True)
df
LocalTime Power(MW)
Datetime
2006-01-01 00:00:00 01/01/06 00:00 0.0
2006-01-01 00:05:00 01/01/06 00:05 0.0
2006-01-01 00:10:00 01/01/06 00:10 0.0
2006-01-01 00:15:00 01/01/06 00:15 0.0
2006-01-01 00:20:00 01/01/06 00:20 0.0
... ... ...
2006-12-31 23:35:00 12/31/06 23:35 0.0
2006-12-31 23:40:00 12/31/06 23:40 0.0
2006-12-31 23:45:00 12/31/06 23:45 0.0
2006-12-31 23:50:00 12/31/06 23:50 0.0
2006-12-31 23:55:00 12/31/06 23:55 0.0

105120 rows Γ— 2 columns

# # show historical data
dff = df['Power(MW)']
df.head(200)
LocalTime Power(MW)
Datetime
2006-01-01 00:00:00 01/01/06 00:00 0.0
2006-01-01 00:05:00 01/01/06 00:05 0.0
2006-01-01 00:10:00 01/01/06 00:10 0.0
2006-01-01 00:15:00 01/01/06 00:15 0.0
2006-01-01 00:20:00 01/01/06 00:20 0.0
2006-01-01 00:25:00 01/01/06 00:25 0.0
2006-01-01 00:30:00 01/01/06 00:30 0.0
2006-01-01 00:35:00 01/01/06 00:35 0.0
2006-01-01 00:40:00 01/01/06 00:40 0.0
2006-01-01 00:45:00 01/01/06 00:45 0.0
2006-01-01 00:50:00 01/01/06 00:50 0.0
2006-01-01 00:55:00 01/01/06 00:55 0.0
2006-01-01 01:00:00 01/01/06 01:00 0.0
2006-01-01 01:05:00 01/01/06 01:05 0.0
2006-01-01 01:10:00 01/01/06 01:10 0.0
2006-01-01 01:15:00 01/01/06 01:15 0.0
2006-01-01 01:20:00 01/01/06 01:20 0.0
2006-01-01 01:25:00 01/01/06 01:25 0.0
2006-01-01 01:30:00 01/01/06 01:30 0.0
2006-01-01 01:35:00 01/01/06 01:35 0.0
2006-01-01 01:40:00 01/01/06 01:40 0.0
2006-01-01 01:45:00 01/01/06 01:45 0.0
2006-01-01 01:50:00 01/01/06 01:50 0.0
2006-01-01 01:55:00 01/01/06 01:55 0.0
2006-01-01 02:00:00 01/01/06 02:00 0.0
2006-01-01 02:05:00 01/01/06 02:05 0.0
2006-01-01 02:10:00 01/01/06 02:10 0.0
2006-01-01 02:15:00 01/01/06 02:15 0.0
2006-01-01 02:20:00 01/01/06 02:20 0.0
2006-01-01 02:25:00 01/01/06 02:25 0.0
2006-01-01 02:30:00 01/01/06 02:30 0.0
2006-01-01 02:35:00 01/01/06 02:35 0.0
2006-01-01 02:40:00 01/01/06 02:40 0.0
2006-01-01 02:45:00 01/01/06 02:45 0.0
2006-01-01 02:50:00 01/01/06 02:50 0.0
2006-01-01 02:55:00 01/01/06 02:55 0.0
2006-01-01 03:00:00 01/01/06 03:00 0.0
2006-01-01 03:05:00 01/01/06 03:05 0.0
2006-01-01 03:10:00 01/01/06 03:10 0.0
2006-01-01 03:15:00 01/01/06 03:15 0.0
2006-01-01 03:20:00 01/01/06 03:20 0.0
2006-01-01 03:25:00 01/01/06 03:25 0.0
2006-01-01 03:30:00 01/01/06 03:30 0.0
2006-01-01 03:35:00 01/01/06 03:35 0.0
2006-01-01 03:40:00 01/01/06 03:40 0.0
2006-01-01 03:45:00 01/01/06 03:45 0.0
2006-01-01 03:50:00 01/01/06 03:50 0.0
2006-01-01 03:55:00 01/01/06 03:55 0.0
2006-01-01 04:00:00 01/01/06 04:00 0.0
2006-01-01 04:05:00 01/01/06 04:05 0.0
2006-01-01 04:10:00 01/01/06 04:10 0.0
2006-01-01 04:15:00 01/01/06 04:15 0.0
2006-01-01 04:20:00 01/01/06 04:20 0.0
2006-01-01 04:25:00 01/01/06 04:25 0.0
2006-01-01 04:30:00 01/01/06 04:30 0.0
2006-01-01 04:35:00 01/01/06 04:35 0.0
2006-01-01 04:40:00 01/01/06 04:40 0.0
2006-01-01 04:45:00 01/01/06 04:45 0.0
2006-01-01 04:50:00 01/01/06 04:50 0.0
2006-01-01 04:55:00 01/01/06 04:55 0.0
2006-01-01 05:00:00 01/01/06 05:00 0.0
2006-01-01 05:05:00 01/01/06 05:05 0.0
2006-01-01 05:10:00 01/01/06 05:10 0.0
2006-01-01 05:15:00 01/01/06 05:15 0.0
2006-01-01 05:20:00 01/01/06 05:20 0.0
2006-01-01 05:25:00 01/01/06 05:25 0.0
2006-01-01 05:30:00 01/01/06 05:30 0.0
2006-01-01 05:35:00 01/01/06 05:35 0.0
2006-01-01 05:40:00 01/01/06 05:40 0.0
2006-01-01 05:45:00 01/01/06 05:45 0.0
2006-01-01 05:50:00 01/01/06 05:50 0.0
2006-01-01 05:55:00 01/01/06 05:55 0.0
2006-01-01 06:00:00 01/01/06 06:00 0.0
2006-01-01 06:05:00 01/01/06 06:05 0.0
2006-01-01 06:10:00 01/01/06 06:10 0.0
2006-01-01 06:15:00 01/01/06 06:15 0.0
2006-01-01 06:20:00 01/01/06 06:20 0.0
2006-01-01 06:25:00 01/01/06 06:25 0.0
2006-01-01 06:30:00 01/01/06 06:30 0.0
2006-01-01 06:35:00 01/01/06 06:35 0.0
2006-01-01 06:40:00 01/01/06 06:40 0.0
2006-01-01 06:45:00 01/01/06 06:45 0.0
2006-01-01 06:50:00 01/01/06 06:50 0.0
2006-01-01 06:55:00 01/01/06 06:55 0.0
2006-01-01 07:00:00 01/01/06 07:00 0.0
2006-01-01 07:05:00 01/01/06 07:05 0.0
2006-01-01 07:10:00 01/01/06 07:10 0.0
2006-01-01 07:15:00 01/01/06 07:15 0.0
2006-01-01 07:20:00 01/01/06 07:20 3.6
2006-01-01 07:25:00 01/01/06 07:25 1.0
2006-01-01 07:30:00 01/01/06 07:30 3.6
2006-01-01 07:35:00 01/01/06 07:35 16.8
2006-01-01 07:40:00 01/01/06 07:40 20.9
2006-01-01 07:45:00 01/01/06 07:45 15.6
2006-01-01 07:50:00 01/01/06 07:50 22.0
2006-01-01 07:55:00 01/01/06 07:55 26.9
2006-01-01 08:00:00 01/01/06 08:00 36.9
2006-01-01 08:05:00 01/01/06 08:05 38.6
2006-01-01 08:10:00 01/01/06 08:10 23.4
2006-01-01 08:15:00 01/01/06 08:15 27.5
2006-01-01 08:20:00 01/01/06 08:20 29.5
2006-01-01 08:25:00 01/01/06 08:25 28.5
2006-01-01 08:30:00 01/01/06 08:30 27.2
2006-01-01 08:35:00 01/01/06 08:35 24.4
2006-01-01 08:40:00 01/01/06 08:40 23.9
2006-01-01 08:45:00 01/01/06 08:45 23.1
2006-01-01 08:50:00 01/01/06 08:50 22.1
2006-01-01 08:55:00 01/01/06 08:55 20.6
2006-01-01 09:00:00 01/01/06 09:00 20.4
2006-01-01 09:05:00 01/01/06 09:05 19.7
2006-01-01 09:10:00 01/01/06 09:10 18.9
2006-01-01 09:15:00 01/01/06 09:15 16.5
2006-01-01 09:20:00 01/01/06 09:20 16.3
2006-01-01 09:25:00 01/01/06 09:25 17.0
2006-01-01 09:30:00 01/01/06 09:30 16.1
2006-01-01 09:35:00 01/01/06 09:35 15.1
2006-01-01 09:40:00 01/01/06 09:40 14.3
2006-01-01 09:45:00 01/01/06 09:45 14.0
2006-01-01 09:50:00 01/01/06 09:50 13.6
2006-01-01 09:55:00 01/01/06 09:55 12.7
2006-01-01 10:00:00 01/01/06 10:00 13.5
2006-01-01 10:05:00 01/01/06 10:05 15.1
2006-01-01 10:10:00 01/01/06 10:10 15.7
2006-01-01 10:15:00 01/01/06 10:15 23.4
2006-01-01 10:20:00 01/01/06 10:20 25.0
2006-01-01 10:25:00 01/01/06 10:25 23.3
2006-01-01 10:30:00 01/01/06 10:30 17.4
2006-01-01 10:35:00 01/01/06 10:35 18.3
2006-01-01 10:40:00 01/01/06 10:40 23.0
2006-01-01 10:45:00 01/01/06 10:45 29.9
2006-01-01 10:50:00 01/01/06 10:50 30.5
2006-01-01 10:55:00 01/01/06 10:55 30.2
2006-01-01 11:00:00 01/01/06 11:00 36.6
2006-01-01 11:05:00 01/01/06 11:05 48.1
2006-01-01 11:10:00 01/01/06 11:10 54.0
2006-01-01 11:15:00 01/01/06 11:15 40.1
2006-01-01 11:20:00 01/01/06 11:20 30.4
2006-01-01 11:25:00 01/01/06 11:25 27.7
2006-01-01 11:30:00 01/01/06 11:30 25.7
2006-01-01 11:35:00 01/01/06 11:35 25.1
2006-01-01 11:40:00 01/01/06 11:40 23.9
2006-01-01 11:45:00 01/01/06 11:45 22.8
2006-01-01 11:50:00 01/01/06 11:50 22.0
2006-01-01 11:55:00 01/01/06 11:55 20.5
2006-01-01 12:00:00 01/01/06 12:00 19.5
2006-01-01 12:05:00 01/01/06 12:05 18.4
2006-01-01 12:10:00 01/01/06 12:10 17.5
2006-01-01 12:15:00 01/01/06 12:15 15.8
2006-01-01 12:20:00 01/01/06 12:20 15.9
2006-01-01 12:25:00 01/01/06 12:25 15.5
2006-01-01 12:30:00 01/01/06 12:30 13.7
2006-01-01 12:35:00 01/01/06 12:35 11.8
2006-01-01 12:40:00 01/01/06 12:40 11.6
2006-01-01 12:45:00 01/01/06 12:45 11.9
2006-01-01 12:50:00 01/01/06 12:50 11.8
2006-01-01 12:55:00 01/01/06 12:55 12.1
2006-01-01 13:00:00 01/01/06 13:00 10.9
2006-01-01 13:05:00 01/01/06 13:05 12.1
2006-01-01 13:10:00 01/01/06 13:10 14.3
2006-01-01 13:15:00 01/01/06 13:15 10.7
2006-01-01 13:20:00 01/01/06 13:20 5.4
2006-01-01 13:25:00 01/01/06 13:25 8.1
2006-01-01 13:30:00 01/01/06 13:30 9.0
2006-01-01 13:35:00 01/01/06 13:35 11.2
2006-01-01 13:40:00 01/01/06 13:40 12.0
2006-01-01 13:45:00 01/01/06 13:45 13.9
2006-01-01 13:50:00 01/01/06 13:50 16.5
2006-01-01 13:55:00 01/01/06 13:55 18.6
2006-01-01 14:00:00 01/01/06 14:00 21.1
2006-01-01 14:05:00 01/01/06 14:05 21.5
2006-01-01 14:10:00 01/01/06 14:10 22.7
2006-01-01 14:15:00 01/01/06 14:15 24.8
2006-01-01 14:20:00 01/01/06 14:20 25.2
2006-01-01 14:25:00 01/01/06 14:25 27.0
2006-01-01 14:30:00 01/01/06 14:30 28.7
2006-01-01 14:35:00 01/01/06 14:35 32.2
2006-01-01 14:40:00 01/01/06 14:40 34.5
2006-01-01 14:45:00 01/01/06 14:45 37.0
2006-01-01 14:50:00 01/01/06 14:50 40.5
2006-01-01 14:55:00 01/01/06 14:55 42.4
2006-01-01 15:00:00 01/01/06 15:00 43.8
2006-01-01 15:05:00 01/01/06 15:05 44.7
2006-01-01 15:10:00 01/01/06 15:10 45.4
2006-01-01 15:15:00 01/01/06 15:15 42.1
2006-01-01 15:20:00 01/01/06 15:20 35.4
2006-01-01 15:25:00 01/01/06 15:25 36.1
2006-01-01 15:30:00 01/01/06 15:30 27.9
2006-01-01 15:35:00 01/01/06 15:35 7.8
2006-01-01 15:40:00 01/01/06 15:40 28.6
2006-01-01 15:45:00 01/01/06 15:45 31.9
2006-01-01 15:50:00 01/01/06 15:50 27.5
2006-01-01 15:55:00 01/01/06 15:55 29.0
2006-01-01 16:00:00 01/01/06 16:00 27.0
2006-01-01 16:05:00 01/01/06 16:05 28.4
2006-01-01 16:10:00 01/01/06 16:10 35.3
2006-01-01 16:15:00 01/01/06 16:15 32.7
2006-01-01 16:20:00 01/01/06 16:20 25.4
2006-01-01 16:25:00 01/01/06 16:25 21.3
2006-01-01 16:30:00 01/01/06 16:30 17.8
2006-01-01 16:35:00 01/01/06 16:35 9.9
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})
if data['y'].isnull().sum() > 0:
    data['y'].fillna(method='ffill', inplace=True)

data_hourly = data.resample('H', on='ds').mean().reset_index()
/tmp/ipykernel_66954/3612108113.py:4: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  data_hourly = data.resample('H', on='ds').mean().reset_index()
# # plot the one-year historical solar generation curve
plt.figure(figsize=(12, 6))
plt.plot(data_hourly['ds'], data_hourly['y'], color='b', label='Power (MW)')
plt.xlabel('Date')
plt.ylabel('Power (MW)')
plt.title('Hourly Power Generation')
plt.legend()
plt.show()
../../_images/f02dba4294f30982554a142adca231db46cbb2913ae69cb7d9bcc8cd5cbcf27a.png

From the overview, our data seems to have high daily seasonality, with zeros in the nighttime and peak power output during the day.

Arima#

We begin our forecasting with the Auto-Regressive Integrated Moving Average (ARIMA) model, a widely used method for time-series forecasting. ARIMA is particularly effective for univariate data that is stationary, meaning its statistical properties such as mean and variance are constant over time. However, our dataset exhibits daily seasonality, with power output peaking during the day and dropping to zero at night. This inherent seasonality suggests that our data may not be strictly stationary, which could impact the model’s accuracy. Despite this, we proceed with ARIMA to establish a baseline and will consider additional preprocessing steps, such as differencing, to address non-stationarity and improve model performance.

# # check whether the input data of historical solar power satisfy the requirement
def adf_test(series):
    result = adfuller(series)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    if result[1] > 0.05:
        print("Series is non-stationary")
    else:
        print("Series is stationary")

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine if a time series is stationary, meaning its statistical properties such as mean, variance, and autocorrelation are constant over time. The test works by assessing the null hypothesis that a unit root is present in the time series data, which would indicate non-stationarity. If the p-value obtained from the test is below a certain threshold (commonly 0.05), the null hypothesis is rejected, suggesting that the series is stationary. Conversely, a p-value above the threshold indicates that the series is non-stationary and may require differencing or other transformations to achieve stationarity.

adf_test(dff)
ADF Statistic: -39.02861991156054
p-value: 0.0
Series is stationary

The ADF test has a loose confidence interval and suggests our data is stationary, which may not necessarily be true, but we move on with it and do not use differencing. Differencing is a transformation technique used in time series analysis to make a non-stationary series stationary by removing trends or seasonality. It involves subtracting the previous observation from the current observation.
Though we might want to try differencing later.

plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(dff, ax=plt.gca())
plt.subplot(122)
plot_pacf(dff, ax=plt.gca())
plt.show()
../../_images/164beb9443893619a819f107956706f6d7a28309f13b39ce37f7cca919f471a2.png

The ARIMA model has p, d and q values that we have to select based on our data. We use these acf and pacf graphs to determine those values.
From the graphs above we understand that differencing is required for our data.

# Differencing
df['PowerDiff'] = dff.diff().dropna()
diff = df['PowerDiff']

Construct the plot again for our differenced data.

plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(diff.dropna(), ax=plt.gca())  # ACF for 'q'
plt.subplot(122)
plot_pacf(diff.dropna(), ax=plt.gca())  # PACF for 'p'
plt.show()
../../_images/f88b03f75f0a25755c6ae822e486e863a6ee67076d64d5e797ae9065539aa488.png

We can finally see from the graps what values for p, d, and q we need to take for our model. p = 2, d = 0 and q = 2.
We can also check if our assumptions are correct using the grid-search algorithm to calculate the best order for our model.

import warnings
from statsmodels.tsa.arima.model import ARIMA
warnings.filterwarnings("ignore")

def evaluate_arima_model(X, arima_order):
    model = ARIMA(X, order=arima_order)
    model_fit = model.fit()
    return model_fit.aic


def grid_search_arima(data, p_values, d_values, q_values):
    best_aic = float("inf")
    best_order = None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                try:
                    aic = evaluate_arima_model(data, (p,d,q))
                    if aic < best_aic:
                        best_aic = aic
                        best_order = (p, d, q)
                except:
                    continue
    return best_order


p_values = range(0, 3)
d_values = range(0, 2)
q_values = range(0, 3)


best_order = grid_search_arima(dff, p_values, d_values, q_values)
print(f'Best ARIMA order: {best_order}')
Best ARIMA order: (2, 0, 2)
# Splitting the data into test and train.
train_size = int(len(df) * 0.8)
train, test = dff[:train_size], dff[train_size:]
# Fitting the model
best_p, best_d, best_q = best_order
model = ARIMA(train, order=(best_p, best_d, best_q))
model_fit = model.fit()


predictions = model_fit.forecast(steps=len(test))


mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
RMSE: 25.9281711705478

Output the values at different times of day to check our prediction

print(predictions[:21024:100])
print(f"Test set length: {len(test)}, Predictions length: {len(predictions)}")
2006-10-20 00:00:00     0.242440
2006-10-20 08:20:00    15.077833
2006-10-20 16:40:00    20.586042
2006-10-21 01:00:00    22.638099
2006-10-21 09:20:00    23.402582
2006-10-21 17:40:00    23.687387
2006-10-22 02:00:00    23.793489
2006-10-22 10:20:00    23.833017
2006-10-22 18:40:00    23.847743
2006-10-23 03:00:00    23.853229
2006-10-23 11:20:00    23.855273
2006-10-23 19:40:00    23.856034
2006-10-24 04:00:00    23.856318
2006-10-24 12:20:00    23.856424
2006-10-24 20:40:00    23.856463
2006-10-25 05:00:00    23.856478
2006-10-25 13:20:00    23.856483
2006-10-25 21:40:00    23.856485
2006-10-26 06:00:00    23.856486
2006-10-26 14:20:00    23.856486
2006-10-26 22:40:00    23.856486
2006-10-27 07:00:00    23.856486
2006-10-27 15:20:00    23.856486
2006-10-27 23:40:00    23.856486
2006-10-28 08:00:00    23.856486
2006-10-28 16:20:00    23.856486
2006-10-29 00:40:00    23.856486
2006-10-29 09:00:00    23.856486
2006-10-29 17:20:00    23.856486
2006-10-30 01:40:00    23.856486
2006-10-30 10:00:00    23.856486
2006-10-30 18:20:00    23.856486
2006-10-31 02:40:00    23.856486
2006-10-31 11:00:00    23.856486
2006-10-31 19:20:00    23.856486
2006-11-01 03:40:00    23.856486
2006-11-01 12:00:00    23.856486
2006-11-01 20:20:00    23.856486
2006-11-02 04:40:00    23.856486
2006-11-02 13:00:00    23.856486
2006-11-02 21:20:00    23.856486
2006-11-03 05:40:00    23.856486
2006-11-03 14:00:00    23.856486
2006-11-03 22:20:00    23.856486
2006-11-04 06:40:00    23.856486
2006-11-04 15:00:00    23.856486
2006-11-04 23:20:00    23.856486
2006-11-05 07:40:00    23.856486
2006-11-05 16:00:00    23.856486
2006-11-06 00:20:00    23.856486
2006-11-06 08:40:00    23.856486
2006-11-06 17:00:00    23.856486
2006-11-07 01:20:00    23.856486
2006-11-07 09:40:00    23.856486
2006-11-07 18:00:00    23.856486
2006-11-08 02:20:00    23.856486
2006-11-08 10:40:00    23.856486
2006-11-08 19:00:00    23.856486
2006-11-09 03:20:00    23.856486
2006-11-09 11:40:00    23.856486
2006-11-09 20:00:00    23.856486
2006-11-10 04:20:00    23.856486
2006-11-10 12:40:00    23.856486
2006-11-10 21:00:00    23.856486
2006-11-11 05:20:00    23.856486
2006-11-11 13:40:00    23.856486
2006-11-11 22:00:00    23.856486
2006-11-12 06:20:00    23.856486
2006-11-12 14:40:00    23.856486
2006-11-12 23:00:00    23.856486
2006-11-13 07:20:00    23.856486
2006-11-13 15:40:00    23.856486
2006-11-14 00:00:00    23.856486
2006-11-14 08:20:00    23.856486
2006-11-14 16:40:00    23.856486
2006-11-15 01:00:00    23.856486
2006-11-15 09:20:00    23.856486
2006-11-15 17:40:00    23.856486
2006-11-16 02:00:00    23.856486
2006-11-16 10:20:00    23.856486
2006-11-16 18:40:00    23.856486
2006-11-17 03:00:00    23.856486
2006-11-17 11:20:00    23.856486
2006-11-17 19:40:00    23.856486
2006-11-18 04:00:00    23.856486
2006-11-18 12:20:00    23.856486
2006-11-18 20:40:00    23.856486
2006-11-19 05:00:00    23.856486
2006-11-19 13:20:00    23.856486
2006-11-19 21:40:00    23.856486
2006-11-20 06:00:00    23.856486
2006-11-20 14:20:00    23.856486
2006-11-20 22:40:00    23.856486
2006-11-21 07:00:00    23.856486
2006-11-21 15:20:00    23.856486
2006-11-21 23:40:00    23.856486
2006-11-22 08:00:00    23.856486
2006-11-22 16:20:00    23.856486
2006-11-23 00:40:00    23.856486
2006-11-23 09:00:00    23.856486
2006-11-23 17:20:00    23.856486
2006-11-24 01:40:00    23.856486
2006-11-24 10:00:00    23.856486
2006-11-24 18:20:00    23.856486
2006-11-25 02:40:00    23.856486
2006-11-25 11:00:00    23.856486
2006-11-25 19:20:00    23.856486
2006-11-26 03:40:00    23.856486
2006-11-26 12:00:00    23.856486
2006-11-26 20:20:00    23.856486
2006-11-27 04:40:00    23.856486
2006-11-27 13:00:00    23.856486
2006-11-27 21:20:00    23.856486
2006-11-28 05:40:00    23.856486
2006-11-28 14:00:00    23.856486
2006-11-28 22:20:00    23.856486
2006-11-29 06:40:00    23.856486
2006-11-29 15:00:00    23.856486
2006-11-29 23:20:00    23.856486
2006-11-30 07:40:00    23.856486
2006-11-30 16:00:00    23.856486
2006-12-01 00:20:00    23.856486
2006-12-01 08:40:00    23.856486
2006-12-01 17:00:00    23.856486
2006-12-02 01:20:00    23.856486
2006-12-02 09:40:00    23.856486
2006-12-02 18:00:00    23.856486
2006-12-03 02:20:00    23.856486
2006-12-03 10:40:00    23.856486
2006-12-03 19:00:00    23.856486
2006-12-04 03:20:00    23.856486
2006-12-04 11:40:00    23.856486
2006-12-04 20:00:00    23.856486
2006-12-05 04:20:00    23.856486
2006-12-05 12:40:00    23.856486
2006-12-05 21:00:00    23.856486
2006-12-06 05:20:00    23.856486
2006-12-06 13:40:00    23.856486
2006-12-06 22:00:00    23.856486
2006-12-07 06:20:00    23.856486
2006-12-07 14:40:00    23.856486
2006-12-07 23:00:00    23.856486
2006-12-08 07:20:00    23.856486
2006-12-08 15:40:00    23.856486
2006-12-09 00:00:00    23.856486
2006-12-09 08:20:00    23.856486
2006-12-09 16:40:00    23.856486
2006-12-10 01:00:00    23.856486
2006-12-10 09:20:00    23.856486
2006-12-10 17:40:00    23.856486
2006-12-11 02:00:00    23.856486
2006-12-11 10:20:00    23.856486
2006-12-11 18:40:00    23.856486
2006-12-12 03:00:00    23.856486
2006-12-12 11:20:00    23.856486
2006-12-12 19:40:00    23.856486
2006-12-13 04:00:00    23.856486
2006-12-13 12:20:00    23.856486
2006-12-13 20:40:00    23.856486
2006-12-14 05:00:00    23.856486
2006-12-14 13:20:00    23.856486
2006-12-14 21:40:00    23.856486
2006-12-15 06:00:00    23.856486
2006-12-15 14:20:00    23.856486
2006-12-15 22:40:00    23.856486
2006-12-16 07:00:00    23.856486
2006-12-16 15:20:00    23.856486
2006-12-16 23:40:00    23.856486
2006-12-17 08:00:00    23.856486
2006-12-17 16:20:00    23.856486
2006-12-18 00:40:00    23.856486
2006-12-18 09:00:00    23.856486
2006-12-18 17:20:00    23.856486
2006-12-19 01:40:00    23.856486
2006-12-19 10:00:00    23.856486
2006-12-19 18:20:00    23.856486
2006-12-20 02:40:00    23.856486
2006-12-20 11:00:00    23.856486
2006-12-20 19:20:00    23.856486
2006-12-21 03:40:00    23.856486
2006-12-21 12:00:00    23.856486
2006-12-21 20:20:00    23.856486
2006-12-22 04:40:00    23.856486
2006-12-22 13:00:00    23.856486
2006-12-22 21:20:00    23.856486
2006-12-23 05:40:00    23.856486
2006-12-23 14:00:00    23.856486
2006-12-23 22:20:00    23.856486
2006-12-24 06:40:00    23.856486
2006-12-24 15:00:00    23.856486
2006-12-24 23:20:00    23.856486
2006-12-25 07:40:00    23.856486
2006-12-25 16:00:00    23.856486
2006-12-26 00:20:00    23.856486
2006-12-26 08:40:00    23.856486
2006-12-26 17:00:00    23.856486
2006-12-27 01:20:00    23.856486
2006-12-27 09:40:00    23.856486
2006-12-27 18:00:00    23.856486
2006-12-28 02:20:00    23.856486
2006-12-28 10:40:00    23.856486
2006-12-28 19:00:00    23.856486
2006-12-29 03:20:00    23.856486
2006-12-29 11:40:00    23.856486
2006-12-29 20:00:00    23.856486
2006-12-30 04:20:00    23.856486
2006-12-30 12:40:00    23.856486
2006-12-30 21:00:00    23.856486
2006-12-31 05:20:00    23.856486
2006-12-31 13:40:00    23.856486
2006-12-31 22:00:00    23.856486
Freq: 500min, Name: predicted_mean, dtype: float64
Test set length: 21024, Predictions length: 21024
# Plot
plt.figure(figsize=(10,6))
plt.plot(train.index, train, label='Train Data', color='blue')
plt.plot(test.index, test, label='Test Data', color='green')
plt.plot(test.index, predictions, label='Predictions', color='red', linestyle='--')
plt.legend()
plt.xlabel('Datetime')
plt.ylabel('Power (MW)')
plt.title('ARIMA Model Predictions vs Actual')
plt.show()
../../_images/de3a3ecfce70e233d9c6e670238a4e7cf8a428fdc4c9669a550d8379eb22b76c.png

We can see from our data and the plot, our prediction sharply increases and peaks at around 23, both for the daytime and the night time, which has a very poor accuracy.

# Forecast next 100 values
future_steps = 100
forecast = model_fit.forecast(steps=future_steps)
# Plot
plt.figure(figsize=(10,6))
plt.plot(df.index, dff, label='Original Data')  # Replace 'Value' with correct column name
plt.plot(pd.date_range(df.index[-1], periods=future_steps, freq='D'), forecast, label='Forecast', color='green')
plt.legend()
plt.show()
../../_images/f9622f47766257c9df61c0bea7414313bb9410d0b75d78a72561e25f5beaf7bc.png

ARIMA yeilded a poor prediction potentially due to our data being highly seasonal. So we try to use a model that is known for handing seasonal data effectively.

Prophet#

The Prophet model, developed by Meta, is a robust and user-friendly tool for time series forecasting. It is particularly well-suited for data with strong seasonal patterns and missing values, as well as scenarios where the data may have outliers or trend changes. Below is an overview of the Prophet model and its components.

import sys
!{sys.executable} -m pip install prophet
from prophet import Prophet
Collecting prophet
  Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting cmdstanpy>=1.0.4 (from prophet)
  Downloading cmdstanpy-1.2.5-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: numpy>=1.15.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (1.26.4)
Requirement already satisfied: matplotlib>=2.0.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (3.10.0)
Requirement already satisfied: pandas>=1.0.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (2.2.3)
Collecting holidays<1,>=0.25 (from prophet)
  Downloading holidays-0.67-py3-none-any.whl.metadata (27 kB)
Requirement already satisfied: tqdm>=4.36.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (4.67.1)
Requirement already satisfied: importlib-resources in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (6.5.2)
Collecting stanio<2.0.0,>=0.4.0 (from cmdstanpy>=1.0.4->prophet)
  Downloading stanio-0.5.1-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: python-dateutil in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from holidays<1,>=0.25->prophet) (2.9.0.post0)
Requirement already satisfied: contourpy>=1.0.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (24.2)
Requirement already satisfied: pillow>=8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (3.2.1)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil->holidays<1,>=0.25->prophet) (1.17.0)
Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.4/14.4 MB 83.7 MB/s eta 0:00:0000:01
?25hDownloading cmdstanpy-1.2.5-py3-none-any.whl (94 kB)
Downloading holidays-0.67-py3-none-any.whl (820 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 820.7/820.7 kB 55.4 MB/s eta 0:00:00
?25hDownloading stanio-0.5.1-py3-none-any.whl (8.1 kB)
Installing collected packages: stanio, holidays, cmdstanpy, prophet
Successfully installed cmdstanpy-1.2.5 holidays-0.67 prophet-1.1.6 stanio-0.5.1
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})
if data['y'].isnull().sum() > 0:
    data['y'].fillna(method='ffill', inplace=True)

data_hourly = data.resample('H', on='ds').mean().reset_index()
split_index = int(len(data_hourly) * 0.8)
train_data = data_hourly[:split_index]
test_data = data_hourly[split_index:]
# Fit the model
model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.fit(train_data)
17:28:17 - cmdstanpy - INFO - Chain [1] start processing
17:28:17 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x7fedf0ba6d90>
# Make future predictions
future = model.make_future_dataframe(periods=24*30, freq='H')
forecast = model.predict(future)
test_forecast = forecast.merge(test_data, on='ds', how='right')
plt.figure(figsize=(12, 6))
plt.plot(test_data['ds'], test_data['y'], label='Test Data (Actual)', color='blue', alpha=0.6)
plt.plot(test_forecast['ds'], test_forecast['yhat'], label='Predicted (Test)', color='orange', alpha=0.8)
plt.fill_between(test_forecast['ds'], test_forecast['yhat_lower'], test_forecast['yhat_upper'], color='orange', alpha=0.2, label='Uncertainty Interval')
plt.title("Test Data Accuracy (Prediction)")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()
../../_images/baea950d2e53cad5e651f9f7714095783d475c633dd56af9fdd4faff2bba85fb.png
plt.figure(figsize=(14, 7))
plt.plot(data['ds'], data['y'], label='Historical Data', color='black', alpha=0.6)
plt.plot(forecast['ds'], forecast['yhat'], label='Forecasted Data', color='green')
plt.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'], color='green', alpha=0.2, label='Uncertainty Interval')
plt.title("Full Forecast with Historical Data")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()
../../_images/bfc320d2e82bfbf96e248efbe1f8fccf0915904955c5d659cc088f78579bc663.png
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][:7000:70]
ds yhat yhat_lower yhat_upper
0 2006-01-01 00:00:00 -6.665930 -24.240018 10.358625
70 2006-01-03 22:00:00 -4.474233 -20.038008 11.853261
140 2006-01-06 20:00:00 -6.592283 -22.713803 9.164866
210 2006-01-09 18:00:00 -5.583250 -22.844907 12.420671
280 2006-01-12 16:00:00 26.849308 10.893639 44.342990
350 2006-01-15 14:00:00 48.079698 31.277399 63.720274
420 2006-01-18 12:00:00 49.432818 32.739164 65.115053
490 2006-01-21 10:00:00 53.029588 37.641518 69.696168
560 2006-01-24 08:00:00 45.396122 29.912687 61.725492
630 2006-01-27 06:00:00 10.705178 -6.510347 26.607076
700 2006-01-30 04:00:00 -8.956497 -24.422312 8.420139
770 2006-02-02 02:00:00 -1.170551 -17.101661 16.039485
840 2006-02-05 00:00:00 -4.957821 -22.109546 11.159278
910 2006-02-07 22:00:00 -3.306811 -20.863814 11.892035
980 2006-02-10 20:00:00 -5.103421 -20.428770 10.981361
1050 2006-02-13 18:00:00 -3.319446 -20.724001 11.953706
1120 2006-02-16 16:00:00 29.978399 14.825827 47.609355
1190 2006-02-19 14:00:00 51.900797 34.795204 69.382889
1260 2006-02-22 12:00:00 53.633801 36.890491 69.474382
1330 2006-02-25 10:00:00 57.281214 39.905520 73.346095
1400 2006-02-28 08:00:00 49.448233 34.419007 64.841434
1470 2006-03-03 06:00:00 14.444099 -1.323467 31.743920
1540 2006-03-06 04:00:00 -5.492482 -21.773490 11.807542
1610 2006-03-09 02:00:00 2.188114 -13.879455 18.838368
1680 2006-03-12 00:00:00 -1.448799 -18.050337 14.659796
1750 2006-03-14 22:00:00 0.638901 -16.547347 16.397506
1820 2006-03-17 20:00:00 -0.457131 -18.698279 15.254874
1890 2006-03-20 18:00:00 2.227401 -14.973428 19.588454
1960 2006-03-23 16:00:00 36.536288 21.510806 52.671831
2030 2006-03-26 14:00:00 59.480708 43.008635 75.393336
2100 2006-03-29 12:00:00 62.149277 46.492645 79.292354
2170 2006-04-01 10:00:00 66.557753 49.186564 82.018656
2240 2006-04-04 08:00:00 59.235910 43.249104 76.906793
2310 2006-04-07 06:00:00 24.431270 7.560132 40.307934
2380 2006-04-10 04:00:00 4.335743 -13.047157 20.579274
2450 2006-04-13 02:00:00 11.470230 -4.340221 28.045248
2520 2006-04-16 00:00:00 6.894983 -8.307186 23.936112
2590 2006-04-18 22:00:00 7.677783 -7.633829 24.566409
2660 2006-04-21 20:00:00 4.972996 -10.115171 22.770960
2730 2006-04-24 18:00:00 5.847444 -11.075033 22.952094
2800 2006-04-27 16:00:00 38.283957 21.068517 54.367096
2870 2006-04-30 14:00:00 59.458451 43.787036 76.326864
2940 2006-05-03 12:00:00 60.631959 43.894916 75.841078
3010 2006-05-06 10:00:00 63.977282 46.359654 80.273678
3080 2006-05-09 08:00:00 56.141473 39.153473 72.497662
3150 2006-05-12 06:00:00 21.428345 4.928609 37.590388
3220 2006-05-15 04:00:00 2.011454 -13.187416 18.508703
3290 2006-05-18 02:00:00 10.315806 -6.078136 27.340943
3360 2006-05-21 00:00:00 7.237416 -10.327478 24.078133
3430 2006-05-23 22:00:00 9.631903 -6.678014 26.928064
3500 2006-05-26 20:00:00 8.422100 -8.266247 23.888559
3570 2006-05-29 18:00:00 10.456069 -5.736026 27.479514
3640 2006-06-01 16:00:00 43.541098 27.250702 60.478775
3710 2006-06-04 14:00:00 64.744611 48.769853 80.851184
3780 2006-06-07 12:00:00 65.299729 47.875205 82.216593
3850 2006-06-10 10:00:00 67.433314 50.865657 85.016854
3920 2006-06-13 08:00:00 57.917999 42.881641 74.453020
3990 2006-06-16 06:00:00 21.234606 3.616952 37.113598
4060 2006-06-19 04:00:00 -0.240966 -16.510179 16.249598
4130 2006-06-22 02:00:00 6.116161 -11.247583 23.117712
4200 2006-06-25 00:00:00 1.374173 -13.974011 18.909048
4270 2006-06-27 22:00:00 2.513891 -12.470514 19.150111
4340 2006-06-30 20:00:00 0.525372 -14.790831 17.337275
4410 2006-07-03 18:00:00 2.263905 -12.794211 19.928545
4480 2006-07-06 16:00:00 35.490399 18.779073 52.209922
4550 2006-07-09 14:00:00 57.185348 40.098695 73.560572
4620 2006-07-12 12:00:00 58.471617 41.970231 74.965051
4690 2006-07-15 10:00:00 61.461021 45.970468 77.432496
4760 2006-07-18 08:00:00 52.823449 35.623734 68.871607
4830 2006-07-21 06:00:00 16.962517 0.447474 32.983314
4900 2006-07-24 04:00:00 -3.789893 -20.622766 12.027440
4970 2006-07-27 02:00:00 3.181478 -13.150792 20.016953
5040 2006-07-30 00:00:00 -1.035615 -17.199097 15.491980
5110 2006-08-01 22:00:00 0.577880 -14.593119 18.136919
5180 2006-08-04 20:00:00 -0.945019 -17.563205 15.315258
5250 2006-08-07 18:00:00 1.283991 -15.302904 18.327193
5320 2006-08-10 16:00:00 35.036419 19.321878 52.017611
5390 2006-08-13 14:00:00 57.273509 41.010533 73.832833
5460 2006-08-16 12:00:00 59.068696 42.082606 74.701344
5530 2006-08-19 10:00:00 62.461875 44.359121 79.215896
5600 2006-08-22 08:00:00 54.043432 38.503110 70.064505
5670 2006-08-25 06:00:00 18.149591 1.953598 34.368925
5740 2006-08-28 04:00:00 -2.922522 -19.685369 14.328186
5810 2006-08-31 02:00:00 3.456821 -13.146439 20.756565
5880 2006-09-03 00:00:00 -1.552922 -17.041100 15.804643
5950 2006-09-05 22:00:00 -0.807801 -17.756285 15.191909
6020 2006-09-08 20:00:00 -3.114020 -19.869315 14.647907
6090 2006-09-11 18:00:00 -1.415580 -18.204937 15.121756
6160 2006-09-14 16:00:00 32.197630 15.385972 47.944213
6230 2006-09-17 14:00:00 54.760301 37.407847 71.634335
6300 2006-09-20 12:00:00 57.326285 40.326160 73.315428
6370 2006-09-23 10:00:00 61.810807 45.609297 78.947643
6440 2006-09-26 08:00:00 54.583659 39.543481 72.311184
6510 2006-09-29 06:00:00 19.696369 3.051884 37.266530
6580 2006-10-02 04:00:00 -0.851641 -16.682391 16.360942
6650 2006-10-05 02:00:00 5.320925 -10.065374 21.620323
6720 2006-10-08 00:00:00 -0.763984 -16.835757 15.982833
6790 2006-10-10 22:00:00 -1.941290 -18.443028 14.145057
6860 2006-10-13 20:00:00 -6.815872 -22.830380 9.753774
6930 2006-10-16 18:00:00 -7.963110 -23.904965 7.692729

Accuracy#

We can see that we have increased our accuracy greatly, using this model accounting for all the seasonality, but still it is predicting values to be less than 0 as well, which can never happen in our model.

model.plot_components(forecast)
plt.show()
../../_images/d0e0d7d71e529d65259216896e94c85710a3640c70cd5ed4a96cdd2a07c646e9.png

These plots reveal the seasonality patterns in our forecasted data, highlighting the underlying trends of the data.
Until now we have not taken account of the seasonality of the data before forecasting and have let the model take care of it by itself. We now try to take care of the seasonality of the data by ourselves.

LightGBM with feature engineering#

LightGBM (Light Gradient Boosting Machine) is a powerful and efficient gradient boosting framework widely used for machine learning tasks.

Feature engineering is the process of preparing and transforming raw data into features that better represent the underlying problem to improve the performance of a machine learning model. A feature is an individual measurable property or characteristic of a phenomenon being observed, often represented as a column in a dataset.
The goal of feature engineering is to extract the most relevant information from the raw data, making it easier for the model to learn patterns and make predictions.

import sys
!{sys.executable} -m pip install lightgbm
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Requirement already satisfied: numpy>=1.17.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.15.2)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 50.0 MB/s eta 0:00:00
?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
# Feature Engineering
data['hour'] = data['LocalTime'].dt.hour
data['day_of_week'] = data['LocalTime'].dt.dayofweek
data['month'] = data['LocalTime'].dt.month
data['is_daytime'] = ((data['hour'] >= 6) & (data['hour'] <= 18)).astype(int)

Add the features: hour, day of the week, month and is_daytime as columns to the data.

# Define features and target variable
X = data[['hour', 'day_of_week', 'month', 'is_daytime']]
y = data['Power(MW)']
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)
# LightGBM Model Training
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
# Train the model with early stopping
model = lgb.train(
    params,
    lgb_train,
    valid_sets=[lgb_eval],  # Validation set
    num_boost_round=1000,
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),  # Early stopping callback
        lgb.log_evaluation(period=100)          # Log evaluation progress every 100 iterations
    ]
)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.041361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 44
[LightGBM] [Info] Number of data points in the train set: 84096, number of used features: 4
[LightGBM] [Info] Start training from score 23.859394
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[38]	valid_0's rmse: 15.1899
# Predictions
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
# Evaluate the Model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
Root Mean Squared Error (RMSE): 15.189919633723044
Mean Absolute Error (MAE): 10.479850535578153
# Visualization: Actual vs Predicted
plt.figure(figsize=(15, 6))
plt.plot(y_test.values[:1000], label="Actual", color='blue', linewidth=0.8)
plt.plot(y_pred[:1000], label="Predicted", color='orange', linestyle='--', linewidth=0.8)
plt.title('Actual vs Predicted Power Generation', fontsize=16)
plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Power (MW)', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
../../_images/adb4a983665bdb904273bd33ab5a640bef48b7972fbfe9cf63761a6540b6f2bd.png

This yeilded a much better result than all the other models, with a low RMSE and MAE for a time series data with high seasonality.

# Feature Importance
importance = model.feature_importance()
feature_names = X.columns
plt.figure(figsize=(10, 6))
plt.barh(feature_names, importance, color='teal', alpha=0.7)
plt.title('Feature Importance', fontsize=16)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.tight_layout()
plt.show()
../../_images/ce67be1420a8547bfb9c7eea464641adf2a1195f8933ca532e755ed7aa278625.png

The lower the value, the greater the importance, so as expected is_daytime which is a boolean that is true if the time of day is between 6 am and 6 pm and false rest of the time turned out to be the most important factor in a successful prediction.

Key findings#

  • Due to a highly seasonal data, we can see how the accuracy improves when using models more suitable for such data, with gradient boosting integrated with featuring engineering yielding the best results.

  • The most significant feature is is_daytime, a boolean indicating whether the observation falls between 6 AM and 6 PM, underscoring its importance in prediction success.

  • Graphical representation clearly illustrates the ranking of feature importance.