Forecasting Solar-Energy

Forecasting Solar-Energy#

Objective#

The primary objective is to build and evaluate a predictive model, with a focus on understanding the relative importance of different features in making predictions. With more solar power integrated into power systems, accurately forecast the power outputs from solar power becomes crucial for reliable and economic operation of the system. The conventional generators need to follow up and ramp down based on the power increase and decline of solar power during sunrise and sunset. During cloudy days, it is also important to forecast the power fluctuations of solar power to prepare adequate reserve capacities. The model deals with structured data needed for solar power forecasting and aims to optimize prediction accuracy while providing explainability with a simple example.
We use historical time-series data from a specified region in Mississippi from 2006 to analyze and forecast solar-energy output. We want to reiterate that the purpose of this module is not to provide the most accurate forecast, but to demonstrate the process of developing a machine learning pipeline for solar power forecasting and analyzing feature importance for interpretability. The dataset used in this module is a sample dataset for demonstration purposes only. The techniques and methods used in this module can be applied to other datasets for solar power forecasting. The procedure of other forecasting methods could be quite different than this method. But the main data processing steps should be similar.

Purpose#

To develop a machine learning pipeline for solar power forecasting purpuse.
To analyze feature importance for forecasting results interpretability.
To improve predictive accuracy using feature engineering and optimization techniques.

Who is this useful for?#

Data Scientists: Interested in understanding feature importance in forecasting models.
Decision-Makers: Seeking insights from the predictions for actionable strategies.
Students & Researchers: Exploring predictive modeling and feature analysis.

Applications#

Predicting outcomes in structured data for solar power forecasting (e.g., sales, risk assessment, customer behavior).
Identifying key drivers influencing outcomes for resource allocation.
Benchmarking forecasting performance (accuracy) against baseline algorithms.

Notebook Components#

Data Preparation: Importing, cleaning, and preprocessing data for modeling. Prepare the historical time-series data for solar power forecasting.
Model Development: Training machine learning models.
- ARIMA model-ARIMA stands for AutoRegressive Integrated Moving Average. It is a popular statistical method used for time series forecasting. The ARIMA model is characterized by three main components: 1. AutoRegressive (AR) part: This part involves regressing the variable on its own lagged (past) values. The number of lagged values to include is denoted by p. 2. Integrated (I) part: This part involves differencing the data to make it stationary, which means that the mean and variance are constant over time. The number of differences needed to achieve stationarity is denoted by d. 3. Moving Average (MA) part: This part involves modeling the error term as a linear combination of error terms occurring at various times in the past. The number of lagged error terms to include is denoted by q.The ARIMA model is generally denoted as ARIMA(p, d, q), where: p is the number of lag observations included in the model (lag order). d is the number of times that the raw observations are differenced. q is the size of the moving average window. Steps to Build an ARIMA Model: Identification: Determine the values of p, d, and q using techniques like the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. Estimation: Fit the ARIMA model to the time series data using the identified parameters.Diagnostic Checking: Check the residuals of the model to ensure that they resemble white noise (i.e., they are normally distributed with a mean of zero and constant variance). Forecasting: Use the fitted ARIMA model to make future predictions.
- Prophet model-Prophet is an open-source forecasting tool developed by Facebook. It is designed to handle time series data that may have daily, weekly, and yearly seasonality, along with holiday effects. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It works best with time series that have strong seasonal effects and several seasons of historical data. The model is intuitive and allows for easy incorporation of additional regressors to improve forecast accuracy. Steps to Build a Prophet Model: Data Preparation: Ensure the data is in the correct format with columns ‘ds’ (date) and ‘y’ (value to forecast). Model Initialization: Create a Prophet object and specify any seasonalities or holidays. Model Fitting: Fit the model to the historical data. Forecasting: Use the fitted model to make future predictions. Visualization: Plot the forecasted values along with the historical data to visualize the model’s performance.
- LightGBM model- LightGBM (Light Gradient Boosting Machine) is a highly efficient and scalable gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following key features: Gradient Boosting: LightGBM is based on the gradient boosting framework, which builds models sequentially. Each new model attempts to correct the errors made by the previous models. This is done by minimizing a loss function using gradient descent. Tree-Based Learning: LightGBM uses decision trees as its base learners. Specifically, it uses a technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up the training process and reduce memory usage. Efficiency: LightGBM is designed to be highly efficient in terms of both speed and memory usage. It achieves this through several optimizations: Histogram-based decision tree learning: This reduces the complexity of finding the best split. Leaf-wise tree growth: Unlike level-wise tree growth used in other frameworks, LightGBM grows trees leaf-wise, which can lead to deeper trees and better accuracy. GOSS and EFB: These techniques further improve efficiency by reducing the number of data instances and features considered during training. Scalability: LightGBM can handle large datasets and high-dimensional data efficiently. It supports parallel and distributed learning, making it suitable for big data applications. Accuracy: Due to its efficient implementation and advanced techniques, LightGBM often achieves higher accuracy compared to other gradient boosting frameworks.
Feature Importance Analysis: Evaluating which features contribute the most to predictions. Analayze the key factors in solar power forecating.
Visualization: Graphically representing feature importance for interpretability.

# Import all the required libraries, pandas for data analytics, numpy for numerical calculation, matplotlib for plotting and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install scikit-learn
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.statespace.sarimax import SARIMAX

# Update submodules to fetch data

Requirement already satisfied: statsmodels in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: numpy<3,>=1.22.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.15.2)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (2.2.3)
Requirement already satisfied: patsy>=0.5.6 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.0.1)
Requirement already satisfied: packaging>=21.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (24.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0)
Requirement already satisfied: scikit-learn in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.15.2)
Requirement already satisfied: joblib>=1.2.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (3.5.0)

pd.set_option('display.max_rows', 300) #set the limit for the maximum number of rows in display

# # define data path for the input data of historical 5 minutes solar power output
# # file path
# df = pd.read_csv("ms-pv-2006/Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
import os
import pandas as pd

current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")

# # read the historical data with pandas
df = pd.read_csv(csv_path)
data = pd.read_csv(csv_path)

Add a new column “DateTime”, with the following format specified and make it our index for later use

df['Datetime'] = pd.to_datetime(df['LocalTime'], format='%m/%d/%y %H:%M')
df.set_index('Datetime', inplace=True)

df

	LocalTime	Power(MW)
Datetime
2006-01-01 00:00:00	01/01/06 00:00	0.0
2006-01-01 00:05:00	01/01/06 00:05	0.0
2006-01-01 00:10:00	01/01/06 00:10	0.0
2006-01-01 00:15:00	01/01/06 00:15	0.0
2006-01-01 00:20:00	01/01/06 00:20	0.0
...	...	...
2006-12-31 23:35:00	12/31/06 23:35	0.0
2006-12-31 23:40:00	12/31/06 23:40	0.0
2006-12-31 23:45:00	12/31/06 23:45	0.0
2006-12-31 23:50:00	12/31/06 23:50	0.0
2006-12-31 23:55:00	12/31/06 23:55	0.0

105120 rows × 2 columns

# # show historical data
dff = df['Power(MW)']
df.head(200)

	LocalTime	Power(MW)
Datetime
2006-01-01 00:00:00	01/01/06 00:00	0.0
2006-01-01 00:05:00	01/01/06 00:05	0.0
2006-01-01 00:10:00	01/01/06 00:10	0.0
2006-01-01 00:15:00	01/01/06 00:15	0.0
2006-01-01 00:20:00	01/01/06 00:20	0.0
2006-01-01 00:25:00	01/01/06 00:25	0.0
2006-01-01 00:30:00	01/01/06 00:30	0.0
2006-01-01 00:35:00	01/01/06 00:35	0.0
2006-01-01 00:40:00	01/01/06 00:40	0.0
2006-01-01 00:45:00	01/01/06 00:45	0.0
2006-01-01 00:50:00	01/01/06 00:50	0.0
2006-01-01 00:55:00	01/01/06 00:55	0.0
2006-01-01 01:00:00	01/01/06 01:00	0.0
2006-01-01 01:05:00	01/01/06 01:05	0.0
2006-01-01 01:10:00	01/01/06 01:10	0.0
2006-01-01 01:15:00	01/01/06 01:15	0.0
2006-01-01 01:20:00	01/01/06 01:20	0.0
2006-01-01 01:25:00	01/01/06 01:25	0.0
2006-01-01 01:30:00	01/01/06 01:30	0.0
2006-01-01 01:35:00	01/01/06 01:35	0.0
2006-01-01 01:40:00	01/01/06 01:40	0.0
2006-01-01 01:45:00	01/01/06 01:45	0.0
2006-01-01 01:50:00	01/01/06 01:50	0.0
2006-01-01 01:55:00	01/01/06 01:55	0.0
2006-01-01 02:00:00	01/01/06 02:00	0.0
2006-01-01 02:05:00	01/01/06 02:05	0.0
2006-01-01 02:10:00	01/01/06 02:10	0.0
2006-01-01 02:15:00	01/01/06 02:15	0.0
2006-01-01 02:20:00	01/01/06 02:20	0.0
2006-01-01 02:25:00	01/01/06 02:25	0.0
2006-01-01 02:30:00	01/01/06 02:30	0.0
2006-01-01 02:35:00	01/01/06 02:35	0.0
2006-01-01 02:40:00	01/01/06 02:40	0.0
2006-01-01 02:45:00	01/01/06 02:45	0.0
2006-01-01 02:50:00	01/01/06 02:50	0.0
2006-01-01 02:55:00	01/01/06 02:55	0.0
2006-01-01 03:00:00	01/01/06 03:00	0.0
2006-01-01 03:05:00	01/01/06 03:05	0.0
2006-01-01 03:10:00	01/01/06 03:10	0.0
2006-01-01 03:15:00	01/01/06 03:15	0.0
2006-01-01 03:20:00	01/01/06 03:20	0.0
2006-01-01 03:25:00	01/01/06 03:25	0.0
2006-01-01 03:30:00	01/01/06 03:30	0.0
2006-01-01 03:35:00	01/01/06 03:35	0.0
2006-01-01 03:40:00	01/01/06 03:40	0.0
2006-01-01 03:45:00	01/01/06 03:45	0.0
2006-01-01 03:50:00	01/01/06 03:50	0.0
2006-01-01 03:55:00	01/01/06 03:55	0.0
2006-01-01 04:00:00	01/01/06 04:00	0.0
2006-01-01 04:05:00	01/01/06 04:05	0.0
2006-01-01 04:10:00	01/01/06 04:10	0.0
2006-01-01 04:15:00	01/01/06 04:15	0.0
2006-01-01 04:20:00	01/01/06 04:20	0.0
2006-01-01 04:25:00	01/01/06 04:25	0.0
2006-01-01 04:30:00	01/01/06 04:30	0.0
2006-01-01 04:35:00	01/01/06 04:35	0.0
2006-01-01 04:40:00	01/01/06 04:40	0.0
2006-01-01 04:45:00	01/01/06 04:45	0.0
2006-01-01 04:50:00	01/01/06 04:50	0.0
2006-01-01 04:55:00	01/01/06 04:55	0.0
2006-01-01 05:00:00	01/01/06 05:00	0.0
2006-01-01 05:05:00	01/01/06 05:05	0.0
2006-01-01 05:10:00	01/01/06 05:10	0.0
2006-01-01 05:15:00	01/01/06 05:15	0.0
2006-01-01 05:20:00	01/01/06 05:20	0.0
2006-01-01 05:25:00	01/01/06 05:25	0.0
2006-01-01 05:30:00	01/01/06 05:30	0.0
2006-01-01 05:35:00	01/01/06 05:35	0.0
2006-01-01 05:40:00	01/01/06 05:40	0.0
2006-01-01 05:45:00	01/01/06 05:45	0.0
2006-01-01 05:50:00	01/01/06 05:50	0.0
2006-01-01 05:55:00	01/01/06 05:55	0.0
2006-01-01 06:00:00	01/01/06 06:00	0.0
2006-01-01 06:05:00	01/01/06 06:05	0.0
2006-01-01 06:10:00	01/01/06 06:10	0.0
2006-01-01 06:15:00	01/01/06 06:15	0.0
2006-01-01 06:20:00	01/01/06 06:20	0.0
2006-01-01 06:25:00	01/01/06 06:25	0.0
2006-01-01 06:30:00	01/01/06 06:30	0.0
2006-01-01 06:35:00	01/01/06 06:35	0.0
2006-01-01 06:40:00	01/01/06 06:40	0.0
2006-01-01 06:45:00	01/01/06 06:45	0.0
2006-01-01 06:50:00	01/01/06 06:50	0.0
2006-01-01 06:55:00	01/01/06 06:55	0.0
2006-01-01 07:00:00	01/01/06 07:00	0.0
2006-01-01 07:05:00	01/01/06 07:05	0.0
2006-01-01 07:10:00	01/01/06 07:10	0.0
2006-01-01 07:15:00	01/01/06 07:15	0.0
2006-01-01 07:20:00	01/01/06 07:20	3.6
2006-01-01 07:25:00	01/01/06 07:25	1.0
2006-01-01 07:30:00	01/01/06 07:30	3.6
2006-01-01 07:35:00	01/01/06 07:35	16.8
2006-01-01 07:40:00	01/01/06 07:40	20.9
2006-01-01 07:45:00	01/01/06 07:45	15.6
2006-01-01 07:50:00	01/01/06 07:50	22.0
2006-01-01 07:55:00	01/01/06 07:55	26.9
2006-01-01 08:00:00	01/01/06 08:00	36.9
2006-01-01 08:05:00	01/01/06 08:05	38.6
2006-01-01 08:10:00	01/01/06 08:10	23.4
2006-01-01 08:15:00	01/01/06 08:15	27.5
2006-01-01 08:20:00	01/01/06 08:20	29.5
2006-01-01 08:25:00	01/01/06 08:25	28.5
2006-01-01 08:30:00	01/01/06 08:30	27.2
2006-01-01 08:35:00	01/01/06 08:35	24.4
2006-01-01 08:40:00	01/01/06 08:40	23.9
2006-01-01 08:45:00	01/01/06 08:45	23.1
2006-01-01 08:50:00	01/01/06 08:50	22.1
2006-01-01 08:55:00	01/01/06 08:55	20.6
2006-01-01 09:00:00	01/01/06 09:00	20.4
2006-01-01 09:05:00	01/01/06 09:05	19.7
2006-01-01 09:10:00	01/01/06 09:10	18.9
2006-01-01 09:15:00	01/01/06 09:15	16.5
2006-01-01 09:20:00	01/01/06 09:20	16.3
2006-01-01 09:25:00	01/01/06 09:25	17.0
2006-01-01 09:30:00	01/01/06 09:30	16.1
2006-01-01 09:35:00	01/01/06 09:35	15.1
2006-01-01 09:40:00	01/01/06 09:40	14.3
2006-01-01 09:45:00	01/01/06 09:45	14.0
2006-01-01 09:50:00	01/01/06 09:50	13.6
2006-01-01 09:55:00	01/01/06 09:55	12.7
2006-01-01 10:00:00	01/01/06 10:00	13.5
2006-01-01 10:05:00	01/01/06 10:05	15.1
2006-01-01 10:10:00	01/01/06 10:10	15.7
2006-01-01 10:15:00	01/01/06 10:15	23.4
2006-01-01 10:20:00	01/01/06 10:20	25.0
2006-01-01 10:25:00	01/01/06 10:25	23.3
2006-01-01 10:30:00	01/01/06 10:30	17.4
2006-01-01 10:35:00	01/01/06 10:35	18.3
2006-01-01 10:40:00	01/01/06 10:40	23.0
2006-01-01 10:45:00	01/01/06 10:45	29.9
2006-01-01 10:50:00	01/01/06 10:50	30.5
2006-01-01 10:55:00	01/01/06 10:55	30.2
2006-01-01 11:00:00	01/01/06 11:00	36.6
2006-01-01 11:05:00	01/01/06 11:05	48.1
2006-01-01 11:10:00	01/01/06 11:10	54.0
2006-01-01 11:15:00	01/01/06 11:15	40.1
2006-01-01 11:20:00	01/01/06 11:20	30.4
2006-01-01 11:25:00	01/01/06 11:25	27.7
2006-01-01 11:30:00	01/01/06 11:30	25.7
2006-01-01 11:35:00	01/01/06 11:35	25.1
2006-01-01 11:40:00	01/01/06 11:40	23.9
2006-01-01 11:45:00	01/01/06 11:45	22.8
2006-01-01 11:50:00	01/01/06 11:50	22.0
2006-01-01 11:55:00	01/01/06 11:55	20.5
2006-01-01 12:00:00	01/01/06 12:00	19.5
2006-01-01 12:05:00	01/01/06 12:05	18.4
2006-01-01 12:10:00	01/01/06 12:10	17.5
2006-01-01 12:15:00	01/01/06 12:15	15.8
2006-01-01 12:20:00	01/01/06 12:20	15.9
2006-01-01 12:25:00	01/01/06 12:25	15.5
2006-01-01 12:30:00	01/01/06 12:30	13.7
2006-01-01 12:35:00	01/01/06 12:35	11.8
2006-01-01 12:40:00	01/01/06 12:40	11.6
2006-01-01 12:45:00	01/01/06 12:45	11.9
2006-01-01 12:50:00	01/01/06 12:50	11.8
2006-01-01 12:55:00	01/01/06 12:55	12.1
2006-01-01 13:00:00	01/01/06 13:00	10.9
2006-01-01 13:05:00	01/01/06 13:05	12.1
2006-01-01 13:10:00	01/01/06 13:10	14.3
2006-01-01 13:15:00	01/01/06 13:15	10.7
2006-01-01 13:20:00	01/01/06 13:20	5.4
2006-01-01 13:25:00	01/01/06 13:25	8.1
2006-01-01 13:30:00	01/01/06 13:30	9.0
2006-01-01 13:35:00	01/01/06 13:35	11.2
2006-01-01 13:40:00	01/01/06 13:40	12.0
2006-01-01 13:45:00	01/01/06 13:45	13.9
2006-01-01 13:50:00	01/01/06 13:50	16.5
2006-01-01 13:55:00	01/01/06 13:55	18.6
2006-01-01 14:00:00	01/01/06 14:00	21.1
2006-01-01 14:05:00	01/01/06 14:05	21.5
2006-01-01 14:10:00	01/01/06 14:10	22.7
2006-01-01 14:15:00	01/01/06 14:15	24.8
2006-01-01 14:20:00	01/01/06 14:20	25.2
2006-01-01 14:25:00	01/01/06 14:25	27.0
2006-01-01 14:30:00	01/01/06 14:30	28.7
2006-01-01 14:35:00	01/01/06 14:35	32.2
2006-01-01 14:40:00	01/01/06 14:40	34.5
2006-01-01 14:45:00	01/01/06 14:45	37.0
2006-01-01 14:50:00	01/01/06 14:50	40.5
2006-01-01 14:55:00	01/01/06 14:55	42.4
2006-01-01 15:00:00	01/01/06 15:00	43.8
2006-01-01 15:05:00	01/01/06 15:05	44.7
2006-01-01 15:10:00	01/01/06 15:10	45.4
2006-01-01 15:15:00	01/01/06 15:15	42.1
2006-01-01 15:20:00	01/01/06 15:20	35.4
2006-01-01 15:25:00	01/01/06 15:25	36.1
2006-01-01 15:30:00	01/01/06 15:30	27.9
2006-01-01 15:35:00	01/01/06 15:35	7.8
2006-01-01 15:40:00	01/01/06 15:40	28.6
2006-01-01 15:45:00	01/01/06 15:45	31.9
2006-01-01 15:50:00	01/01/06 15:50	27.5
2006-01-01 15:55:00	01/01/06 15:55	29.0
2006-01-01 16:00:00	01/01/06 16:00	27.0
2006-01-01 16:05:00	01/01/06 16:05	28.4
2006-01-01 16:10:00	01/01/06 16:10	35.3
2006-01-01 16:15:00	01/01/06 16:15	32.7
2006-01-01 16:20:00	01/01/06 16:20	25.4
2006-01-01 16:25:00	01/01/06 16:25	21.3
2006-01-01 16:30:00	01/01/06 16:30	17.8
2006-01-01 16:35:00	01/01/06 16:35	9.9

data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})

if data['y'].isnull().sum() > 0:
    data['y'].fillna(method='ffill', inplace=True)

data_hourly = data.resample('H', on='ds').mean().reset_index()

/tmp/ipykernel_66954/3612108113.py:4: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  data_hourly = data.resample('H', on='ds').mean().reset_index()

# # plot the one-year historical solar generation curve
plt.figure(figsize=(12, 6))
plt.plot(data_hourly['ds'], data_hourly['y'], color='b', label='Power (MW)')
plt.xlabel('Date')
plt.ylabel('Power (MW)')
plt.title('Hourly Power Generation')
plt.legend()
plt.show()

../../_images/f02dba4294f30982554a142adca231db46cbb2913ae69cb7d9bcc8cd5cbcf27a.png

From the overview, our data seems to have high daily seasonality, with zeros in the nighttime and peak power output during the day.

Arima#

We begin our forecasting with the Auto-Regressive Integrated Moving Average (ARIMA) model, a widely used method for time-series forecasting. ARIMA is particularly effective for univariate data that is stationary, meaning its statistical properties such as mean and variance are constant over time. However, our dataset exhibits daily seasonality, with power output peaking during the day and dropping to zero at night. This inherent seasonality suggests that our data may not be strictly stationary, which could impact the model’s accuracy. Despite this, we proceed with ARIMA to establish a baseline and will consider additional preprocessing steps, such as differencing, to address non-stationarity and improve model performance.

# # check whether the input data of historical solar power satisfy the requirement
def adf_test(series):
    result = adfuller(series)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    if result[1] > 0.05:
        print("Series is non-stationary")
    else:
        print("Series is stationary")

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine if a time series is stationary, meaning its statistical properties such as mean, variance, and autocorrelation are constant over time. The test works by assessing the null hypothesis that a unit root is present in the time series data, which would indicate non-stationarity. If the p-value obtained from the test is below a certain threshold (commonly 0.05), the null hypothesis is rejected, suggesting that the series is stationary. Conversely, a p-value above the threshold indicates that the series is non-stationary and may require differencing or other transformations to achieve stationarity.

adf_test(dff)

ADF Statistic: -39.02861991156054
p-value: 0.0
Series is stationary

The ADF test has a loose confidence interval and suggests our data is stationary, which may not necessarily be true, but we move on with it and do not use differencing. Differencing is a transformation technique used in time series analysis to make a non-stationary series stationary by removing trends or seasonality. It involves subtracting the previous observation from the current observation.
Though we might want to try differencing later.

plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(dff, ax=plt.gca())
plt.subplot(122)
plot_pacf(dff, ax=plt.gca())
plt.show()

../../_images/164beb9443893619a819f107956706f6d7a28309f13b39ce37f7cca919f471a2.png

The ARIMA model has p, d and q values that we have to select based on our data. We use these acf and pacf graphs to determine those values.
From the graphs above we understand that differencing is required for our data.

# Differencing
df['PowerDiff'] = dff.diff().dropna()
diff = df['PowerDiff']

Construct the plot again for our differenced data.

plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(diff.dropna(), ax=plt.gca())  # ACF for 'q'
plt.subplot(122)
plot_pacf(diff.dropna(), ax=plt.gca())  # PACF for 'p'
plt.show()

../../_images/f88b03f75f0a25755c6ae822e486e863a6ee67076d64d5e797ae9065539aa488.png

We can finally see from the graps what values for p, d, and q we need to take for our model. p = 2, d = 0 and q = 2.
We can also check if our assumptions are correct using the grid-search algorithm to calculate the best order for our model.

import warnings
from statsmodels.tsa.arima.model import ARIMA
warnings.filterwarnings("ignore")

def evaluate_arima_model(X, arima_order):
    model = ARIMA(X, order=arima_order)
    model_fit = model.fit()
    return model_fit.aic


def grid_search_arima(data, p_values, d_values, q_values):
    best_aic = float("inf")
    best_order = None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                try:
                    aic = evaluate_arima_model(data, (p,d,q))
                    if aic < best_aic:
                        best_aic = aic
                        best_order = (p, d, q)
                except:
                    continue
    return best_order


p_values = range(0, 3)
d_values = range(0, 2)
q_values = range(0, 3)


best_order = grid_search_arima(dff, p_values, d_values, q_values)
print(f'Best ARIMA order: {best_order}')

Best ARIMA order: (2, 0, 2)

# Splitting the data into test and train.
train_size = int(len(df) * 0.8)
train, test = dff[:train_size], dff[train_size:]

# Fitting the model
best_p, best_d, best_q = best_order
model = ARIMA(train, order=(best_p, best_d, best_q))
model_fit = model.fit()


predictions = model_fit.forecast(steps=len(test))


mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

RMSE: 25.9281711705478

Output the values at different times of day to check our prediction

print(predictions[:21024:100])
print(f"Test set length: {len(test)}, Predictions length: {len(predictions)}")

2006-10-20 00:00:00     0.242440
2006-10-20 08:20:00    15.077833
2006-10-20 16:40:00    20.586042
2006-10-21 01:00:00    22.638099
2006-10-21 09:20:00    23.402582
2006-10-21 17:40:00    23.687387
2006-10-22 02:00:00    23.793489
2006-10-22 10:20:00    23.833017
2006-10-22 18:40:00    23.847743
2006-10-23 03:00:00    23.853229
2006-10-23 11:20:00    23.855273
2006-10-23 19:40:00    23.856034
2006-10-24 04:00:00    23.856318
2006-10-24 12:20:00    23.856424
2006-10-24 20:40:00    23.856463
2006-10-25 05:00:00    23.856478
2006-10-25 13:20:00    23.856483
2006-10-25 21:40:00    23.856485
2006-10-26 06:00:00    23.856486
2006-10-26 14:20:00    23.856486
2006-10-26 22:40:00    23.856486
2006-10-27 07:00:00    23.856486
2006-10-27 15:20:00    23.856486
2006-10-27 23:40:00    23.856486
2006-10-28 08:00:00    23.856486
2006-10-28 16:20:00    23.856486
2006-10-29 00:40:00    23.856486
2006-10-29 09:00:00    23.856486
2006-10-29 17:20:00    23.856486
2006-10-30 01:40:00    23.856486
2006-10-30 10:00:00    23.856486
2006-10-30 18:20:00    23.856486
2006-10-31 02:40:00    23.856486
2006-10-31 11:00:00    23.856486
2006-10-31 19:20:00    23.856486
2006-11-01 03:40:00    23.856486
2006-11-01 12:00:00    23.856486
2006-11-01 20:20:00    23.856486
2006-11-02 04:40:00    23.856486
2006-11-02 13:00:00    23.856486
2006-11-02 21:20:00    23.856486
2006-11-03 05:40:00    23.856486
2006-11-03 14:00:00    23.856486
2006-11-03 22:20:00    23.856486
2006-11-04 06:40:00    23.856486
2006-11-04 15:00:00    23.856486
2006-11-04 23:20:00    23.856486
2006-11-05 07:40:00    23.856486
2006-11-05 16:00:00    23.856486
2006-11-06 00:20:00    23.856486
2006-11-06 08:40:00    23.856486
2006-11-06 17:00:00    23.856486
2006-11-07 01:20:00    23.856486
2006-11-07 09:40:00    23.856486
2006-11-07 18:00:00    23.856486
2006-11-08 02:20:00    23.856486
2006-11-08 10:40:00    23.856486
2006-11-08 19:00:00    23.856486
2006-11-09 03:20:00    23.856486
2006-11-09 11:40:00    23.856486
2006-11-09 20:00:00    23.856486
2006-11-10 04:20:00    23.856486
2006-11-10 12:40:00    23.856486
2006-11-10 21:00:00    23.856486
2006-11-11 05:20:00    23.856486
2006-11-11 13:40:00    23.856486
2006-11-11 22:00:00    23.856486
2006-11-12 06:20:00    23.856486
2006-11-12 14:40:00    23.856486
2006-11-12 23:00:00    23.856486
2006-11-13 07:20:00    23.856486
2006-11-13 15:40:00    23.856486
2006-11-14 00:00:00    23.856486
2006-11-14 08:20:00    23.856486
2006-11-14 16:40:00    23.856486
2006-11-15 01:00:00    23.856486
2006-11-15 09:20:00    23.856486
2006-11-15 17:40:00    23.856486
2006-11-16 02:00:00    23.856486
2006-11-16 10:20:00    23.856486
2006-11-16 18:40:00    23.856486
2006-11-17 03:00:00    23.856486
2006-11-17 11:20:00    23.856486
2006-11-17 19:40:00    23.856486
2006-11-18 04:00:00    23.856486
2006-11-18 12:20:00    23.856486
2006-11-18 20:40:00    23.856486
2006-11-19 05:00:00    23.856486
2006-11-19 13:20:00    23.856486
2006-11-19 21:40:00    23.856486
2006-11-20 06:00:00    23.856486
2006-11-20 14:20:00    23.856486
2006-11-20 22:40:00    23.856486
2006-11-21 07:00:00    23.856486
2006-11-21 15:20:00    23.856486
2006-11-21 23:40:00    23.856486
2006-11-22 08:00:00    23.856486
2006-11-22 16:20:00    23.856486
2006-11-23 00:40:00    23.856486
2006-11-23 09:00:00    23.856486
2006-11-23 17:20:00    23.856486
2006-11-24 01:40:00    23.856486
2006-11-24 10:00:00    23.856486
2006-11-24 18:20:00    23.856486
2006-11-25 02:40:00    23.856486
2006-11-25 11:00:00    23.856486
2006-11-25 19:20:00    23.856486
2006-11-26 03:40:00    23.856486
2006-11-26 12:00:00    23.856486
2006-11-26 20:20:00    23.856486
2006-11-27 04:40:00    23.856486
2006-11-27 13:00:00    23.856486
2006-11-27 21:20:00    23.856486
2006-11-28 05:40:00    23.856486
2006-11-28 14:00:00    23.856486
2006-11-28 22:20:00    23.856486
2006-11-29 06:40:00    23.856486
2006-11-29 15:00:00    23.856486
2006-11-29 23:20:00    23.856486
2006-11-30 07:40:00    23.856486
2006-11-30 16:00:00    23.856486
2006-12-01 00:20:00    23.856486
2006-12-01 08:40:00    23.856486
2006-12-01 17:00:00    23.856486
2006-12-02 01:20:00    23.856486
2006-12-02 09:40:00    23.856486
2006-12-02 18:00:00    23.856486
2006-12-03 02:20:00    23.856486
2006-12-03 10:40:00    23.856486
2006-12-03 19:00:00    23.856486
2006-12-04 03:20:00    23.856486
2006-12-04 11:40:00    23.856486
2006-12-04 20:00:00    23.856486
2006-12-05 04:20:00    23.856486
2006-12-05 12:40:00    23.856486
2006-12-05 21:00:00    23.856486
2006-12-06 05:20:00    23.856486
2006-12-06 13:40:00    23.856486
2006-12-06 22:00:00    23.856486
2006-12-07 06:20:00    23.856486
2006-12-07 14:40:00    23.856486
2006-12-07 23:00:00    23.856486
2006-12-08 07:20:00    23.856486
2006-12-08 15:40:00    23.856486
2006-12-09 00:00:00    23.856486
2006-12-09 08:20:00    23.856486
2006-12-09 16:40:00    23.856486
2006-12-10 01:00:00    23.856486
2006-12-10 09:20:00    23.856486
2006-12-10 17:40:00    23.856486
2006-12-11 02:00:00    23.856486
2006-12-11 10:20:00    23.856486
2006-12-11 18:40:00    23.856486
2006-12-12 03:00:00    23.856486
2006-12-12 11:20:00    23.856486
2006-12-12 19:40:00    23.856486
2006-12-13 04:00:00    23.856486
2006-12-13 12:20:00    23.856486
2006-12-13 20:40:00    23.856486
2006-12-14 05:00:00    23.856486
2006-12-14 13:20:00    23.856486
2006-12-14 21:40:00    23.856486
2006-12-15 06:00:00    23.856486
2006-12-15 14:20:00    23.856486
2006-12-15 22:40:00    23.856486
2006-12-16 07:00:00    23.856486
2006-12-16 15:20:00    23.856486
2006-12-16 23:40:00    23.856486
2006-12-17 08:00:00    23.856486
2006-12-17 16:20:00    23.856486
2006-12-18 00:40:00    23.856486
2006-12-18 09:00:00    23.856486
2006-12-18 17:20:00    23.856486
2006-12-19 01:40:00    23.856486
2006-12-19 10:00:00    23.856486
2006-12-19 18:20:00    23.856486
2006-12-20 02:40:00    23.856486
2006-12-20 11:00:00    23.856486
2006-12-20 19:20:00    23.856486
2006-12-21 03:40:00    23.856486
2006-12-21 12:00:00    23.856486
2006-12-21 20:20:00    23.856486
2006-12-22 04:40:00    23.856486
2006-12-22 13:00:00    23.856486
2006-12-22 21:20:00    23.856486
2006-12-23 05:40:00    23.856486
2006-12-23 14:00:00    23.856486
2006-12-23 22:20:00    23.856486
2006-12-24 06:40:00    23.856486
2006-12-24 15:00:00    23.856486
2006-12-24 23:20:00    23.856486
2006-12-25 07:40:00    23.856486
2006-12-25 16:00:00    23.856486
2006-12-26 00:20:00    23.856486
2006-12-26 08:40:00    23.856486
2006-12-26 17:00:00    23.856486
2006-12-27 01:20:00    23.856486
2006-12-27 09:40:00    23.856486
2006-12-27 18:00:00    23.856486
2006-12-28 02:20:00    23.856486
2006-12-28 10:40:00    23.856486
2006-12-28 19:00:00    23.856486
2006-12-29 03:20:00    23.856486
2006-12-29 11:40:00    23.856486
2006-12-29 20:00:00    23.856486
2006-12-30 04:20:00    23.856486
2006-12-30 12:40:00    23.856486
2006-12-30 21:00:00    23.856486
2006-12-31 05:20:00    23.856486
2006-12-31 13:40:00    23.856486
2006-12-31 22:00:00    23.856486
Freq: 500min, Name: predicted_mean, dtype: float64
Test set length: 21024, Predictions length: 21024

# Plot
plt.figure(figsize=(10,6))
plt.plot(train.index, train, label='Train Data', color='blue')
plt.plot(test.index, test, label='Test Data', color='green')
plt.plot(test.index, predictions, label='Predictions', color='red', linestyle='--')
plt.legend()
plt.xlabel('Datetime')
plt.ylabel('Power (MW)')
plt.title('ARIMA Model Predictions vs Actual')
plt.show()

../../_images/de3a3ecfce70e233d9c6e670238a4e7cf8a428fdc4c9669a550d8379eb22b76c.png

We can see from our data and the plot, our prediction sharply increases and peaks at around 23, both for the daytime and the night time, which has a very poor accuracy.

# Forecast next 100 values
future_steps = 100
forecast = model_fit.forecast(steps=future_steps)

# Plot
plt.figure(figsize=(10,6))
plt.plot(df.index, dff, label='Original Data')  # Replace 'Value' with correct column name
plt.plot(pd.date_range(df.index[-1], periods=future_steps, freq='D'), forecast, label='Forecast', color='green')
plt.legend()
plt.show()

../../_images/f9622f47766257c9df61c0bea7414313bb9410d0b75d78a72561e25f5beaf7bc.png

ARIMA yeilded a poor prediction potentially due to our data being highly seasonal. So we try to use a model that is known for handing seasonal data effectively.

Prophet#

The Prophet model, developed by Meta, is a robust and user-friendly tool for time series forecasting. It is particularly well-suited for data with strong seasonal patterns and missing values, as well as scenarios where the data may have outliers or trend changes. Below is an overview of the Prophet model and its components.

import sys
!{sys.executable} -m pip install prophet
from prophet import Prophet

Collecting prophet
  Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting cmdstanpy>=1.0.4 (from prophet)
  Downloading cmdstanpy-1.2.5-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: numpy>=1.15.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (1.26.4)
Requirement already satisfied: matplotlib>=2.0.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (3.10.0)
Requirement already satisfied: pandas>=1.0.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (2.2.3)
Collecting holidays<1,>=0.25 (from prophet)
  Downloading holidays-0.67-py3-none-any.whl.metadata (27 kB)
Requirement already satisfied: tqdm>=4.36.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (4.67.1)
Requirement already satisfied: importlib-resources in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (6.5.2)
Collecting stanio<2.0.0,>=0.4.0 (from cmdstanpy>=1.0.4->prophet)
  Downloading stanio-0.5.1-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: python-dateutil in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from holidays<1,>=0.25->prophet) (2.9.0.post0)
Requirement already satisfied: contourpy>=1.0.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (24.2)
Requirement already satisfied: pillow>=8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (3.2.1)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil->holidays<1,>=0.25->prophet) (1.17.0)
Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.4/14.4 MB 83.7 MB/s eta 0:00:0000:01
?25hDownloading cmdstanpy-1.2.5-py3-none-any.whl (94 kB)
Downloading holidays-0.67-py3-none-any.whl (820 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 820.7/820.7 kB 55.4 MB/s eta 0:00:00
?25hDownloading stanio-0.5.1-py3-none-any.whl (8.1 kB)
Installing collected packages: stanio, holidays, cmdstanpy, prophet
Successfully installed cmdstanpy-1.2.5 holidays-0.67 prophet-1.1.6 stanio-0.5.1

import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)

data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})

if data['y'].isnull().sum() > 0:
    data['y'].fillna(method='ffill', inplace=True)

data_hourly = data.resample('H', on='ds').mean().reset_index()

split_index = int(len(data_hourly) * 0.8)
train_data = data_hourly[:split_index]
test_data = data_hourly[split_index:]

# Fit the model
model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.fit(train_data)

17:28:17 - cmdstanpy - INFO - Chain [1] start processing
17:28:17 - cmdstanpy - INFO - Chain [1] done processing

<prophet.forecaster.Prophet at 0x7fedf0ba6d90>

# Make future predictions
future = model.make_future_dataframe(periods=24*30, freq='H')
forecast = model.predict(future)

test_forecast = forecast.merge(test_data, on='ds', how='right')

plt.figure(figsize=(12, 6))
plt.plot(test_data['ds'], test_data['y'], label='Test Data (Actual)', color='blue', alpha=0.6)
plt.plot(test_forecast['ds'], test_forecast['yhat'], label='Predicted (Test)', color='orange', alpha=0.8)
plt.fill_between(test_forecast['ds'], test_forecast['yhat_lower'], test_forecast['yhat_upper'], color='orange', alpha=0.2, label='Uncertainty Interval')
plt.title("Test Data Accuracy (Prediction)")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()

../../_images/baea950d2e53cad5e651f9f7714095783d475c633dd56af9fdd4faff2bba85fb.png

plt.figure(figsize=(14, 7))
plt.plot(data['ds'], data['y'], label='Historical Data', color='black', alpha=0.6)
plt.plot(forecast['ds'], forecast['yhat'], label='Forecasted Data', color='green')
plt.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'], color='green', alpha=0.2, label='Uncertainty Interval')
plt.title("Full Forecast with Historical Data")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()

../../_images/bfc320d2e82bfbf96e248efbe1f8fccf0915904955c5d659cc088f78579bc663.png

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][:7000:70]

	ds	yhat	yhat_lower	yhat_upper
0	2006-01-01 00:00:00	-6.665930	-24.240018	10.358625
70	2006-01-03 22:00:00	-4.474233	-20.038008	11.853261
140	2006-01-06 20:00:00	-6.592283	-22.713803	9.164866
210	2006-01-09 18:00:00	-5.583250	-22.844907	12.420671
280	2006-01-12 16:00:00	26.849308	10.893639	44.342990
350	2006-01-15 14:00:00	48.079698	31.277399	63.720274
420	2006-01-18 12:00:00	49.432818	32.739164	65.115053
490	2006-01-21 10:00:00	53.029588	37.641518	69.696168
560	2006-01-24 08:00:00	45.396122	29.912687	61.725492
630	2006-01-27 06:00:00	10.705178	-6.510347	26.607076
700	2006-01-30 04:00:00	-8.956497	-24.422312	8.420139
770	2006-02-02 02:00:00	-1.170551	-17.101661	16.039485
840	2006-02-05 00:00:00	-4.957821	-22.109546	11.159278
910	2006-02-07 22:00:00	-3.306811	-20.863814	11.892035
980	2006-02-10 20:00:00	-5.103421	-20.428770	10.981361
1050	2006-02-13 18:00:00	-3.319446	-20.724001	11.953706
1120	2006-02-16 16:00:00	29.978399	14.825827	47.609355
1190	2006-02-19 14:00:00	51.900797	34.795204	69.382889
1260	2006-02-22 12:00:00	53.633801	36.890491	69.474382
1330	2006-02-25 10:00:00	57.281214	39.905520	73.346095
1400	2006-02-28 08:00:00	49.448233	34.419007	64.841434
1470	2006-03-03 06:00:00	14.444099	-1.323467	31.743920
1540	2006-03-06 04:00:00	-5.492482	-21.773490	11.807542
1610	2006-03-09 02:00:00	2.188114	-13.879455	18.838368
1680	2006-03-12 00:00:00	-1.448799	-18.050337	14.659796
1750	2006-03-14 22:00:00	0.638901	-16.547347	16.397506
1820	2006-03-17 20:00:00	-0.457131	-18.698279	15.254874
1890	2006-03-20 18:00:00	2.227401	-14.973428	19.588454
1960	2006-03-23 16:00:00	36.536288	21.510806	52.671831
2030	2006-03-26 14:00:00	59.480708	43.008635	75.393336
2100	2006-03-29 12:00:00	62.149277	46.492645	79.292354
2170	2006-04-01 10:00:00	66.557753	49.186564	82.018656
2240	2006-04-04 08:00:00	59.235910	43.249104	76.906793
2310	2006-04-07 06:00:00	24.431270	7.560132	40.307934
2380	2006-04-10 04:00:00	4.335743	-13.047157	20.579274
2450	2006-04-13 02:00:00	11.470230	-4.340221	28.045248
2520	2006-04-16 00:00:00	6.894983	-8.307186	23.936112
2590	2006-04-18 22:00:00	7.677783	-7.633829	24.566409
2660	2006-04-21 20:00:00	4.972996	-10.115171	22.770960
2730	2006-04-24 18:00:00	5.847444	-11.075033	22.952094
2800	2006-04-27 16:00:00	38.283957	21.068517	54.367096
2870	2006-04-30 14:00:00	59.458451	43.787036	76.326864
2940	2006-05-03 12:00:00	60.631959	43.894916	75.841078
3010	2006-05-06 10:00:00	63.977282	46.359654	80.273678
3080	2006-05-09 08:00:00	56.141473	39.153473	72.497662
3150	2006-05-12 06:00:00	21.428345	4.928609	37.590388
3220	2006-05-15 04:00:00	2.011454	-13.187416	18.508703
3290	2006-05-18 02:00:00	10.315806	-6.078136	27.340943
3360	2006-05-21 00:00:00	7.237416	-10.327478	24.078133
3430	2006-05-23 22:00:00	9.631903	-6.678014	26.928064
3500	2006-05-26 20:00:00	8.422100	-8.266247	23.888559
3570	2006-05-29 18:00:00	10.456069	-5.736026	27.479514
3640	2006-06-01 16:00:00	43.541098	27.250702	60.478775
3710	2006-06-04 14:00:00	64.744611	48.769853	80.851184
3780	2006-06-07 12:00:00	65.299729	47.875205	82.216593
3850	2006-06-10 10:00:00	67.433314	50.865657	85.016854
3920	2006-06-13 08:00:00	57.917999	42.881641	74.453020
3990	2006-06-16 06:00:00	21.234606	3.616952	37.113598
4060	2006-06-19 04:00:00	-0.240966	-16.510179	16.249598
4130	2006-06-22 02:00:00	6.116161	-11.247583	23.117712
4200	2006-06-25 00:00:00	1.374173	-13.974011	18.909048
4270	2006-06-27 22:00:00	2.513891	-12.470514	19.150111
4340	2006-06-30 20:00:00	0.525372	-14.790831	17.337275
4410	2006-07-03 18:00:00	2.263905	-12.794211	19.928545
4480	2006-07-06 16:00:00	35.490399	18.779073	52.209922
4550	2006-07-09 14:00:00	57.185348	40.098695	73.560572
4620	2006-07-12 12:00:00	58.471617	41.970231	74.965051
4690	2006-07-15 10:00:00	61.461021	45.970468	77.432496
4760	2006-07-18 08:00:00	52.823449	35.623734	68.871607
4830	2006-07-21 06:00:00	16.962517	0.447474	32.983314
4900	2006-07-24 04:00:00	-3.789893	-20.622766	12.027440
4970	2006-07-27 02:00:00	3.181478	-13.150792	20.016953
5040	2006-07-30 00:00:00	-1.035615	-17.199097	15.491980
5110	2006-08-01 22:00:00	0.577880	-14.593119	18.136919
5180	2006-08-04 20:00:00	-0.945019	-17.563205	15.315258
5250	2006-08-07 18:00:00	1.283991	-15.302904	18.327193
5320	2006-08-10 16:00:00	35.036419	19.321878	52.017611
5390	2006-08-13 14:00:00	57.273509	41.010533	73.832833
5460	2006-08-16 12:00:00	59.068696	42.082606	74.701344
5530	2006-08-19 10:00:00	62.461875	44.359121	79.215896
5600	2006-08-22 08:00:00	54.043432	38.503110	70.064505
5670	2006-08-25 06:00:00	18.149591	1.953598	34.368925
5740	2006-08-28 04:00:00	-2.922522	-19.685369	14.328186
5810	2006-08-31 02:00:00	3.456821	-13.146439	20.756565
5880	2006-09-03 00:00:00	-1.552922	-17.041100	15.804643
5950	2006-09-05 22:00:00	-0.807801	-17.756285	15.191909
6020	2006-09-08 20:00:00	-3.114020	-19.869315	14.647907
6090	2006-09-11 18:00:00	-1.415580	-18.204937	15.121756
6160	2006-09-14 16:00:00	32.197630	15.385972	47.944213
6230	2006-09-17 14:00:00	54.760301	37.407847	71.634335
6300	2006-09-20 12:00:00	57.326285	40.326160	73.315428
6370	2006-09-23 10:00:00	61.810807	45.609297	78.947643
6440	2006-09-26 08:00:00	54.583659	39.543481	72.311184
6510	2006-09-29 06:00:00	19.696369	3.051884	37.266530
6580	2006-10-02 04:00:00	-0.851641	-16.682391	16.360942
6650	2006-10-05 02:00:00	5.320925	-10.065374	21.620323
6720	2006-10-08 00:00:00	-0.763984	-16.835757	15.982833
6790	2006-10-10 22:00:00	-1.941290	-18.443028	14.145057
6860	2006-10-13 20:00:00	-6.815872	-22.830380	9.753774
6930	2006-10-16 18:00:00	-7.963110	-23.904965	7.692729

Accuracy#

We can see that we have increased our accuracy greatly, using this model accounting for all the seasonality, but still it is predicting values to be less than 0 as well, which can never happen in our model.

model.plot_components(forecast)
plt.show()

../../_images/d0e0d7d71e529d65259216896e94c85710a3640c70cd5ed4a96cdd2a07c646e9.png

These plots reveal the seasonality patterns in our forecasted data, highlighting the underlying trends of the data.
Until now we have not taken account of the seasonality of the data before forecasting and have let the model take care of it by itself. We now try to take care of the seasonality of the data by ourselves.

LightGBM with feature engineering#

LightGBM (Light Gradient Boosting Machine) is a powerful and efficient gradient boosting framework widely used for machine learning tasks.

Feature engineering is the process of preparing and transforming raw data into features that better represent the underlying problem to improve the performance of a machine learning model. A feature is an individual measurable property or characteristic of a phenomenon being observed, often represented as a column in a dataset.
The goal of feature engineering is to extract the most relevant information from the raw data, making it easier for the model to learn patterns and make predictions.

import sys
!{sys.executable} -m pip install lightgbm
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Requirement already satisfied: numpy>=1.17.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.15.2)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 50.0 MB/s eta 0:00:00
?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0

import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')

# Feature Engineering
data['hour'] = data['LocalTime'].dt.hour
data['day_of_week'] = data['LocalTime'].dt.dayofweek
data['month'] = data['LocalTime'].dt.month
data['is_daytime'] = ((data['hour'] >= 6) & (data['hour'] <= 18)).astype(int)

Add the features: hour, day of the week, month and is_daytime as columns to the data.

# Define features and target variable
X = data[['hour', 'day_of_week', 'month', 'is_daytime']]
y = data['Power(MW)']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)

# LightGBM Model Training
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

# Train the model with early stopping
model = lgb.train(
    params,
    lgb_train,
    valid_sets=[lgb_eval],  # Validation set
    num_boost_round=1000,
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),  # Early stopping callback
        lgb.log_evaluation(period=100)          # Log evaluation progress every 100 iterations
    ]
)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.041361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 44
[LightGBM] [Info] Number of data points in the train set: 84096, number of used features: 4
[LightGBM] [Info] Start training from score 23.859394
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[38]	valid_0's rmse: 15.1899

# Predictions
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

# Evaluate the Model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")

Root Mean Squared Error (RMSE): 15.189919633723044
Mean Absolute Error (MAE): 10.479850535578153

# Visualization: Actual vs Predicted
plt.figure(figsize=(15, 6))
plt.plot(y_test.values[:1000], label="Actual", color='blue', linewidth=0.8)
plt.plot(y_pred[:1000], label="Predicted", color='orange', linestyle='--', linewidth=0.8)
plt.title('Actual vs Predicted Power Generation', fontsize=16)
plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Power (MW)', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

../../_images/adb4a983665bdb904273bd33ab5a640bef48b7972fbfe9cf63761a6540b6f2bd.png

This yeilded a much better result than all the other models, with a low RMSE and MAE for a time series data with high seasonality.

# Feature Importance
importance = model.feature_importance()
feature_names = X.columns
plt.figure(figsize=(10, 6))
plt.barh(feature_names, importance, color='teal', alpha=0.7)
plt.title('Feature Importance', fontsize=16)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.tight_layout()
plt.show()

../../_images/ce67be1420a8547bfb9c7eea464641adf2a1195f8933ca532e755ed7aa278625.png

The lower the value, the greater the importance, so as expected is_daytime which is a boolean that is true if the time of day is between 6 am and 6 pm and false rest of the time turned out to be the most important factor in a successful prediction.

Key findings#

Due to a highly seasonal data, we can see how the accuracy improves when using models more suitable for such data, with gradient boosting integrated with featuring engineering yielding the best results.
The most significant feature is is_daytime, a boolean indicating whether the observation falls between 6 AM and 6 PM, underscoring its importance in prediction success.
Graphical representation clearly illustrates the ranking of feature importance.