Forecasting Solar-Energy#
Objective#
The primary objective is to build and evaluate a predictive model, with a focus on understanding the relative importance of different features in making predictions. With more solar power integrated into power systems, accurately forecast the power outputs from solar power becomes crucial for reliable and economic operation of the system. The conventional generators need to follow up and ramp down based on the power increase and decline of solar power during sunrise and sunset. During cloudy days, it is also important to forecast the power fluctuations of solar power to prepare adequate reserve capacities. The model deals with structured data needed for solar power forecasting and aims to optimize prediction accuracy while providing explainability with a simple example.
We use historical time-series data from a specified region in Mississippi from 2006 to analyze and forecast solar-energy output.
We want to reiterate that the purpose of this module is not to provide the most accurate forecast, but to demonstrate the process of developing a machine learning pipeline for solar power forecasting and analyzing feature importance for interpretability. The dataset used in this module is a sample dataset for demonstration purposes only. The techniques and methods used in this module can be applied to other datasets for solar power forecasting. The procedure of other forecasting methods could be quite different than this method. But the main data processing steps should be similar.
Purpose#
To develop a machine learning pipeline for solar power forecasting purpuse.
To analyze feature importance for forecasting results interpretability.
To improve predictive accuracy using feature engineering and optimization techniques.
Who is this useful for?#
Data Scientists: Interested in understanding feature importance in forecasting models.
Decision-Makers: Seeking insights from the predictions for actionable strategies.
Students & Researchers: Exploring predictive modeling and feature analysis.
Applications#
Predicting outcomes in structured data for solar power forecasting (e.g., sales, risk assessment, customer behavior).
Identifying key drivers influencing outcomes for resource allocation.
Benchmarking forecasting performance (accuracy) against baseline algorithms.
Notebook Components#
Data Preparation: Importing, cleaning, and preprocessing data for modeling. Prepare the historical time-series data for solar power forecasting.
Model Development: Training machine learning models.
ARIMA model-ARIMA stands for AutoRegressive Integrated Moving Average. It is a popular statistical method used for time series forecasting. The ARIMA model is characterized by three main components: 1. AutoRegressive (AR) part: This part involves regressing the variable on its own lagged (past) values. The number of lagged values to include is denoted by p. 2. Integrated (I) part: This part involves differencing the data to make it stationary, which means that the mean and variance are constant over time. The number of differences needed to achieve stationarity is denoted by d. 3. Moving Average (MA) part: This part involves modeling the error term as a linear combination of error terms occurring at various times in the past. The number of lagged error terms to include is denoted by q.The ARIMA model is generally denoted as ARIMA(p, d, q), where: p is the number of lag observations included in the model (lag order). d is the number of times that the raw observations are differenced. q is the size of the moving average window. Steps to Build an ARIMA Model: Identification: Determine the values of p, d, and q using techniques like the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. Estimation: Fit the ARIMA model to the time series data using the identified parameters.Diagnostic Checking: Check the residuals of the model to ensure that they resemble white noise (i.e., they are normally distributed with a mean of zero and constant variance). Forecasting: Use the fitted ARIMA model to make future predictions.
Prophet model-Prophet is an open-source forecasting tool developed by Facebook. It is designed to handle time series data that may have daily, weekly, and yearly seasonality, along with holiday effects. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. It works best with time series that have strong seasonal effects and several seasons of historical data. The model is intuitive and allows for easy incorporation of additional regressors to improve forecast accuracy. Steps to Build a Prophet Model: Data Preparation: Ensure the data is in the correct format with columns βdsβ (date) and βyβ (value to forecast). Model Initialization: Create a Prophet object and specify any seasonalities or holidays. Model Fitting: Fit the model to the historical data. Forecasting: Use the fitted model to make future predictions. Visualization: Plot the forecasted values along with the historical data to visualize the modelβs performance.
LightGBM model- LightGBM (Light Gradient Boosting Machine) is a highly efficient and scalable gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following key features: Gradient Boosting: LightGBM is based on the gradient boosting framework, which builds models sequentially. Each new model attempts to correct the errors made by the previous models. This is done by minimizing a loss function using gradient descent. Tree-Based Learning: LightGBM uses decision trees as its base learners. Specifically, it uses a technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up the training process and reduce memory usage. Efficiency: LightGBM is designed to be highly efficient in terms of both speed and memory usage. It achieves this through several optimizations: Histogram-based decision tree learning: This reduces the complexity of finding the best split. Leaf-wise tree growth: Unlike level-wise tree growth used in other frameworks, LightGBM grows trees leaf-wise, which can lead to deeper trees and better accuracy. GOSS and EFB: These techniques further improve efficiency by reducing the number of data instances and features considered during training. Scalability: LightGBM can handle large datasets and high-dimensional data efficiently. It supports parallel and distributed learning, making it suitable for big data applications. Accuracy: Due to its efficient implementation and advanced techniques, LightGBM often achieves higher accuracy compared to other gradient boosting frameworks.
Feature Importance Analysis: Evaluating which features contribute the most to predictions. Analayze the key factors in solar power forecating.
Visualization: Graphically representing feature importance for interpretability.
# Import all the required libraries, pandas for data analytics, numpy for numerical calculation, matplotlib for plotting and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install scikit-learn
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Update submodules to fetch data
Requirement already satisfied: statsmodels in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: numpy<3,>=1.22.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.15.2)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (2.2.3)
Requirement already satisfied: patsy>=0.5.6 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (1.0.1)
Requirement already satisfied: packaging>=21.3 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from statsmodels) (24.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0)
Requirement already satisfied: scikit-learn in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.15.2)
Requirement already satisfied: joblib>=1.2.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from scikit-learn) (3.5.0)
pd.set_option('display.max_rows', 300) #set the limit for the maximum number of rows in display
# # define data path for the input data of historical 5 minutes solar power output
# # file path
# df = pd.read_csv("ms-pv-2006/Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
import os
import pandas as pd
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
# # read the historical data with pandas
df = pd.read_csv(csv_path)
data = pd.read_csv(csv_path)
Add a new column βDateTimeβ, with the following format specified and make it our index for later use
df['Datetime'] = pd.to_datetime(df['LocalTime'], format='%m/%d/%y %H:%M')
df.set_index('Datetime', inplace=True)
df
LocalTime | Power(MW) | |
---|---|---|
Datetime | ||
2006-01-01 00:00:00 | 01/01/06 00:00 | 0.0 |
2006-01-01 00:05:00 | 01/01/06 00:05 | 0.0 |
2006-01-01 00:10:00 | 01/01/06 00:10 | 0.0 |
2006-01-01 00:15:00 | 01/01/06 00:15 | 0.0 |
2006-01-01 00:20:00 | 01/01/06 00:20 | 0.0 |
... | ... | ... |
2006-12-31 23:35:00 | 12/31/06 23:35 | 0.0 |
2006-12-31 23:40:00 | 12/31/06 23:40 | 0.0 |
2006-12-31 23:45:00 | 12/31/06 23:45 | 0.0 |
2006-12-31 23:50:00 | 12/31/06 23:50 | 0.0 |
2006-12-31 23:55:00 | 12/31/06 23:55 | 0.0 |
105120 rows Γ 2 columns
# # show historical data
dff = df['Power(MW)']
df.head(200)
LocalTime | Power(MW) | |
---|---|---|
Datetime | ||
2006-01-01 00:00:00 | 01/01/06 00:00 | 0.0 |
2006-01-01 00:05:00 | 01/01/06 00:05 | 0.0 |
2006-01-01 00:10:00 | 01/01/06 00:10 | 0.0 |
2006-01-01 00:15:00 | 01/01/06 00:15 | 0.0 |
2006-01-01 00:20:00 | 01/01/06 00:20 | 0.0 |
2006-01-01 00:25:00 | 01/01/06 00:25 | 0.0 |
2006-01-01 00:30:00 | 01/01/06 00:30 | 0.0 |
2006-01-01 00:35:00 | 01/01/06 00:35 | 0.0 |
2006-01-01 00:40:00 | 01/01/06 00:40 | 0.0 |
2006-01-01 00:45:00 | 01/01/06 00:45 | 0.0 |
2006-01-01 00:50:00 | 01/01/06 00:50 | 0.0 |
2006-01-01 00:55:00 | 01/01/06 00:55 | 0.0 |
2006-01-01 01:00:00 | 01/01/06 01:00 | 0.0 |
2006-01-01 01:05:00 | 01/01/06 01:05 | 0.0 |
2006-01-01 01:10:00 | 01/01/06 01:10 | 0.0 |
2006-01-01 01:15:00 | 01/01/06 01:15 | 0.0 |
2006-01-01 01:20:00 | 01/01/06 01:20 | 0.0 |
2006-01-01 01:25:00 | 01/01/06 01:25 | 0.0 |
2006-01-01 01:30:00 | 01/01/06 01:30 | 0.0 |
2006-01-01 01:35:00 | 01/01/06 01:35 | 0.0 |
2006-01-01 01:40:00 | 01/01/06 01:40 | 0.0 |
2006-01-01 01:45:00 | 01/01/06 01:45 | 0.0 |
2006-01-01 01:50:00 | 01/01/06 01:50 | 0.0 |
2006-01-01 01:55:00 | 01/01/06 01:55 | 0.0 |
2006-01-01 02:00:00 | 01/01/06 02:00 | 0.0 |
2006-01-01 02:05:00 | 01/01/06 02:05 | 0.0 |
2006-01-01 02:10:00 | 01/01/06 02:10 | 0.0 |
2006-01-01 02:15:00 | 01/01/06 02:15 | 0.0 |
2006-01-01 02:20:00 | 01/01/06 02:20 | 0.0 |
2006-01-01 02:25:00 | 01/01/06 02:25 | 0.0 |
2006-01-01 02:30:00 | 01/01/06 02:30 | 0.0 |
2006-01-01 02:35:00 | 01/01/06 02:35 | 0.0 |
2006-01-01 02:40:00 | 01/01/06 02:40 | 0.0 |
2006-01-01 02:45:00 | 01/01/06 02:45 | 0.0 |
2006-01-01 02:50:00 | 01/01/06 02:50 | 0.0 |
2006-01-01 02:55:00 | 01/01/06 02:55 | 0.0 |
2006-01-01 03:00:00 | 01/01/06 03:00 | 0.0 |
2006-01-01 03:05:00 | 01/01/06 03:05 | 0.0 |
2006-01-01 03:10:00 | 01/01/06 03:10 | 0.0 |
2006-01-01 03:15:00 | 01/01/06 03:15 | 0.0 |
2006-01-01 03:20:00 | 01/01/06 03:20 | 0.0 |
2006-01-01 03:25:00 | 01/01/06 03:25 | 0.0 |
2006-01-01 03:30:00 | 01/01/06 03:30 | 0.0 |
2006-01-01 03:35:00 | 01/01/06 03:35 | 0.0 |
2006-01-01 03:40:00 | 01/01/06 03:40 | 0.0 |
2006-01-01 03:45:00 | 01/01/06 03:45 | 0.0 |
2006-01-01 03:50:00 | 01/01/06 03:50 | 0.0 |
2006-01-01 03:55:00 | 01/01/06 03:55 | 0.0 |
2006-01-01 04:00:00 | 01/01/06 04:00 | 0.0 |
2006-01-01 04:05:00 | 01/01/06 04:05 | 0.0 |
2006-01-01 04:10:00 | 01/01/06 04:10 | 0.0 |
2006-01-01 04:15:00 | 01/01/06 04:15 | 0.0 |
2006-01-01 04:20:00 | 01/01/06 04:20 | 0.0 |
2006-01-01 04:25:00 | 01/01/06 04:25 | 0.0 |
2006-01-01 04:30:00 | 01/01/06 04:30 | 0.0 |
2006-01-01 04:35:00 | 01/01/06 04:35 | 0.0 |
2006-01-01 04:40:00 | 01/01/06 04:40 | 0.0 |
2006-01-01 04:45:00 | 01/01/06 04:45 | 0.0 |
2006-01-01 04:50:00 | 01/01/06 04:50 | 0.0 |
2006-01-01 04:55:00 | 01/01/06 04:55 | 0.0 |
2006-01-01 05:00:00 | 01/01/06 05:00 | 0.0 |
2006-01-01 05:05:00 | 01/01/06 05:05 | 0.0 |
2006-01-01 05:10:00 | 01/01/06 05:10 | 0.0 |
2006-01-01 05:15:00 | 01/01/06 05:15 | 0.0 |
2006-01-01 05:20:00 | 01/01/06 05:20 | 0.0 |
2006-01-01 05:25:00 | 01/01/06 05:25 | 0.0 |
2006-01-01 05:30:00 | 01/01/06 05:30 | 0.0 |
2006-01-01 05:35:00 | 01/01/06 05:35 | 0.0 |
2006-01-01 05:40:00 | 01/01/06 05:40 | 0.0 |
2006-01-01 05:45:00 | 01/01/06 05:45 | 0.0 |
2006-01-01 05:50:00 | 01/01/06 05:50 | 0.0 |
2006-01-01 05:55:00 | 01/01/06 05:55 | 0.0 |
2006-01-01 06:00:00 | 01/01/06 06:00 | 0.0 |
2006-01-01 06:05:00 | 01/01/06 06:05 | 0.0 |
2006-01-01 06:10:00 | 01/01/06 06:10 | 0.0 |
2006-01-01 06:15:00 | 01/01/06 06:15 | 0.0 |
2006-01-01 06:20:00 | 01/01/06 06:20 | 0.0 |
2006-01-01 06:25:00 | 01/01/06 06:25 | 0.0 |
2006-01-01 06:30:00 | 01/01/06 06:30 | 0.0 |
2006-01-01 06:35:00 | 01/01/06 06:35 | 0.0 |
2006-01-01 06:40:00 | 01/01/06 06:40 | 0.0 |
2006-01-01 06:45:00 | 01/01/06 06:45 | 0.0 |
2006-01-01 06:50:00 | 01/01/06 06:50 | 0.0 |
2006-01-01 06:55:00 | 01/01/06 06:55 | 0.0 |
2006-01-01 07:00:00 | 01/01/06 07:00 | 0.0 |
2006-01-01 07:05:00 | 01/01/06 07:05 | 0.0 |
2006-01-01 07:10:00 | 01/01/06 07:10 | 0.0 |
2006-01-01 07:15:00 | 01/01/06 07:15 | 0.0 |
2006-01-01 07:20:00 | 01/01/06 07:20 | 3.6 |
2006-01-01 07:25:00 | 01/01/06 07:25 | 1.0 |
2006-01-01 07:30:00 | 01/01/06 07:30 | 3.6 |
2006-01-01 07:35:00 | 01/01/06 07:35 | 16.8 |
2006-01-01 07:40:00 | 01/01/06 07:40 | 20.9 |
2006-01-01 07:45:00 | 01/01/06 07:45 | 15.6 |
2006-01-01 07:50:00 | 01/01/06 07:50 | 22.0 |
2006-01-01 07:55:00 | 01/01/06 07:55 | 26.9 |
2006-01-01 08:00:00 | 01/01/06 08:00 | 36.9 |
2006-01-01 08:05:00 | 01/01/06 08:05 | 38.6 |
2006-01-01 08:10:00 | 01/01/06 08:10 | 23.4 |
2006-01-01 08:15:00 | 01/01/06 08:15 | 27.5 |
2006-01-01 08:20:00 | 01/01/06 08:20 | 29.5 |
2006-01-01 08:25:00 | 01/01/06 08:25 | 28.5 |
2006-01-01 08:30:00 | 01/01/06 08:30 | 27.2 |
2006-01-01 08:35:00 | 01/01/06 08:35 | 24.4 |
2006-01-01 08:40:00 | 01/01/06 08:40 | 23.9 |
2006-01-01 08:45:00 | 01/01/06 08:45 | 23.1 |
2006-01-01 08:50:00 | 01/01/06 08:50 | 22.1 |
2006-01-01 08:55:00 | 01/01/06 08:55 | 20.6 |
2006-01-01 09:00:00 | 01/01/06 09:00 | 20.4 |
2006-01-01 09:05:00 | 01/01/06 09:05 | 19.7 |
2006-01-01 09:10:00 | 01/01/06 09:10 | 18.9 |
2006-01-01 09:15:00 | 01/01/06 09:15 | 16.5 |
2006-01-01 09:20:00 | 01/01/06 09:20 | 16.3 |
2006-01-01 09:25:00 | 01/01/06 09:25 | 17.0 |
2006-01-01 09:30:00 | 01/01/06 09:30 | 16.1 |
2006-01-01 09:35:00 | 01/01/06 09:35 | 15.1 |
2006-01-01 09:40:00 | 01/01/06 09:40 | 14.3 |
2006-01-01 09:45:00 | 01/01/06 09:45 | 14.0 |
2006-01-01 09:50:00 | 01/01/06 09:50 | 13.6 |
2006-01-01 09:55:00 | 01/01/06 09:55 | 12.7 |
2006-01-01 10:00:00 | 01/01/06 10:00 | 13.5 |
2006-01-01 10:05:00 | 01/01/06 10:05 | 15.1 |
2006-01-01 10:10:00 | 01/01/06 10:10 | 15.7 |
2006-01-01 10:15:00 | 01/01/06 10:15 | 23.4 |
2006-01-01 10:20:00 | 01/01/06 10:20 | 25.0 |
2006-01-01 10:25:00 | 01/01/06 10:25 | 23.3 |
2006-01-01 10:30:00 | 01/01/06 10:30 | 17.4 |
2006-01-01 10:35:00 | 01/01/06 10:35 | 18.3 |
2006-01-01 10:40:00 | 01/01/06 10:40 | 23.0 |
2006-01-01 10:45:00 | 01/01/06 10:45 | 29.9 |
2006-01-01 10:50:00 | 01/01/06 10:50 | 30.5 |
2006-01-01 10:55:00 | 01/01/06 10:55 | 30.2 |
2006-01-01 11:00:00 | 01/01/06 11:00 | 36.6 |
2006-01-01 11:05:00 | 01/01/06 11:05 | 48.1 |
2006-01-01 11:10:00 | 01/01/06 11:10 | 54.0 |
2006-01-01 11:15:00 | 01/01/06 11:15 | 40.1 |
2006-01-01 11:20:00 | 01/01/06 11:20 | 30.4 |
2006-01-01 11:25:00 | 01/01/06 11:25 | 27.7 |
2006-01-01 11:30:00 | 01/01/06 11:30 | 25.7 |
2006-01-01 11:35:00 | 01/01/06 11:35 | 25.1 |
2006-01-01 11:40:00 | 01/01/06 11:40 | 23.9 |
2006-01-01 11:45:00 | 01/01/06 11:45 | 22.8 |
2006-01-01 11:50:00 | 01/01/06 11:50 | 22.0 |
2006-01-01 11:55:00 | 01/01/06 11:55 | 20.5 |
2006-01-01 12:00:00 | 01/01/06 12:00 | 19.5 |
2006-01-01 12:05:00 | 01/01/06 12:05 | 18.4 |
2006-01-01 12:10:00 | 01/01/06 12:10 | 17.5 |
2006-01-01 12:15:00 | 01/01/06 12:15 | 15.8 |
2006-01-01 12:20:00 | 01/01/06 12:20 | 15.9 |
2006-01-01 12:25:00 | 01/01/06 12:25 | 15.5 |
2006-01-01 12:30:00 | 01/01/06 12:30 | 13.7 |
2006-01-01 12:35:00 | 01/01/06 12:35 | 11.8 |
2006-01-01 12:40:00 | 01/01/06 12:40 | 11.6 |
2006-01-01 12:45:00 | 01/01/06 12:45 | 11.9 |
2006-01-01 12:50:00 | 01/01/06 12:50 | 11.8 |
2006-01-01 12:55:00 | 01/01/06 12:55 | 12.1 |
2006-01-01 13:00:00 | 01/01/06 13:00 | 10.9 |
2006-01-01 13:05:00 | 01/01/06 13:05 | 12.1 |
2006-01-01 13:10:00 | 01/01/06 13:10 | 14.3 |
2006-01-01 13:15:00 | 01/01/06 13:15 | 10.7 |
2006-01-01 13:20:00 | 01/01/06 13:20 | 5.4 |
2006-01-01 13:25:00 | 01/01/06 13:25 | 8.1 |
2006-01-01 13:30:00 | 01/01/06 13:30 | 9.0 |
2006-01-01 13:35:00 | 01/01/06 13:35 | 11.2 |
2006-01-01 13:40:00 | 01/01/06 13:40 | 12.0 |
2006-01-01 13:45:00 | 01/01/06 13:45 | 13.9 |
2006-01-01 13:50:00 | 01/01/06 13:50 | 16.5 |
2006-01-01 13:55:00 | 01/01/06 13:55 | 18.6 |
2006-01-01 14:00:00 | 01/01/06 14:00 | 21.1 |
2006-01-01 14:05:00 | 01/01/06 14:05 | 21.5 |
2006-01-01 14:10:00 | 01/01/06 14:10 | 22.7 |
2006-01-01 14:15:00 | 01/01/06 14:15 | 24.8 |
2006-01-01 14:20:00 | 01/01/06 14:20 | 25.2 |
2006-01-01 14:25:00 | 01/01/06 14:25 | 27.0 |
2006-01-01 14:30:00 | 01/01/06 14:30 | 28.7 |
2006-01-01 14:35:00 | 01/01/06 14:35 | 32.2 |
2006-01-01 14:40:00 | 01/01/06 14:40 | 34.5 |
2006-01-01 14:45:00 | 01/01/06 14:45 | 37.0 |
2006-01-01 14:50:00 | 01/01/06 14:50 | 40.5 |
2006-01-01 14:55:00 | 01/01/06 14:55 | 42.4 |
2006-01-01 15:00:00 | 01/01/06 15:00 | 43.8 |
2006-01-01 15:05:00 | 01/01/06 15:05 | 44.7 |
2006-01-01 15:10:00 | 01/01/06 15:10 | 45.4 |
2006-01-01 15:15:00 | 01/01/06 15:15 | 42.1 |
2006-01-01 15:20:00 | 01/01/06 15:20 | 35.4 |
2006-01-01 15:25:00 | 01/01/06 15:25 | 36.1 |
2006-01-01 15:30:00 | 01/01/06 15:30 | 27.9 |
2006-01-01 15:35:00 | 01/01/06 15:35 | 7.8 |
2006-01-01 15:40:00 | 01/01/06 15:40 | 28.6 |
2006-01-01 15:45:00 | 01/01/06 15:45 | 31.9 |
2006-01-01 15:50:00 | 01/01/06 15:50 | 27.5 |
2006-01-01 15:55:00 | 01/01/06 15:55 | 29.0 |
2006-01-01 16:00:00 | 01/01/06 16:00 | 27.0 |
2006-01-01 16:05:00 | 01/01/06 16:05 | 28.4 |
2006-01-01 16:10:00 | 01/01/06 16:10 | 35.3 |
2006-01-01 16:15:00 | 01/01/06 16:15 | 32.7 |
2006-01-01 16:20:00 | 01/01/06 16:20 | 25.4 |
2006-01-01 16:25:00 | 01/01/06 16:25 | 21.3 |
2006-01-01 16:30:00 | 01/01/06 16:30 | 17.8 |
2006-01-01 16:35:00 | 01/01/06 16:35 | 9.9 |
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})
if data['y'].isnull().sum() > 0:
data['y'].fillna(method='ffill', inplace=True)
data_hourly = data.resample('H', on='ds').mean().reset_index()
/tmp/ipykernel_66954/3612108113.py:4: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
data_hourly = data.resample('H', on='ds').mean().reset_index()
# # plot the one-year historical solar generation curve
plt.figure(figsize=(12, 6))
plt.plot(data_hourly['ds'], data_hourly['y'], color='b', label='Power (MW)')
plt.xlabel('Date')
plt.ylabel('Power (MW)')
plt.title('Hourly Power Generation')
plt.legend()
plt.show()

From the overview, our data seems to have high daily seasonality, with zeros in the nighttime and peak power output during the day.
Arima#
We begin our forecasting with the Auto-Regressive Integrated Moving Average (ARIMA) model, a widely used method for time-series forecasting. ARIMA is particularly effective for univariate data that is stationary, meaning its statistical properties such as mean and variance are constant over time. However, our dataset exhibits daily seasonality, with power output peaking during the day and dropping to zero at night. This inherent seasonality suggests that our data may not be strictly stationary, which could impact the modelβs accuracy. Despite this, we proceed with ARIMA to establish a baseline and will consider additional preprocessing steps, such as differencing, to address non-stationarity and improve model performance.
# # check whether the input data of historical solar power satisfy the requirement
def adf_test(series):
result = adfuller(series)
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
if result[1] > 0.05:
print("Series is non-stationary")
else:
print("Series is stationary")
The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine if a time series is stationary, meaning its statistical properties such as mean, variance, and autocorrelation are constant over time. The test works by assessing the null hypothesis that a unit root is present in the time series data, which would indicate non-stationarity. If the p-value obtained from the test is below a certain threshold (commonly 0.05), the null hypothesis is rejected, suggesting that the series is stationary. Conversely, a p-value above the threshold indicates that the series is non-stationary and may require differencing or other transformations to achieve stationarity.
adf_test(dff)
ADF Statistic: -39.02861991156054
p-value: 0.0
Series is stationary
The ADF test has a loose confidence interval and suggests our data is stationary, which may not necessarily be true, but we move on with it and do not use differencing. Differencing is a transformation technique used in time series analysis to make a non-stationary series stationary by removing trends or seasonality. It involves subtracting the previous observation from the current observation.
Though we might want to try differencing later.
plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(dff, ax=plt.gca())
plt.subplot(122)
plot_pacf(dff, ax=plt.gca())
plt.show()

The ARIMA model has p, d and q values that we have to select based on our data. We use these acf and pacf graphs to determine those values.
From the graphs above we understand that differencing is required for our data.
# Differencing
df['PowerDiff'] = dff.diff().dropna()
diff = df['PowerDiff']
Construct the plot again for our differenced data.
plt.figure(figsize=(12,6))
plt.subplot(121)
plot_acf(diff.dropna(), ax=plt.gca()) # ACF for 'q'
plt.subplot(122)
plot_pacf(diff.dropna(), ax=plt.gca()) # PACF for 'p'
plt.show()

We can finally see from the graps what values for p, d, and q we need to take for our model. p = 2, d = 0 and q = 2.
We can also check if our assumptions are correct using the grid-search algorithm to calculate the best order for our model.
import warnings
from statsmodels.tsa.arima.model import ARIMA
warnings.filterwarnings("ignore")
def evaluate_arima_model(X, arima_order):
model = ARIMA(X, order=arima_order)
model_fit = model.fit()
return model_fit.aic
def grid_search_arima(data, p_values, d_values, q_values):
best_aic = float("inf")
best_order = None
for p in p_values:
for d in d_values:
for q in q_values:
try:
aic = evaluate_arima_model(data, (p,d,q))
if aic < best_aic:
best_aic = aic
best_order = (p, d, q)
except:
continue
return best_order
p_values = range(0, 3)
d_values = range(0, 2)
q_values = range(0, 3)
best_order = grid_search_arima(dff, p_values, d_values, q_values)
print(f'Best ARIMA order: {best_order}')
Best ARIMA order: (2, 0, 2)
# Splitting the data into test and train.
train_size = int(len(df) * 0.8)
train, test = dff[:train_size], dff[train_size:]
# Fitting the model
best_p, best_d, best_q = best_order
model = ARIMA(train, order=(best_p, best_d, best_q))
model_fit = model.fit()
predictions = model_fit.forecast(steps=len(test))
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
RMSE: 25.9281711705478
Output the values at different times of day to check our prediction
print(predictions[:21024:100])
print(f"Test set length: {len(test)}, Predictions length: {len(predictions)}")
2006-10-20 00:00:00 0.242440
2006-10-20 08:20:00 15.077833
2006-10-20 16:40:00 20.586042
2006-10-21 01:00:00 22.638099
2006-10-21 09:20:00 23.402582
2006-10-21 17:40:00 23.687387
2006-10-22 02:00:00 23.793489
2006-10-22 10:20:00 23.833017
2006-10-22 18:40:00 23.847743
2006-10-23 03:00:00 23.853229
2006-10-23 11:20:00 23.855273
2006-10-23 19:40:00 23.856034
2006-10-24 04:00:00 23.856318
2006-10-24 12:20:00 23.856424
2006-10-24 20:40:00 23.856463
2006-10-25 05:00:00 23.856478
2006-10-25 13:20:00 23.856483
2006-10-25 21:40:00 23.856485
2006-10-26 06:00:00 23.856486
2006-10-26 14:20:00 23.856486
2006-10-26 22:40:00 23.856486
2006-10-27 07:00:00 23.856486
2006-10-27 15:20:00 23.856486
2006-10-27 23:40:00 23.856486
2006-10-28 08:00:00 23.856486
2006-10-28 16:20:00 23.856486
2006-10-29 00:40:00 23.856486
2006-10-29 09:00:00 23.856486
2006-10-29 17:20:00 23.856486
2006-10-30 01:40:00 23.856486
2006-10-30 10:00:00 23.856486
2006-10-30 18:20:00 23.856486
2006-10-31 02:40:00 23.856486
2006-10-31 11:00:00 23.856486
2006-10-31 19:20:00 23.856486
2006-11-01 03:40:00 23.856486
2006-11-01 12:00:00 23.856486
2006-11-01 20:20:00 23.856486
2006-11-02 04:40:00 23.856486
2006-11-02 13:00:00 23.856486
2006-11-02 21:20:00 23.856486
2006-11-03 05:40:00 23.856486
2006-11-03 14:00:00 23.856486
2006-11-03 22:20:00 23.856486
2006-11-04 06:40:00 23.856486
2006-11-04 15:00:00 23.856486
2006-11-04 23:20:00 23.856486
2006-11-05 07:40:00 23.856486
2006-11-05 16:00:00 23.856486
2006-11-06 00:20:00 23.856486
2006-11-06 08:40:00 23.856486
2006-11-06 17:00:00 23.856486
2006-11-07 01:20:00 23.856486
2006-11-07 09:40:00 23.856486
2006-11-07 18:00:00 23.856486
2006-11-08 02:20:00 23.856486
2006-11-08 10:40:00 23.856486
2006-11-08 19:00:00 23.856486
2006-11-09 03:20:00 23.856486
2006-11-09 11:40:00 23.856486
2006-11-09 20:00:00 23.856486
2006-11-10 04:20:00 23.856486
2006-11-10 12:40:00 23.856486
2006-11-10 21:00:00 23.856486
2006-11-11 05:20:00 23.856486
2006-11-11 13:40:00 23.856486
2006-11-11 22:00:00 23.856486
2006-11-12 06:20:00 23.856486
2006-11-12 14:40:00 23.856486
2006-11-12 23:00:00 23.856486
2006-11-13 07:20:00 23.856486
2006-11-13 15:40:00 23.856486
2006-11-14 00:00:00 23.856486
2006-11-14 08:20:00 23.856486
2006-11-14 16:40:00 23.856486
2006-11-15 01:00:00 23.856486
2006-11-15 09:20:00 23.856486
2006-11-15 17:40:00 23.856486
2006-11-16 02:00:00 23.856486
2006-11-16 10:20:00 23.856486
2006-11-16 18:40:00 23.856486
2006-11-17 03:00:00 23.856486
2006-11-17 11:20:00 23.856486
2006-11-17 19:40:00 23.856486
2006-11-18 04:00:00 23.856486
2006-11-18 12:20:00 23.856486
2006-11-18 20:40:00 23.856486
2006-11-19 05:00:00 23.856486
2006-11-19 13:20:00 23.856486
2006-11-19 21:40:00 23.856486
2006-11-20 06:00:00 23.856486
2006-11-20 14:20:00 23.856486
2006-11-20 22:40:00 23.856486
2006-11-21 07:00:00 23.856486
2006-11-21 15:20:00 23.856486
2006-11-21 23:40:00 23.856486
2006-11-22 08:00:00 23.856486
2006-11-22 16:20:00 23.856486
2006-11-23 00:40:00 23.856486
2006-11-23 09:00:00 23.856486
2006-11-23 17:20:00 23.856486
2006-11-24 01:40:00 23.856486
2006-11-24 10:00:00 23.856486
2006-11-24 18:20:00 23.856486
2006-11-25 02:40:00 23.856486
2006-11-25 11:00:00 23.856486
2006-11-25 19:20:00 23.856486
2006-11-26 03:40:00 23.856486
2006-11-26 12:00:00 23.856486
2006-11-26 20:20:00 23.856486
2006-11-27 04:40:00 23.856486
2006-11-27 13:00:00 23.856486
2006-11-27 21:20:00 23.856486
2006-11-28 05:40:00 23.856486
2006-11-28 14:00:00 23.856486
2006-11-28 22:20:00 23.856486
2006-11-29 06:40:00 23.856486
2006-11-29 15:00:00 23.856486
2006-11-29 23:20:00 23.856486
2006-11-30 07:40:00 23.856486
2006-11-30 16:00:00 23.856486
2006-12-01 00:20:00 23.856486
2006-12-01 08:40:00 23.856486
2006-12-01 17:00:00 23.856486
2006-12-02 01:20:00 23.856486
2006-12-02 09:40:00 23.856486
2006-12-02 18:00:00 23.856486
2006-12-03 02:20:00 23.856486
2006-12-03 10:40:00 23.856486
2006-12-03 19:00:00 23.856486
2006-12-04 03:20:00 23.856486
2006-12-04 11:40:00 23.856486
2006-12-04 20:00:00 23.856486
2006-12-05 04:20:00 23.856486
2006-12-05 12:40:00 23.856486
2006-12-05 21:00:00 23.856486
2006-12-06 05:20:00 23.856486
2006-12-06 13:40:00 23.856486
2006-12-06 22:00:00 23.856486
2006-12-07 06:20:00 23.856486
2006-12-07 14:40:00 23.856486
2006-12-07 23:00:00 23.856486
2006-12-08 07:20:00 23.856486
2006-12-08 15:40:00 23.856486
2006-12-09 00:00:00 23.856486
2006-12-09 08:20:00 23.856486
2006-12-09 16:40:00 23.856486
2006-12-10 01:00:00 23.856486
2006-12-10 09:20:00 23.856486
2006-12-10 17:40:00 23.856486
2006-12-11 02:00:00 23.856486
2006-12-11 10:20:00 23.856486
2006-12-11 18:40:00 23.856486
2006-12-12 03:00:00 23.856486
2006-12-12 11:20:00 23.856486
2006-12-12 19:40:00 23.856486
2006-12-13 04:00:00 23.856486
2006-12-13 12:20:00 23.856486
2006-12-13 20:40:00 23.856486
2006-12-14 05:00:00 23.856486
2006-12-14 13:20:00 23.856486
2006-12-14 21:40:00 23.856486
2006-12-15 06:00:00 23.856486
2006-12-15 14:20:00 23.856486
2006-12-15 22:40:00 23.856486
2006-12-16 07:00:00 23.856486
2006-12-16 15:20:00 23.856486
2006-12-16 23:40:00 23.856486
2006-12-17 08:00:00 23.856486
2006-12-17 16:20:00 23.856486
2006-12-18 00:40:00 23.856486
2006-12-18 09:00:00 23.856486
2006-12-18 17:20:00 23.856486
2006-12-19 01:40:00 23.856486
2006-12-19 10:00:00 23.856486
2006-12-19 18:20:00 23.856486
2006-12-20 02:40:00 23.856486
2006-12-20 11:00:00 23.856486
2006-12-20 19:20:00 23.856486
2006-12-21 03:40:00 23.856486
2006-12-21 12:00:00 23.856486
2006-12-21 20:20:00 23.856486
2006-12-22 04:40:00 23.856486
2006-12-22 13:00:00 23.856486
2006-12-22 21:20:00 23.856486
2006-12-23 05:40:00 23.856486
2006-12-23 14:00:00 23.856486
2006-12-23 22:20:00 23.856486
2006-12-24 06:40:00 23.856486
2006-12-24 15:00:00 23.856486
2006-12-24 23:20:00 23.856486
2006-12-25 07:40:00 23.856486
2006-12-25 16:00:00 23.856486
2006-12-26 00:20:00 23.856486
2006-12-26 08:40:00 23.856486
2006-12-26 17:00:00 23.856486
2006-12-27 01:20:00 23.856486
2006-12-27 09:40:00 23.856486
2006-12-27 18:00:00 23.856486
2006-12-28 02:20:00 23.856486
2006-12-28 10:40:00 23.856486
2006-12-28 19:00:00 23.856486
2006-12-29 03:20:00 23.856486
2006-12-29 11:40:00 23.856486
2006-12-29 20:00:00 23.856486
2006-12-30 04:20:00 23.856486
2006-12-30 12:40:00 23.856486
2006-12-30 21:00:00 23.856486
2006-12-31 05:20:00 23.856486
2006-12-31 13:40:00 23.856486
2006-12-31 22:00:00 23.856486
Freq: 500min, Name: predicted_mean, dtype: float64
Test set length: 21024, Predictions length: 21024
# Plot
plt.figure(figsize=(10,6))
plt.plot(train.index, train, label='Train Data', color='blue')
plt.plot(test.index, test, label='Test Data', color='green')
plt.plot(test.index, predictions, label='Predictions', color='red', linestyle='--')
plt.legend()
plt.xlabel('Datetime')
plt.ylabel('Power (MW)')
plt.title('ARIMA Model Predictions vs Actual')
plt.show()

We can see from our data and the plot, our prediction sharply increases and peaks at around 23, both for the daytime and the night time, which has a very poor accuracy.
# Forecast next 100 values
future_steps = 100
forecast = model_fit.forecast(steps=future_steps)
# Plot
plt.figure(figsize=(10,6))
plt.plot(df.index, dff, label='Original Data') # Replace 'Value' with correct column name
plt.plot(pd.date_range(df.index[-1], periods=future_steps, freq='D'), forecast, label='Forecast', color='green')
plt.legend()
plt.show()

ARIMA yeilded a poor prediction potentially due to our data being highly seasonal. So we try to use a model that is known for handing seasonal data effectively.
Prophet#
The Prophet model, developed by Meta, is a robust and user-friendly tool for time series forecasting. It is particularly well-suited for data with strong seasonal patterns and missing values, as well as scenarios where the data may have outliers or trend changes. Below is an overview of the Prophet model and its components.
import sys
!{sys.executable} -m pip install prophet
from prophet import Prophet
Collecting prophet
Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Collecting cmdstanpy>=1.0.4 (from prophet)
Downloading cmdstanpy-1.2.5-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: numpy>=1.15.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (1.26.4)
Requirement already satisfied: matplotlib>=2.0.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (3.10.0)
Requirement already satisfied: pandas>=1.0.4 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (2.2.3)
Collecting holidays<1,>=0.25 (from prophet)
Downloading holidays-0.67-py3-none-any.whl.metadata (27 kB)
Requirement already satisfied: tqdm>=4.36.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (4.67.1)
Requirement already satisfied: importlib-resources in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from prophet) (6.5.2)
Collecting stanio<2.0.0,>=0.4.0 (from cmdstanpy>=1.0.4->prophet)
Downloading stanio-0.5.1-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: python-dateutil in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from holidays<1,>=0.25->prophet) (2.9.0.post0)
Requirement already satisfied: contourpy>=1.0.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (24.2)
Requirement already satisfied: pillow>=8 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from matplotlib>=2.0.0->prophet) (3.2.1)
Requirement already satisfied: pytz>=2020.1 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from pandas>=1.0.4->prophet) (2025.1)
Requirement already satisfied: six>=1.5 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from python-dateutil->holidays<1,>=0.25->prophet) (1.17.0)
Downloading prophet-1.1.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.4 MB)
ββββββββββββββββββββββββββββββββββββββββ 14.4/14.4 MB 83.7 MB/s eta 0:00:0000:01
?25hDownloading cmdstanpy-1.2.5-py3-none-any.whl (94 kB)
Downloading holidays-0.67-py3-none-any.whl (820 kB)
ββββββββββββββββββββββββββββββββββββββββ 820.7/820.7 kB 55.4 MB/s eta 0:00:00
?25hDownloading stanio-0.5.1-py3-none-any.whl (8.1 kB)
Installing collected packages: stanio, holidays, cmdstanpy, prophet
Successfully installed cmdstanpy-1.2.5 holidays-0.67 prophet-1.1.6 stanio-0.5.1
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
data = data.rename(columns={'LocalTime': 'ds', 'Power(MW)': 'y'})
if data['y'].isnull().sum() > 0:
data['y'].fillna(method='ffill', inplace=True)
data_hourly = data.resample('H', on='ds').mean().reset_index()
split_index = int(len(data_hourly) * 0.8)
train_data = data_hourly[:split_index]
test_data = data_hourly[split_index:]
# Fit the model
model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.fit(train_data)
17:28:17 - cmdstanpy - INFO - Chain [1] start processing
17:28:17 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x7fedf0ba6d90>
# Make future predictions
future = model.make_future_dataframe(periods=24*30, freq='H')
forecast = model.predict(future)
test_forecast = forecast.merge(test_data, on='ds', how='right')
plt.figure(figsize=(12, 6))
plt.plot(test_data['ds'], test_data['y'], label='Test Data (Actual)', color='blue', alpha=0.6)
plt.plot(test_forecast['ds'], test_forecast['yhat'], label='Predicted (Test)', color='orange', alpha=0.8)
plt.fill_between(test_forecast['ds'], test_forecast['yhat_lower'], test_forecast['yhat_upper'], color='orange', alpha=0.2, label='Uncertainty Interval')
plt.title("Test Data Accuracy (Prediction)")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()

plt.figure(figsize=(14, 7))
plt.plot(data['ds'], data['y'], label='Historical Data', color='black', alpha=0.6)
plt.plot(forecast['ds'], forecast['yhat'], label='Forecasted Data', color='green')
plt.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'], color='green', alpha=0.2, label='Uncertainty Interval')
plt.title("Full Forecast with Historical Data")
plt.xlabel("Date")
plt.ylabel("Power (MW)")
plt.legend()
plt.grid()
plt.show()

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']][:7000:70]
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
0 | 2006-01-01 00:00:00 | -6.665930 | -24.240018 | 10.358625 |
70 | 2006-01-03 22:00:00 | -4.474233 | -20.038008 | 11.853261 |
140 | 2006-01-06 20:00:00 | -6.592283 | -22.713803 | 9.164866 |
210 | 2006-01-09 18:00:00 | -5.583250 | -22.844907 | 12.420671 |
280 | 2006-01-12 16:00:00 | 26.849308 | 10.893639 | 44.342990 |
350 | 2006-01-15 14:00:00 | 48.079698 | 31.277399 | 63.720274 |
420 | 2006-01-18 12:00:00 | 49.432818 | 32.739164 | 65.115053 |
490 | 2006-01-21 10:00:00 | 53.029588 | 37.641518 | 69.696168 |
560 | 2006-01-24 08:00:00 | 45.396122 | 29.912687 | 61.725492 |
630 | 2006-01-27 06:00:00 | 10.705178 | -6.510347 | 26.607076 |
700 | 2006-01-30 04:00:00 | -8.956497 | -24.422312 | 8.420139 |
770 | 2006-02-02 02:00:00 | -1.170551 | -17.101661 | 16.039485 |
840 | 2006-02-05 00:00:00 | -4.957821 | -22.109546 | 11.159278 |
910 | 2006-02-07 22:00:00 | -3.306811 | -20.863814 | 11.892035 |
980 | 2006-02-10 20:00:00 | -5.103421 | -20.428770 | 10.981361 |
1050 | 2006-02-13 18:00:00 | -3.319446 | -20.724001 | 11.953706 |
1120 | 2006-02-16 16:00:00 | 29.978399 | 14.825827 | 47.609355 |
1190 | 2006-02-19 14:00:00 | 51.900797 | 34.795204 | 69.382889 |
1260 | 2006-02-22 12:00:00 | 53.633801 | 36.890491 | 69.474382 |
1330 | 2006-02-25 10:00:00 | 57.281214 | 39.905520 | 73.346095 |
1400 | 2006-02-28 08:00:00 | 49.448233 | 34.419007 | 64.841434 |
1470 | 2006-03-03 06:00:00 | 14.444099 | -1.323467 | 31.743920 |
1540 | 2006-03-06 04:00:00 | -5.492482 | -21.773490 | 11.807542 |
1610 | 2006-03-09 02:00:00 | 2.188114 | -13.879455 | 18.838368 |
1680 | 2006-03-12 00:00:00 | -1.448799 | -18.050337 | 14.659796 |
1750 | 2006-03-14 22:00:00 | 0.638901 | -16.547347 | 16.397506 |
1820 | 2006-03-17 20:00:00 | -0.457131 | -18.698279 | 15.254874 |
1890 | 2006-03-20 18:00:00 | 2.227401 | -14.973428 | 19.588454 |
1960 | 2006-03-23 16:00:00 | 36.536288 | 21.510806 | 52.671831 |
2030 | 2006-03-26 14:00:00 | 59.480708 | 43.008635 | 75.393336 |
2100 | 2006-03-29 12:00:00 | 62.149277 | 46.492645 | 79.292354 |
2170 | 2006-04-01 10:00:00 | 66.557753 | 49.186564 | 82.018656 |
2240 | 2006-04-04 08:00:00 | 59.235910 | 43.249104 | 76.906793 |
2310 | 2006-04-07 06:00:00 | 24.431270 | 7.560132 | 40.307934 |
2380 | 2006-04-10 04:00:00 | 4.335743 | -13.047157 | 20.579274 |
2450 | 2006-04-13 02:00:00 | 11.470230 | -4.340221 | 28.045248 |
2520 | 2006-04-16 00:00:00 | 6.894983 | -8.307186 | 23.936112 |
2590 | 2006-04-18 22:00:00 | 7.677783 | -7.633829 | 24.566409 |
2660 | 2006-04-21 20:00:00 | 4.972996 | -10.115171 | 22.770960 |
2730 | 2006-04-24 18:00:00 | 5.847444 | -11.075033 | 22.952094 |
2800 | 2006-04-27 16:00:00 | 38.283957 | 21.068517 | 54.367096 |
2870 | 2006-04-30 14:00:00 | 59.458451 | 43.787036 | 76.326864 |
2940 | 2006-05-03 12:00:00 | 60.631959 | 43.894916 | 75.841078 |
3010 | 2006-05-06 10:00:00 | 63.977282 | 46.359654 | 80.273678 |
3080 | 2006-05-09 08:00:00 | 56.141473 | 39.153473 | 72.497662 |
3150 | 2006-05-12 06:00:00 | 21.428345 | 4.928609 | 37.590388 |
3220 | 2006-05-15 04:00:00 | 2.011454 | -13.187416 | 18.508703 |
3290 | 2006-05-18 02:00:00 | 10.315806 | -6.078136 | 27.340943 |
3360 | 2006-05-21 00:00:00 | 7.237416 | -10.327478 | 24.078133 |
3430 | 2006-05-23 22:00:00 | 9.631903 | -6.678014 | 26.928064 |
3500 | 2006-05-26 20:00:00 | 8.422100 | -8.266247 | 23.888559 |
3570 | 2006-05-29 18:00:00 | 10.456069 | -5.736026 | 27.479514 |
3640 | 2006-06-01 16:00:00 | 43.541098 | 27.250702 | 60.478775 |
3710 | 2006-06-04 14:00:00 | 64.744611 | 48.769853 | 80.851184 |
3780 | 2006-06-07 12:00:00 | 65.299729 | 47.875205 | 82.216593 |
3850 | 2006-06-10 10:00:00 | 67.433314 | 50.865657 | 85.016854 |
3920 | 2006-06-13 08:00:00 | 57.917999 | 42.881641 | 74.453020 |
3990 | 2006-06-16 06:00:00 | 21.234606 | 3.616952 | 37.113598 |
4060 | 2006-06-19 04:00:00 | -0.240966 | -16.510179 | 16.249598 |
4130 | 2006-06-22 02:00:00 | 6.116161 | -11.247583 | 23.117712 |
4200 | 2006-06-25 00:00:00 | 1.374173 | -13.974011 | 18.909048 |
4270 | 2006-06-27 22:00:00 | 2.513891 | -12.470514 | 19.150111 |
4340 | 2006-06-30 20:00:00 | 0.525372 | -14.790831 | 17.337275 |
4410 | 2006-07-03 18:00:00 | 2.263905 | -12.794211 | 19.928545 |
4480 | 2006-07-06 16:00:00 | 35.490399 | 18.779073 | 52.209922 |
4550 | 2006-07-09 14:00:00 | 57.185348 | 40.098695 | 73.560572 |
4620 | 2006-07-12 12:00:00 | 58.471617 | 41.970231 | 74.965051 |
4690 | 2006-07-15 10:00:00 | 61.461021 | 45.970468 | 77.432496 |
4760 | 2006-07-18 08:00:00 | 52.823449 | 35.623734 | 68.871607 |
4830 | 2006-07-21 06:00:00 | 16.962517 | 0.447474 | 32.983314 |
4900 | 2006-07-24 04:00:00 | -3.789893 | -20.622766 | 12.027440 |
4970 | 2006-07-27 02:00:00 | 3.181478 | -13.150792 | 20.016953 |
5040 | 2006-07-30 00:00:00 | -1.035615 | -17.199097 | 15.491980 |
5110 | 2006-08-01 22:00:00 | 0.577880 | -14.593119 | 18.136919 |
5180 | 2006-08-04 20:00:00 | -0.945019 | -17.563205 | 15.315258 |
5250 | 2006-08-07 18:00:00 | 1.283991 | -15.302904 | 18.327193 |
5320 | 2006-08-10 16:00:00 | 35.036419 | 19.321878 | 52.017611 |
5390 | 2006-08-13 14:00:00 | 57.273509 | 41.010533 | 73.832833 |
5460 | 2006-08-16 12:00:00 | 59.068696 | 42.082606 | 74.701344 |
5530 | 2006-08-19 10:00:00 | 62.461875 | 44.359121 | 79.215896 |
5600 | 2006-08-22 08:00:00 | 54.043432 | 38.503110 | 70.064505 |
5670 | 2006-08-25 06:00:00 | 18.149591 | 1.953598 | 34.368925 |
5740 | 2006-08-28 04:00:00 | -2.922522 | -19.685369 | 14.328186 |
5810 | 2006-08-31 02:00:00 | 3.456821 | -13.146439 | 20.756565 |
5880 | 2006-09-03 00:00:00 | -1.552922 | -17.041100 | 15.804643 |
5950 | 2006-09-05 22:00:00 | -0.807801 | -17.756285 | 15.191909 |
6020 | 2006-09-08 20:00:00 | -3.114020 | -19.869315 | 14.647907 |
6090 | 2006-09-11 18:00:00 | -1.415580 | -18.204937 | 15.121756 |
6160 | 2006-09-14 16:00:00 | 32.197630 | 15.385972 | 47.944213 |
6230 | 2006-09-17 14:00:00 | 54.760301 | 37.407847 | 71.634335 |
6300 | 2006-09-20 12:00:00 | 57.326285 | 40.326160 | 73.315428 |
6370 | 2006-09-23 10:00:00 | 61.810807 | 45.609297 | 78.947643 |
6440 | 2006-09-26 08:00:00 | 54.583659 | 39.543481 | 72.311184 |
6510 | 2006-09-29 06:00:00 | 19.696369 | 3.051884 | 37.266530 |
6580 | 2006-10-02 04:00:00 | -0.851641 | -16.682391 | 16.360942 |
6650 | 2006-10-05 02:00:00 | 5.320925 | -10.065374 | 21.620323 |
6720 | 2006-10-08 00:00:00 | -0.763984 | -16.835757 | 15.982833 |
6790 | 2006-10-10 22:00:00 | -1.941290 | -18.443028 | 14.145057 |
6860 | 2006-10-13 20:00:00 | -6.815872 | -22.830380 | 9.753774 |
6930 | 2006-10-16 18:00:00 | -7.963110 | -23.904965 | 7.692729 |
Accuracy#
We can see that we have increased our accuracy greatly, using this model accounting for all the seasonality, but still it is predicting values to be less than 0 as well, which can never happen in our model.
model.plot_components(forecast)
plt.show()

These plots reveal the seasonality patterns in our forecasted data, highlighting the underlying trends of the data.
Until now we have not taken account of the seasonality of the data before forecasting and have let the model take care of it by itself. We now try to take care of the seasonality of the data by ourselves.
LightGBM with feature engineering#
LightGBM (Light Gradient Boosting Machine) is a powerful and efficient gradient boosting framework widely used for machine learning tasks.
Feature engineering is the process of preparing and transforming raw data into features that better represent the underlying problem to improve the performance of a machine learning model. A feature is an individual measurable property or characteristic of a phenomenon being observed, often represented as a column in a dataset.
The goal of feature engineering is to extract the most relevant information from the raw data, making it easier for the model to learn patterns and make predictions.
import sys
!{sys.executable} -m pip install lightgbm
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
Collecting lightgbm
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Requirement already satisfied: numpy>=1.17.0 in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /home/hcui9/miniforge3/envs/a/lib/python3.11/site-packages (from lightgbm) (1.15.2)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
ββββββββββββββββββββββββββββββββββββββββ 3.6/3.6 MB 50.0 MB/s eta 0:00:00
?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0
import os
current_dir = os.getcwd()
csv_path = os.path.join(current_dir, "ms-pv-2006", "Actual_30.25_-89.45_2006_UPV_118MW_5_Min.csv")
data = pd.read_csv(csv_path)
data['LocalTime'] = pd.to_datetime(data['LocalTime'], format='%m/%d/%y %H:%M')
# Feature Engineering
data['hour'] = data['LocalTime'].dt.hour
data['day_of_week'] = data['LocalTime'].dt.dayofweek
data['month'] = data['LocalTime'].dt.month
data['is_daytime'] = ((data['hour'] >= 6) & (data['hour'] <= 18)).astype(int)
Add the features: hour, day of the week, month and is_daytime as columns to the data.
# Define features and target variable
X = data[['hour', 'day_of_week', 'month', 'is_daytime']]
y = data['Power(MW)']
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=False)
# LightGBM Model Training
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
'objective': 'regression',
'metric': 'rmse',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
# Train the model with early stopping
model = lgb.train(
params,
lgb_train,
valid_sets=[lgb_eval], # Validation set
num_boost_round=1000,
callbacks=[
lgb.early_stopping(stopping_rounds=50), # Early stopping callback
lgb.log_evaluation(period=100) # Log evaluation progress every 100 iterations
]
)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.041361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 44
[LightGBM] [Info] Number of data points in the train set: 84096, number of used features: 4
[LightGBM] [Info] Start training from score 23.859394
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[38] valid_0's rmse: 15.1899
# Predictions
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
# Evaluate the Model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
Root Mean Squared Error (RMSE): 15.189919633723044
Mean Absolute Error (MAE): 10.479850535578153
# Visualization: Actual vs Predicted
plt.figure(figsize=(15, 6))
plt.plot(y_test.values[:1000], label="Actual", color='blue', linewidth=0.8)
plt.plot(y_pred[:1000], label="Predicted", color='orange', linestyle='--', linewidth=0.8)
plt.title('Actual vs Predicted Power Generation', fontsize=16)
plt.xlabel('Sample Index', fontsize=12)
plt.ylabel('Power (MW)', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

This yeilded a much better result than all the other models, with a low RMSE and MAE for a time series data with high seasonality.
# Feature Importance
importance = model.feature_importance()
feature_names = X.columns
plt.figure(figsize=(10, 6))
plt.barh(feature_names, importance, color='teal', alpha=0.7)
plt.title('Feature Importance', fontsize=16)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.tight_layout()
plt.show()

The lower the value, the greater the importance, so as expected is_daytime which is a boolean that is true if the time of day is between 6 am and 6 pm and false rest of the time turned out to be the most important factor in a successful prediction.
Key findings#
Due to a highly seasonal data, we can see how the accuracy improves when using models more suitable for such data, with gradient boosting integrated with featuring engineering yielding the best results.
The most significant feature is is_daytime, a boolean indicating whether the observation falls between 6 AM and 6 PM, underscoring its importance in prediction success.
Graphical representation clearly illustrates the ranking of feature importance.