Weather is a major driver for so many things that happen in the real world. In fact, it is so important that it usually ends up benefiting any forecasting model that incorporates it using machine learning models.
Think about the following scenarios:
It is fair to say that any model in the scenarios above that doesn’t include weather as a factor is either pointless or not quite as good as it could be.
Surprisingly, while there are a lot of online resources focusing on how to forecast weather itself, there’s virtually nothing that shows how to obtain & use weather data effectively as a feature, i.e. as an input to predict something else. This is what this post is about.
First we’ll highlight the challenges associated with using weather data for modelling, which models are commonly used, and what providers are out there. Then we’ll run a case study and use data from one of the providers to build a machine learning model that forecasts taxi rides in New York.
At the end of this post you will have learned about:
This article was published as a part of the Data Science Blogathon.
For a ML model in production we need both (1) live data to produce predictions in real time and (2) a bulk of historical data to train a model that is able to do such a thing.
Obviously, when making live predictions, we will use the current weather forecast as an input, as it is the most up-to-date estimate of what is going to happen in the future. For instance, when predicting how much solar energy will be produced tomorrow the model input we need is what the forecasts say about tomorrow’s weather.
If we want the model to perform well in the real world, training data needs to reflect live data. For model training, there’s a choice to be made about whether to use historical measurements or historical forecasts. Historical measurements reflect only the outcome, i.e. what weather stations recorded. However, the live model is going to make use of forecasts, not measurements, since the measurements aren’t yet available at the time the model makes it’s prediction.
If there is a chance to obtain historical forecasts, they should always be preferred as this trains the model under the exact same conditions as are available at the time of live predictions.
Consider this example: Whenever there’s a lot of clouds, a solar energy farm will produce little electricity. A model that is trained on historical measurements will learn that when cloud coverage feature shows a high value, there’s a 100% probability that there won’t be much electricity. On the other hand, a model trained on historical forecasts will learn that there’s another dimension to this: forecasting distance. When making predictions several days ahead, a high value for cloud coverage is only an estimate and does not mean that the day in question will be cloudy with certainty. In such cases the model will be able to only somewhat rely on this feature and consider other features too when predicting solar generation.
Weather data =/= weather data. There’s tons of factors ruling out a specific set of weather data as even remotely useful. Among the main factors are:
Additionally, the shape or format of the data can be cumbersome to work with. Any extra steps of ETL that you need to create may introduce bugs and the time-dependent nature of the data can make this work quite frustrating.
Data that is older than a day, or a week, often comes in form of CSV dumps, FTP servers, or at best on a separate API endpoint, but then again often with different fields than the live forecast endpoint. This creates the risk of mismatched data and can blow up complexity in your ETL.
Costs can vary extremely depending on the provider and which types of weather data are required. For instance, providers may charge for each single coordinate which can be a problem when many locations are required. Obtaining historical weather forecasts is generally quite difficult and costly.
Numerical weather prediction models, as they are often called, simulate the physical behavior of all the different aspects of weather. There’s plenty of them, varying in their format (see above), the parts of the globe they cover, and accuracy.
Here’s a quick list of the most widely used weather models:
Providers are there to bring data from weather models to the end user. Often enough they also have their own proprietary forecasting models on top of the standard weather models. Here are some of the known ones:
For the machine learning use case, the providers mentioned above turn out to be either not offering historical forecasts, or the process to get and combine the data is both cumbersome and expensive. In contrast, blueskyapi.io offers a simple API that can be called to obtain both live and historical forecasts in the same format, making the data pipelining very straightforward. The original data comes from GFS, the most widely used weather model.
Imagine you own a taxi business in NYC and want to forecast the amount of taxi rides in order to optimize your staff & fleet planning. As you have access to NYC’s historical combined taxi data, you decide to make use of it and create a machine learning model.
We’ll use data that can be downloaded from the NYC website here.
First some imports:
import pandas as pd
import numpy as np
import holidays
import datetime
import pytz
from dateutil.relativedelta import relativedelta
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import shap
import pyarrow
timezone = pytz.timezone("US/Eastern")
dates = pd.date_range("2022-04", "2023-03", freq="MS", tz=timezone)
To get our taxi dataset, we need to loop through the files and create an aggregated dataframe with counts per hour. This will take about 20s to complete.
aggregated_dfs = []
for date in dates:
print(date)
df = pd.read_parquet(
f"./data/yellow_tripdata_{date.strftime('%Y-%m')}.parquet",
engine='pyarrow'
)
df["timestamp"] = pd.DatetimeIndex(
df["tpep_pickup_datetime"], tz=timezone, ambiguous='NaT'
).floor("H")
# data cleaning, sometimes it includes wrong timestamps
df = df[
(df.timestamp >= date) &
(df.timestamp < date + relativedelta(months=1))
]
aggregated_dfs.append(
df.groupby(["timestamp"]).agg({"trip_distance": "count"}
).reset_index())
df = pd.concat(aggregated_dfs).reset_index(drop=True)
df.columns = ["timestamp", "count"]
Let’s have a look at the data. First 2 days:
df.head(48).plot("timestamp", "count")
Everything:
fig, ax = plt.subplots()
fig.set_size_inches(20, 8)
ax.plot(df.timestamp, df["count"])
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
Interestingly, we can see that during some of the holiday times the amount of taxi rides is quite reduced. From a time series perspective there is no obvious trend or heteroscedasticity in the data.
Next, we’ll add a couple of typical features used in time series forecasting.
Encode timestamp pieces
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.day_of_week
Encode holidays
us_holidays = holidays.UnitedStates()
df["date"] = df["timestamp"].dt.date
df["holiday_today"] = [ind in us_holidays for ind in df.date]
df["holiday_tomorrow"] = [ind + datetime.timedelta(days=1) in us_holidays for ind in df.date]
df["holiday_yesterday"] = [ind - datetime.timedelta(days=1) in us_holidays for ind in df.date]
Now we come to the interesting bit: the weather data. Below is a walkthrough on how to use the BlueSky weather API. For Python users, it is available via pip:
pip install blueskyapi
However it is also possible to just use cURL.
BlueSky’s basic API is free. It’s recommended to get an API key via the website, as this will boost the amount of data that can be pulled from the API.
With their paid subcription, you can obtain additional weather variables, more frequent forecast updates, better granularity, etc., but for the sake of the case study this is not needed.
import blueskyapi
client = blueskyapi.Client() # use API key here to boost data limit
We need to pick the location, forecast distances, and weather variables of interest. Let’s get a full year worth of weather forecasts to match the taxi data.
# New York
lat = 40.5
lon = 106.0
weather = client.forecast_history(
lat=lat,
lon=lon,
min_forecast_moment="2022-04-01T00:00:00+00:00",
max_forecast_moment="2023-04-01T00:00:00+00:00",
forecast_distances=[3,6], # hours ahead
columns=[
'precipitation_rate_at_surface',
'apparent_temperature_at_2m',
'temperature_at_2m',
'total_cloud_cover_at_convective_cloud_layer',
'wind_speed_gust_at_surface',
'categorical_rain_at_surface',
'categorical_snow_at_surface'
],
)
weather.iloc[0]
That’s all we had to do to when it comes to obtaining the weather data!
We need to ensure the weather data gets mapped correctly to the taxi data. For that we need the target moment a weather forecast was made for. We get this by adding forecast_moment + forecast_distance:
weather["target_moment"] = weather.forecast_moment + pd.to_timedelta(
weather.forecast_distance, unit="h"
)
A typical issue when joining data is the data type and timezone awareness of the timestamps. Let’s match up the timezones to ensure we join them correctly.
df["timestamp"] = [timezone.normalize(ts).astimezone(pytz.utc) for ts in df["timestamp"]]
weather["target_moment"] = weather["target_moment"].dt.tz_localize('UTC')
As a last step we join, for any timestamp in the taxi data, the latest available weather forecast to it.
d = pd.merge_asof(df, weather, left_on="timestamp", right_on="target_moment", direction="nearest")
d.iloc[0]
Our dataset is complete!
Before modelling it usually makes sense to check a couple more things, such as whether the target variable is stationary and if there is any missingness or anomalies in the data. However, for the sake of this blog post, we’re going to keep it really simple and just go ahead and fit an out-of-the-box random forest model with the features we extracted & created:
d = d[~d.isnull().any(axis=1)].reset_index(drop=True)
X = d[
[
"day_of_week",
"hour",
"holiday_today",
"holiday_tomorrow",
"holiday_yesterday",
"precipitation_rate_at_surface",
"apparent_temperature_at_2m",
"temperature_at_2m",
"total_cloud_cover_at_convective_cloud_layer",
"wind_speed_gust_at_surface",
"categorical_rain_at_surface",
"categorical_snow_at_surface"
]
]
y = d["count"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, shuffle=False
)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
pred_train = rf.predict(X_train)
plt.figure(figsize=(50,8))
plt.plot(y_train)
plt.plot(pred_train)
plt.show()
pred_test = rf.predict(X_test)
plt.figure(figsize=(50,8))
plt.plot(y_test.reset_index(drop=True))
plt.plot(pred_test)
plt.show()
As expected, quite some accuracy is lost on the test set vs. the training set. This could be improved, but overall, the predictions seem reasonable, albeit often conservative when it comes to the very high values.
print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")
MAPE is 17.16 %
To confirm that adding weather data improved the model, let’s compare it with a benchmark model that is fitted on everything but the weather data:
X = d[
[
"day_of_week",
"hour",
"holiday_today",
"holiday_tomorrow",
"holiday_yesterday"
]
]
y = d["count"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, shuffle=False
)
rf0 = RandomForestRegressor(random_state=42)
rf0.fit(X_train, y_train)
pred_train = rf0.predict(X_train)
pred_test = rf0.predict(X_test)
print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")
MAPE is 17.76 %
Adding weather data improved the taxi ride forecast MAPE by 0.6%. While this percentage may not seem like a lot, depending on the operations of a business such an improvement could have a significant impact.
Next to metrics, let’s have a look at the feature importances. We’re going to use the SHAP package, which is using shap values to explain the individual, marginal contribution of each feature to the model, i.e. it checks how much an individual feature contributes on top of the other features.
explainer = shap.Explainer(rf)
shap_values = explainer(X_test)
This will take a couple of minutes, as it’s running plenty of “what if” scenarios over all of the features: if any feature was missing, how would that affect overall prediction accuracy?
shap.plots.beeswarm(shap_values)
We can see that by far the most important explanatory variables were the hour of the day and the day of the week. This makes perfect sense. Taxi ride counts are highly cyclical with demand on taxis varying a lot during the day and the week. Some of the weather data turned out to be useful as well. When it’s cold, there’s more cab rides. To some degree, however, temperature might also be a proxy for general yearly seasonality effects in taxi demand. Another important feature is wind gusts, with less cabs being used when there is more gusts. A hypothesis here could be that there is less traffic during stormy weather.
That’s it! You have created a simple model using weather that can be used in practice.
In this article we discussed the importance of weather data in forecasting models across various sectors, the challenges associated with using it effectively, and the available numerical weather prediction models and providers, highlighting BlueSky API as a cost-effective and efficient way to obtain both live and historical forecasts. Through a case study on forecasting New York taxi rides, this article provided a hands-on demonstration of using weather data in machine learning, teaching you all the basic skills you need to get started:
A. Weather data can be incorporated into time series forecasting models as a set of external variables or covariates, also called features, to forecast some other time-dependent target variable. Unlike many other features, weather data is both conceptually and practically more complicated to add to such a model. The article explains how to do this correctly.
A. It’s important to consider the various aspects such as accuracy, granularity, forecast horizon, forecast updates, and relevance of the weather data. You should ensure it is reliable and corresponds to the location of interest. Also, not all weather variables may be impactful to your operations, so feature selection is crucial to avoid overfitting and enhance model performance.
A. There are many possible reasons. For instance, by integrating weather data, businesses can anticipate fluctuations in demand or supply caused by weather changes and adjust accordingly. This can help optimize resource allocation, reduce waste, and improve customer service by preparing for expected changes.
A. Machine learning algorithms can automatically identify patterns in historical data, including subtle relationships between weather changes and operational metrics. They can handle large volumes of data, accommodate multiple variables, and improve over time when getting exposed to more data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.