Using Weather Data for Machine Learning Models

Marc . Last Updated : 11 Aug, 2023

11 min read

Introduction

Weather is a major driver for so many things that happen in the real world. In fact, it is so important that it usually ends up benefiting any forecasting model that incorporates it using machine learning models.

Think about the following scenarios:

A public transport agency tries to forecast delays and congestion in the system
An energy provider would like to estimate the amount of solar electricity generation tomorrow for the purpose of energy trading
Event organizers need to anticipate the amount of attendees in order to ensure safety standards are met
A farm needs to schedule the harvesting operations for the upcoming week

Using Weather Data for Machine Learning Models

It is fair to say that any model in the scenarios above that doesn’t include weather as a factor is either pointless or not quite as good as it could be.

Surprisingly, while there are a lot of online resources focusing on how to forecast weather itself, there’s virtually nothing that shows how to obtain & use weather data effectively as a feature, i.e. as an input to predict something else. This is what this post is about.

Overview

First we’ll highlight the challenges associated with using weather data for modelling, which models are commonly used, and what providers are out there. Then we’ll run a case study and use data from one of the providers to build a machine learning model that forecasts taxi rides in New York.

At the end of this post you will have learned about:

Challenges around using weather data for modelling
Which wheather models and providers exist
Typical ETL & feature building steps for time series data
Evaluation of feature importances using SHAP values

This article was published as a part of the Data Science Blogathon.

Introduction
Overview
Challenges
Weather Models
Providers
BlueSky API
Case Study: New York Taxi Rides
Preprocessing Taxi Data
Feature Engineering Taxi Data
BlueSky Weather Data
Join Data
Model
Model Without Weather
Feature Importance
Further Model Improvements
Conclusion
Frequently Asked Questions

Challenges

Measured vs. Forecasted Weather

For a ML model in production we need both (1) live data to produce predictions in real time and (2) a bulk of historical data to train a model that is able to do such a thing.

Obviously, when making live predictions, we will use the current weather forecast as an input, as it is the most up-to-date estimate of what is going to happen in the future. For instance, when predicting how much solar energy will be produced tomorrow the model input we need is what the forecasts say about tomorrow’s weather.

What about Model Training?

If we want the model to perform well in the real world, training data needs to reflect live data. For model training, there’s a choice to be made about whether to use historical measurements or historical forecasts. Historical measurements reflect only the outcome, i.e. what weather stations recorded. However, the live model is going to make use of forecasts, not measurements, since the measurements aren’t yet available at the time the model makes it’s prediction.

If there is a chance to obtain historical forecasts, they should always be preferred as this trains the model under the exact same conditions as are available at the time of live predictions.

Model training | Using Weather Data for Machine Learning Models — by American Public Power Association on Unsplash

Consider this example: Whenever there’s a lot of clouds, a solar energy farm will produce little electricity. A model that is trained on historical measurements will learn that when cloud coverage feature shows a high value, there’s a 100% probability that there won’t be much electricity. On the other hand, a model trained on historical forecasts will learn that there’s another dimension to this: forecasting distance. When making predictions several days ahead, a high value for cloud coverage is only an estimate and does not mean that the day in question will be cloudy with certainty. In such cases the model will be able to only somewhat rely on this feature and consider other features too when predicting solar generation.

Format

Weather data =/= weather data. There’s tons of factors ruling out a specific set of weather data as even remotely useful. Among the main factors are:

Granularity: are there records for every hour, every 3 hours, daily?
Variables: does it include the feature(s) I need?
Spatial Resolution: how many km² does one record refer to?
Horizon: how far out does the forecast go?
Forecast Updates: how often is a new forecast created?

Additionally, the shape or format of the data can be cumbersome to work with. Any extra steps of ETL that you need to create may introduce bugs and the time-dependent nature of the data can make this work quite frustrating.

Live vs. Old Data

Data that is older than a day, or a week, often comes in form of CSV dumps, FTP servers, or at best on a separate API endpoint, but then again often with different fields than the live forecast endpoint. This creates the risk of mismatched data and can blow up complexity in your ETL.

Costs

Costs can vary extremely depending on the provider and which types of weather data are required. For instance, providers may charge for each single coordinate which can be a problem when many locations are required. Obtaining historical weather forecasts is generally quite difficult and costly.

Weather Models

Numerical weather prediction models, as they are often called, simulate the physical behavior of all the different aspects of weather. There’s plenty of them, varying in their format (see above), the parts of the globe they cover, and accuracy.

Here’s a quick list of the most widely used weather models:

GFS: most known standard model, widely used, global
CFS: less accurate than GFS, for long-term climate forecasts, global
ECMWF: most accurate but expensive model, global
UM: most accurate model for UK, global also available
WRF: open source code to produce DIY regional weather forecasts

Providers

Providers are there to bring data from weather models to the end user. Often enough they also have their own proprietary forecasting models on top of the standard weather models. Here are some of the known ones:

AccuWeather
MetOffice
OpenWeatherMap
AerisWeather
DWD (Germany)
Meteogroup (UK)

BlueSky API

For the machine learning use case, the providers mentioned above turn out to be either not offering historical forecasts, or the process to get and combine the data is both cumbersome and expensive. In contrast, blueskyapi.io offers a simple API that can be called to obtain both live and historical forecasts in the same format, making the data pipelining very straightforward. The original data comes from GFS, the most widely used weather model.

Case Study: New York Taxi Rides

Imagine you own a taxi business in NYC and want to forecast the amount of taxi rides in order to optimize your staff & fleet planning. As you have access to NYC’s historical combined taxi data, you decide to make use of it and create a machine learning model.

We’ll use data that can be downloaded from the NYC website here.

First some imports:

import pandas as pd
import numpy as np
import holidays
import datetime
import pytz
from dateutil.relativedelta import relativedelta
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import shap
import pyarrow

Preprocessing Taxi Data

timezone = pytz.timezone("US/Eastern")
dates = pd.date_range("2022-04", "2023-03", freq="MS", tz=timezone)

To get our taxi dataset, we need to loop through the files and create an aggregated dataframe with counts per hour. This will take about 20s to complete.

aggregated_dfs = []
for date in dates:
    print(date)
    df = pd.read_parquet(
      f"./data/yellow_tripdata_{date.strftime('%Y-%m')}.parquet", 
      engine='pyarrow'
    )
    df["timestamp"] = pd.DatetimeIndex(
      df["tpep_pickup_datetime"], tz=timezone, ambiguous='NaT'
    ).floor("H")
    
    # data cleaning, sometimes it includes wrong timestamps
    df = df[
        (df.timestamp >= date) & 
        (df.timestamp < date + relativedelta(months=1))
    ]
    aggregated_dfs.append(
      df.groupby(["timestamp"]).agg({"trip_distance": "count"}
    ).reset_index())
df = pd.concat(aggregated_dfs).reset_index(drop=True)
df.columns = ["timestamp", "count"]

Let’s have a look at the data. First 2 days:

df.head(48).plot("timestamp", "count")

Everything:

fig, ax = plt.subplots()
fig.set_size_inches(20, 8)
ax.plot(df.timestamp, df["count"])
ax.xaxis.set_major_locator(plt.MaxNLocator(10))

Interestingly, we can see that during some of the holiday times the amount of taxi rides is quite reduced. From a time series perspective there is no obvious trend or heteroscedasticity in the data.

Feature Engineering Taxi Data

Next, we’ll add a couple of typical features used in time series forecasting.

Encode timestamp pieces

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.day_of_week

Encode holidays

us_holidays = holidays.UnitedStates()
df["date"] = df["timestamp"].dt.date
df["holiday_today"] = [ind in us_holidays for ind in df.date]
df["holiday_tomorrow"] = [ind + datetime.timedelta(days=1) in us_holidays for ind in df.date]
df["holiday_yesterday"] = [ind - datetime.timedelta(days=1) in us_holidays for ind in df.date]

BlueSky Weather Data

Now we come to the interesting bit: the weather data. Below is a walkthrough on how to use the BlueSky weather API. For Python users, it is available via pip:

pip install blueskyapi

However it is also possible to just use cURL.

BlueSky’s basic API is free. It’s recommended to get an API key via the website, as this will boost the amount of data that can be pulled from the API.

With their paid subcription, you can obtain additional weather variables, more frequent forecast updates, better granularity, etc., but for the sake of the case study this is not needed.

import blueskyapi
client = blueskyapi.Client()  # use API key here to boost data limit

We need to pick the location, forecast distances, and weather variables of interest. Let’s get a full year worth of weather forecasts to match the taxi data.

# New York
lat = 40.5
lon = 106.0

weather = client.forecast_history(
    lat=lat,
    lon=lon,
    min_forecast_moment="2022-04-01T00:00:00+00:00",
    max_forecast_moment="2023-04-01T00:00:00+00:00",
    forecast_distances=[3,6],  # hours ahead
    columns=[
        'precipitation_rate_at_surface',
        'apparent_temperature_at_2m',
        'temperature_at_2m',
        'total_cloud_cover_at_convective_cloud_layer',
        'wind_speed_gust_at_surface',
        'categorical_rain_at_surface',
        'categorical_snow_at_surface'
    ],
)
weather.iloc[0]

That’s all we had to do to when it comes to obtaining the weather data!

Join Data

We need to ensure the weather data gets mapped correctly to the taxi data. For that we need the target moment a weather forecast was made for. We get this by adding forecast_moment + forecast_distance:

weather["target_moment"] = weather.forecast_moment + pd.to_timedelta(
    weather.forecast_distance, unit="h"
)

A typical issue when joining data is the data type and timezone awareness of the timestamps. Let’s match up the timezones to ensure we join them correctly.

df["timestamp"] = [timezone.normalize(ts).astimezone(pytz.utc) for ts in df["timestamp"]]
weather["target_moment"] = weather["target_moment"].dt.tz_localize('UTC')

As a last step we join, for any timestamp in the taxi data, the latest available weather forecast to it.

d = pd.merge_asof(df, weather, left_on="timestamp", right_on="target_moment", direction="nearest")
d.iloc[0]

Our dataset is complete!

Model

Before modelling it usually makes sense to check a couple more things, such as whether the target variable is stationary and if there is any missingness or anomalies in the data. However, for the sake of this blog post, we’re going to keep it really simple and just go ahead and fit an out-of-the-box random forest model with the features we extracted & created:

d = d[~d.isnull().any(axis=1)].reset_index(drop=True)
X = d[
    [
        "day_of_week", 
        "hour", 
        "holiday_today", 
        "holiday_tomorrow", 
        "holiday_yesterday", 
        "precipitation_rate_at_surface",
        "apparent_temperature_at_2m",
        "temperature_at_2m",
        "total_cloud_cover_at_convective_cloud_layer",
        "wind_speed_gust_at_surface",
        "categorical_rain_at_surface",
        "categorical_snow_at_surface"
    ]
]
y = d["count"]
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.33, random_state=42, shuffle=False
)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
pred_train = rf.predict(X_train)
plt.figure(figsize=(50,8))
plt.plot(y_train)
plt.plot(pred_train)
plt.show()

pred_test = rf.predict(X_test)
plt.figure(figsize=(50,8))
plt.plot(y_test.reset_index(drop=True))
plt.plot(pred_test)
plt.show()

As expected, quite some accuracy is lost on the test set vs. the training set. This could be improved, but overall, the predictions seem reasonable, albeit often conservative when it comes to the very high values.

print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")

MAPE is 17.16 %

Model Without Weather

To confirm that adding weather data improved the model, let’s compare it with a benchmark model that is fitted on everything but the weather data:

X = d[
    [
        "day_of_week", 
        "hour", 
        "holiday_today", 
        "holiday_tomorrow", 
        "holiday_yesterday"
    ]
]
y = d["count"] 
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.33, random_state=42, shuffle=False
)
rf0 = RandomForestRegressor(random_state=42)
rf0.fit(X_train, y_train)
pred_train = rf0.predict(X_train)
pred_test = rf0.predict(X_test)
print("MAPE is", round(mean_absolute_percentage_error(y_test,pred_test) * 100, 2), "%")

MAPE is 17.76 %

Adding weather data improved the taxi ride forecast MAPE by 0.6%. While this percentage may not seem like a lot, depending on the operations of a business such an improvement could have a significant impact.

Feature Importance

Next to metrics, let’s have a look at the feature importances. We’re going to use the SHAP package, which is using shap values to explain the individual, marginal contribution of each feature to the model, i.e. it checks how much an individual feature contributes on top of the other features.

explainer = shap.Explainer(rf)
shap_values = explainer(X_test)

This will take a couple of minutes, as it’s running plenty of “what if” scenarios over all of the features: if any feature was missing, how would that affect overall prediction accuracy?

shap.plots.beeswarm(shap_values)

We can see that by far the most important explanatory variables were the hour of the day and the day of the week. This makes perfect sense. Taxi ride counts are highly cyclical with demand on taxis varying a lot during the day and the week. Some of the weather data turned out to be useful as well. When it’s cold, there’s more cab rides. To some degree, however, temperature might also be a proxy for general yearly seasonality effects in taxi demand. Another important feature is wind gusts, with less cabs being used when there is more gusts. A hypothesis here could be that there is less traffic during stormy weather.

Further Model Improvements

Consider creating more features from existing data, for instance lagging the target variable from the previous day or week.
Frequent retraining of the model will make sure trends are always captured. This will have a big impact when using the model in the real world.
Consider adding more external data, such as NY traffic & congestion data.
Consider other timeseries models and tools such as Facebook Prophet.

future model improvements | machine learning models — by Clint Patterson on Unsplash

Conclusion

That’s it! You have created a simple model using weather that can be used in practice.

In this article we discussed the importance of weather data in forecasting models across various sectors, the challenges associated with using it effectively, and the available numerical weather prediction models and providers, highlighting BlueSky API as a cost-effective and efficient way to obtain both live and historical forecasts. Through a case study on forecasting New York taxi rides, this article provided a hands-on demonstration of using weather data in machine learning, teaching you all the basic skills you need to get started:

Typical ETL & feature building steps for time series data
Weather data ETL and feature building via BlueSky API
Fitting and evaluating a simple random forest model for timeseries
Evaluation of feature importances using shap values

Key Takeaways

While weather data can be extremely complex to integrate into existing machine learning models, modern weather data services such as BlueSky API greatly reduce the workload.
The integration of BlueSky’s weather data into the model enhanced predictive accuracy in the New York taxi case study, highlighting that weather plays a visible practical role in daily operations.
Plenty of sectors like retail, agriculture, energy, transport, etc. benefit in similar or greater ways and therefore require good weather forecast integrations to improve their own forecasting and enhance their operational efficiency and resource allocation.

Frequently Asked Questions

Q1. How can weather data be incorporated into time series forecasting models?

A. Weather data can be incorporated into time series forecasting models as a set of external variables or covariates, also called features, to forecast some other time-dependent target variable. Unlike many other features, weather data is both conceptually and practically more complicated to add to such a model. The article explains how to do this correctly.

Q2. What aspects should be considered when creating a predictive model and using weather data as features?

A. It’s important to consider the various aspects such as accuracy, granularity, forecast horizon, forecast updates, and relevance of the weather data. You should ensure it is reliable and corresponds to the location of interest. Also, not all weather variables may be impactful to your operations, so feature selection is crucial to avoid overfitting and enhance model performance.

Q3. How can time series forecasting with weather data improve operational efficiency?

A. There are many possible reasons. For instance, by integrating weather data, businesses can anticipate fluctuations in demand or supply caused by weather changes and adjust accordingly. This can help optimize resource allocation, reduce waste, and improve customer service by preparing for expected changes.

Q4. How does machine learning aid in time series forecasting with weather data?

A. Machine learning algorithms can automatically identify patterns in historical data, including subtle relationships between weather changes and operational metrics. They can handle large volumes of data, accommodate multiple variables, and improve over time when getting exposed to more data.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Marc .

Fan of data products & smart tools including AI based technology. Skilled in Machine Learning, Data Science & Engineering. More than 5 years of experience in the energy & transportation sector. Embracing Agile Development.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction

Common Patterns

Validation Techniques

Time Series Forecasting

Exponential Smoothing

ARIMA

Prophet

Deep Learning

Using Weather Data for Machine Learning Models

Introduction

Overview

Table of contents

Challenges

Measured vs. Forecasted Weather

What about Model Training?

Format

Live vs. Old Data

Costs

Weather Models

Providers

BlueSky API

Case Study: New York Taxi Rides

Preprocessing Taxi Data

Feature Engineering Taxi Data

BlueSky Weather Data

Join Data

Model

Model Without Weather

Feature Importance

Further Model Improvements

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt