Forest Fire prediction using Machine Learning

Aman Preet Last Updated : 16 Oct, 2024

8 min read

This article was published as a part of the Data Science Blogathon

What does this article entail?

Forest, bush, or vegetarian fire, can be described as any uncontrolled and non-prescribed combustion or burning of plants in a natural setting such as a forest, grassland, etc. In this article we are not determining if a forest fire will take place or not, we are predicting the confidence of the forest fire based on some attributes.

Forest Fire Prediction Using ML — Image 1

Why do we need a forest fire prediction model?

Well, the first question arises as that why we even need Machine learning to predict forest fire in that particular area? So, yes the question is valid that despite having the experienced forest department who have been dealing with these issues for a long time why is there a need for ML, having said that answer is quite simple that the experienced forest department can check on 3-4 parameters from their human mind but ML on other hand can handle the numerous parameters whether it can be latitude, longitude, satellite, version, and whatnot, so dealing with this multi-relationship of a parameter that is responsible for the fire in the forest we do need ML for sure!

Importing libraries

import datetime as dt

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestRegressor

Reading forest fire exploration dataset (.csv)

forest = pd.read_csv('fire_archive.csv')

Let’s have a look at our dataset (2.7+ MB)

Data exploration

forest.shape

Output:

(36011, 15)

Here we can see that we have 36011 rows and 15 columns in our dataset obviously, we have to do a lot of data cleaning but first

Let’s explore this dataset more

forest.columns

Output:

Index(['latitude', 'longitude', 'brightness', 'scan', 'track', 'acq_date',
       'acq_time', 'satellite', 'instrument', 'confidence', 'version',
       'bright_t31', 'frp', 'daynight', 'type'],
      dtype='object')

Checking for the null values in the forest fire prediction dataset

forest.isnull().sum()

Output:

latitude      0
longitude     0
brightness    0
scan          0
track         0
acq_date      0
acq_time      0
satellite     0
instrument    0
confidence    0
version       0
bright_t31    0
frp           0
daynight      0
type          0
dtype: int64

Fortunately, we don’t have any null values in this dataset

forest.describe()

Output:

plt.figure(figsize=(10, 10))
sns.heatmap(forest.corr(),annot=True,cmap='viridis',linewidths=.5)

Output:

Data cleaning

forest = forest.drop(['track'], axis = 1)

Here we are dropping the track column

Note: By the way from the dataset we are not finding if the forest fire happens or not, we are trying to find the confidence of the forest fire happening. They may seem to be the same thing but there is a very small difference between them, try to find that 🙂

Finding categorical data

print("The scan column")
print(forest['scan'].value_counts())
print()
print("The aqc_time column")
print(forest['acq_time'].value_counts())
print()
print("The satellite column")
print(forest['satellite'].value_counts())
print()
print("The instrument column")
print(forest['instrument'].value_counts())
print()
print("The version column")
print(forest['version'].value_counts())
print()
print("The daynight column")
print(forest['daynight'].value_counts())
print()

Output:

The scan column
1.0    8284
1.1    6000
1.2    3021
1.3    2412
1.4    1848
1.5    1610
1.6    1451
1.7    1281
1.8    1041
1.9     847
2.0     707
2.2     691
2.1     649
2.3     608
2.5     468
2.4     433
2.8     422
3.0     402
2.7     366
2.9     361
2.6     347
3.1     259
3.2     244
3.6     219
3.4     203
3.3     203
3.8     189
3.9     156
4.7     149
4.3     137
3.5     134
3.7     134
4.1     120
4.6     118
4.5     116
4.2     108
4.0     103
4.4     100
4.8      70
Name: scan, dtype: int64

The aqc_time column
506     851
454     631
122     612
423     574
448     563
       ... 
1558      1
635       1
1153      1
302       1
1519      1
Name: acq_time, Length: 662, dtype: int64

The satellite column
Aqua     20541
Terra    15470
Name: satellite, dtype: int64

The instrument column
MODIS    36011
Name: instrument, dtype: int64

The version column
6.3    36011
Name: version, dtype: int64

The daynight column
D    28203
N     7808
Name: daynight, dtype: int64

From the above data, we can see that some columns have just one value recurring in them, meaning they are not valuable to us
So we will drop them altogether.
Thus only satellite and day-night columns are the only categorical type.

Having said that, we can even use the scan column to restructure it into a categorical data type column. Which we will be doing in just a while.

forest = forest.drop(['instrument', 'version'], axis = 1)

forest.head()

Output:

Looking at the head of the data for Forest fire prediction

daynight_map = {"D": 1, "N": 0}
satellite_map = {"Terra": 1, "Aqua": 0}

forest['daynight'] = forest['daynight'].map(daynight_map)
forest['satellite'] = forest['satellite'].map(satellite_map)

forest.head()

Output:

Looking at the type of another column

forest['type'].value_counts()

Output:

0    35666
2      335
3       10
Name: type, dtype: int64

Concatenating forest and types data frame

types = pd.get_dummies(forest['type'])
forest = pd.concat([forest, types], axis=1)

forest = forest.drop(['type'], axis = 1)
forest.head()

Output:

Viewing the data after concatenating forest and types dataframe

Renaming columns for better understanding

forest = forest.rename(columns={0: 'type_0', 2: 'type_2', 3: 'type_3'})

Binning Method

Now I mentioned we will be converting scan column to categorical type, we will be doing this using the binning method.
The range for these columns was 1 to 4.8

bins = [0, 1, 2, 3, 4, 5]
labels = [1,2,3,4,5]
forest['scan_binned'] = pd.cut(forest['scan'], bins=bins, labels=labels)

forest.head()

Output:

Converting the datatype to datetype from string or NumPy.

forest['acq_date'] = pd.to_datetime(forest['acq_date'])

Now we will be dropping the scan column and handle datetype data – we can extract useful information from these datatypes just like we do with categorical data.

forest = forest.drop(['scan'], axis = 1)

Creating a new column year with the help of acq_date column

forest['year'] = forest['acq_date'].dt.year
forest.head()

Output:

As we have added the year column similarly we will add the month and day column

forest['month'] = forest['acq_date'].dt.month
forest['day'] = forest['acq_date'].dt.day

Checking the shape of the dataset again

forest.shape

Output:

(36011, 17)

Now, as we can see that two more columns have been added which are a breakdown of date columns

Separating our target variable

y = forest['confidence']
fin = forest.drop(['confidence', 'acq_date', 'acq_time', 'bright_t31', 'type_0'], axis = 1)

Checking for correlation once again

plt.figure(figsize=(10, 10))
sns.heatmap(fin.corr(),annot=True,cmap='viridis',linewidths=.5)

Output:

Checking for correlation — Image by Author

Let’s see our cleaned and sorted dataset now

fin.head()

Output:

Splitting the clean data into training and testing dataset

Xtrain, Xtest, ytrain, ytest = train_test_split(fin.iloc[:, :500], y, test_size=0.2)

Model building

Using RandomForestRegressor for model building

random_model = RandomForestRegressor(n_estimators=300, random_state = 42, n_jobs = -1)

#Fit
random_model.fit(Xtrain, ytrain)

y_pred = random_model.predict(Xtest)

#Checking the accuracy
random_model_accuracy = round(random_model.score(Xtrain, ytrain)*100,2)
print(round(random_model_accuracy, 2), '%')

Output:

95.32 %

Checking the accuracy

random_model_accuracy1 = round(random_model.score(Xtest, ytest)*100,2)
print(round(random_model_accuracy1, 2), '%')

Output:

65.32 %

Saving the model by pickle module using the serialized format

import pickle 
saved_model = pickle.dump(random_model, open('ForestModelOld.pickle','wb'))

Model Tuning

The accuracy is not so great, plus the model is overfitting
So we use RandomCV

Getting all the parameters from the model

random_model.get_params()

Output:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 300,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

Putting RadomizedSearchCV in action!

"""
n_estimators = number of trees in the forest
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)
"""

from sklearn.model_selection import RandomizedSearchCV

Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 300, stop = 500, num = 20)]
Number of features to consider at every split
max_features = ['auto', 'sqrt']
Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(15, 35, num = 7)]
max_depth.append(None)
Minimum number of samples required to split a node
min_samples_split = [2, 3, 5]
Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
                }
print(random_grid)

Output:

{'n_estimators': [300, 310, 321, 331, 342, 352, 363, 373, 384, 394, 405, 415, 426, 436, 447, 457, 468, 478, 489, 500], 'max_features': ['auto', 'sqrt'], 'max_depth': [15, 18, 21, 25, 28, 31, 35, None], 'min_samples_split': [2, 3, 5], 'min_samples_leaf': [1, 2, 4]}

A random search of parameters, using 3 fold cross-validation, search across 100 different combinations and use all available cores
n_iter, which controls the number of different combinations to try, and cv which is the number of folds to use for cross-validation

rf_random = RandomizedSearchCV(estimator = random_model, param_distributions = random_grid, 
                                n_iter = 50, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random.fit(Xtrain, ytrain)

Output:

Just like this snippet, there will be numerous folds in this RandomizedSearchCV

Getting the best parameter out of it

rf_random.best_params_

Output:

{'n_estimators': 394,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 25}

Creating a new model with tuned parameters

random_new = RandomForestRegressor(n_estimators = 394, min_samples_split = 2, min_samples_leaf = 1, max_features = 'sqrt',
                                      max_depth = 25, bootstrap = True)

#Fit
random_new.fit(Xtrain, ytrain)

y_pred1 = random_new.predict(Xtest)

#Checking the accuracy
random_model_accuracy1 = round(random_new.score(Xtrain, ytrain)*100,2)
print(round(random_model_accuracy1, 2), '%')

Output:

95.31 %

Checking the accuracy

random_model_accuracy2 = round(random_new.score(Xtest, ytest)*100,2)
print(round(random_model_accuracy2, 2), '%')

Output:

67.39 %

Saving the tuned model by pickle module using the serialized format

saved_model = pickle.dump(random_new, open('ForestModel.pickle','wb'))

Loading the tuned pickled model

reg_from_pickle = pickle.load(saved_model)

bz2file

Here comes the cherry on the cake part (bonus of this article). Let’s understand what this bz2file module is all about. Let’s get started!

What is bz2file

bz2file is one of the modules in python which are responsible for compression and decompression of files, hence it can help in decreasing the serialized or deserialized file to a smaller size which will be very helpful in the long run when we have large datasets

How bz2file is helpful here?

As we know that our dataset is 2.7+ MB and our Random forest model is whooping 700+ MB so we need to compress that so that that model won’t be leading as a hectic situation for storage.

How to install bz2file?

Jupyter notebook: !pip install bz2file
Anaconda prompt/CMD: pip install bz2file

Hence I installed bz2file, which is used to compress data. This is a life-saving package for those who have low spaces on their disk but want to store or use large datasets. Now the pickled file was over 700 MB in size which when used bz2 compressed into a file of size 93 MB or less.

import bz2

compressionLevel = 9
source_file = 'ForestModel.pickle' # this file can be in a different format, like .csv or others...
destination_file = 'ForestModel.bz2'

with open(source_file, 'rb') as data:
    tarbz2contents = bz2.compress(data.read(), compressionLevel)
    
fh = open(destination_file, "wb")
fh.write(tarbz2contents)
fh.close()

This code will suppress the size of the tuned pickle model.

Okay so that’s a wrap from my side!

Endnotes

Thank you for reading my article 🙂

I hope you guys will like this step-by-step learning of forest fire prediction using machine learning. One last thing I wanna mention is that I’m very well aware of the fact that the accuracy of the model is not that good but the objective of the article is quite balanced so you guys can try out different Ml algorithms to look for better accuracy.

Here’s the repo link to this article.

Here you can access my other articles which are published on Analytics Vidhya as a part of the Blogathon (link)

If got any queries you can connect with me on LinkedIn, refer to this link

About me

Greeting to everyone, I’m currently working in TCS and previously I worked as a Data Science Associate Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field i.e. Data Science along with its other subsets of Artificial Intelligence such as, Computer Vision, Machine learning, and Deep learning feel free to collaborate with me on any project on the above-mentioned domains (LinkedIn).

Image Source-

Image 1 – https://www.theleader.info/wp-content/uploads/2017/08/forest-fire.jpg

The media shown in this article on forest fire prediction is not owned by Analytics Vidhya and are used at the Author’s discretion.

Aman Preet

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Marina

Very nice your post. But what will be the final outcome of the code?

vishal tigga

Hello the post has been very much helpful to me but I want to know more about data sets? Like from where the data sets have been taken, as per the normal data set for the forest fire prediction conclude fwi, ffmc etc. but here nothing aligns.. So please help me I will grateful for your help

Shahrukh Ahmed

hi, can you please provide me the dataset link

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Forest Fire prediction using Machine Learning

What does this article entail?

Why do we need a forest fire prediction model?

Importing libraries

Reading forest fire exploration dataset (.csv)

Let’s have a look at our dataset (2.7+ MB)

Data exploration

Output:

Output:

Checking for the null values in the forest fire prediction dataset

Output:

Output:

Output:

Data cleaning

Finding categorical data

Output:

Output:

Output:

Looking at the type of another column

Output:

Concatenating forest and types data frame

Output:

Renaming columns for better understanding

Binning Method

Output:

Creating a new column year with the help of acq_date column

Output:

Checking the shape of the dataset again

Output:

Separating our target variable

Checking for correlation once again

Output:

Output:

Splitting the clean data into training and testing dataset

Model building

Using RandomForestRegressor for model building

Output:

Checking the accuracy

Output: