Regression Analysis : Real-time Portugal 2019 Election Results

Priyal Last Updated : 23 May, 2021

7 min read

This article was published as a part of the Data Science Blogathon

Hope you all are safe and healthy! Welcome to my blog!

Today, we will see
Regression Analysis using Portugal 2019 Election Results dataset.

Photo by Wojciech Portnicki on Unsplash

Concept :

To give an overview, ML models can be classified based on the task performed and the nature of the output:

Supervised learning under which Regression & Classification comes while in unsupervised learning Clustering is there.

· Regression: Output is a continuous variable.

· Classification: Output is a categorical variable.

· Clustering: No notion of output.

Regression: It is a form of predictive modelling technique where we try to find a significant relationship between a dependent variable and one or more independent variables also called target variables. There are various types of regression techniques: Linear, Logistic, Polynomial, Ridge, Lasso and many more.

About Dataset

This Dataset describes the evolution of results in the Portuguese Parliamentary Elections of October 6th, 2019. The data spans a time interval of 4 hours and 25 minutes, in intervals of 5 minutes, concerning the results of the 27 parties involved in the electoral event.

Dataset has 28 columns in which the “FinalMandates” column is the target variable that describes the number of MPs elected.

The variables in the dataset used: –

1. TimeElapsed (Numeric): Time (minutes) passed since the first data acquisition

2. time (timestamp): Date and time of the data acquisition

3. territoryName (string): Short name of the location (district or nation-wide)

4. totalMandates (numeric): MP’s elected at the moment

5. availableMandates (numeric): MP’s left to elect at the moment

6. numParishes (numeric): Total number of parishes in this location

7. numParishesApproved (numeric): Number of parishes approved in this location

8. blankVotes (numeric): Number of blank votes

9. blankVotesPercentage (numeric): Percentage of blank votes

10. nullVotes (numeric): Number of null votes

11. nullVotesPercentage (numeric): Percentage of null votes

12. votersPercentage (numeric): Percentage of voters

13. subscribedVoters (numeric): Number of subscribed voters in the location

14. totalVoters (numeric): Percentage of blank votes

15. pre.blankVotes (numeric): Number of blank votes (previous election)

16. pre.blankVotesPercentage (numeric): Percentage of blank votes (previous election)

17. pre.nullVotes (numeric): Number of null votes (previous election)

18. pre.nullVotesPercentage (numeric): Percentage of null votes (previous election)

19. pre.votersPercentage (numeric): Percentage of voters (previous election)

20. pre.subscribedVoters (numeric): Number of subscribed voters in the location (previous election)

21. pre.totalVoters (numeric): Percentage of blank votes (previous election)

22. Party (string): Political Party

23. Mandates (numeric): MP’s elected at the moment for the party in a given district

24. Percentage (numeric): Percentage of votes in a party

25. validVotesPercentage (numeric): Percentage of valid votes in a party

26. Votes (numeric): Percentage of party votes

27. Hondt (numeric): Number of MP’s according to the distribution of votes now

28. FinalMandates (numeric): Target: final number of elected MP’s in a district/national level

Problem Definition:

Here, the task is to predict how many MPs were elected at a district/national level after the 2019 Portugal Parliament Elections.

1. Importing Libraries and Dataset

The first and foremost steps are importing the necessary libraries like NumPy, pandas, matplotlib and seaborn in our notebook.

Then we move on to load the dataset from CSV format and convert it into panda DataFrame and check the top five rows to analyze the data.

2. Cleaning Dataset

1. Checking Null Values: By using dataset isnull().sum() we check that there were no missing values present in the dataset.

2. Checking Datatypes: We checked Datatypes of all columns, to see any inconsistencies in the data.

3. Converting Format: Additionally, in the time column, we’ve changed datatype from object to datetime format for better analysis.

Exploratory Data Analysis :

Now conducting EDA to gain insights into the data.

1. Correlation:

Checking correlation with sns.heatmap() revealed that multiple columns are highly correlated.

2. Data Visualization:

a. Univariate Analysis:

For this, I plotted countplot for Party and TerritoryName respectively.

Regression analysis | count plot territory name

b. Bivariate Analysis:

For bivariate, I plotted different variables against the target
variable-‘Final mandate’ to understand the relationship of the data.

Regression analysis | Bivariate analysis

EDA Concluding Remarks:

1. Heatmap Analysis:

we can conclude that many factors have a correlation>0.9 and can be reduced later to reduce the dimensionality of the data.

2. Univariate Data Visualization:

we observe that:

a. Parties with minimum count are JPP and MAS.

b. Most territories have count in the ranges of 800 to 1000 per territory.

3. Bi-Variate Analysis:

We can observe that:

a. Hondt and Votes are directly correlated with the target variable and show discrete values.

b. totalVoters show negligible correlation with the target variable.

c. Party and TerritoryName variables have outliers present in the data.

Pre-Processing Pipeline:

After the EDA process, we are aware of the changes that need to be incorporated into the dataset to make it more suitable for building machine learning models:

1. PCA (Principal Component Analysis):

As our DataFrame has 28 columns and most of the columns are correlated to a high degree (>0.9), we can use PCA analysis to reduce the dimensionality of the model.

We have divided the variables into two PCA groups based on their correlation: PCA Group A and PCA Group B.

With this analysis we combined the variables into two and columns are reduced from 28 to 15.

2. Label Encoding:

As there are 2 feature variables: TerritoryName and Party, we used Label Encoding to convert them to numerical values.

3. Removing Outliers:

As we observed during Data Visualization that outliers are present in the data. We will further analyze and remove outliers to make the machine learning model more robust.

a. Box Plot:

Box plot was used to further visualize the outliers present in the data for various categorical variables.

We can observe that many outliers are present in this data and in the next step we will attempt to remove the outliers.

b. Z score Analysis:

This analysis is used to remove outliers from the existing dataset. This analysis works by first calculating z scores for every data value and removing the data with a z score >3.

After applying zscore analysis, we removed around 3299 rows from the dataset. We are currently left with 18344 rows and 15 columns.

c. Normal Distribution Analysis:

To check for normal distribution of the data, first, we plotted Histogram data and checked for skewness.

Regression analysis | Distribution analysis

With histplots, we observed that VotersPercantage is skewed towards the left and available Mandates is skewed towards the right. So we converted values from these columns to their sqrt values to normalize the data.

Now, our dataset is ready to be put into the machine learning model for regression analysis.

Building Machine Learning Model:

a. Scaling Dataset:

Standardizing the value of the X variable by using Standard Scaler to make the data normally distributed.

b. Splitting Dataset:

After preprocessing, we now split the data into training/testing subsets.

c. Evaluating Models:

We now checked various regression model and calculated metrics such as the model score, Mean Squared Error, Mean Absolute Error, Root Mean Squared Error and R2 Score.

Here, ‘for loop’ is used for fitting different models in one go.

Usually, Mean Squared Error and the R2 Score explain how close the regression line is to the data points. Based on various model performance metrics.

Random Forest Regressor and Decision Tree Regressor performed best with extremely low error and a good R2 score.

Regression analysis | Randomforest regressor

d. Hyper-Parameter Tuning:

Now to increase our accuracy even further, we will perform hyperparameter tuning in both the models and based on final metrics, we will choose the final model.

From both the results, Random Forest Regressor is the final model because its RMSE value is smaller RandomForestRegressor( 0.03) than DecisionTreeRegressor(0.04).

e. Checking Model prediction:

To check model performance, we will now plot a scatter plot between test
results and values predicted by Random Forest-Model.

Concluding Remarks

1. Taken output variable as Final Mandate.

2. Understood relationship of target variable with other variables by using Data Visualization:

· Votes, Hondt have a linear positive relationship.

· Total Votes, Mandates have discreet values against Target variable.

· Correlation between many variables >0.9, hence used PCA to decrease the dimensionality of the data from 28 to 17.

· Label Encoded object data such as Party and territory Name for better EDA analysis.

3. Removed outliers using z score analysis and converted data into a normal distribution.

4. Checked various regressor models and found Random Forest and Decision Tree with best R2score values>0.99

5. Performed hyper-tuning to find the best parameters of these models and finally chose Random Forest for the final model.

6. Final score for RFR model is 0.9998, RMSE is 0.02 and the R2 score is 0.9996.

7. Plotted to scatter plot and found a linear line that shows a close match between test and predicted values.

8. Overall, the model is a good predictor of true values.

Here is the link to my complete solution and dataset used:

https://github.com/priyalagarwal27/Regression-on-ELECTION-dataset

Well, What are your thoughts on this? I would love to hear yours. I hope you liked the article.

Stay connected for more related articles. Please leave any suggestions, questions, requests for further clarifications down below in the comment section.

Thanks for the reading and Happy Machine Learning!!!

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Priyal

Hello everyone out there. I'm Priyal Agarwal, working as a Data Analyst. With a background in data science and analytics, I’m passionate about leveraging data to drive business strategies and enhance customer experiences. I’m particularly interested in predictive analytics and machine learning.
I am excited about contributing to the data science community by developing innovative solutions that push the boundaries of what's possible. I believe that data science and AI have the power to revolutionize industries and improve lives, and I am eager to be at the forefront of this transformative journey. Looking forward to connecting with like-minded professionals and expanding my knowledge.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction

Tools

Libraries

Plots

Use cases

Regression Analysis : Real-time Portugal 2019 Election Results

Hope you all are safe and healthy! Welcome to my blog!

Concept :

About Dataset

Problem Definition:

1. Importing Libraries and Dataset

2. Cleaning Dataset

Exploratory Data Analysis :

1. Correlation:

2. Data Visualization:

a. Univariate Analysis:

b. Bivariate Analysis:

EDA Concluding Remarks:

1. Heatmap Analysis:

2. Univariate Data Visualization:

3. Bi-Variate Analysis:

Pre-Processing Pipeline:

1. PCA (Principal Component Analysis):

2. Label Encoding:

3. Removing Outliers:

a. Box Plot:

b. Z score Analysis:

c. Normal Distribution Analysis:

Building Machine Learning Model:

a. Scaling Dataset:

b. Splitting Dataset:

c. Evaluating Models:

d. Hyper-Parameter Tuning:

Concluding Remarks

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at