Fake News Classification Using Deep Learning

pavan Last Updated : 22 Mar, 2022

8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Here’s a quick puzzle for you. I’ll give you two titles, and you’ll have to tell me which is fake. Ready? Let’s get started:

“Adani Group is planning to explore investment in the EV sector.”
“Wipro is planning to buy an EV-based startup.”

Well, it turns out that both of those headlines were fake news. In this article, you will learn the fake news classification using deep learning.

The grim reality is that there is a lot of misinformation and disinformation on the internet. Ninety per cent of Canadians have fallen for false news, according to a 2019 research done by Ipsos Public Affairs for Canada’s Centre for International Governance Innovation.

It got me thinking: is it feasible to build an algorithm that can tell whether an article’s title is fake news? Well, it appears to be the case!

In this post, we go through the exploration of the classification model with BERT and LSTMs to identify the fake new classification.

Go through this Github link to view the complete code.

Dataset for Fake News Classification

We use the dataset from Kaggle. It consists of 2095 article details that include author, title, and other information. Go through the link to get the dataset.

EDA

Let us start analyzing our data to get better insights from it. The dataset looks clean, and now we map the values to our class Real and Fake such as 0 and 1.

data = pd.read_csv('/content/news_articles.csv')
data = data[['title', 'label']]
data['label'] = data['label'].map({'Real': 0, 'Fake':1})
data.head()

EDA | Fake News Classification — Image by Author

Since we have 1294 samples of real news and 801 samples of fake news, there is an approximately 62:38 news ratio. It means that our dataset is relatively biased. For our project, we consider the title and class columns.

Now, we can analyze the trends present in our dataset. To get an idea of dataset size, we get the mean, min, and max character lengths of titles. We use a histogram to visualize the data.

# Character Length of Titles - Min, Mean, Max
print('Mean Length', data['title'].apply(len).mean())
print('Min Length', data['title'].apply(len).min())
print('Max Length', data['title'].apply(len).max())

x = data['title'].apply(len).plot.hist()

We can observe that characters in each title range from 2-443. We can also see that more per cent of samples with a length of 0-100. The mean length of the dataset is around 61.

Preprocessing Data

Now we will use the NLTK library to preprocess our dataset, which includes:

Tokenization:

It is the process of dividing a text into smaller units (each word will be an index in an array)

Lemmatization:

It removes the endings of the word to the root word. It reduces the word children to a child.

Stop words Removal:

Words like the and for will be eliminated from our dataset because they take too much room.

#Import nltk preprocessing library to convert text into a readable format
import nltk
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

#Tokenize the string (create a list -> each index is a word)
data['title'] = data.apply(lambda row: nltk.word_tokenize(row['title']), axis=1)

#Define text lemmatization model (eg: walks will be changed to walk)
lemmatizer = WordNetLemmatizer()

#Loop through title dataframe and lemmatize each word
def lemma(data):
  return [lemmatizer.lemmatize(w) for w in data]

#Apply to dataframe
data['title'] = data['title'].apply(lemma)

#Define all stopwords in the English language (it, was, for, etc.)
stop = stopwords.words('english')

#Remove them from our dataframe
data['title'] = data['title'].apply(lambda x: [i for i in x if i not in stop])

data.head()

We create two models using this data for text classification:

An LSTM model (Tensorflow’s wiki-words-250 embeddings)
A BERT model.

LSTM Model for Fake News Classification

We split our data into a 70:30 ratio of train and test.

#Split data into training and testing dataset
title_train, title_test, y_train, y_test = train_test_split(titles, labels, 
                                                            test_size=0.3, 
                                                            random_state=1000)

To get predictions based on the text from our model, we need to encode it in vector format then it is processed by the machine.

Word2Vec Skip-Gram architecture had used by TensorFlow’s wiki-words-250. Based on the input, Skip-gram had trained by predicting the context.

Consider this sentence as an example:

I am going on a voyage in my car.

The word voyage passed as input and one as the window size. The window size means before and after the target word to predict. In our case, the words are gone and car (excluding stopwords, and go is the lemmatized form of going).

We one-hot-encode our word, resulting in an input vector of size 1 x V, where V is the vocabulary size. A weight matrix of V rows (one for each word in our vocabulary) and E columns, where E is a hyperparameter indicating the size of each embedding, will be multiplied by the representation. Except for one, all values in the input vector are zero because it is one-hot encoded (representing the word we are inputting). Finally, when the weight matrix had multiplied by the output, a 1xE vector denotes the embedding for that word.

The output layer, which consists of a softmax regression classifier, will receive the 1xE vector. It had built of V neurons (which correspond to the vocabulary’s one-hot encoding) that produce a value between 0 and 1 for each word, indicating the likelihood of that word being in the window size.

Word embeddings with a size E of 250 are present in Tensorflow’s wiki-words-250. Embeddings applied to the model by looping through all of the words and computing the embedding for each one. We’ll need to utilize the pad sequences function to adjust for samples of variable lengths.

embed = hub.load("https://tfhub.dev/google/Wiki-words-250/2")

#Convert each series of words to a word2vec embedding
indiv = []
for i in title_train:
  temp = np.array(embed(i))
 
  indiv.append(temp)

#Accounts for different length of words
indiv = tf.keras.preprocessing.sequence.pad_sequences(indiv,dtype='float')
indiv.shape

Therefore, there are 1466 samples in the training data, the highest length is 46 words, and each word has 250 features.

Now, we build our model. It consists of:

1 LSTM layer with 50 units
2 Dense layers (first 20 neurons, the second 5) with an activation function ReLU.
1 Dense output layer with activation function sigmoid.

We will use the Adam optimizer, a binary cross-entropy loss, and a performance metric of accuracy. The model will be trained over 10 epochs. Feel free to further adjust these hyperparameters.

#Sequential model has a 50 cell LSTM layer before Dense layers
model = tf.keras.models.Sequential()


model.add(tf.keras.layers.LSTM(50))
model.add(tf.keras.layers.Dense(20,activation='relu'))
model.add(tf.keras.layers.Dense(5,activation='relu'))

model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

#Compile model with binary_crossentropy loss, Adam optimizer, and accuracy metrics
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 1e-4),
              loss="binary_crossentropy", metrics=['accuracy'])

#Train model on 10 epochs
model.fit(indiv, y_train,validation_data=[test,y_test], epochs=20)

We get an accuracy of 59.4% on test data.

Using BERT for Fake News Classification

What would you reply if I asked you to name the English term with the most definitions?

That word is “set,” according to the Oxford English Dictionary’s Second Edition.

If you think about it, we could make a lot of different statements using that term in various settings. Consider the following scenario:

My books are part of a set of stationery collections
My teammate set the badminton for me
I set the table for lunch

The problem with Word2Vec is that no matter how the word had used, it generates the same embedding. We use BERT, which can build contextualized embeddings, to combat this.

BERT is known as “Bidirectional Encoder Representations from Transformers.” It employs a transformer model to generate contextualized embeddings by utilizing attention mechanisms.

An encoder-decoder design had used in a transformer model. The encoder layer creates a continuous representation based on the data it has learned from the input. The preceding input is delivered into the model by the decoder layer, which generates an output. Because BERT’s purpose is to build a vector representation from the text, it only employs an encoder.

Pre-Training & Fine-Tuning

BERT had trained using two ways. The first method is known to be veiled language modelling. Before transmitting sequences, a [MASK] token had used to replace 15% of the words. Using the context supplied by the unmasked words, the model will predict the masked words.

It is accomplished by

Using embedding matrix to apply a classification layer to the encoder output. As a result, it will be the same size as the vocabulary.
Using the softmax function to calculate the likelihood of the word.

The second strategy is to guess the upcoming sentence. The model will be given two sentences as input and predict whether the second sentence will come after the first. While training, half of the inputs are pairs, while the other half consists of random sentences from the corpus. To distinguish between the two statements,

Here, it adds a [CLS] token at the start of the first sentence and a [SEP] token at the end of each.
Each token (word) contains a positional embedding that allows information extracted from the text’s location. Because there is no repetition in a transformer model, there is no inherent comprehension of the word’s place.
Each token is given a sentence embedding (further differentiating between the sentences).

For Next Sentence Prediction, the output of the [CLS] embedding, which stands for “aggregate sequence representation for sentence classification,” is passed through a classification layer with softmax to return the probability of the two sentences being sequential.

Pre-training and fine-tuning | Fake News Classification

Image by Author

Implementation of BERT

The BERT preprocessor and encoder from Tensorflow-hub had used. Do not run the content via the earlier-mentioned framework (which removes capitalization, applies lemmatization, etc.) The BERT preprocessor had used to abstract this.

We split our data for training and testing in the ratio of 80:20.

from sklearn.model_selection import train_test_split  
#Split data into training and testing dataset 
title_train, title_test, y_train, y_test = train_test_split(titles, labels, test_size=0.2,                                                             random_state=1000)

Now, load Bert preprocessor and encoder

# Use the bert preprocesser and bert encoder from tensorflow_hub 
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") 
bert_encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')

We can now work on our neural network. It must be a functional model, with each layer’s output serving as an argument to the next.

1 Input layer: Used to pass sentences into the model.
The bert_preprocess layer: Preprocess the input text.
The bert_encoder layer: Pass the preprocessed tokens into the BERT encoder.
1 Dropout layer with 0.2. The BERT encoder pooled_output is passed into it.
2 Dense layers with 10 and 1 neurons. The first uses a ReLU activation function, and the second is sigmoid.

import tensorflow as tf

# Input Layers
input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name='news')

# BERT layers
processed = bert_preprocess(input_layer)
output = bert_encoder(processed)

# Fully Connected Layers
layer = tf.keras.layers.Dropout(0.2, name='dropout')(output['pooled_output'])
layer = tf.keras.layers.Dense(10,activation='relu', name='hidden')(layer)
layer = tf.keras.layers.Dense(1,activation='sigmoid', name='output')(layer)


model = tf.keras.Model(inputs=[input_layer],outputs=[layer])

The “pooled output” will be transmitted into the dropout layer, as you can see. This value represents the text’s overall sequence representation. It is, as previously said, the representation of the [CLS] token outputs.

The Adam optimizer, a binary cross-entropy loss, and an accuracy performance metric had used. For five epochs, the model had trained. Feel free to tweak these hyperparameters even more.

#Compile model on adam optimizer, binary_crossentropy loss, and accuracy metrics
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
#Train model on 5 epochs
model.fit(title_train, y_train, epochs= 5)

#Evaluate model on test data
model.evaluate(title_test,y_test)

Image by Author

Above, you can see that our model achieved an accuracy of 61.33%.

Conclusion

To improve the model performance:

Train the models on a large dataset.
Tweak hyperparameters of the model.

I hope you had found this post insightful and a better understanding of NLP techniques for fake news classification.

Read more articles on our blog. Click here.

References

Image – 1: Photo by Roman Kraft on Unsplash

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

pavan

Embarking on a transformative odyssey through the realms of AI, ML, and NLP, I've woven a tapestry of experience over three dynamic years. Amidst the digital symphony, I now find myself enraptured by the artistry of Generative AI, sculpting the future of innovation. As I dance with colossal language models, each keystroke becomes a brushstroke, painting the canvas of possibility in this ever-evolving technological landscape.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Fake News Classification Using Deep Learning

Introduction

Dataset for Fake News Classification

EDA

Preprocessing Data

LSTM Model for Fake News Classification

Using BERT for Fake News Classification

Pre-Training & Fine-Tuning

Implementation of BERT

Conclusion

References

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory