Summarize Twitter Live data using Pretrained NLP models

manmohan24nov Last Updated : 06 Nov, 2020

4 min read

Introduction

Twitter users spend an average of 4 minutes on social media Twitter. On an average of 1 minute, they read the same stuff. It shows that users spend around 25% of their time reading the same stuff.

Also, most of the tweets will not appear on your dashboard. You may get to know the trending topics, but you miss not trending topics. In trending topics, you might only read the top 5 tweets and their comments.

So, what are you going to do to avoid wastage of time on Twitter?

I would say summarize your whole trending Twitter tags data. And, then you can finish reading all trending tweets in less than 2 minutes.

In this article, I will explain to you how you can leverage Natural Language Processing (NLP) pre-trained models to summarize twitter posts based on hashtags. We will use 4 ( T5, BART, GPT-2, XLNet) pre-trained models for this job.

Why use 4 types of pre-trained models for summarization?

Each pre-trained model has its own architecture and weights. So, the summarization output given by these models could be different from each other.

Test the twitter data on different models and then choose the model which shows summarization close to your understanding. And then deploy that model into production.

Let’s start with collecting Twitter Live data.

Twitter Live Data

You can get Twitter live data in 2 ways.

Official Twitter API. Follow this article to get a Twitter dataset.
Use the Beautiful Soup library to scrape the data from Twitter.

I will be using step 1 to fetch the data. Once you receive the credentials for Twitter API, follow the below code to get Twitter data through API.

	In [1]: import tweepy

	In [2]: # Twitter API credentials
	...: consumer_key = "consumer key"
	...: consumer_secret = "consumer secret"
	...: access_key = "access key"
	...: access_secret = "access secret"

	In [3]: auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
	...: auth.set_access_token(access_key, access_secret)
	...: api = tweepy.API(auth,wait_on_rate_limit=True)

	In [4]: # Define the search term and removes the retweets to avoid duplicate data
	...: search_words = "#USElections2020"
	...: search_words = search_words + " -filter:retweets"

	In [5]: # Collect tweets
	...: tweet_new =tweepy.Cursor(api.search,
	...: q=search_words,
	...: lang="en",
	...: result_type='popular').items(100)

	In [6]: tweet_data=[]
	...: # Iterate the tweets
	...: for tweet in tweet_new:
	...: tweet_data.append(tweet.text)
	...:

	In [7]: tweet_data
	Out[7]:
	['Nine Nigerian-Americans Contesting In United States Election On Tuesday \| Sahara Reporters https://t.co/YvozKoaP70… https://t.co/lZ3h9Ne5pA',
	'#SRPoll: Which Candidate Do You Support In The US Presidential Election?\n\n1. @JoeBiden - Democratic Candidate\n\n2.… https://t.co/vpbqN5P9jn',
	'One Day To US Elections, @JoeBiden Leads As @realDonaldTrump Trails In Battleground States, Threatens To Head To Co… https://t.co/JaBB79T5sb',
	'Time Machine:\nAmazing that Martin Luther King Jr. was 31 years old when he gave this prescient and prophetic speech… https://t.co/hQjT5nHZ3A',
	'The best thing about #USElections2020 is how the NRIs are supporting the narratives for #JoeBiden of equality, free… https://t.co/yne65KlkdT',
	'A reminder as you seek comfort food in the days ahead that calories don’t count if you don’t use a plate. #HandToMouth #USElections2020',
	'Don’t forget the big #USElections2020 overnight show with me and a host of brilliant guests on both sides of the At… https://t.co/6G2LVXXZGM',
	'When I tell you Trump is a 🇳🇬Politician you lot will say its nonsense.\nWọn n pin iresi in the abroad… https://t.co/uVl27mRQTJ',
	'“In a choice between a clown and a gaffe-prone plagiarist tarred by his son’s alleged corruption, Trump deserves th… https://t.co/PBR8IdqCo6',
	'OPINION: An eerily timely message to Americans, resonating across the decades, on the eve of the 2020 presidential… https://t.co/cqIrQuBp3b',
	'In the coming days and weeks especially, it is critical that social media platforms apply their standards in a mann… https://t.co/kXXm26lncZ',
	'With just 2 days to go, what does my timeline think about the #USElections2020']

view raw fetch_tweeter_data.py hosted with ❤ by GitHub

Now, let’s start summarizing data using pre-trained models one by one.

1. Summarization using T5 Model

T5 is a state of the art model used in various NLP tasks that includes summarization. We will be using the transformers library to download the T5 pre-trained model and load that model in a code.

The Transformers library is developed and maintained by the Hugging Face team. It’s an open-source library.

Know more about the T5 model here.

Here is code to summarize the Twitter dataset using the T5 model.

	>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
	2020-11-03 15:26:26.375782: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
	>>> model = T5ForConditionalGeneration.from_pretrained('t5-base')
	>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
	>>> text = " ".join(tweet_data)
	>>> TEXT_CLEANING_RE = "@\S+\|https?:\S+\|http?:\S\|[^A-Za-z0-9]+"
	>>> text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
	>>> Preprocessed_text = "summarize: "+Text
	>>> tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
	>>> summary_ids = model.generate(tokens_input,
	... min_length=60,
	... max_length=180,
	... length_penalty=4.0)
	>>>
	>>> summary = tokenizer.decode(summary_ids[0])
	>>> print(summary)
	sahara reporters will be covering the uselections2020 on tuesday. a nigerian journalist will be presenting a live coverage of the uselections2020.
	a nigerian journalist will be presenting a live coverage of the uselections2020.

view raw t5_summerization.py hosted with ❤ by GitHub

Observation on Code

You can use different types of T5 pre-trained models having different weights and architecture. Available versions of the T5 model in the transformer library are t5-base, t5-large, t5-small, t5-3B, and t5-11B.
Return_tensor value should be pt for PyTorch.
The maximum sentence length used to train the pre-models is 512. So, keep the max_length value to 512.
The length of the summarized sentence increase with an increase in length_penality value. Length_penality=1 means no penalty.

2. Summarization using BART models

BART uses both BERT (bidirectional encoder) and GPT (left to the right decoder) architecture with seq2seq translation. BART achieves the state of the art results in the summarization task.

BART pre-trained model is trained on CNN/Daily mail data for the summarization task, but it will also give good results for the Twitter dataset.

We will take advantage of the hugging face transformer library to download the T5 model and then load the model in a code.

Here is code to summarize the Twitter dataset using the BART model.

	>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
	2020-11-03 15:26:26.375782: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
	>>> model = T5ForConditionalGeneration.from_pretrained('t5-base')
	>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
	>>> text = " ".join(tweet_data)
	>>> TEXT_CLEANING_RE = "@\S+\|https?:\S+\|http?:\S\|[^A-Za-z0-9]+"
	>>> text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
	>>> Preprocessed_text = "summarize: "+Text
	>>> tokens_input = tokenizer.encode(Preprocessed_text,return_tensors="pt", max_length=512, truncation=True)
	>>> summary_ids = model.generate(tokens_input,
	... min_length=60,
	... max_length=180,
	... length_penalty=4.0)
	>>>
	>>> summary = tokenizer.decode(summary_ids[0])
	>>> print(summary)
	sahara reporters will be covering the uselections2020 on tuesday. a nigerian journalist will be presenting a live coverage of the uselections2020.
	a nigerian journalist will be presenting a live coverage of the uselections2020.

view raw t5_summerization.py hosted with ❤ by GitHub

Observation on Code

You can increase and decrease the length of the summarization using min_length and max_length. Ideally, summarization length should be 10% to 20% of the total article length.
This model is ideally suitable to summarize the news articles. But it can also give good results on Twitter data.
You can use different BART model versions such as bart-large, bart-base, bart-large-cnn, and bart-large-mnli.

3. Summarization using GPT-2 model

GPT-2 model with 1.5 million parameters is a large transformer-based language model. It’s trained for predicting the next word. So, we can use this specialty to summarize Twitter data.

GPT-2 models come with various versions. And, each version’s size is more than 1 GB.

We will be using the bert-extractive-summarizer library to download GPT-2 models. Learn more about the bert-extractive-summarizer library here.

Use pip install bert-extractive-summarizer command to install the library.

Here is a code to summarize the Twitter dataset using the GPT-2 model.

	>>> from summarizer import TransformerSummarizer
	>>> import re
	>>> GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
	>>> text = " ".join(tweet_data)
	>>> TEXT_CLEANING_RE = "@\S+\|https?:\S+\|http?:\S\|[^A-Za-z0-9]+"
	>>> text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
	>>> summerize = ''.join(GPT2_model(text, min_length=60, max_length=120))
	>>> summerize
	'Overnight show with me and a host of brilliant guests on both sides of the at trump s defeat will expose narendramodi to international censure change in the white house likely to force the in in a choice between a clown and a gaffe prone plagiarist tarred by his son s alleged corruption trump deserves th see a detailed map of'

view raw gpt_2_summerization.py hosted with ❤ by GitHub

Observation on Code

The transformer_type value will vary according to the pre-trained model we use.
You can change the transformer_model_key as per the requirement. GPT-2 has four versions gpt2, gpt2-medium, gpt2-large and gpt2-XL.
This library also has a min_length and max_length option. You can assign values to these variables as per your requirement.

4. Summarization using XLNet model

XLNet is an improved version of the BERT model which implement permutation language modeling in its architecture. Also, XLNet is a bidirectional transformer where the next tokens are predicted in random order.

The XLNet model has two versions xlnet-base-cased and xlnet-large-cased.

Here is a code to summarize the Twitter dataset using the XLNet model.

	>>> from summarizer import TransformerSummarizer
	>>> import re
	>>> xlnet_model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
	>>> text = " ".join(tweet_data)
	>>> TEXT_CLEANING_RE = "@\S+\|https?:\S+\|http?:\S\|[^A-Za-z0-9]+"
	>>> text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
	>>> summerize = ''.join(xlnet_model(text, min_length=60, max_length=120))
	>>> summerize
	"The fixwithohimai and chidiodinkalu look ahead to tomorrow's presidential election. The uselections2020 overnight show will feature guests on both sides of the at trump s defeat.
	A new poll shows potus leading in one of the most important swing states pennsylvania."

view raw xlnet_summerize.py hosted with ❤ by GitHub

Observation on Code

You can change the value of min_length and max_length as per your requirement.
This model will trim the sentence length if it exceeds 512 value.

Other use-cases of Summarization

Summarize each article and present it to the readers as a summary.
You can use this method to generate high-quality SEO. It will help your articles to discover more on google.
Summarize the whole comment section of the post. These posts may belong to Reddit or Twitter social media platform.
You can summarize the whitepapers, e-books, or blog posts and share them on your social media platform.

Conclusion

In this article, we have summarized the Twitter live data using T5, BART, GPT-2, and XLNet pre-trained models. Each model generates a different summarize output for the same dataset. Summarization by the T5 model and BART has outperformed the GPT-2 and XLNet models.

These pre-trained models can also summarize articles, e-books, blogs with human-level performance. In the future, you can see a lot of improvements in summarization tasks. And this will help you to solve many summarization related tasks.

manmohan24nov

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

Summarize Twitter Live data using Pretrained NLP models

Introduction

Why use 4 types of pre-trained models for summarization?

Twitter Live Data

1. Summarization using T5 Model

2. Summarization using BART models

3. Summarization using GPT-2 model

4. Summarization using XLNet model

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at