Twitter users spend an average of 4 minutes on social media Twitter. On an average of 1 minute, they read the same stuff. It shows that users spend around 25% of their time reading the same stuff.
Also, most of the tweets will not appear on your dashboard. You may get to know the trending topics, but you miss not trending topics. In trending topics, you might only read the top 5 tweets and their comments.
So, what are you going to do to avoid wastage of time on Twitter?
I would say summarize your whole trending Twitter tags data. And, then you can finish reading all trending tweets in less than 2 minutes.
In this article, I will explain to you how you can leverage Natural Language Processing (NLP) pre-trained models to summarize twitter posts based on hashtags. We will use 4 ( T5, BART, GPT-2, XLNet) pre-trained models for this job.
Why use 4 types of pre-trained models for summarization?
Each pre-trained model has its own architecture and weights. So, the summarization output given by these models could be different from each other.
Test the twitter data on different models and then choose the model which shows summarization close to your understanding. And then deploy that model into production.
Let’s start with collecting Twitter Live data.
Twitter Live Data
You can get Twitter live data in 2 ways.
Official Twitter API. Follow this article to get a Twitter dataset.
Use the Beautiful Soup library to scrape the data from Twitter.
I will be using step 1 to fetch the data. Once you receive the credentials for Twitter API, follow the below code to get Twitter data through API.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now, let’s start summarizing data using pre-trained models one by one.
1. Summarization using T5 Model
T5 is a state of the art model used in various NLP tasks that includes summarization. We will be using the transformers library to download the T5 pre-trained model and load that model in a code.
The Transformers library is developed and maintained by the Hugging Face team. It’s an open-source library.
Here is code to summarize the Twitter dataset using the T5 model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can use different types of T5 pre-trained models having different weights and architecture. Available versions of the T5 model in the transformer library are t5-base, t5-large, t5-small, t5-3B, and t5-11B.
Return_tensor value should be pt for PyTorch.
The maximum sentence length used to train the pre-models is 512. So, keep the max_length value to 512.
The length of the summarized sentence increase with an increase in length_penality value. Length_penality=1 means no penalty.
2. Summarization using BART models
BART uses both BERT (bidirectional encoder) and GPT (left to the right decoder) architecture with seq2seq translation. BART achieves the state of the art results in the summarization task.
BART pre-trained model is trained on CNN/Daily mail data for the summarization task, but it will also give good results for the Twitter dataset.
We will take advantage of the hugging face transformer library to download the T5 model and then load the model in a code.
Here is code to summarize the Twitter dataset using the BART model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
You can increase and decrease the length of the summarization using min_length and max_length. Ideally, summarization length should be 10% to 20% of the total article length.
This model is ideally suitable to summarize the news articles. But it can also give good results on Twitter data.
You can use different BART model versions such as bart-large, bart-base, bart-large-cnn, and bart-large-mnli.
3. Summarization using GPT-2 model
GPT-2 model with 1.5 million parameters is a large transformer-based language model. It’s trained for predicting the next word. So, we can use this specialty to summarize Twitter data.
GPT-2 models come with various versions. And, each version’s size is more than 1 GB.
Use pip install bert-extractive-summarizer command to install the library.
Here is a code to summarize the Twitter dataset using the GPT-2 model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
'Overnight show with me and a host of brilliant guests on both sides of the at trump s defeat will expose narendramodi to international censure change in the white house likely to force the in in a choice between a clown and a gaffe prone plagiarist tarred by his son s alleged corruption trump deserves th see a detailed map of'
The transformer_type value will vary according to the pre-trained model we use.
You can change the transformer_model_key as per the requirement. GPT-2 has four versions gpt2, gpt2-medium, gpt2-large and gpt2-XL.
This library also has a min_length and max_length option. You can assign values to these variables as per your requirement.
4. Summarization using XLNet model
XLNet is an improved version of the BERT model which implement permutation language modeling in its architecture. Also, XLNet is a bidirectional transformer where the next tokens are predicted in random order.
The XLNet model has two versions xlnet-base-cased and xlnet-large-cased.
Here is a code to summarize the Twitter dataset using the XLNet model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"The fixwithohimai and chidiodinkalu look ahead to tomorrow's presidential election. The uselections2020 overnight show will feature guests on both sides of the at trump s defeat.
A new poll shows potus leading in one of the most important swing states pennsylvania."
You can change the value of min_length and max_length as per your requirement.
This model will trim the sentence length if it exceeds 512 value.
Other use-cases of Summarization
Summarize each article and present it to the readers as a summary.
You can use this method to generate high-quality SEO. It will help your articles to discover more on google.
Summarize the whole comment section of the post. These posts may belong to Reddit or Twitter social media platform.
You can summarize the whitepapers, e-books, or blog posts and share them on your social media platform.
Conclusion
In this article, we have summarized the Twitter live data using T5, BART, GPT-2, and XLNet pre-trained models. Each model generates a different summarize output for the same dataset. Summarization by the T5 model and BART has outperformed the GPT-2 and XLNet models.
These pre-trained models can also summarize articles, e-books, blogs with human-level performance. In the future, you can see a lot of improvements in summarization tasks. And this will help you to solve many summarization related tasks.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
you copied pasted T5 code under BART section too. Kindly correct it