A web page is a document or information resource that is accessible through the World Wide Web. It is typically made up of HTML (Hypertext Markup Language), which provides the structure and content of the page, and CSS (Cascading Style Sheets), which provides the styling information for how the page should be presented to the user.
A web page can contain text, images, videos, audio, and other multimedia content. It can also contain interactive elements, such as forms, links, and buttons, that allow users to interact with the page and access additional information or services.
Web pages are hosted on web servers and accessed through web browsers, such as Google Chrome, Mozilla Firefox, or Apple Safari. A web page’s URL (Uniform Resource Locator) provides a unique identifier that can be used to access the page from anywhere in the world.
Web pages play a central role in the World Wide Web, providing the primary means of accessing information and services online. They are used for a wide range of purposes, including personal communication, commerce, education, entertainment, and information dissemination.
In this article, we will build a classifier to pre-identify which URLs are important and what you’re looking for.
This article was published as a part of the Data Science Blogathon.
Web page classification refers to the process of categorizing web pages into predefined classes or categories based on their content and structure. This can be useful for a variety of purposes, such as organizing web pages for easier searching and browsing, filtering out irrelevant or malicious web pages, and improving the accuracy of search engine results.
Classification of web pages can be performed using machine learning techniques, such as decision trees, random forests, support vector machines, and neural networks. These algorithms take as input a set of features extracted from the web pages, such as the frequency of specific HTML tags, the presence of certain keywords in the text, or the structure of the links between web pages. The algorithms then learn to map these features to class labels based on a training dataset of web pages that have been manually annotated with their correct class labels.
In order to obtain high accuracy in web page classification, it is important to have a large and diverse training dataset and effective feature engineering to capture the relevant information about the web pages. The choice of machine learning algorithm will also impact performance and should be selected based on the nature of the problem and the available training data.
There are several reasons why web page classification is important:
In summary, web page classification is a crucial component in the organization, discovery, and analysis of online content and has numerous applications in fields such as search engines, online advertising, and user personalization.
The problem of web page classification is to accurately categorize a given web page into one or more predefined classes based on its content and structure.
When you type breast cancer symptoms and effects in a google search, you will get around 130 million results, and going through each of those is actually impossible.
So instead of this, what you can do is you can build a classifier to pre-identify which of the URLs are actually important and what you’re looking for.
Problem Statement:
In this case study, we are provided with URLs from 53000+ web pages. The objective is to build a classifier that can classify the web pages into their respective classes (Each web page can belong to only 1 class).
Below is the list of the classes that we have in the target variable; note that each of the URLs in our data set will belong to only one class. So this will be a multi-class classification problem.
Basically, given the complete URL, predict the tag a web page belongs to out of 9 predefined tags as given below:
The dataset contains the following features:
The objective here is to predict the class of the web page from the above-mentioned 9 classes. So, let’s quickly go through the components of a URL.
Let us have a look at each component of a URL to have a better understanding of the data.
Here we have an example of this part here:
let’s go with the particle example. First, import all libraries.
In the above section, we learn about the problem statement of webpage classification. We’ll start with importing the required packages, and then we’ll load the data set. Start with importing the required packages, and then we’ll load the data set.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
from tqdm import tqdm
from urllib.parse import urlparse
pd.set_option("display.max_colwidth", 200)
import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_predict, GroupKFold
from sklearn.pipeline import FeatureUnion
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import hstack
%matplotlib inline
We have the data set saved as a web page data.csv file.
df = pd.read_csv("webpage data.csv")
Let’s look at the shape of the dataset.
df.shape
So we have 53,229 rows and 4 columns, and let’s look at what these columns look like.
The webpage id is a unique id against each row. Then we have the domain, which, as you can see for the initial few rows, we have fiercepharma.com and the URL against each of these and, finally, the target or the tag for each URL.
So just to confirm, let’s see how many unique classes we have in the variable tag. So I can do this using the data frame and column name with a unique function.
df['Tag'].unique()
So it returns all the unique categories inside this column which is just tagged here, and we can see news, clinical trials, conferences, and so on.
As mentioned in the problem statement, there are 9 separate categories. Let us have a look at a few samples from each category to have.
To start with, we have the class profile, and let’s just see a few examples. So we have the domain of healthcare for people.
df[df['Tag'] == 'profile'].head(2)
now let’s look at some other examples.
So conferences it’s quite self-explanatory. We have events within the URL which suggests that it’s a page which shares the details of a particular conference conducted by different organizations.
df[df['Tag'] == 'conferences'].head(2)
let’s look at some other examples.
We have another example here for the class forums. We can see the tags community here, which says forum within the URL.
df[df['Tag'] == 'forum'].head(2)
df[df['Tag'] == 'others'].head(2)
So that’s a brief about the data set and the target classes.
Now, let’s just look at the distribution of the target class. So we’ll be using the value counts function here and we’ll just plot this.
cnt_tag = df['Tag'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(cnt_tag.index, cnt_tag.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.show()
As you can see above image, a maximum of the URLs belong to the class others. And then, we have news and publications with approximately a frequency of 7500.
Other classes, like thesis and guidelines, have quite a low frequency compared to others in news and publications. So, clearly, we can say this is an imbalanced classification problem.
A word cloud, also known as a tag cloud, is a visual representation of the most frequent words in a text corpus. In the context of web pages, a word cloud can be generated for the URLs of a set of web pages in order to gain insight into the common words used in the URLs.
Now I want to see how well the given sentiments are distributed across the training dataset. One way to accomplish this task is by understanding the common words by plotting word clouds.
A word cloud is a visualization where the most frequent words appear in large sizes, and the less frequent words appear in smaller sizes.
Let’s visualize all the words in our data using the word cloud plot.
all_words = ' '.join([text for text in df['Url']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).
generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
Let’s just have a look at it. Now the word cloud represents the frequency of the words. So, whichever words in the word clouds are bigger have a higher frequency. So, I can see Biomed central and articles as the two most frequent words.
Overall, most of the URLs direct to healthcare pages, as can be seen from the word cloud, and there are words such as thesis, Edu, etc. which again imply that the frequency of words should be an important feature for prediction.
A good idea would be to create word clouds for each category. So I am going to do this for the category thesis.
In this case, I have set a condition that will take the words present in the column URL but also satisfies the condition that the tag is a thesis.
we have the general parameters of weight, height, and so on.
all_words = ' '.join([text for text in df[df['Tag'] == 'thesis']['Url']])
wordcloud = WordCloud(width=800, height=500, random_state=21,
max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()
Now, this shows a word cloud for the category thesis. The most common words are EDU, handle, kernel thesis library, etc.
Similarly, you can print word clouds for particular categories or classes to get an idea about the most common words in this URL, and using those words would be helpful in determining the final class.
Here we simply have domain and URL, which are neither numeric nor categorical variables, as each URL is unique.
The URLs in the dataset can be considered as a single string since the words in the URL have no spaces. Instead, there are 2 types of separators here ‘/’ and ‘-.’ We can replace these with spaces and get individual words this way.
def clean_url(df):
df["Url"] = df["Url"].str.replace("/", " ")
df["Url"] = df["Url"].str.replace("-", " ")
df["Url"] = df["Url"].str.replace("https:", "")
df["Url"] = df["Url"].str.replace("http:", "")
return df
df = clean_url(df)
df.head(5)
look at the head of the newly cleaned data.
So, here as you can see, we have separated words now instead of a complete string.
Now we can move to the feature extraction part. We have tokenized the words and done the necessary cleaning. It is time to convert these to features.
Here, we will use BOW features. Sklearn provides functionality for both. Let’s use that to create these features.
If you look at the documentation for Sklearn feature extraction, you will see that:
The URL has a lot of abbreviations for the same words, so it could be a good idea to use create a bag of word features from characters as well.
So let’s start with creating the bag of words; in this case, we’ll be using the count vectorizer here. We have given the count vectorizer a parameter n grams range, which goes from 1 to 3.
# Word and character BOW on URLs
vec_bow = CountVectorizer(ngram_range=(1, 3), min_df=400)
vec_bow.fit(df['Url'])
Url_bow = vec_bow.transform(df['Url'])
We’ll keep on adding the features as we go ahead.
Let’s just try the build first model and see the accuracy. So for building a model, we first need to create a train and
validation set.
We will not randomly shuffle the data set into the train and validation set in this particular problem.
Randomly splitting the dataset into train and test and checking performance will not correct this problem. Here’s why.
Let’s say we have a domain ecommons.cornell.edu. This is basically Cornell University’s digital repository and predominantly contains thesis classes. Now, suppose this domain (ecommons.cornell.edu) and class (thesis) combination are contained in both train and test, just on the basis of the domain. In that case, I can predict the class to be a thesis, but this model would not be useful and would not generalize well on a new thesis by a different domain.
Well, let me explain that using a simple example. So let’s say here we have a subset of the data set. We have agelab.mit.edu as one of the domains and aac.asm.org as the other domain, and against both these domains, we have the following tags.
The Solution?
The train and test data split should be done based on the Domain-Tag combination, such that no 2 URLs for the same class and domain are kept in the train and test, respectively, because, in that case, the domain can be directly mapped to the tag and that would be a leakage.
This is a multi-class classification problem, and the metric we will use here is the weighted F1-Score. As discussed in the multi-class module, a weighted F1 score basically assigns weights proportional to the class frequency in the train set.
Here we are going to create a new variable which is called target_str. And this variable is basically the domain_tag. We’ll be using the group k fold to create the train and validation set and have given the number of folds to five.
The group Kfold split will create five different folds of the complete dataset df.
# Replicate train/test split strategy for cross validation
df["target_str"] = df["Domain"].astype(str) + '_' + df["Tag"].astype(str)
# cvlist = list(GroupKFold(5).split(df, groups=df["target_str"]))
Here we have given the groups as the new variable that we just created target_str, and this group Kfold split will ensure that all the categories are present in the target_str variable.
df["target_str"].head()
cvlist
Now, a variable x stores all the features created from the bag of words we used in the count vectorizer above.
X = Url_bow
TAG_DICT = {"others":1, "news": 2, "publication":3, "profile": 4,
"conferences": 5, "forum": 6, "clinicalTrials": 7,
"thesis": 8, "guidelines": 9}
df['target'] = df.Tag.map(TAG_DICT)
y = df["target"].values
Above the line of code, we are just converting the strings in the target variable to numbers, and we will store that in the variable y.
def cv_score(ml_model, df):
i = 1
cv_scores = []
X = df
# Custom Cross validation based on group KFold
for df_index,test_index in cvlist:
print('n{} of Group kfold {}'.format(i,5))
xtr,xvl = X[df_index],X[test_index]
ytr,yvl = y[df_index],y[test_index]
# Define model for fitting on the training set for each fold
model = ml_model
model.fit(xtr, ytr)
pred_probs = model.predict_proba(xvl)
label_preds = np.argmax(pred_probs, axis=1) + 1
# Calculate scores for each fold and print
score = f1_score(yvl, label_preds, average="weighted")
sufix = ""
msg = ""
msg += "Weighted F1 Score: {}".format(score)
print("{}".format(msg))
# Save scores
cv_scores.append(score)
i+=1
return cv_scores
Now, this function takes in the model that we will build and the features that we have. Let’s now we’ll be building our first model, which is the multinomial Naive Bayes.
We are giving in the features, which are the word features we created for the function cv score.
cv_score(MultinomialNB(alpha=.01), Url_bow)
So this will give us the scores for all five folds. So, we can see we have a score of 0.59 for the first fold and then 0.63, and so on. So the highest score goes to 0.68.
Now we’ll create some more features. so let’s look at how we can create features using the characters.
So previously, what we did was we created features which was a bag of words using each individual word. And then, we grouped words. we used diagrams and trigrams. so we took two words into three words together. now we can perform the same for characters.
These scores are low. Since the URLs are not regular sentences. It would be a good idea also to build features using character n-grams as well.
For sequences of characters, the 3 grams that can be generated from “good morning” are “goo,” “ood,” “od “, “d m,” ” mo”, “mor”
Let’s do that and check performance again.
# Word and character BOW on URLs
vec1 = CountVectorizer(analyzer='char', ngram_range=(1, 5), min_df=500)
vec2 = CountVectorizer(analyzer='word', ngram_range=(1, 3), min_df=400)
vec_bow = FeatureUnion([("char", vec1), ("word", vec2)])
vec_bow.fit(df['Url'])
Url_bow = vec_bow.transform(df['Url'])
These will be our new features or additional features using the characters and the words. Now we’ll again build the same multinomial Naive based model, and let’s see if there is any improvement in the score.
cv_score(MultinomialNB(alpha=.01), Url_bow)
So the scores have improved. So, we have the score for the first four last 0.67, and then we have the best to be 0.72. We have created the bag of features using count Vectorizer. We see significant improvement by using the character N-Grams. Now, let’s try the TFIDF features
we have a TF-IDF vectorizer here. And we’ll use both the analyzer as characters and the words. And we’ll be creating the features for engram range of 1 to 5 for characters and 1 to 3 for words.
# Word and character TFIDF on URLs
vec1 = TfidfVectorizer(analyzer='char', ngram_range=(1, 5), min_df=500)
vec2 = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=400)
vec_tfidf = FeatureUnion([("char", vec1), ("word", vec2)])
vec_tfidf.fit(df['Url'])
Url_tfidf = vec_tfidf.transform(df['Url'])
And again, build the same model and see if these new features add any more importance. And if the score improves or the score remains the same in this case.
nb = cv_score(MultinomialNB(alpha=.01), Url_tfidf)
So, the scores again have improved significantly. And so far, we have built the Naive Bayes models for a different number of features that we created.
Let’s try a few different models in this case. So I’m going to try the logistic regression here. This will take in the same features which we just created from the TF-IDF vectorizers.
We are getting better performance from TFIDF features. Let’s go with that and try logistic regression now. Here, I have used class weight as balanced. This specifically changes the weights of samples inversely proportional to the frequency, meaning the classes with fewer samples will have more weight.
log_reg = cv_score(LogisticRegression(C=0.1,class_weight="balanced"), Url_tfidf)
So, in this case, the logistic regression model has actually out-shown our naive Bayes models. Although for the first initial folds, we had a score similar to the Naive based models that we had. But for the other two folds, we have the best score to be 0.8 in this case.
Let’s now give try the tree-based model.
Now, we will try tree-based methods to check performance. So here I’m building a decision tree model again for our same set of features.
dtree = cv_score(DecisionTreeClassifier(min_samples_leaf=25,
min_samples_split=25), Url_tfidf)
Looks like the decision tree model hasn’t performed really well on the given data set and has not been able to classify the URLs very well.
Let’s give the random forest model another try.
So let’s see if using the same features is a random forest model able to classify our URLs better. And we have set the estimator to be 100 and the maximum depth here to 50.
rf_params = {'random_state': 0, 'n_jobs': -1, 'n_estimators': 100,
'max_depth': 50, 'n_jobs': -1}
rf = cv_score(RandomForestClassifier(**rf_params), Url_tfidf)
the random forest has shown a similar performance to the decision tree model. Although the performance has improved slightly still, we cannot say that random forest has performed well on this status set.
So let’s compare the performance of all the models we have built so far.
results_df = pd.DataFrame({'Random Forest':rf, 'Decision Tree': dtree,
'Logistic Regression': log_reg, 'Naive Bayes':nb})
results_df.plot(y=["Random Forest", "Decision Tree","Logistic Regression",
"Naive Bayes"], kind="bar")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
And here, we have all the models shown in different colors, and on the x-axis, we have the number of folds that we build.
On the y-axis, we have the reference scores. So we can clearly see that the logistic regression has performed well in all the cases. And then the next best here is the Naive Bayes and then the random forest algorithms. And we can also add more features to improve the model’s performance.
A web page is a document that is accessible through the World Wide Web and typically contains text, images, and other multimedia
content, along with interactive elements that allow users to interact with the page. Web pages play a central role in the World Wide Web and
are used for various purposes.
Classification of web pages is a significant system for Web mining because the original step of Web mining is grading the web pages of different classes.
Web page classification uses machine learning algorithms like decision trees, random forests, support vector machines, and neural networks. These algorithms take as input features extracted from the web pages and learn to map these features to class labels based on a training dataset.
Webpage classification is a supervised learning problem that categorizes a webpage into predefined categories based on labeled training data. We observed these key points while building a webpage classifier with different machine learning algorithms below.
The accuracy of web page classification depends on the quality and diversity of the training data and the choice of the machine learning
algorithm. Word clouds can be used to gain insight into the common words used in the URLs of a set of web pages and can be useful for
understanding the most common themes or topics.
If you want to read my previous blogs, you can read Previous Data Science Blog posts here. Connect with me on Linkedin.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Hello in this article you say "We have the data set saved as a web page data.csv file". how i can collect web page data automatically?