This article was published as a part of the blog.
In this article, we will be dealing with the Restaurant reviews dataset. In this dataset, there are reviews from the customers which are either positive or negative. And now we are going to build a machine learning model using both Support Vector Classifier(SVC) and Count Vectorizer methods. And finally, this model is going to predict whether the given review is either positive or negative.
Let’s start by looking into the dataset.
Here is the link for the dataset. You can download it and proceed.
https://drive.google.com/file/d/1TgqU0Q_wyEy250ed5xm3lAggYSKU71wN/view?usp=sharing
In this dataset there are two columns namely, Review and Liked. The review column has all the reviews given by the customer. And in Liked column it can be either 0 or 1. 1 indicates positive review and 0 indicates negative review.
We have to import some basic important libraries before working on the machine learning model.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
Next, we have to create a data frame. Download the dataset which was shown previously. And create using pandas.
#import Restaurant Reviews Dataset df=pd.read_table(r"C:UsersAdminDownloadsRestaurant_Reviews.csv")
In between Invited commas, paste the path of the Restaurant Reviews dataset on your computer. This will save the data frame in the df variable.
let’s view it.
df
It will show the output like this. It will show the first five and last five rows and also it will show the number of rows and number of columns in the data frame.
df.info()
info() method gives the information about the data frame. I will give the number of columns, column labels, number of non-null entries, the data type of the column, memory usage.
output will be
Statistical Description:
It will give total count, mean, standard deviation, minimum value, maximum value, 25% of data, 50% of data, 75% of data.
df.describe()
The output will be like,
Let’s see the total columns in the df.
df.columns
Index([‘Review’, ‘Liked’], dtype=’object’)
nunique() method gives the number of unique values in the particular column
df['Liked'].nunique()
2
unique() method gives unique values in the particular column.
print(df['Liked'].unique())[1 0]
value_counts() method gives the number of times the particular value repeated in that column through the data frame.
df['Liked'].value_counts()
Let’s see the top 5 entries of the data frame.
df.head()
and similarly, the tail() method is used to view the last 5 entries of the data frame.
Visualizations
plt.figure(figsize=(8,5)) sns.countplot(x=df.Liked);
Here we used the seaborn library to visualize the data frame. This is a count plot where it counts the entries of the column and plots it.
Here, X is the input feature that we give to the model, and Y is the output that the model should predict. And coming to our dataset, the Review column is the input that we give, and Liked is going to be predicted by the model.
x=df['Review'].values y=df['Liked'].values
For this, we have to import train_test_split from the scikit learn library. And then whole data frame is divided into four data sets. They are, x_train, x_test, y_train, y_test. Bot x and y are divided into training and test datasets.
from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
x_train.shape
(750,)
x_test.shape
(250,)
y_train.shape
(750,)
y_test.shape
(250,)
from the sci-kit learn library we have to import CountVectorizer. And then store it in a variable something like vect with setting stop_wors as “English”.
This count vectorizer transforms the text into a vector based on the count of the words like the number of times the word is repeated in the sentence.
from sklearn.feature_extraction.text import CountVectorizer vect=CountVectorizer(stop_words='english')
x_train_vect=vect.fit_transform(x_train) x_test_vect=vect.transform(x_test)
Import Support Vector Classifier(SVC) from Support Vector Machine (SVM) library and assign it to a variable called a model.
from sklearn.svm import SVC model=SVC()
The fit method is used to train the model and we have to pass training datasets as arguments in it to train the model.
model.fit(x_train_vect,y_train)
Use predict method to predict the test results. Pass the x variables of the testing dataset in it.
y_pred=model.predict(x_test_vect)
For machine learning models to evaluate it, we use variable methods and all these are in the metrics library and here for support vector classifier(svc), we use accuracy score to evaluate it.
Import accuracy_score from scikit learn metrics library and then pass two arguments to which we have to compare and evaluate. Here predicted dataset and test dataset are taken to evaluate.
accuracy_score(y_pred,y_test)
0.792
For my model, the accuracy is 79.2%.
Before using pipeline in our model, let us understand a little bit about the pipeline. Basically, the pipeline is used whenever we use multiple methods, classes, or models together. Let us understand the pipeline more using the below code.
First, we will see without using the pipeline.
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
And now using pipeline we just need to use very few lines of code. We just have to pass all the methods we are willing to use as arguments in the pipeline method.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
Now coming to our model, let’s use the pipeline method. For that import make_pipeline from the pipeline library. And pass CountVectorizer and SVC as arguments into it.
from sklearn.pipeline import make_pipeline text_model=make_pipeline(CountVectorizer(),SVC())
Now again as we know the fit method is used to train the model, train our new model which is made using the pipeline.
text_model.fit(x_train,y_train)
Similarly predict the results using predict method.
y_pred=text_model.predict(x_test)
And the outcome will be,
y_pred
Source: Author
Let’s evaluate our new model using accuracy_method.
accuracy_score(y_pred,y_test)
0.792
The accuracy of the model is 79.2%.
We can save the model and for that, we have to use joblib. Import joblib and using dump method we can save it. We have to pass two arguments in it. one is the model and the other is the name of our file.
import joblib joblib.dump(text_model,'Project')
And again to use it we have to use the load method. We can retrieve it using the load method and save it to a variable.
import joblib text_model=joblib.load('Verzeo_Major_Project')
Now our model is well trained and ready for implementation. Let us try with some examples.
text_model.predict(['hello!!Love Your Food'])
array([1], dtype=int64)
Here the review is a positive review and as expected our model predicted 1 for it which means positive.
Let’s try with a negative review and see what it will predict.
text_model.predict(["omg!!it was too spice and i asked you don't add too much "])
array([0], dtype=int64)
As expected it gave 0 as output which means negative.
We have learned how to work on support vector classifier and count vectorizer and also we have seen how to use both on the model using pipeline and we have created a model which is able to predict whether the review is positive or negative. We have also seen it using some examples. And we saved the model using the joblib and also retrieved it and used back using the joblib.
Hope you guys found this article on restaurant reviews analysis useful. Share your views in the comments sections. Read more articles on our blog.
Connect with me on LinkedIn: https://www.linkedin.com/in/amrutha-k-6335231a6vl/
Thank you!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.