This article focuses primarily on text data feature engineering. Within the same process, we will be going over the following techniques and processes:
Lemmatization / Stemming
Count Vectorizer
One Hot Encoding
Train Test Split
Principal Component Analysis
Some general text cleaning and null value imputation techniques
Explanatory Data Analysis
Linear Discriminant Analysis
Please note that the techniques used within our process are finalized after much trial and error. The use case of these or similar techniques will always be dependent on the use case.
The evaluation parameter for the algorithms is the F1 Score as we want to keep the balance between our Precision score and our Recall score.
Let’s start out by understanding a little about the dataset used within this process. The dataset was sourced from the following link, through Kaggle.
https://www.kaggle.com/deepcontractor/supreme-court-judgment-prediction
< df = pd.read_csv('../input/supreme-court-judgment-prediction/justice.csv', delimiter=',', encoding = "utf8") df.dropna(inplace=True) df.head() >
The CSV file contains 3,303 rows and 16 columns. First_party_winner is the target column.
Columns Unnamed: 0, docket, name, first_party, second_party, issue_area, facts_len, majority_vote, minority_vote, href, ID, term were dropped since their contribution towards the target variable prediction is insignificant.
The remaining dependent variables are decision type, disposition and facts.
Missing values were dropped using .dropna(). The % of null values was less than 5%, thus they were dropped directly without the need for imputation.
< df.drop(columns=['Unnamed: 0', 'docket','name','first_party','second_party', 'issue_area', 'facts_len', 'majority_vote', 'minority_vote', 'href', 'ID','term'], inplace=True) >
We separate the dataset into target variables and two groups of independent variables, one (df_cat) which requires one-hot encoding to be machine-readable and the other (df_nlp) which is text data that needs to be cleaned before features can be engineered from it.
< df_cat = df[['decision_type', 'disposition']] df_target = df['first_party_winner'] df_nlp = df['facts'] >
Next, we reset the indices to avoid NaN values during concatenation. Resetting the indices also enables us to perform one-hot encoding on categorical data without raising errors.
< df_cat.reset_index(drop=True, inplace=True) df_target.reset_index(drop=True, inplace=True) df_nlp.reset_index(drop=True, inplace=True) >
We begin by label encoding our target column ‘first_party_winner’, converting its values from True or False to 1 and 0 respectively.
< from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() df_target= label_encoder.fit_transform(df_target) df_target1 = pd.DataFrame(df_target, columns=['first_party_winner']) df_target1 > Next we work on the ‘facts’ column by performing feature engineering on this column. < df_nlp1 = pd.DataFrame(df_nlp, columns=['facts']) df_nlp1['facts'] = df_nlp1['facts'].str.replace(r'<[^<>]*>', '', regex=True) df_nlp1 >
We use the above-given code to perform initial cleaning on our ‘facts’ feature.
Next, we tokenize our corpus and define a function to allow our text to be cleaned further using Regex and implement Lemmatization. Remember that you should run either
Stemming or Lemmatization on your data, never both.
< corpus = df_nlp1["facts"] lst_tokens = nltk.tokenize.word_tokenize(corpus.str.cat(sep=" ")) ps = nltk.stem.porter.PorterStemmer() lem = nltk.stem.wordnet.WordNetLemmatizer() lst_stopwords = nltk.corpus.stopwords.words("english") def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None): ## clean (convert to lowercase and remove punctuations and characters and then strip) text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
## Tokenize (convert from string to list) lst_text = text.split() ## remove Stopwords if lst_stopwords is not None: lst_text = [word for word in lst_text if word not in lst_stopwords] ## Stemming (remove -ing, -ly, ...) if flg_stemm == True: ps = nltk.stem.porter.PorterStemmer() lst_text = [ps.stem(word) for word in lst_text]
## Lemmatisation (convert the word into root word) if flg_lemm == True: lem = nltk.stem.wordnet.WordNetLemmatizer() lst_text = [lem.lemmatize(word) for word in lst_text] ## back to string from list text = " ".join(lst_text) return text df_nlp1["facts_clean"] = df_nlp1["facts"].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=lst_stopwords)) >
The following code snippets essentially allow us to visualise words and their frequencies across our whole dataset. We also filter our data by target variable value to draw insights from the shape and distribution of our dataset. Finally, we also judge the capacity of an lda Model to discriminate between different topics within our dataset. Note that y = 0 and y = 1 both will be run to visualise both binary target variable options.
< df_nlp2 = pd.concat([df_nlp1,df_target1['first_party_winner']],axis=1, join='inner') y = 1 corpus = df_nlp2[df_nlp2["first_party_winner"]== y]["facts_clean"] lst_tokens = nltk.tokenize.word_tokenize(corpus.str.cat(sep=" ")) fig, ax = plt.subplots(nrows=2, ncols=1) fig.suptitle("Most frequent words", fontsize=15) #figure(figsize=(30, 24)) ## unigrams dic_words_freq = nltk.FreqDist(lst_tokens) dtf_uni = pd.DataFrame(dic_words_freq.most_common(), columns=["Word","Freq"]) dtf_uni.set_index("Word").iloc[:10,:].sort_values(by="Freq").plot( kind="barh", title="Unigrams", ax=ax[0], legend=False).grid(axis='x') ax[0].set(ylabel=None) ## bigrams dic_words_freq = nltk.FreqDist(nltk.ngrams(lst_tokens, 2)) dtf_bi = pd.DataFrame(dic_words_freq.most_common(), columns=["Word","Freq"]) dtf_bi["Word"] = dtf_bi["Word"].apply(lambda x: " ".join( string for string in x) ) dtf_bi.set_index("Word").iloc[:10,:].sort_values(by="Freq").plot( kind="barh", title="Bigrams", ax=ax[1], legend=False).grid(axis='x') ax[1].set(ylabel=None) plt.show()
import wordcloud wc = wordcloud.WordCloud(background_color='black', max_words=100, max_font_size=35) wc = wc.generate(str(corpus)) fig = plt.figure(num=1) plt.axis('off') plt.imshow(wc, cmap=None) plt.show()
y = 1 corpus = df_nlp2[df_nlp2["first_party_winner"]==y]["facts_clean"] ## pre-process corpus lst_corpus = [] for string in corpus: lst_words = string.split() lst_grams = [" ".join(lst_words[i:i + 2]) for i in range(0, len(lst_words), 2)] lst_corpus.append(lst_grams)## map words to an id id2word = gensim.corpora.Dictionary(lst_corpus)## create dictionary word:freq dic_corpus = [id2word.doc2bow(word) for word in lst_corpus] ## train LDA lda_model = gensim.models.ldamodel.LdaModel(corpus=dic_corpus, id2word=id2word, num_topics=7, random_state=123, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)
## output
lst_dics = [] for i in range(0,3): lst_tuples = lda_model.get_topic_terms(i) for tupla in lst_tuples: lst_dics.append({"topic":i, "id":tupla[0], "word":id2word[tupla[0]], "weight":tupla[1]}) dtf_topics = pd.DataFrame(lst_dics, columns=['topic','id','word','weight']) ## plot fig, ax = plt.subplots() sns.barplot(y="word", x="weight", hue="topic", data=dtf_topics, dodge=False, ax=ax).set_title('Main Topics') ax.set(ylabel="", xlabel="Word Importance") plt.show() >
We then import Count Vectorizers to allow us to streamline our process of vectorizing our facts column. Then, we predict results using different Machine Learning models.
Before we make our predictions, however, we must perform train_test_split on our data to avoid overfitting our model on given data. Avoiding this step will give us very high accuracy on our given data but will make our model useless on unseen data.
< from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() xfeatures = df_nlp2['facts_clean'] ylabel = df_nlp2['first_party_winner'] X_train, X_test, y_train, y_test = train_test_split(xfeatures,ylabel, test_size=0.25) >
After performing a train test split we will fit our pipeline on three different models, namely RandomForest, KNeighbors and Logistic Regression.
< pipe = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression(solver='liblinear'))]) pipe.fit(X_train,y_train) pipe.score(X_test,y_test) pipe1= Pipeline(steps=[('cv',CountVectorizer()),('rf',RandomForestClassifier())]) pipe1.fit(X_train,y_train) pipe1.score(X_test,y_test) pipe2= Pipeline(steps=[('cv',CountVectorizer()),('rf',KNeighborsClassifier(n_neighbors=3))]) pipe2.fit(X_train,y_train) pipe2.score(X_test,y_test) >
Logistic Regression Fit Summary: This model reaches an accuracy of 54%, which is very weak and only slightly better than random guessing.
XGBoost Fit Summary: This model reaches an accuracy of 63%, which reinforces the assumption that logistic regression is not capable of capturing the trend in the data properly. This model is better, but still quite weak.
KNN Fit Summary: This model reaches an accuracy of 59%, which is again quite weak but still able to capture more trend than logistic regression.
Random Forest Fit Summary: This model reaches an accuracy of 64%. This is the best response we get from among the chosen models.
After fitting all these algorithms we find that Random Forest gives us the best accuracy with F1 Score.
Now that we have baseline accuracies, we will add back the one-encoded versions of the disposition and decision_type columns. We have a total of 20,375 columns (including the one-hot encoding columns and the count_vectorizer columns). While the computation time for only the vectorized columns was not necessarily high, we ideally would perform dimensionality reduction on such a wide dataset. Therefore, we perform LDA and reduce our columns down to 200.
< df_cat1 = pd.get_dummies(df_cat['decision_type']) df_cat2 = pd.get_dummies(df_cat['disposition']) df_cat3=pd.concat([df_cat2,df_cat1],axis=1,join='inner') df_cat3=pd.concat([df_cat3,df_nl1['first_party_winner']],axis=1,join='inner') vectorize=CountVectorizer() count_matrix = vectorize.fit_transform(df_nl1['facts_clean']) count_array = count_matrix.toarray() data_final = pd.DataFrame(data=count_array,columns = vectorize.get_feature_names()) data_final = pd.concat([data_final,df_cat3],axis=1,join='inner')
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=200, random_state=0) lda_data = lda.fit_transform(X_train) lda_data_train = pd.DataFrame(data=lda_data) lda_data_test = pd.DataFrame(data=lda.transform(X_test)) >
Finally, we fit the above algorithms again to score our final models after one-hot encoded data has been included and LDA has been performed.
Logistic Regression Fit Summary: This model reaches an accuracy of 58%, which is very weak and only slightly better than random guessing.
XGBoost Fit Summary: This model reaches an accuracy of 65%, which reinforces the assumption that logistic regression is not capable of capturing the trend in the data properly. This model is better, but still quite weak.
KNN Fit Summary: This model reaches an accuracy of 63%, which is again quite weak but still able to capture more trends than logistic regression.
Random Forest Fit Summary: This model reaches an accuracy of 67%. This is the best response we get from among the chosen models.
Yet again, Random Forest performed the best amongst all the algorithms with an accuracy of 67%.
But let’s see if we can increase this accuracy by changing a few hyperparameters inside the Random Forest. The problem here is that we will have to run the GridSearch CV for any combination of parameters to find the best in order to get the optimal accuracy.
To avoid such high computation we input a range of values that a parameter can take and then run GridSearchCV.
The best parameter given by GridSearchCV are:
max_depth= 8, max_features = 100, min_samples_leaf = 2, n_estimators = 200
< rand=RandomForestClassifier(max_depth= 8, max_features = 100, min_samples_leaf = 2, n_estimators = 200)
rand.fit(lda_data_train,y_train)
rand.score(lda_data_train,y_train) >
We fit another Random Forest algorithm with the ideal combination of parameters.
The overall accuracy increased by 3%.
Thank you for taking the time to read through our process. We hope you could take something away to enhance your learning and help enable your own process.
Read more articles on our blog.