Real-World Machine Learning Case Study: Clustering Transactions Based on Text Descriptions

Guest Blog Last Updated : 12 Jul, 2020

5 min read

Introduction

We are living in the era of digital technologies. When was the last time you walked into a shop that didn’t have a PayTM or BHIM UPI? These digital transaction technologies have quickly become a key part of our daily lives.

And not just at an individual level, these digital technologies are at the core of every financial institution. Executing a payment transaction or fund transfer has become very smooth with multiple possible options (like internet banking, ATM, credit or debit cards, UPI, POS Machines, etc.) having reliable systems running at the backend.

For every transaction we make, there would be an appropriate description message generated against it, like this:

In this article, we’ll talk about a real-world use case of a financial institution using clustering (a popular machine learning algorithm) to customize their product offerings for its customer base.

Motivation Behind this Case Study

As a financial institution, it’s always important to engage the existing customer base with customized offers based on their varying interests. It is a significant challenge for any financial institution to capture the ideal 360-degree view of a customer.

Social media platforms like Twitter, WhatsApp, Facebook, etc. have become primary sources of information for profiling a customer’s interests and preferences. A financial institution often incurs huge costs for availing data from third party sources. Even then, it becomes very difficult to map a social media account to a unique customer.

So how do we solve this?

A partial solution to the above problem can be addressed by using in-house transaction data available with the institution.

We can cluster the transactions performed by a customer into discrete categories based on the transaction description message. This approach can be used to flag whether a transaction was performed for Food, Sports, Clothes, Bill Payment, Household, Others etc. If a customer has most of the transactions appearing in a particular category, then we can have a better estimate of his/her preferences.

Here’s the Approach We Took

Let’s understand how we approached this problem statement and the key steps we took to figure out a solution.

Determining the number of Topics

We start the process with all transactions with their description messages mapped to each customer. To start with, we have an important task of finalizing the number of clusters (or) categories (or) topics. To achieve this objective, we use Topic Modelling.

Topic Modelling is a method for unsupervised classification of documents which finds natural groups of items even when we’re not sure what we are looking for. It mostly uses Latent Dirichlet Allocation (LDA) for fitting a topic model.

It treats each document (i.e. Transaction) as a mixture of topics, and each topic as a mixture of words. Here’s an example: the word budget might occur in the topics movie and in politics. The underlying assumption of this LDA is that every observation in the sample comes from an arbitrary unknown distribution that can be explained by a generative statistical model.

Let us view this methodology to address our problem. There exists a generative statistical model that has generated all the words in the transaction descriptions which came from unknown arbitrary distribution (i.e. unknown groups or topics). We try to estimate/build a statistical model so that it predicts the probability of a word belonging to a particular topic.

Topic Coherence

We have fixed the total number of topics by manually looking at the top keywords across topics. This might be slightly inconsistent, and we need a subjective way to assess the correct number of topics. We use the Topic Coherence measure to identify the correct number of topics.

Topic coherence is applied to the top N words from the topic. It is defined as the average/median of the pairwise word-similarity scores of the words in the topic. A good model will generate coherent topics, i.e., topics with high topic coherence scores.

Good topics are topics that can be described by a short label; therefore, this is what the topic coherence measure captures.

Time for Clustering!

We have fixed the total number of topics/clusters now (i.e. 7 topics in our case). We should start assigning each of these transaction description messages into topics. Topic modelling alone might not yield accurate results in assigning a document to a topic.

Here, we use the output of topic modeling along with a few more features to cluster transaction description messages using K-Means clustering. Here, we’ll concentrate on building a feature set for K-Means clustering.

Features:

Basic Features
- Words count, Digits count, Special symbol count
- Longest digits sequence length, digits-character ratio
- Average, Maximum word lengths etc.
- Week, Day and Month of Transaction, is date present, is weekend transaction, etc.
- Transaction performed in the last 5 days or First 5 days of the month
- Public holiday and festival transactions, etc.
Lookup Features – Top brands in the industry & common nouns are used as lookup names. Count the number of words in the transaction description related to a particular industry.
- - Food: Vegetables, Dominos, FreshDirect, Subway, etc.
  - Sports: Baseball, Adidas, soccer, cleats etc.
  - Health: Pharmacy, Hospital, Gym, etc.
  - Bill & EMI: Policy, power, statement, schedule, withdrawal, phone, etc.
  - Entertainment: Netflix, Prime shows, Spotify, Soundcloud, Bar
  - E-Commerce: Amazon, Walmart, eBay, Ticketmaster, etc.
Others: Uber, Airbus, packagers etc.
Topic Modelling Features
- Perform Topic modelling on DTM Matrices generated using TF-IDF measure for unigrams & bigrams. We get 2 sets of 7 different probabilities for every topic for both unigram and bigram DTM Matrices for a transaction description

Final Thoughts

Around 30 features are made for every transaction description and we perform K-Means clustering to assign each transaction description to one of the 7 Clusters.

Results show that observations near to the cluster centers are mostly labeled with the correct topic. Few observations far from cluster centers have been assigned the wrong topic label. Out of manually reviewed 350 transaction descriptions, around 240 (~69% Accuracy) transaction descriptions are correctly labeled with the appropriate topic.

Now we have at least a basic estimate of the in-house customers’ preferences and interests. We can send customized offers and options to keep them engaged and improve business.

Though the approach of using a topic model is relatively novel, the approach of using transactions for classifying interests of customers has been in use mostly by credit card issuers. For instance, American Express has been using this approach for creating interest graphs for its customers. Such interest graphs not only categorize transactions into major groups like food, travel, etc. but also creates micro-segments like Thai Food fans, Wildlife enthusiasts, etc. And all these from the wealth of the transaction data only!

About the Author

Ravindra Reddy Tamma – Data Scientist (Actify Data Labs)

Ravi is a machine learning expert at Actify Data Labs. His expertise spans across credit risk analytics, application fraud modelling, OCR, text mining and deployment of models as APIs. He has worked extensively with lenders for developing application, behaviour, and collection scorecards.

Ravi has also developed a national-level application fraud model for unsecured lending in India using unstructured credit bureau header information. In addition to credit risk, Ravi has deep expertise in OCR, image analytics and text mining. Ravi also brings-in deep expertise in automating production data pipelines and deployment of machine learning models as scalable APIs.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Real-World Machine Learning Case Study: Clustering Transactions Based on Text Descriptions

Introduction

Motivation Behind this Case Study

Here’s the Approach We Took

Determining the number of Topics

Topic Coherence

Time for Clustering!

Features:

Final Thoughts

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics