Overview on Analytics Problem
Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”.
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 30th May 2022, we were joined by Amitayu Roy for a DataHour session on “Traversing the Journey of an Analytics Problem.”
Amitayu Ray is a Data Science leader with 16 years of experience in Analytics consulting, AI solution design, client management and business development across industry practice areas. Helping telecom and media giants around the world realize value from large scale AI/ML implementation. Empowering organizations with out of the box AI solutions to solve modern and constantly evolving problems.
Amitayu is currently working as a Senior Manager in the field of Applied Intelligence, Strategy and Consulting at Accenture. He works at exchange strategy and consulting. He lead their analytics consulting, data practice, AI and machine learning enablement for the North America geography.
Session Link
Are you excited to dive deeper into the world of Data Science and Machine Learning? We got you covered. Let’s get started with the major highlights of this session: Traversing the Journey of an Analytics Problem.
Introduction
With this session, you’ll learn:
What are the objective of this session? Basically, this session is a data science journey, so, it’s more of a storytelling journey. Taking you of a problem from a business problem to an industrialized value-based solution that you give to your business. And this is what we are focusing on.
Here we’ll be focusing on how kind of value we realize from an analytics problem. It talks more about the processes
associated with a typical analytics problem and an analytic solution. It talks about the evolution of data science over the years and where we are heading. This helps you how do you really translate a business problem and you know how to translate it into a certain analytical form and also tells the outlines, the key responsibility areas for the different rules that are evolving that are there right now.
What this session suggest not to do? We are not doing deep dive into AIML algorithm. This is not a technical session on data engineering on big data cloud platforms. This is a process oriented session to help you understand the end-to-end journey where exactly does your work fit in. We are not going to talk about model performance parameters, feature engineering, etc. We’ll talk at a high level we are not going to get into the details of what exactly is going on whether those okay. We are also not going to talk about the agile approach to deliver AIML project.
Some foundational Updates from Data Science Industry
This is important, because, to understand a journey you need to have some basic foundational updates. For example: In this session there is a group of 250 individuals. If we need to plot this data on a maturity curve then some of you are already advanced in analytics, some are just starting up. So, kept this session as generic as possible. Presenter tried answering/narrating that journey in a manner so that it becomes clear to everyone.
How Data Science have Evolved over the Years?
So, let’s have a quick round up of how data science have evolved over the years.
First Stage
In 90s, Analytics 1.0 emerges. So, this is when the BI emerged for the first time. Data started becoming very important asset. But it was mostly on Excel, VBA, etc. And you know the consumption of that from a business perspective, whoever was consuming that they were just looking at some manual reports. Somebody was opening an excel doing some pivots, etc.
Second Stage
In the early 2000s, from a data side we looked at the data warehouse-the ETL systems. From a report side we looked at the the Dashboards, Tableau and Power Bi. The most dominant system at that point of time was SAS and SPSS. All the statistical models working at that time were Regression Models, time series clustering. Analytics consumed these reports and dashboards were semi-automated. The models were run manually, even, the Feature Engineering was done manually too.
Third Stage
Here people start investing in building stronger data warehouses. Resultantly, emerges Analytics 3.0 in 2010s. By this time people have already realized that data is an important asset. So, big data has started to emerge. Hadoop has already come into picture. The Dashboards, Power BI are mostly automated by now.
The ML models comes into the picture for the first time in late 2000s. Eventually, become a stronger application as compared to SAS. From here building high-end ML models starts. But the models themselves are not fully automated. This means that you have to manually run models, check results, the roc curves, performance matrix, do model validations, etc-full manual check.
Fourth/Present Stage
Now, we are in analytics 4.0, actually, between 4.0 and 5.0. This happened in 2015. The advancement of big data has transformed the way we manage/compute data. All the big players like the Amazons, the Googles, the Microsoft started investing in integrated cloud platforms. They realized the tremendous profit of a cloud platform service which can do end-to-end data management, reporting, visualization, modeling, model implementation, etc.
Web-based reports/Automated reports, Deep Learning, AI, the NLPs, the TensorFlow’s, simulation models, auto MLS became very common. Now, everybody started thinking about productionizing analytics/analytics model. The ML world came into existence around 2018.
Future Stage
Analytics 5.0 is where we are heading. So 2025 is when we are possibly going to reach a very different dimension of analytics. Then, the quantum computing, big data on cloud will become normal for most of the organizations; the visualizations will mostly become ERVR visualizations. In fact, with these visualization – tableau, power bi will also exist. Then we’ll have some production ready AI implementations. All new investing models will become end-to-end industrialized i.e. end-to-end automate.
The “Must Have” AI knowledge Assets for tomorrow
The must have skills\good to have skills are:
- Your ability to problem solve
- Know SQL and fundamentals of querying your data
- Fundamentals of mathematical step and statistical deductions
- the basic principles of computational algorithms
On top of that these foundational skills for the next three-five years since it’s a very evolving industry are:
- Changing big data engineering framework: It is going to be a critical skill that will probably play a big role.
- Knowledge of cloud platform architecture: With the AWS and Google cloud and Azure almost an entire suit of analytics products are available on the cloud.
- You need expertise in industry: Build your capability in one domain.
- Explainable AI and ML methodologies: For almost a decade now we have just bypassed these questions from businesses saying that these are black box models. All the AIML models are black box models that is no longer going to stay. So we need to come up with approaches by which the model methodologies can be explained.
- Implementation of analytics and MLAPs principles
One important take-away is-Upskill and Evolve yourself consistently to stay relevant in the market.
Analytics Problem Journey
Typical industry problem we encounter
Lets start from the problem in its very raw form. What does a problem look like.
- Industry problems are extremely vague. Moving forward many of these proposals/business meetings (with the CTOs, the CIOs, the CMOs) gives you a high level problem statement which might not make any sense.
- When you ask question against that problem you realize that there are too many unanswered questions.
Nobody is defining a clear outcome, so it’s your job to define that outcome.
- In many cases businesses do not give clarity around that issue.
So we as analytics consultants/data scientists need to have that clarity in our mind to be able to answer and address those questions.
Four Major Strategic Priorities
There are four major strategic priorities and this is where they earn their bread from. This is how they generate their money from the business.
- Enabling more revenue growth by employing different revenue-oriented strategies.
- Reducing optimization cost–Every business has a cost operational/capital cost, here focus is to optimize that cost.
- Improve/Re-engineer processes–there might be lots of inefficiencies in a process. So, to reduce them, you need to re-engineer your process.
- Improve customer experience
A Customer journey from Industrial Point of View and Role of Data
Imagine yourself as a customer. How do industry view you as a customer and what they do. This is the journey of a customer.
- Prospect Assessment: Industry try to identify who is the right customer for them. Example: Zomato has a very robust prospect assessment engine. This helps whom should they onboard i.e. they sends very specific niche kind of messages to people who are their customers to get them on board.
- Acquisition: It means how do you bring that customer on board.
- Onboarding and Engagement: How to engage more with the customer. Example: Amazon- you see a campaign that amazon offers. Next moment you join amazon. The way amazon make sure that you use their services by sending you niche kind of messages (eg- asking your opinion about a product). That is where the engagement part comes.
- Growth Marketing: It is typically when they engage with customers. They are trying to cross sell/upsell something.
- Loyalty and Operations: Loyalty-if they figure out that you are going to other competitors also like flipkart, etc; they do get that data and then they try to create differentiated services on the operation. They ensures that the customer service is on their tool. Example-Return request.
- Churn and Retention: When you are not engaging enough with amazon or just stopped using amazon. They have a feeling of customer churn and then they figure out new things to retain you.
- Feedback and Social Listening: The feedbacks you personally provides or industry get/gather from online media (eg-you published on twitter that you are not happy with amazon service). On the basis of these, they try to improve.
- Personalization: It means that they want to give you niche personalized service.
There is analytics applicable everywhere right from a prospect assessment to an acquisition to onboarding, growth, loyalty, etc.
Importance of Outcome and Value/Impact driven by the Solution
Source: Amitayu’s Presentation
Churn is industry agnostic means that it could be for any service providing industry (banking, telecom, retail, consumer goods, e-commerce). Loyalty management and churn and retention are two major functions which
are associated with churns.
There are five major questions for any churn problem anywhere in the world.
- Why are they leaving
- Who are most likely to leave?
- Whom do business want to retain?
- What kind of actions should business take to routine
- If business have taken an action how should they target.
What we need to address here is- what is the impact or the value that the business is trying to achieve by
addressing this particular problem? As a data scientist, this business approaches us and ask for a befitting solution. The main reason behind churn is cost acquisition. This is the main thing why our organizations typically do these kind of churn analysis because cost of acquisition is very high. So they need to retain customers and ensure that we, data scientists, are able to sort of:
- what is the potential reason for churn
- how to retain customers
- what could be done to increase profit
- how to increase revenue with continuity
What is Probable Solution Here?
A. Prevent Revenue Loss
- Identify customers who are likely to churn
- Compute net value generated by a high-risk customer
- Retain high value – high risk customers with suitable retention offers
- Prevent revenue loss through retention.
B. Optimize Campaign Cost
- Compute cost of retention campaign
- Estimate the total budget of retention campaign (Cost*No. of leads)
- Identify the right customer to whom this campaign needs to be sent
- Calculate the ROI of the campaign, by calculating net revenue saved v/s net cost incurred
Hypothesis Driven Approach-The Most Effective Way of Problem Formulation
Hypothesis driven approach is a proven approach that the consulting firms/the analytics firms have sort of adopted for many years.
A hypothesis seeks to explain why something has happened, or what might happen, under certain conditions. They are often written as if-then statements. So any hypothesis driven approach has five
major ways to solve that problem:
- what is my end goal?
- how do I reach there?
- what is the journey towards that end goal?
- and what kind of input information would be required through the journey?
- then, what are the key milestones that needs to deliver?
- whatever I am doing today I am building complex AI models but is the business able to consume that?
Now, we’ll look how to solve all these questions and draw analytics.
Stages of an Analytical Problem Journey-the Analytics Solution Hierarchy
Solution of the analytical problem is:
- Problem Formulation: We build an issue tree. It means you are getting a problem or breaking down the problem into simpler blocks which are easy to understand. Then, make sure you are following a MECE approach (hard to be 100% compliant).
- Solution deliveries: Build hypothesis from the hypothesis chart. Then, do analysis outcome from each of these. Lastly, validate the hypothesis on data.
- Data Requirement: Key attributes/features required for validation and testing. Then, identify the root variables and sources from where they are collected. Lastly, do assessment of data availability and accessibility.
- Analytical Approach: Perform hypothesis testing-either approve, disapprove or iterate. Generate helpful insights from the testing. Combine these hypothesis and use them with AIML models. Then, test accuracy and stability of these models. Lastly, explain what-so-ever the model output is and with proof.
- Implementation roadmap: Scale and map is the end-goal. Make sure that data and model pipelines have been built as a part of MLOPs. Then, integrate it with client cloud system and do SIT. Do monitoring after post implementation also. Also, provide end-to-end enablement training to run operations.
Example: How to Perform Issue Tree?
Hypothesis Framework – Build on these Issue Trees
Source: Amitayu’s Presentation
For every problem, we’ll build a issue tree and for these we build a hypothesis framework.
Key Roles Involved Across the Stages
Source: Amitayu’s Presentation
The kinds of roles involved through the journey are:
It is mostly the business analyst, the domain experts, very small representation of data analyst, a very small representation of data scientists, so, all of them are sitting together brainstorming how to translate that business problem into an analytical problem.
Business analyst or a typical consultant plays big role here because you know they will look at the industry practices and will look at the benchmarks. And then accordingly they will try to come up with the hypothesis.
Data engineering is as expected the bulk of the work. There is from a data engineer, data scientist and data analyst also play a small role in that field of space.
The analytical approach which is where the model building, the insights the visualization everything happens. Data scientist has a big role to play as a data analyst. Data engineer also has a significant role to play because they are the ones who are building up the data.
The ml engineers role is to implement these models. This is a new role that is coming up in the industry people who implement models into an existing framework. This kind of a role requires you to have understanding of
models as well as understanding of a technical architecture.
Therefore, we can say particular role plays a big part in the solution implementation. The ml engineer, the business analyst also plays a big part because they are the ones who are bringing all the entities together in making this solution a success.
Data Engineering – The Power-house of Data Science Solutions
Purpose of Existence
- AI Data Foundation: Data engineering creates the foundation for all AIs. So bringing together the data from multiple different sources (structured or unstructured) ensuring the data is correlated and is ready for consumption. This is creating the foundation legacy modernization i.e. if you have an old platform like DB2; data engineers job is also to ensure that this legacy platform is modernized into a modern data architecture.
- Data Lake Mainstreaming: Data lake is where you store all your information. It’s a consolidated view of all your data sources into a unified platform. Design of a data lake followed by a data warehouse-customer 360. There are lots of compliance related stuff operations – data security and governance that comes here in data like mainstreaming.
- Data on cloud: In the next 10 years, all the data will come on cloud. There will be no on-premise systems at all existing. Even the smaller organizations are moving to cloud. So, you need to have the foundational knowledge/know how of the AWS, GCP, etc. And how does it sort of integrate with all other applications.
- Data and analytics consumption: A data engineer’s role is evolving from consumption of data and consumption of analytics, dashboards, etc. Now, they ensures that the data is in such a state that can be pulled automatically from the source system, and then fed into a data lake. Then create and publish report by pulling data from data lake. Consumption means that the data engineering team is end-to-end enabling these functionality.
Why We are Saying Data Engineering as Power-house and its Key-trends?
Data engineering is the powerhouse or the mitochondria of the today’s data-driven world. We as data scientists are a data scientist for many years. The data engineers are the ones who holds everything together. The data engineering team is ensuring that all the data flow and should have no gap in that solution which is same as business self-service. So that the business does not have to do anything, everything will be enabled by the data engineers and data scientists.
It’s key-trends are:
- Cloud Deployments: Majority of organizations are leveraging cloud solutions to rapidly standup analytics or operational environments.
- Governed Data Lakes: They are evolve to become center sources of Data to enterprise via data catalogues to search and shop for data capabilities.
- Rapid Insights Discovery: Investing in data exploration capabilities to identify patterns, trends and unknown opportunities.
- Business-Self Service: Greater use of search, SQL, NLP and self-service tools for intelligent data preparation, operational intelligence and visualization.
- Modern-Hybrid Architecture: Companies are leveraging several technology components for accelerating data movement.
- Smart-Data Management: Evolution of intelligent solutions to data management challenges, automation and learning based solutions to integration and data quality.
Key-Pillars to Data Engineering Engagement
Assessment of Existing Data Architecture: To address a churn problem, first thing is assess the architecture of your business-
- End-to-end knowledge of their data.
- Do a basic data discovery – for hypothesis what kind of data sources do I need
- What kind of existing technology stack do they have
- Is that data accessible how do i access it
These are the initial steps of building up doing an assessment of the enterprise data architecture. Those who are a little advanced know what a sandbox is – setting up a sandbox and virtual workbench. And connecting your data sources and your applications within the sandbox to your data warehouses and data lakes is also part of the data engineers.
Development of Analytical Data Record: So for churn problem we need to build up an analytical data record; which is a customer level data set which will help me build models. So, the data engineer as well as the data scientists are going to build the data together. This tasks acquisition of the data quality assessment, Creation of the features integration of the lakes-merge with the data warehouse, data dictionaries, and metadata management.
Deployment of Analytical Solution Framework: They create the data pipelines to automate end-to-end solution MLOPs. They are creating the scaling and automations in the data to ensure that the codes are configurable and deployable and using parameter driven batch codes. Performing feature engineering based on use cases and applying meta data management.
What is a Customer360?
This is the output that the data engineer will produce for the purpose of model which is an analytical data record. A customer 360 is a view of the customer by which all the possible attributes of a customer are brought together under one platform like financial usage, service products, channel, etc.
When we build these exhaustive customer records it has thousands and thousands of features that cater to many many different kinds of use cases and not just churn. Churn is just one of those.
From Diagnostics to AI problem- How Does the Problem Evolve?
EDA (Exploratory Data Analysis) – Know as much about the data at your disposal (Churn Problem)
We already defined what is a churn. Generally, if somebody stops their service it’s called a churn. But a churn could be a:
- Product churn or a relationship churn: If you are an atl customer you have many different products. You decided to stop the services of your post paid-that’s a product churn. But if you have stopped using service of all of the services-tv, broadband, etc that’s a relationship churn.
- Inactivity or a hard churn versus a soft churn: Somebody who is inactive for a long time may often be confused as a churn. Because that person is not doing anything. We might often predict that this person is not going to use my service any longer. But suddenly that person comes back, so, how do you differentiate between that. That is something that you need to understand clearly.
- Dealing with returning customers is the same point if a customer is inactive for a long time suddenly comes back. Do you want to call that customer as a churn.
- Voluntary versus Involuntary churn: Assume if you are telecom provider and decided to throw out someone yourself because he/she has been a very bad customer. That’s not a churn.
- Frauds and delinquents: So frauds are doing certain fraudulent activities. If they leave do you really want to call them as churn.
So all of these questions need to be answered. On this basis, we have three major aspects:
-
It has to be business aligned,
- and has to be clearly oriented with an outcome,
- and clearly quantified.
Assume that we have defined who is a churn in our data and we are now trying to answer each of these questions.
Testing Your Hypothesis
After knowing who is a churn. We’ll solve-why are customers leaving/unwilling to continue?
We’ll follow six major steps:
- Define a hypothesis: Build a hypothesis following issue tree criteria.
- Attribute extraction: After hypothesis generation ask the data engineer from where you can get the data. This is called attribute extraction.
- Feature creation means you create a certain variable or a derived variable according to your business problem or to test your hypothesis.
- Causal forensics which is more around understanding the root cause behind a certain event.
- Statistical significance test: when you are doing a root cause it has to be statistically true or mathematically true to say that yes this is genuinely a true hypothesis.
- Hypothesis Verdict: Example – customers who have less volume of incidents in your broadband or face less outages in the broadband are less likely to churn. That’s a hypothesis. How to test this. Take all customers, look at the churns, then look at what volume of outages are happening there. If churn is high within that group of population where the outage is high you can certainly say that your hypothesis is true if it is statistically proven.
Value Segmentation – Who has the Most Potential?
You might misunderstood that every customer who is leaving may not be important to us. But this is not true. We determine that by using an approach called as a value segmentation.
Here we’ll categorize customers on the value that they are going to generate for the business which is called the net present value and future which is called the lifetime value. We build a different variety of models here like some basic segmentation clustering models or some very advanced Ai models to get to this outcome. But typically the result of that is something like
We have very high value customers. So if you look at your data and come to more details around the data, these are very high around the r2. They’re very high revenue players – their 10 year is already more than 60 months they
hold a lot of products. So if you see that anybody is churning or anybody who has an action likelihood from this group you will definitely retain them with customized offers.
Similarly high value, mid value, and low value- the big chunk of customers in a typical customer base regardless of what industry is, do always come in the low value category and whether or not you want to retain them is a decision that you need to take on an individual basis. But this is where your retention becomes important. You definitely want to retain them.
Source: Amitayu’s Presentation
AI-ML Powered Modelling Approach
By using this model engine we are able to answer two churn related questions:
- Which of business customers are more likely to churn next?
- What action should business take to retain them?
There are three major steps in this model engine
- First, develop a model
- Second, validate a model and
- Last, ensure that your model is good enough to be deployed
Source: Amitayu’s Presentation
How you’ll do that? Whatever that model is in development/deployment you need do your future engineering and hypothesis testing. Do a lot of ensemble algorithm selections and specialized algorithms for specialized use cases etc. And most importantly do an iteration. Keep on iterating this process till the time your model looks good.
How do you know your model is looking good? Through validation – You can check model accuracy, stability, and robustness.
Finally value creation – Model explainability, Automated ML frameworks is an important value for businesses today . Businesses today want an agile approach. What is an agile approach? Agile works in sprint so we have to create deliverables in sprints and a typical AIML is a good example of how you can deliver different kinds of solutions/sprints to businesses.
Churn Prediction and Recommendation Model
We are going to build three models for the purpose of this solution
Churn Prediction Model: So, this is a propensity model wherein you will predict that which customers are likely to churn. On the basis of data you will tell the business that these are the customers who have a high likelihood of change the business.
Value Segmentation Model which you should overlay on top of your churn prediction model which will tell you that these are the guys likely to churn but this is the value on top of that. So if you see that a particular customer is high risk and high value you definitely want to maintain that the business is saying that’s good for me.
How to Retain Them?
Offer Recommendation Model: For every segment or even for every single customer we will create an offer
and you can take that offer to that customer and you will be able to retain that customer.
Algorithms used in the models: A churn prediction model will be a classification or a regression problem or simulation. Churn specifies feature engineering.
Segmentation model — Use clustering, an unsupervised appproach ( hierarchical or non-hierarchical clustering) or you use the supervised segmentation like a KNN or some of the advanced AI techniques.
Then in a offer recommendation model you do typically a collaborative filtering or a rule engine. Collaborative filtering is the same algorithm lies in all recommendation systems around the world be it amazon, netflix, youtube and google. You need to have test control mechanism which means that every time you are sending out a recommendation to somebody then the effectiveness of that recommendation needs to be tested.
So if you are sending out an offer to a customer, whether or not the customer accepts that offer tells you that how good your recommendation was. Which is what gives rise to something called as a reinforcement learning. Reinforcement learning reinforces the performance of a model with the help of this approach.
Note: Now imagine if we could build everything together in one model wherein for every single customer you could identify the right offer. When do I send that offer, through which channel should I send the offer, what is the message that I should write in that offer, that is called a personalization.
AI-ML Powered personalized Framework
Pivotal element of a client digital transformation across industries.
Source: Amitayu’s Presentation
A churn problem or the end goal for many of the churn problems could be a personalization outcome wherein for every customer who is likely to churn you have a very specialized gift. For every single customer depending upon what is the customer value, what is the offer that you’re planning to generate, through which channel will you target that customer, what message you will take to the customer.
Example: Zomato – send you emails with your name on top. The moment you see that you feel how they’re sending me a very personalized email. Even these days they are going one level deeper. They are sending subject lines based on your other email topics. So zomato has a subject line which says “you have been browsing or you have been looking into amazon”. And immediately you feel – how do they know that and you click on that email. This clicking is called a clickbait.
Generating Analytics Value through Implementation
Why to Implement Analytics Solution?
There are different levels of value that AI generates like an intelligent product, intelligent automation, enhanced interaction, and also creates enhanced judgment, enhanced trust. So all of these are individual value players which ensure that the outcome that you are trying to get out of an analytics use case is getting multiplied. The value that you saw first is for an intelligent product. And for an intelligent automation you should implement your AI solutions/ analytic solutions.
Source: Amitayu’s Presentation
Key Building Blocks for Value Realization
There are five steps you have to understand:
Source: Amitayu’s Presentation
Know your domain unless you walk into the shoes of the business you will never be able to solve a problem. Start simple you cannot solve everything on day one.
Start simple-start delivering smaller values and only then you can reach the goal.
You need to have a digital mindset you need to ensure that as much data you can get you need to incorporate that data and that will only transform businesses.
The value framework is a must when you are thinking about building these AI applications.
There are no silver bullets – there are no shortcuts.
If we are to build something credible you have to follow the steps. It’s not a one day job it takes months to build that credibility but that is what you need to do somebody.
Introduction to MLOPs
It is a process to package, deploy and sort of a whole solution into a production environment such that everything happens automatically. The entire journey of a machine learning life cycle from data identification to data extraction to data analysis to model development to model implementation happens completely on its own in an iterative and experimental.
Machine Learning Deployment Life Cycle
Source: Amitayu’s Presentation
We have to build an MLOPs framework. Then, we need to have data which is the basic data framework. And then, we’ll perform the feature engineering, we train test models but at the core of it. We have to ensure that ML engineering level operations, the experiments, all the use cases – they all come together into a framework which the client can easily consume.
Conclusion
I hope you enjoyed the session and understood it very well. Major Takeaways from the session are:
- What to do and what ought not to do?
- How Data Science has evolved over years?
- The must have skills for tomorrow.
- Analytics Problem Journey
- Data Engineering – The power-house of Data Science Solutions.
- From Diagnostics to AI problem- How does the problem evolve.
- AI-ML Powered Modelling Approach.
- Churn Prediction and Recommendation model
- AI-ML Powered personalized Framework
- Generating Analytics Value through Implementation