Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”.
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 29th March 2022, we were joined by Anastasiia Molodoria for a DataHour session on “How to Stay Relevant in the Booming World of AI?”
Anastasiia has a strong math background and experience in predictive modelling, NLP (Natural Language Processing), data processing, and deep learning. She has successfully integrated ML, DL, and NLP solutions for retailers and product tech companies considering optimization and automation of routine daily tasks and increasing business efficiency.
Currently, she’s working at MobiDev as the Data Science Team Leader.
Are you excited to dive deeper into the world of Data Science and Machine Learning? We got you covered. Let’s get started with the major highlights of this session: How to Stay Relevant in the Booming World of AI?
From AI this session, you’ll have two learnings:
Anastasiia covered these topics by considering business cases for a deeper understanding of the value of AI integration for solving real-world problems. Also, this will help you get insight into what main steps should be done for achieving the successful delivery of an ML product to the client. And how to provide the right expectation to the business, when you don’t know beforehand the exact output of your ML research.
Prerequisites: Some basic understanding of Data Science.
So, let’s dive into the ocean of AI.
With the few basic examples, we’ll try to understand what are the right estimates and expectations we need to meet to click start a project. So, let’s begin.
For the same, first, we need to figure out two things: PoC vs MVP
What exactly these two terminologies are and why do we need them?
We need these to know:
Now, let’s look at how PoC and MVP work.
It takes an input (a task we need to solve) and helps us in getting desired output.
Input Output
The output is given by PoC:
It helps in maintaining a balance between Minimum and Viable product types (classification), that is, through this, we can get an optimal set of features to start with.
For example, Donuts in the market. The market is flooded with donuts and suppose, you are a newcomer in this business. Only knowing how to make donuts will not benefit your business because there are so many brands that already exist. So, what new you need to emerge, is basically the idea that will boost your business and make this successful. That idea is the extra feature that you’ll add to your donut. The extra feature is called MVP Solution.
No, let’s see how to/how not to build MVP with another example.
Explanation of the example: The 1st way of making a car is not a minimum viable product because suppose if something goes wrong at step three, it’s not possible to get the desired product successfully. And all our efforts whether it is money or time or effort, all go in vain.
Moreover, the 2nd way of making is the perfect way of building a minimum viable product.
This was all about PoC and MVP individually.
Here, we’ll answer a basic question, Do we know where to start?
Case1: If yes, the next step is, are the goal and all steps clear?
Case2: If no, you don’t have any idea, then you’ll have to choose PoC.
It’s a project that is not a sequential development. For example (sequential development project), developing a mobile application, here, we’ll know what will to the next step and get the desired result. Data Science projects are iterative. For this type of project we need to know:
Business Understanding: Ask your client what he/she wants to develop.
Data Understanding: After the assignment of data by the client to you. There might be a situation where you are not able to get insights from the data. So, just connect with the client again to get a proper understanding of the data.
Data Preparation: This is one of the major dimensions that a data scientist has to look into for whatever projects you are handling.
Modelling: Select the model that fits your idea the best. But, if you observe some glitch in the model, then, you can go back and look into the data preparation again. And, then by correcting, a new model or correction in the model can be done.
Evaluation: Evaluate the data whether the model will work or not. If this works, then go for the deployment. If not, you need to understand the business again.
Deployment: Deploy the AI project.
Source: Presenter
This is a different project. We need somehow to estimate it. And if for example on one side we have some mobile application where more or less it’s clear that we have some specific operating system. Here, we need to add some buttons, etc. How properly you’ll estimate this if you don’t know the results in advance. So you don’t know whether it will work, or what accuracy you will have but you need to do it. This is a frequently asked question so please estimate it.
Why? If you are not sure. Because, for example, if the idea is to develop a mobile application based on AI and if this main core task cannot be solved. There is no need to gather all other developers at all because we are not solving the main functionality that we have. So ask clients to start with PoC if you are not sure, it’s fine completely.
Why? It can be the case when clients ask you something like any accuracy commitments. Don’t do this because you don’t know the results in advance. You can describe it to the client. What you can do, for example, in the first stage, select several models that you are going to try and usually these models are tested by some open-source data sets. There you receive some metrics and share this look with the client. Convey the client that you are not sure how it will be on your data. Because we didn’t try and model needs to be developed and it’s fine.
Why? Because for example, let’s imagine that you test your POC or experiment with audio files and it looks good to you. Then you are sure that it will work in production. But when it becomes the product, the data you see is completely different. Like, there is a lot of background noise and the approach is not working. So better to write it in risk. If some output gets wrong you can refer to this point that you provided to the client in the very beginning.
Like, As we described to you. To describe it to the client so that you will be on the same page.
So, if you have more understanding it’s great because first of all you will be more confident in your estimates and achieving the goal. This is the main point here as many details you will have to reach your goal better.
It’s important but the client doesn’t tell about it. It’s a true story we will cover it today a bit more later on. So if the client doesn’t tell about runtime it doesn’t mean that it doesn’t matter so better to clarify on your own.
Source: Presenter
When you show something abstractive means when you apply a model to some open-source data it’s one
story completely. Another story is when clients don’t provide you with a lot of data (eg, three pictures) but when clients see some insights from you on their own data, it’s a completely different story. You will get more trust from the clients definitely. So even if you have a small amount of data try to use it for a demo to the client. Make sure that clients understand your point.
We all are working in the data science area where it’s really easy to lose people because we have here a lot of technical details and often client is not technical as we are. So we have to describe complicated stuff in easy words. So, make sure that the client understands your plan. Ask questions if it was clear or maybe you should paraphrase, it’s fine. But it’s better to do each time to avoid any miscommunication in the future because it’s really even worse if you will have it.
And the last recommendation, you can provide a report to the client with all details of your work. It’s really great point because when you finish your meeting you can share this report and the client can go through these details. The most interesting one includes optimization- computer vision, NLP, and time series.
Let’s understand this with an example of a Cafe Chain Owner.
The main goal of this owner is to support and maintain the business. This owner came to you with two questions:
As input data client provides you with a SQL database with a bunch of tables. The connection between these tables or data what is the expected solution here. So if the client wants no number of products that will be sold.
But this isn’t enough, try to think wider and deeper as a competent data scientist.
We’ll understand this with an example. There is a business owner who has a product and the main goal of this person is:
Let’s imagine customer support via chats because we need to have somewhere text in this business case. So it will be in the chats and this client came to you with two requests:
We propose to the client:
Tabular Data/ Structured Data is a form of database which consists of a few rows and columns. We can say tabular data is a table that stores data of different types no matter whether it is boolean, number or alphabet. Tabular Data makes data ready for insights more efficient. Usually, it deals with this task, but not finite at least. We deal with tabular data that has three main tasks – regression, classification, and clusterization.
Regression: This task is for predicting some specific number like price, sales units, means a
number, decimal.
Classification: This is based on assigning some class to the observation. For example disease detection,
whether it’s good or sentiment.
Clusterization: It’s we don’t know the number of classes but we want to group our data in some specific amount of groups as we discussed previously customer groups detection so wrap up people by behaviours.
It’s true that expectations and reality are different. We are expecting that we will not have missing data and all variables are well known and data is clean and everything is fine our goal is to apply the model and tune it. But,
this is not reality. To have good insights into data we need to follow:
Understanding the data: In wrapping up the table, classical table or data task the main and the first important step is data understanding. So all of this unknown stuff should be completely understandable at this stage so you will need to have a full understanding if it’s not clear ask the client. Because if you are not sure about the data. It’s not possible to develop a really valuable model.
Data Cleaning and preparation: Data to be ready for modelling.
Feature engineering: Before modern feature engineering, it was a great and interesting step. You can generate new features, you can get new insights, and you can discuss them with the client. So like all experiments are up to you and it’s really interesting. From the modern step, you can do this task. But you can go back to the official engineering or data cleanup again. We are working with an iterative process it’s not sequential.
Modeling: So like all experiments is up to you and it’s really interesting. From the modern step, you can do this task. But you can go back to the official engineering or data cleanup again. We are working with an iterative process it’s not sequential.
Evaluation: So it’s fine and evaluation, so validate data on your data set. Make sure that you are not overfitting and that everything working as you expected.
How can this result be improved further?
This is basically the area of working with text. The presenter found out really great research and you can click the link if you are interested to read more NLP market. But the main idea here in 2020 is the amount of investment in NLP and we are expecting huge growth. And considering the number of projects that we are working on with NLP is quite huge. This area will definitely grow. And in this research, they cover two key growing directions.
What does it mean? It means that we all get used to SIRI usage, for example, if you want to find out something on Youtube when you turn on tv we don’t want to tap it, we want to just say and it’s much easier. It’s still NLP and all of the stuff will be even developed more and more in NLP we have also like a lot of directions.
First, we need to split text with punctuation or without penetration. It’s a different story but idea is to split text and then to each unique value assign some token. Of course, there are different options for spreading and for tokenization. But, the high-level idea is to convert this text to the digits and then we can work with this digit sequentially. So, it looks easier but if someone hasn’t worked with the text yet so hope you will have more understanding that it’s not really complicated under the hood.
We have two approaches generally:
Text classification and key vertex similarity can be solved with both approaches. But you can see some of the tasks. If we can solve tasks with classical approaches maybe it will not be relevant in a real world.
Now the main focus is on deep learning and model trained with neural networks produce really great. For example, we can enrich this model with our custom data so that this area is developed much more right now.
Let’s say the task is to generate summaries for a given input text.
These are:
Two years ago, extractive summary approach was like everywhere and the abstractive was really rare and did not produce good results. But right now the situation is opposite completely. So extractive summary approaches the core intuition behind it. When we have text we split text by sentence, we score each sentence and apply some rating and select the most relevant. But output will contain the original sentences and most likely it will be out of context.
The abstractive summary approach can paraphrase and can generate new sentences. So on input, you have let’s say 10 sentence output can be one with paraphrased and like shortly. The summary of this text right now is more popular and it produces really great results. But several years ago it was completely opposite.
Image captioning: So the idea here is, that we have an image as input we are detecting objects here. And then we are generating some descriptions in the text. So it could be useful for some automatic creation of text or description. The photo’s interesting point is annotation for blind people. So we can convert it to the voice and like describe to the blind people what is shown here.
Computer Vision is an area in AI for working with images and pictures. So this is a classic task for computer vision.
Example: Let’s try to understand how to detect whether a cat’s dog is in the picture. We’ll perform this task with the help of computer vision directions.
Pictures need to be converted to numbers because we are working with numbers. And how it looks behind when you read like ordinary jpg or png picture it looks like three-layered metrics. When two dimensions are height and width in pixels and the next one is three channels rgb.
For example, for a picture, three channels, it’s usually it can be a different mouth but usually three channels red green blue all of our pictures that we are looking in our like fonts most of them contains these three layers. And based on the intensity of each color we are achieving this result.
But what if you don’t know anything about computer vision. In that scenario:
What if we don’t have the dataset, the client doesn’t provide the dataset and wants us to detect a person. What is a data scientist you’ll do here, don’t say no to the client. You have three options:
These are:
LIDAR: It’s really interesting. Maybe someone from you has already this camera on your mobile phone. The idea of this camera, it has laser and output it’s not to the pictures like we have. It’s with a picture with the information about where light is going from so plenty of information here from lidar. And it’s really high developed right now.
GANS: It’s really awesome interesting neural network in computer vision. The main idea behind that we have an input pictures and we can modify this picture. So, apply a smile or change race or hairstyle or hair cards or something in appearance. It can be useful in different areas and starting from generating new data samples for your data sets and photo editing and face animation.
Human Pose Estimation: The idea here is to detect key points. It can be useful for some fitness apps for example to identify whether you properly doing some exercises. The use case can be studied further from here
Human pose estimation guide
In this scenario – How do meet such type of client expectations?
Few things that you must consider for writing a code on AI:
I hope you have enjoyed the session and afterwards, you will stay relevant in the world of AI. Secondly, the layman’s examples must have complimented your learning. Wish you, good luck. Learn more, grow high.