As the day was coming to a close, I thought of fitting in another meeting. Two analysts in my team had been working for creating a data set for one of the predictive models we wanted to build. The combined work experience (on predictive modeling) between the analysts was ~ 5 years. I expected to breeze through the meeting and leave for the day.
So, the meeting started. Five minutes into the meeting and I knew that the meeting will take much longer than I initially thought!
The reason? Let’s go through the discussion as it happened:
Kunal: How many rows do you have in the data set?
Analyst 1: (After going through the data set) X rows
Kunal: How many rows do you expect?
Analyst 1 & 2: Blank look at their faces
Kunal: How many events / data points do you expect in the period / every month?
Analyst 1 & 2: …. (None of them had a clue)
The number of rows in the data set looked higher to me. The analysts had missed it clearly, because they did not benchmark it against business expectation (or did not have it in the first place). On digging deeper, we found that some events had multiple rows in the data sets and hence the higher number of rows.
A high percentage of analysts would have gone through similar experience at some point or other in their career.
At times, either due to timeline pressures or due to some other reason, we overlook doing basic sanity checks on the dataset we are working on. However, overlooking data accuracy at initial stages of project can prove very costly and hence usually it pays off to be paranoid about data accuracy.
I usually follow a simple framework for checking accuracy of data points. In this article, I’ll share the process, I typically use for checking data sanity. The framework goes top down, which suits well. If you have any glaring mistakes in the data sets, they would be evident early in the process.
Please note that the remaining article assumes that you are working on a structured data set. For unstructured datasets, while the principles would still apply, but the process would change.
[stextbox id=”section”]Step 1: Check number of columns and rows against expectations [/stextbox]
The first step as soon as you get any data set would be to check whether it has all the required rows and columns. Number of columns would be dictated by the number of hypothesis you have and the variables you would need to prove / dis-prove these hypothesis.
Number of rows on the other hand would be dictated by number of events you expect in the chosen period. The easiest benchmark would be based on your business understanding.
[stextbox id=”section”]Step 2: Check for duplicates at id level (and not for entire row) [/stextbox]
Once you are sure all the columns are present and number of rows look within expected range, quickly check for duplicates at level of your id (or the level at which the rows should be unique – it could be a combination of variables)
[stextbox id=”section”]Step 3: Check for blank columns, large % of blank data, high % of same data [/stextbox]
Now that you know all columns are there and there are no duplicates, look out if there are columns which are entirely blank. This can happen in case some join fails or in case there is some error in data extraction. If none of the columns are blank, look at the % of blank cases by each column and frequency distributions to find out if the same data is being repeated in more cases than expected.
[stextbox id=”section”]Step 4: Look at the distribution across various segments – check against business understanding and use pivot tables[/stextbox]
This step continues where 3 finishes. Instead of looking at frequencies of data points individually, look at their distributions. Do you expect normal, bi-polar or uniform distribution? Does the distribution look like what you expect?
[stextbox id=”section”]Step 5. Check outliers on all key variables – especially the computed ones[/stextbox]
Once the distributions look fine, check for outliers. Especially in cases where you have computed columns. Do the values on extreme look like as you had wanted? Make sure there are no divisions by zero, you have capped the values you would want to.
[stextbox id=”section”]Step 6: Check if values of a few test cases are in sync[/stextbox]
Once you have checked all the columns individually, check whether they are in sync with each other. Check whether various dates of cases are in chronological order (e.g. . Do the balances, spend and credit limit look in sync with each other for your credit card customers?
[stextbox id=”section”]Step 7: Pick up a few rows and check out their values in the underlying systems[/stextbox]
Once all the previous steps are done, it is time to check a few samples by querying the underlying systems or databases. If there was any error in data, you should have ideally identified it by now. This step just ensures that the data is as it was in the underlying systems.
Please note that some of these errors can be spotted through use of logs provided by your tool. Looking at the logs usually provides a lot of information about errors and warnings.
These were the steps I use to check the accuracy of data and they usually help me to spot any glaring errors in the data. Obviously, they are not the answer to every possible error, but they should give you a good starting point and direction. What do you think about this framework? Are there other framework / methods you use to check the accuracy of data? If so, please add them in the comments below.
Hi Kunal, One thing I have realized is that even within the realm of structured thinking, there are different structures possible.( say S1 and S2). There has to be obviously some analysis which should result in a person deciding that S1 is better/worse than S2. For example lets see somebody says what are the no of pizzas shop..should first question be "is it a city " or " does it have more youngsters "..obviously both of them are important but comes as the first filter is more often than not a victim of personal discretion. what do you think ?
Hi Ayush, Very valid question. The way I would look at it is that both of these structures ultimately fall under a single structure. The overall question being what factors influence location of pizza shop. There would be one set of hypothesis / variable related to geography (city, state, tier of cities), another set with economic variables of the city (average income, spend, credit, number of jobs, number of students or target market etc.), another set of variables with type of markets available (e.g. malls, residential markets) so on and so forth. Ultimately all these variables would go into a model, which would highlight which ones have higher influence and which would not. I don't see any need to judge whether S1 is better or S2. In fact a good analyst should never start create a solution starting from a biased position. Hope this answers what you had asked. Kunal
Great points, Kunal. Very well explained and written.
Hi Kunal, I am an HR Analyst, working for a financial services organization. I want to understand, how to create a data repository for a startup organization for your business function, and perform process monitoring and performance analytics. Regards, Ram
Hi Ram, Unfortunately, the process is relatively long, complex and one where decisions should be taken with all stakeholders. There are various parameters which impact the outcome (for example size of data, granularity of data, how long do you want to retain data). Depending on these answers, the outcome would vary. If your data is relatively small and you want to perform simple analysis and reporting, tools like Qlikview and Tableau can be good value for money. Thanks, Kunal