This article was published as a part of the Data Science Blogathon.
How are the fields of machine learning and statistics related? Are they different fields by themselves? How important are statistics for machine learning? Is the phenomenal rise of machine learning in recent decades indicate some problems in the fundamentals of statistical theory in its application to real-world problems?
There are several blogs and articles that discuss this topic. The answers range from assertions that “machine learning is just a marketing gimmick of repackaging age-old statistics” to giving machine learning “a glorified status that does not require anything to do with classical statistics”. In general, no clear-cut view seems to exist in public opinion which is very unfortunate.
While the topic is wide, in this article, I shall point out five basic misconceptions about machine learning with reference to statistics. Here are those misconceptions:
Misconception 1: “Machine learning models are different from Statistical Models”
Misconception 2: “Machine learning is about training a model with a huge population dataset itself while statistics is about making statistical inferences on population based on sample data”
Misconception 3: “An in-depth understanding of statistics is not required for data scientists”
Misconception 4: “Machine learning models learn over time compared to statistical models”
I have read articles on hugely popular data science platforms like medium.com, towardsdatascience.com, machinelearningmastery.com, and similar other forums where the idea that machine learning models are different from statistical models is impressed upon. This view is not entirely correct and could also be misleading.
The truth is they are not any different at least in the case of widely used statistical models.
Take the example of linear regression or classification algorithms. Are these models in machine learning use a different methodology than that used in statistical packages like SAS, SPSS, etc.? The answer is NO. The machine learning models use the same statistical models and require the same assumptions.
For example, the simple linear regression using the scikit-learn library uses the same least-squares optimization method that statistical packages use based on the same underlying assumptions like the independence of features.
However, an important fact that requires to be noted is that the numerical methods that perform well with huge datasets like Stochastic Gradient Descent algorithms are also part of machine learning models. This is one major difference from the traditional statistical packages. If you want to do linear regression with the stochastic gradient descent method, then you should use SGDRegressor or SGDClassifier classes of a scikit-learn library of python.
Apart from that It is a misnomer to think that machine learning models use huge dataset, and hence they must also be using some different kind of solvers that requires huge dataset as compared to statistical models which require just a sample data
In fact, the startling truth is that the machine learning models like linear regression will perfectly work fine with just a sample set of data!! Go and try using a training set of just 30% of your data from any public dataset. You will get an accuracy that will be in the range of +/2% from that obtained with 80% training data.
You may ask: If machine learning models can also perform with a small dataset, what then of the assumptions required for linear regression in statistical models? The statistical models require assumptions like linear relationship, multivariate normality, no multi-collinearity, no auto-correlation and homoscedasticity. Are these assumptions not required for machine learning algorithms required for regression? The truth is both the models will give some results if you feed them with a dataset ignoring the assumptions. But the validity of the results depends on the assumptions that are made in the algorithms or statistical models. If we use the same methodology to solve linear equations, if the assumptions are not valid or required for machine learning, then they are not required for statistical models also!
What about using ‘training and testing datasets in machine learning as against sampling in statistical models? That we separate data into training and test dataset is not a requirement of any of the models involved. If the model works fine with a small sample in the statistical package like SPSS, then it would work the same with python machine learning library also! You will get the same accuracy in both SPSS or any statistical package and machine learning exercise as long as they use the same solvers.
What is training here? If we use a sample of size 30, then it is not training? It is just nomenclature. It is applicable only for some of the machine learning algorithms concerning artificial intelligence where the larger the size of the dataset, the greater the accuracy. This is not true for classical problems like linear regression. For these, the accuracy does not improve with a larger dataset.
Since we use a small set of samples in statistical models, we can’t be sure of the actual characteristics of the population involved. Hence, we make an inference on the population-based sample statistics using the central limit theorem. We basically evaluate the ‘statistical validity of the model constructed using sample data.
Don’t we need to evaluate ‘the statistical validity of the results of the machine learning models? Suppose we use machine learning to analyze the relationship between two variables using linear regression. We finally get the output as some coefficients and intercept. We test the model with the test dataset and get the same output as with the training dataset. But, how relevant are these values? Just because we use the entire population, it does not mean we need not make any inference about the validity of the model. The statistics like the coefficient of determination, p-value, etc are required to understand the validity of the model – which does not change with the size of the training dataset as they depend on the nature of variables, their probability distribution, their interdependence, etc. These characteristics remain the same irrespective of the size of the training dataset.
Let us take another example. Suppose we are dealing with a dataset with numerous features and we find there is a high possibility of multicollinearity between independent variables. The regression model will fail even if you use the entire dataset if there is multicollinearity. In other words, you can use the entire population dataset in machine learning against sample data in statistical packages does not in any way solve problems like multicollinearity.
We may resort to dimensionality reduction using principal component analysis. We may want to check whether PCA can be used and if used, how many components should be selected. This requires knowledge of Bartlett’s sphericity test. We may get the p-value from python for the chi-square test, but how do we evaluate this figure? If the test says PCA can be used, we need a scree plot, Kaiser’s rule, etc which requires not only an understanding of statistical inference methods but also knowledge of linear algebra concepts like SVD (condition index derived from SVD can suggest the need for PCA) and eigenvectors.
Hence, it is necessary for data scientists to also be conversant with statistics expertise.
1. Statistics
2. Linear Algebra
3. Advanced Mathematics and Calculus
4. Machine Learning libraries
5. Python or equivalent language
6. SQL, ETL tools and cloud-based ML pipelines knowledge
Those in the data science field know well that it is generally not true – it is true only for some of the algorithms dealing with a cognitive domain like images, speech, audio, visuals, etc. In the case of quantitative models like regression, once the model is built, the results are obtained based on the model. Unless we retrain the model, the model will remain the same.
However, since we are supposed to have built the model using the population dataset itself, any new data is only expected to mimic the training dataset (i.e. test dataset (and production data) and should have the same characteristics as that of the training dataset). Hence, we cannot change the model based on the production data. Hence, there is no evolution of the machine learning model and hence there is really no learning involved. It is a misnomer to say that the machine learning models allow machines to learn over time.
This statement, however, is applicable and true in the case of AI-related algorithms where the performance of the model improves with the size of the training dataset.
Machine learning has several significant and major advantages. As an independent field, it has evolved tremendously over the years to facilitate real-world problem solving and decision making.
For those problems where the algorithms or methodologies are the same in both the machine learning and statistical models, the major advantage of machine learning libraries over statistical packages is in the application of those models to real-world problems. In the past, statistical packages were only used for predictive analytics and business intelligence on historical transaction data.
The machine learning models on the other hand now facilitate the application of these models for solving real-world problems in a real-time production environment. While statistical packages are stand-alone packages used for analysis, applications can be developed using machine learning libraries that can be integrated with other business applications to make real-time and automatic decision making as against just getting some analytical insights.
Machine learning libraries using programming languages like python can also handle all kinds of data like unstructured quantitative data along with data like pictures, images, audio, etc while statistical analytical packages generally deal with historical and structured transactional data.
Some of the widely believed notions and ideas about machine learning are not equally applicable to all types of machine learning models. For example, there are machine learning models that perform better with larger datasets providing higher accuracy while there are other traditional algorithms (which are also used in statistical packages) that can perform well even with small datasets.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.