Using Predictive Power Score to Pinpoint Non-linear Correlations

Guest Blog Last Updated : 24 Oct, 2024
11 min read
Using Predictive Power Score to Pinpoint Non-linear Correlations
Image by Author
Predictive Power Score to Pinpoint Non-linear Correlations
Image by Author
correlation power plot
Image by Author
  • Is there a score that tells us if there is any relationship between two columns — no matter if the relationship is linear, non-linear, Gaussian, or some other type of relationship?
  • Of course, the score should be asymmetrical because I want to detect all the strange relationships between two variables.
  • The score should be 0 if there is no relationship and the score should be 1 if there is a perfect relationship
  • And that the score helps to answer the question Are there correlations between the columns? with a correlation matrix, then you make a scatter plot over the two columns to compare them and see if there is indeed a strong correlation.
  • And like the icing on the cake, the score should be able to handle both categorical and numerical columns by default.
!pip3 install ppscore

 

Calculating the Predictive Power Score

  • When the objective is numerical we can use a Regression Decision Tree and calculate the Mean Absolute Error (MAE).
  • When the objective is categorical, we can use a Classification Decision Tree and calculate the weighted F1

Predictive Power Score VS Correlation

Predictive Power Score VS Correlation
Image by Github
import pandas as pd
import numpy as np
import ppscore as pps
df = pd.DataFrame()
df
Predictive Power Score numpy
Image by Author
df["x"] = np.random.uniform(-2, 2, 10000)
df.head()
Image for post
df["error"] = np.random.uniform(-0.5, 0.5, 10000)
df.head()
import pandas as pd
import numpy as np
import ppscore as pps
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 10000)
print(df.head())
Image for post
Image by Author
df["y"] = df["x"] * df["x"] + df["error"]
df.head()
import pandas as pd
import numpy as np
import ppscore as pps
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 10000)
print(df.head())
  • 1- On the column:
df["x"].corr(df["y"])
-0.0115046561021449
  • 2- On the DataFrame
df.corr()
Image for post
pps.score(df, "x", "y")
{'x': 'x',
 'y': 'y',
 'ppscore': 0.675090383548477,
 'case': 'regression',
 'is_valid_score': True,
 'metric': 'mean absolute error',
 'baseline_score': 1.025540102508908,
 'model_score': 0.33320784136182485,
 'model': DecisionTreeRegressor()}
Image for post
pps.score(df, "y", "x")
{'x': 'y',
 'y': 'x',
 'ppscore': 0,
 'case': 'regression',
 'is_valid_score': True,
 'metric': 'mean absolute error',
 'baseline_score': 1.0083196087945172,
 'model_score': 1.1336852173737795,
 'model': DecisionTreeRegressor()}
pps.predictors(df, "y")
pps score
pps.predictors(df, "x")
pps predictor
pps.matrix(df)
pps matrix

Analyzing & visualizing results

import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")
Predictive Power Score visualization
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')matrix_df
Predictive Power Score error
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
Predictive Power Score heatmap

Example with Categorical Features

categorical variable Predictive Power Score

Disclosure

Limitations

  • The calculation is slower than the correlation (matrix).
  • The score cannot be interpreted as easily as the correlation because it does not tell you anything about the type of relationship that was found. Therefore, the PPS is better at finding patterns but the correlation is better at communicating the linear relationships found.
  • You cannot compare the scores of different target variables in a strictly mathematical way because they are calculated using different evaluation metrics. Scores are still valuable in the real world, but this must be kept in mind.
  • There are limitations to the components used under the hood

 

Conclusions

  • In addition to your usual feature selection mechanism, you can use the PPS to find good predictors for your target column.
  • You can also remove features that only add random noise.
  • Those features sometimes still score high on the feature importance metric.
  • You can remove features that can be predicted by other features because they do not add new information
  • You can identify pairs of mutually predictive characteristics in the PPS matrix — this includes strongly correlated characteristics but will also detect non-linear relationships.
  • Detect leakage: Use the PPS matrix to detect leakage between variables — even if the leakage is mediated by other variables.
  • Data normalization: Find entity structures in the data by interpreting the PPS matrix as a directed graph. This can be surprising when the data contain latent structures that were previously unknown. For example, the TicketID in the Titanic data set is often a flag for a

References

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details