Data summarization is an essential first step in any data analysis workflow. While Pandas’ describe()
function has been a go-to tool for many, its functionality is limited to numeric data and provides only basic statistics. Enter Skimpy, a Python library designed to offer detailed, visually appealing, and comprehensive data summaries for all column types.
In this article, we’ll explore why Skimpy is a worthy alternative to Pandas describe(). You’ll learn how to install and use Skimpy, explore its features, and compare its output with describe() through examples. By the end, you’ll have a complete understanding of how Skimpy enhances exploratory data analysis (EDA).
describe()
function.describe()
.The describe()
function in Pandas is widely used to summarize data quickly. While it serves as a powerful tool for exploratory data analysis (EDA), its utility is limited in several aspects. Here’s a detailed breakdown of its shortcomings and why users often seek alternatives like Skimpy:
By default, describe()
only works on numeric columns unless explicitly configured otherwise.
Example:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 80000, 120000, 90000],
}
df = pd.DataFrame(data)
print(df.describe())
Output:
Age Salary
count 4.000000 4.000000
mean 32.500000 90000.000000
std 6.454972 20000.000000
min 25.000000 70000.000000
25% 28.750000 77500.000000
50% 32.500000 85000.000000
75% 36.250000 97500.000000
max 40.000000 120000.000000
Key Issue:
Non-numeric columns (Name
and City
) are ignored unless you explicitly call describe(include='all')
. Even then, the output remains limited in scope for non-numeric columns.
When non-numeric columns are included using include='all'
, the summary is minimal. It shows only:
Example:
print(df.describe(include="all"))
Output:
Name Age City Salary
count 4 4.0 4 4.000000
unique 4 NaN 4 NaN
top Alice NaN New York NaN
freq 1 NaN 1 NaN
mean NaN 32.5 NaN 90000.000000
std NaN 6.5 NaN 20000.000000
min NaN 25.0 NaN 70000.000000
25% NaN 28.8 NaN 77500.000000
50% NaN 32.5 NaN 85000.000000
75% NaN 36.2 NaN 97500.000000
max NaN 40.0 NaN 120000.000000
Key Issues:
Name
and City
) are summarized using overly basic metrics (e.g., top
, freq
).Pandas’ describe()
does not explicitly show the percentage of missing data for each column. Identifying missing data requires separate commands:
print(df.isnull().sum())
The default metrics provided by describe()
are basic. For numeric data, it shows:
However, it lacks advanced statistical details such as:
describe()
outputs a plain text summary, which, while functional, is not visually engaging or easy to interpret in some cases. Visualizing trends or distributions requires additional libraries like Matplotlib or Seaborn.
Example: A histogram or boxplot would better represent distributions, but describe()
doesn’t provide such visual capabilities.
Skimpy is a Python library designed to simplify and enhance exploratory data analysis (EDA). It provides detailed and concise summaries of your data, handling both numeric and non-numeric columns effectively. Unlike Pandas’ describe()
, Skimpy includes advanced metrics, missing data insights, and a cleaner, more intuitive output. This makes it an excellent tool for quickly understanding datasets, identifying data quality issues, and preparing for deeper analysis.
Install Skimpy Using pip:
Run the following command in your terminal or command prompt:
pip install skimpy
Verify the Installation:
After installation, you can verify that Skimpy is installed correctly by importing it in a Python script or Jupyter Notebook:
from skimpy import skim
print("Skimpy installed successfully!")
Let us now explore various reasons in detail as to why using Skimpy is better:
Skimpy treats all data types with equal importance, providing rich summaries for both numeric and non-numeric columns in a single, unified table.
Example:
from skimpy import skim
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 80000, 120000, 90000],
}
df = pd.DataFrame(data)
skim(df)
Output:
Skimpy generates a concise, well-structured table with information such as:
Skimpy automatically highlights missing data in its summary, showing the percentage and count of missing values for each column. This eliminates the need for additional commands like df.isnull().sum()
.
Why This Matters:
Skimpy goes beyond basic descriptive statistics by including additional metrics that provide deeper insights:
For non-numeric data like strings, Skimpy delivers detailed summaries that Pandas describe()
cannot match:
Example Output for Text Columns:
Column | Unique Values | Most Frequent Value | Mode Count | Avg Length |
---|---|---|---|---|
Name | 4 | Alice | 1 | 5.25 |
City | 4 | New York | 1 | 7.50 |
Skimpy uses color-coded and tabular outputs that are easier to interpret, especially for large datasets. These visuals highlight:
This visual appeal makes Skimpy’s summaries presentation-ready, which is particularly useful for reporting findings to stakeholders.
Skimpy provides specific metrics for categorical data that Pandas’ describe()
does not, such as:
This makes Skimpy particularly valuable for datasets involving demographic, geographic, or other categorical variables.
Below, we explore how to use Skimpy effectively for data summarization.
To use Skimpy, you first need to import it alongside your dataset. Skimpy integrates seamlessly with Pandas DataFrames.
Example Dataset:
Let’s work with a simple dataset containing numeric, categorical, and text data.
import pandas as pd
from skimpy import skim
# Sample dataset
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 40],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 80000, 120000, 90000],
"Rating": [4.5, None, 4.7, 4.8],
}
df = pd.DataFrame(data)
The core function of Skimpy is skim()
. When applied to a DataFrame, it provides a detailed summary of all columns.
Usage:
skim(df)
Let’s break down what Skimpy’s output means:
Column | Data Type | Missing (%) | Mean | Median | Min | Max | Unique | Most Frequent Value | Mode Count |
---|---|---|---|---|---|---|---|---|---|
Name | Text | 0.0% | — | — | — | — | 4 | Alice | 1 |
Age | Numeric | 0.0% | 32.5 | 32.5 | 25 | 40 | — | — | — |
City | Text | 0.0% | — | — | — | — | 4 | New York | 1 |
Salary | Numeric | 0.0% | 90000 | 85000 | 70000 | 120000 | — | — | — |
Rating | Numeric | 25.0% | 4.67 | 4.7 | 4.5 | 4.8 | — | — | — |
Skimpy is particularly useful for identifying:
Skimpy allows some flexibility to adjust its output depending on your needs:
skim(df[["Age", "Salary"]])
skim(df).loc[:, ["Column", "Missing (%)"]]
Skimpy simplifies data summarization by offering detailed, human-readable insights into datasets of all types. Unlike Pandas describe()
, it doesn’t restrict its focus to numeric data and provides a more enriched summary experience. Whether you’re cleaning data, exploring trends, or preparing reports, Skimpy’s features make it an indispensable tool for data professionals.
describe()
.A. It is a Python library designed for comprehensive data summarization, offering insights beyond Pandas describe()
.
describe()
? A. Yes, it provides enhanced functionality and can effectively replace describe()
.
A. Yes, it is optimized for handling large datasets efficiently.
A. Install it using pip: pip install skimpy
.
describe()
? A. It summarizes all data types, includes missing value insights, and presents outputs in a more user-friendly format.