Skimpy: Alternative to Pandas describe() for Data Summarization

Ayushi Trivedi Last Updated : 26 Nov, 2024
6 min read

Data summarization is an essential first step in any data analysis workflow. While Pandas’ describe() function has been a go-to tool for many, its functionality is limited to numeric data and provides only basic statistics. Enter Skimpy, a Python library designed to offer detailed, visually appealing, and comprehensive data summaries for all column types.

In this article, we’ll explore why Skimpy is a worthy alternative to Pandas describe(). You’ll learn how to install and use Skimpy, explore its features, and compare its output with describe() through examples. By the end, you’ll have a complete understanding of how Skimpy enhances exploratory data analysis (EDA).

Learning Outcomes

  • Understand the limitations of Pandas’ describe() function.
  • Learn how to install and implement Skimpy in Python.
  • Explore Skimpy’s detailed outputs and insights with examples.
  • Compare outputs from Skimpy and Pandas describe().
  • Understand how to integrate Skimpy into your data analysis workflow.

Why Pandas describe() is Not Enough?

The describe() function in Pandas is widely used to summarize data quickly. While it serves as a powerful tool for exploratory data analysis (EDA), its utility is limited in several aspects. Here’s a detailed breakdown of its shortcomings and why users often seek alternatives like Skimpy:

Focus on Numeric Data by Default

By default, describe() only works on numeric columns unless explicitly configured otherwise.

Example:

import pandas as pd  

data = {  
    "Name": ["Alice", "Bob", "Charlie", "David"],  
    "Age": [25, 30, 35, 40],  
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],  
    "Salary": [70000, 80000, 120000, 90000],  
}  

df = pd.DataFrame(data)  
print(df.describe())  

Output:

             Age        Salary  
count   4.000000      4.000000  
mean   32.500000  90000.000000  
std     6.454972  20000.000000  
min    25.000000  70000.000000  
25%    28.750000  77500.000000  
50%    32.500000  85000.000000  
75%    36.250000  97500.000000  
max    40.000000 120000.000000  

Key Issue:

Non-numeric columns (Name and City) are ignored unless you explicitly call describe(include='all'). Even then, the output remains limited in scope for non-numeric columns.

Limited Summary for Non-Numeric Data

When non-numeric columns are included using include='all', the summary is minimal. It shows only:

  • Count: Number of non-missing values.
  • Unique: Count of unique values.
  • Top: The most frequently occurring value.
  • Freq: Frequency of the top value.

Example:

print(df.describe(include="all"))  

Output:

          Name  Age  City         Salary  
count        4  4.0     4      4.000000  
unique       4  NaN     4           NaN  
top     Alice  NaN  New York        NaN  
freq         1  NaN     1           NaN  
mean       NaN 32.5    NaN  90000.000000  
std        NaN  6.5    NaN  20000.000000  
min        NaN 25.0    NaN  70000.000000  
25%        NaN 28.8    NaN  77500.000000  
50%        NaN 32.5    NaN  85000.000000  
75%        NaN 36.2    NaN  97500.000000  
max        NaN 40.0    NaN 120000.000000  

Key Issues:

  • String columns (Name and City) are summarized using overly basic metrics (e.g., top, freq).
  • No insights into string lengths, patterns, or missing data proportions.

No Information on Missing Data

Pandas’ describe() does not explicitly show the percentage of missing data for each column. Identifying missing data requires separate commands:

print(df.isnull().sum())  

Lack of Advanced Metrics

The default metrics provided by describe() are basic. For numeric data, it shows:

  • Count, mean, and standard deviation.
  • Minimum, maximum, and quartiles (25%, 50%, and 75%).

However, it lacks advanced statistical details such as:

  • Kurtosis and skewness: Indicators of data distribution.
  • Outlier detection: No indication of extreme values beyond typical ranges.
  • Custom aggregations: Limited flexibility for applying user-defined functions.

Poor Visualization of Data

describe() outputs a plain text summary, which, while functional, is not visually engaging or easy to interpret in some cases. Visualizing trends or distributions requires additional libraries like Matplotlib or Seaborn.

Example: A histogram or boxplot would better represent distributions, but describe() doesn’t provide such visual capabilities.

Getting Started with Skimpy

Skimpy is a Python library designed to simplify and enhance exploratory data analysis (EDA). It provides detailed and concise summaries of your data, handling both numeric and non-numeric columns effectively. Unlike Pandas’ describe(), Skimpy includes advanced metrics, missing data insights, and a cleaner, more intuitive output. This makes it an excellent tool for quickly understanding datasets, identifying data quality issues, and preparing for deeper analysis.

Install Skimpy Using pip:
Run the following command in your terminal or command prompt:

pip install skimpy

Verify the Installation:
After installation, you can verify that Skimpy is installed correctly by importing it in a Python script or Jupyter Notebook:

from skimpy import skim  
print("Skimpy installed successfully!")

Why Skimpy is Better?

Let us now explore various reasons in detail as to why using Skimpy is better:

Unified Summary for All Data Types

Skimpy treats all data types with equal importance, providing rich summaries for both numeric and non-numeric columns in a single, unified table.

Example:

from skimpy import skim  
import pandas as pd  

data = {  
    "Name": ["Alice", "Bob", "Charlie", "David"],  
    "Age": [25, 30, 35, 40],  
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],  
    "Salary": [70000, 80000, 120000, 90000],  
}  

df = pd.DataFrame(data)  
skim(df)  

Output:

Skimpy generates a concise, well-structured table with information such as:

  • Numeric Data: Count, mean, median, standard deviation, minimum, maximum, and quartiles.
  • Non-Numeric Data: Unique values, most frequent value (mode), missing values, and character count distributions.
Skimpy output

Built-In Handling of Missing Data

Skimpy automatically highlights missing data in its summary, showing the percentage and count of missing values for each column. This eliminates the need for additional commands like df.isnull().sum().

Why This Matters:

  • Helps users identify data quality issues upfront.
  • Encourages quick decisions about imputation or removal of missing data.

Advanced Statistical Insights

Skimpy goes beyond basic descriptive statistics by including additional metrics that provide deeper insights:

  • Kurtosis: Indicates the “tailedness” of a distribution.
  • Skewness: Measures asymmetry in the data distribution.
  • Outlier Flags: Highlights columns with potential outliers.

Rich Summary for Text Columns

For non-numeric data like strings, Skimpy delivers detailed summaries that Pandas describe() cannot match:

  • String Length Distribution: Provides insights into minimum, maximum, and average string lengths.
  • Patterns and Variations: Identifies common patterns in text data.
  • Unique Values and Modes: Gives a clearer picture of text diversity.

Example Output for Text Columns:

ColumnUnique ValuesMost Frequent ValueMode CountAvg Length
Name4Alice15.25
City4New York17.50

Compact and Intuitive Visuals

Skimpy uses color-coded and tabular outputs that are easier to interpret, especially for large datasets. These visuals highlight:

  • Missing values.
  • Distributions.
  • Summary statistics, all in a single glance.

This visual appeal makes Skimpy’s summaries presentation-ready, which is particularly useful for reporting findings to stakeholders.

Built-In Support for Categorical Variables

Skimpy provides specific metrics for categorical data that Pandas’ describe() does not, such as:

  • Distribution of categories.
  • Frequency and proportions for each category.

This makes Skimpy particularly valuable for datasets involving demographic, geographic, or other categorical variables.

Using Skimpy for Data Summarization

Below, we explore how to use Skimpy effectively for data summarization.

Step1: Import Skimpy and Prepare Your Dataset

To use Skimpy, you first need to import it alongside your dataset. Skimpy integrates seamlessly with Pandas DataFrames.

Example Dataset:
Let’s work with a simple dataset containing numeric, categorical, and text data.

import pandas as pd
from skimpy import skim

# Sample dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 80000, 120000, 90000],
    "Rating": [4.5, None, 4.7, 4.8],
}

df = pd.DataFrame(data)

Step2: Apply the skim() Function

The core function of Skimpy is skim(). When applied to a DataFrame, it provides a detailed summary of all columns.

Usage:

skim(df)
Skimpy output

Step3: Interpret Skimpy’s Summary

Let’s break down what Skimpy’s output means:

ColumnData TypeMissing (%)MeanMedianMinMaxUniqueMost Frequent ValueMode Count
NameText0.0%4Alice1
AgeNumeric0.0%32.532.52540
CityText0.0%4New York1
SalaryNumeric0.0%900008500070000120000
RatingNumeric25.0%4.674.74.54.8
  • Missing Values: The “Rating” column has 25% missing values, indicating potential data quality issues.
  • Numeric Columns: The mean and median for “Salary” are close, indicating a roughly symmetric distribution, whereas “Age” is evenly distributed within its range.
  • Text Columns: The “City” column has 4 unique values with “New York” being the most frequent.

Step4: Focus on Key Insights

Skimpy is particularly useful for identifying:

  • Data Quality Issues:
    • Missing values in columns like “Rating.”
    • Outliers through metrics like min, max, and quartiles.
  • Patterns in Categorical Data:
    • Most frequent categories in columns like “City.”
  • String Length Insights:
    • For text-heavy datasets, Skimpy provides average string lengths, helping in preprocessing tasks like tokenization.

Step5: Customizing Skimpy Output

Skimpy allows some flexibility to adjust its output depending on your needs:

  • Subset Columns: Analyze only specific columns by passing them as a subset of the DataFrame:
skim(df[["Age", "Salary"]])
  • Focus on Missing Data: Quickly identify missing data percentages:
skim(df).loc[:, ["Column", "Missing (%)"]]

Advantages of Using Skimpy

  • All-in-One Summary: Skimpy consolidates numeric and non-numeric insights into a single table.
  • Time-Saving: Eliminates the need to write multiple lines of code for exploring different data types.
  • Improved Readability: Clean, visually appealing summaries make it easier to identify trends and outliers.
  • Efficient for Large Datasets: Skimpy is optimized to handle datasets with numerous columns without overwhelming the user.

Conclusion

Skimpy simplifies data summarization by offering detailed, human-readable insights into datasets of all types. Unlike Pandas describe(), it doesn’t restrict its focus to numeric data and provides a more enriched summary experience. Whether you’re cleaning data, exploring trends, or preparing reports, Skimpy’s features make it an indispensable tool for data professionals.

Key Takeaways

  • Skimpy handles both numeric and non-numeric columns seamlessly.
  • It provides additional insights, such as missing values and unique counts.
  • The output format is more intuitive and visually appealing than Pandas describe().

Frequently Asked Questions

Q1. What is Skimpy?

A. It is a Python library designed for comprehensive data summarization, offering insights beyond Pandas describe().

Q2. Can Skimpy replace describe()?

A. Yes, it provides enhanced functionality and can effectively replace describe().

Q3. Does Skimpy support large datasets?

A. Yes, it is optimized for handling large datasets efficiently.

Q4. How do I install Skimpy?

A. Install it using pip: pip install skimpy.

Q5. What makes Skimpy better than describe()?

A. It summarizes all data types, includes missing value insights, and presents outputs in a more user-friendly format.

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Responses From Readers

Clear

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details