Skimpy: Alternative to Pandas describe() for Data Summarization

Ayushi Trivedi Last Updated : 26 Nov, 2024

6 min read

Data summarization is an essential first step in any data analysis workflow. While Pandas’ describe() function has been a go-to tool for many, its functionality is limited to numeric data and provides only basic statistics. Enter Skimpy, a Python library designed to offer detailed, visually appealing, and comprehensive data summaries for all column types.

In this article, we’ll explore why Skimpy is a worthy alternative to Pandas describe(). You’ll learn how to install and use Skimpy, explore its features, and compare its output with describe() through examples. By the end, you’ll have a complete understanding of how Skimpy enhances exploratory data analysis (EDA).

Learning Outcomes

Understand the limitations of Pandas’ describe() function.
Learn how to install and implement Skimpy in Python.
Explore Skimpy’s detailed outputs and insights with examples.
Compare outputs from Skimpy and Pandas describe().
Understand how to integrate Skimpy into your data analysis workflow.

Why Pandas describe() is Not Enough?
Getting Started with Skimpy
Why Skimpy is Better?
Using Skimpy for Data Summarization
Advantages of Using Skimpy
Conclusion
Frequently Asked Questions

Why Pandas describe() is Not Enough?

The describe() function in Pandas is widely used to summarize data quickly. While it serves as a powerful tool for exploratory data analysis (EDA), its utility is limited in several aspects. Here’s a detailed breakdown of its shortcomings and why users often seek alternatives like Skimpy:

Focus on Numeric Data by Default

By default, describe() only works on numeric columns unless explicitly configured otherwise.

Example:

import pandas as pd  

data = {  
    "Name": ["Alice", "Bob", "Charlie", "David"],  
    "Age": [25, 30, 35, 40],  
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],  
    "Salary": [70000, 80000, 120000, 90000],  
}  

df = pd.DataFrame(data)  
print(df.describe())

Output:

             Age        Salary  
count   4.000000      4.000000  
mean   32.500000  90000.000000  
std     6.454972  20000.000000  
min    25.000000  70000.000000  
25%    28.750000  77500.000000  
50%    32.500000  85000.000000  
75%    36.250000  97500.000000  
max    40.000000 120000.000000

Key Issue:

Non-numeric columns (Name and City) are ignored unless you explicitly call describe(include='all'). Even then, the output remains limited in scope for non-numeric columns.

Limited Summary for Non-Numeric Data

When non-numeric columns are included using include='all', the summary is minimal. It shows only:

Count: Number of non-missing values.
Unique: Count of unique values.
Top: The most frequently occurring value.
Freq: Frequency of the top value.

Example:

print(df.describe(include="all"))

Output:

          Name  Age  City         Salary  
count        4  4.0     4      4.000000  
unique       4  NaN     4           NaN  
top     Alice  NaN  New York        NaN  
freq         1  NaN     1           NaN  
mean       NaN 32.5    NaN  90000.000000  
std        NaN  6.5    NaN  20000.000000  
min        NaN 25.0    NaN  70000.000000  
25%        NaN 28.8    NaN  77500.000000  
50%        NaN 32.5    NaN  85000.000000  
75%        NaN 36.2    NaN  97500.000000  
max        NaN 40.0    NaN 120000.000000

Key Issues:

String columns (Name and City) are summarized using overly basic metrics (e.g., top, freq).
No insights into string lengths, patterns, or missing data proportions.

No Information on Missing Data

Pandas’ describe() does not explicitly show the percentage of missing data for each column. Identifying missing data requires separate commands:

print(df.isnull().sum())

Lack of Advanced Metrics

The default metrics provided by describe() are basic. For numeric data, it shows:

Count, mean, and standard deviation.
Minimum, maximum, and quartiles (25%, 50%, and 75%).

However, it lacks advanced statistical details such as:

Kurtosis and skewness: Indicators of data distribution.
Outlier detection: No indication of extreme values beyond typical ranges.
Custom aggregations: Limited flexibility for applying user-defined functions.

Poor Visualization of Data

describe() outputs a plain text summary, which, while functional, is not visually engaging or easy to interpret in some cases. Visualizing trends or distributions requires additional libraries like Matplotlib or Seaborn.

Example: A histogram or boxplot would better represent distributions, but describe() doesn’t provide such visual capabilities.

Getting Started with Skimpy

Skimpy is a Python library designed to simplify and enhance exploratory data analysis (EDA). It provides detailed and concise summaries of your data, handling both numeric and non-numeric columns effectively. Unlike Pandas’ describe(), Skimpy includes advanced metrics, missing data insights, and a cleaner, more intuitive output. This makes it an excellent tool for quickly understanding datasets, identifying data quality issues, and preparing for deeper analysis.

Install Skimpy Using pip:
Run the following command in your terminal or command prompt:

pip install skimpy

Verify the Installation:
After installation, you can verify that Skimpy is installed correctly by importing it in a Python script or Jupyter Notebook:

from skimpy import skim  
print("Skimpy installed successfully!")

Why Skimpy is Better?

Let us now explore various reasons in detail as to why using Skimpy is better:

Unified Summary for All Data Types

Skimpy treats all data types with equal importance, providing rich summaries for both numeric and non-numeric columns in a single, unified table.

Example:

from skimpy import skim  
import pandas as pd  

data = {  
    "Name": ["Alice", "Bob", "Charlie", "David"],  
    "Age": [25, 30, 35, 40],  
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],  
    "Salary": [70000, 80000, 120000, 90000],  
}  

df = pd.DataFrame(data)  
skim(df)

Output:

Skimpy generates a concise, well-structured table with information such as:

Numeric Data: Count, mean, median, standard deviation, minimum, maximum, and quartiles.
Non-Numeric Data: Unique values, most frequent value (mode), missing values, and character count distributions.

Built-In Handling of Missing Data

Skimpy automatically highlights missing data in its summary, showing the percentage and count of missing values for each column. This eliminates the need for additional commands like df.isnull().sum().

Why This Matters:

Helps users identify data quality issues upfront.
Encourages quick decisions about imputation or removal of missing data.

Advanced Statistical Insights

Skimpy goes beyond basic descriptive statistics by including additional metrics that provide deeper insights:

Kurtosis: Indicates the “tailedness” of a distribution.
Skewness: Measures asymmetry in the data distribution.
Outlier Flags: Highlights columns with potential outliers.

Rich Summary for Text Columns

For non-numeric data like strings, Skimpy delivers detailed summaries that Pandas describe() cannot match:

String Length Distribution: Provides insights into minimum, maximum, and average string lengths.
Patterns and Variations: Identifies common patterns in text data.
Unique Values and Modes: Gives a clearer picture of text diversity.

Example Output for Text Columns:

Column	Unique Values	Most Frequent Value	Mode Count	Avg Length
Name	4	Alice	1	5.25
City	4	New York	1	7.50

Compact and Intuitive Visuals

Skimpy uses color-coded and tabular outputs that are easier to interpret, especially for large datasets. These visuals highlight:

Missing values.
Distributions.
Summary statistics, all in a single glance.

This visual appeal makes Skimpy’s summaries presentation-ready, which is particularly useful for reporting findings to stakeholders.

Built-In Support for Categorical Variables

Skimpy provides specific metrics for categorical data that Pandas’ describe() does not, such as:

Distribution of categories.
Frequency and proportions for each category.

This makes Skimpy particularly valuable for datasets involving demographic, geographic, or other categorical variables.

Using Skimpy for Data Summarization

Below, we explore how to use Skimpy effectively for data summarization.

Step1: Import Skimpy and Prepare Your Dataset

To use Skimpy, you first need to import it alongside your dataset. Skimpy integrates seamlessly with Pandas DataFrames.

Example Dataset:
Let’s work with a simple dataset containing numeric, categorical, and text data.

import pandas as pd
from skimpy import skim

# Sample dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 80000, 120000, 90000],
    "Rating": [4.5, None, 4.7, 4.8],
}

df = pd.DataFrame(data)

Step2: Apply the skim() Function

The core function of Skimpy is skim(). When applied to a DataFrame, it provides a detailed summary of all columns.

Usage:

skim(df)

Step3: Interpret Skimpy’s Summary

Let’s break down what Skimpy’s output means:

Column	Data Type	Missing (%)	Mean	Median	Min	Max	Unique	Most Frequent Value	Mode Count
Name	Text	0.0%	—	—	—	—	4	Alice	1
Age	Numeric	0.0%	32.5	32.5	25	40	—	—	—
City	Text	0.0%	—	—	—	—	4	New York	1
Salary	Numeric	0.0%	90000	85000	70000	120000	—	—	—
Rating	Numeric	25.0%	4.67	4.7	4.5	4.8	—	—	—

Missing Values: The “Rating” column has 25% missing values, indicating potential data quality issues.
Numeric Columns: The mean and median for “Salary” are close, indicating a roughly symmetric distribution, whereas “Age” is evenly distributed within its range.
Text Columns: The “City” column has 4 unique values with “New York” being the most frequent.

Step4: Focus on Key Insights

Skimpy is particularly useful for identifying:

Data Quality Issues:
- Missing values in columns like “Rating.”
- Outliers through metrics like min, max, and quartiles.
Patterns in Categorical Data:
- Most frequent categories in columns like “City.”
String Length Insights:
- For text-heavy datasets, Skimpy provides average string lengths, helping in preprocessing tasks like tokenization.

Step5: Customizing Skimpy Output

Skimpy allows some flexibility to adjust its output depending on your needs:

Subset Columns: Analyze only specific columns by passing them as a subset of the DataFrame:

skim(df[["Age", "Salary"]])

Focus on Missing Data: Quickly identify missing data percentages:

skim(df).loc[:, ["Column", "Missing (%)"]]

Advantages of Using Skimpy

All-in-One Summary: Skimpy consolidates numeric and non-numeric insights into a single table.
Time-Saving: Eliminates the need to write multiple lines of code for exploring different data types.
Improved Readability: Clean, visually appealing summaries make it easier to identify trends and outliers.
Efficient for Large Datasets: Skimpy is optimized to handle datasets with numerous columns without overwhelming the user.

Conclusion

Skimpy simplifies data summarization by offering detailed, human-readable insights into datasets of all types. Unlike Pandas describe(), it doesn’t restrict its focus to numeric data and provides a more enriched summary experience. Whether you’re cleaning data, exploring trends, or preparing reports, Skimpy’s features make it an indispensable tool for data professionals.

Key Takeaways

Skimpy handles both numeric and non-numeric columns seamlessly.
It provides additional insights, such as missing values and unique counts.
The output format is more intuitive and visually appealing than Pandas describe().

Frequently Asked Questions

Q1. What is Skimpy?

A. It is a Python library designed for comprehensive data summarization, offering insights beyond Pandas describe().

Q2. Can Skimpy replace describe()?

A. Yes, it provides enhanced functionality and can effectively replace describe().

Q3. Does Skimpy support large datasets?

A. Yes, it is optimized for handling large datasets efficiently.

Q4. How do I install Skimpy?

A. Install it using pip: pip install skimpy.

Q5. What makes Skimpy better than describe()?

A. It summarizes all data types, includes missing value insights, and presents outputs in a more user-friendly format.

Ayushi Trivedi

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Skimpy: Alternative to Pandas describe() for Data Summarization

Learning Outcomes

Table of contents

Why Pandas describe() is Not Enough?

Focus on Numeric Data by Default

Limited Summary for Non-Numeric Data

No Information on Missing Data

Lack of Advanced Metrics

Poor Visualization of Data

Getting Started with Skimpy

Why Skimpy is Better?

Unified Summary for All Data Types

Built-In Handling of Missing Data

Advanced Statistical Insights

Rich Summary for Text Columns

Compact and Intuitive Visuals

Built-In Support for Categorical Variables

Using Skimpy for Data Summarization

Step1: Import Skimpy and Prepare Your Dataset

Step2: Apply the skim() Function

Step3: Interpret Skimpy’s Summary

Step4: Focus on Key Insights

Step5: Customizing Skimpy Output

Advantages of Using Skimpy

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)