The STAR schema is an efficient database design used in data warehousing and business intelligence. It organizes data into a central fact table linked to surrounding dimension tables. This star-like structure simplifies complex queries, enhances performance, and is ideal for large datasets requiring fast retrieval and simplified joins.
A major advantage of the STAR schema is its ability to minimize the number of query joins, improving readability and performance, especially for data aggregation and reporting. Its straightforward design supports rapid data summarization, which is essential for generating business insights.
The STAR schema also offers scalability, allowing new dimension tables to be added without disrupting the existing structure. This supports ongoing growth and adaptability. Separating fact and dimension tables minimizes data redundancy and maintains consistency.
In this blog, we’ll explore the STAR schema, demonstrate its setup for optimal query performance with simulated data, and compare it with the Snowflake schema, offering a streamlined approach to data management and analysis.
This article was published as a part of the Data Science Blogathon.
The STAR schema is a database schema consisting of a central fact table surrounded by dimension tables. Fact tables store measurable, quantitative data, such as sales transactions and customer orders. In contrast, dimension tables store descriptive attributes, such as customer details, product categories, and time information.
A STAR has a structure that resembles a star and is created by connecting the fact and dimension tables using foreign keys. This design is highly optimized for read-heavy operations, especially in reporting and analytical environments.
This structure enables fast querying by simplifying the joins between tables and reducing complexity when extracting insights from data.
Also read: A Brief Introduction to the Concept of Data Warehouse
To illustrate how the STAR schema works, we’ll generate a simulated dataset representing customer orders in an online store. This data will populate our fact and dimension tables.
We’ll create a simulated customer dataset, including key information such as their ID, name, location, and membership type. The Customer Data dimension table details each customer and allows us to link orders to specific customers to analyze customer behavior, preferences, and demographic trends.
import pandas as pd
import numpy as np
def generate_customer_data(n_customers=1000):
np.random.seed(42)
customer_ids = np.arange(1, n_customers + 1)
first_names = np.random.choice(['Thato', 'Jane', 'Alice', 'Bob'], size=n_customers)
last_names = np.random.choice(['Smith', 'Mkhize', 'Brown', 'Johnson'], size=n_customers)
locations = np.random.choice(['South Africa', 'Canada', 'UK', 'Germany'], size=n_customers)
membership_levels = np.random.choice(['Standard', 'Premium'], size=n_customers)
customers = pd.DataFrame({
'customer_id': customer_ids,
'first_name': first_names,
'last_name': last_names,
'location': locations,
'membership_level': membership_levels
})
return customers
customers_df = generate_customer_data()
customers_df.head()
Output:
Also read: A Complete Guide to Data Warehousing in 2024
Next, we’ll create a dataset for products that are available for purchase. This data will include fields like product ID, product name, category, and price.
def generate_product_data(n_products=500):
product_ids = np.arange(1, n_products + 1)
product_names = np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones'], size=n_products)
categories = np.random.choice(['Electronics', 'Accessories'], size=n_products)
prices = np.random.uniform(50, 1000, size=n_products)
products = pd.DataFrame({
'product_id': product_ids,
'product_name': product_names,
'category': categories,
'price': prices
})
return products
products_df = generate_product_data()
products_df.head()
Output:
The dates dimension table is crucial for time-based analysis in any data warehousing or business intelligence scenario. It allows you to aggregate and analyze data based on specific periods such as year, month, day, or quarter. This table will reference the transaction’s time, allowing us to link each order to its corresponding date.
import pandas as pd
def generate_dates_data(start_date='2023-01-01', end_date='2024-02-21'):
# Create a date range
date_range = pd.date_range(start=start_date, end=end_date, freq='D')
# Create a DataFrame with date parts
dates_df = pd.DataFrame({
'order_date': date_range,
'year': date_range.year,
'month': date_range.month,
'day': date_range.day,
'week': date_range.isocalendar().week,
'quarter': date_range.quarter
})
return dates_df
# Generate the Dates dimension table
dates_df = generate_dates_data()
dates_df.head()
Output:
Also read: What is Data Warehousing?
Finally, we’ll generate the order data that acts as the fact table. This dataset will track customer orders, including the order date, total price, and product information. Each row in the Orders fact table represents a unique order placed by a customer, and it links directly to the relevant dimension tables (Customers, Products, and Dates) through foreign keys. This allows for detailed analysis, such as tracking how much each customer spends, which products are most popular, and how order activity varies over time.
def generate_order_data(n_orders=10000):
order_ids = np.arange(1, n_orders + 1)
customer_ids = np.random.randint(1, 1000, size=n_orders)
product_ids = np.random.randint(1, 500, size=n_orders)
order_dates = pd.date_range('2023-01-01', periods=n_orders, freq='H')
quantities = np.random.randint(1, 5, size=n_orders)
total_prices = quantities * np.random.uniform(50, 1000, size=n_orders)
orders = pd.DataFrame({
'order_id': order_ids,
'customer_id': customer_ids,
'product_id': product_ids,
'order_date': order_dates,
'quantity': quantities,
'total_price': total_prices
})
return orders
orders_df = generate_order_data()
orders_df.head()
Output:
We can now construct the STAR schema using the simulated customer order data. The primary fact table will consist of orders, while the associated dimension tables will encompass customers, products, and dates.
The STAR schema design simplifies queries, as each dimension table directly relates to the fact table, reducing the complexity of SQL joins.
Also read: Understanding the Basics of Data Warehouse and its Structure
Now that our schema is in place assume these 4 tables (orders, customers, products, dates) have been created and stored in a SQL database with the same schema as the above dataframes generated for each respective table. With this setup, we can run SQL queries to gain valuable business insights from the data.
We can easily retrieve total sales by product category using the Orders fact table and the Products dimension table. This query sums the total_price from the Orders table and groups the results by the product category from the Products table:
SELECT
p.category,
SUM(o.total_price) AS total_sales
FROM
orders o
JOIN
products p
ON
o.product_id = p.product_id
GROUP BY
p.category
ORDER BY
total_sales DESC;
We can join the orders and customers tables to understand how different membership levels affect order value. This query shows whether premium members spend more on average than standard members.
SELECT
c.membership_level,
AVG(o.total_price) AS avg_order_value
FROM
orders o
JOIN
customers c
ON
o.customer_id = c.customer_id
GROUP BY
c.membership_level
ORDER BY
avg_order_value DESC;
The primary difference between the STAR schema and the Snowflake schema is found in the organization of dimension tables, specifically regarding the degree of normalization implemented within those tables.
A Snowflake schema is a type of database schema that organizes dimension tables through normalization into several interconnected tables. Unlike the STAR schema, which features denormalized dimension tables, the Snowflake schema further divides dimension tables into sub-dimensions. For instance, a dimension table representing locations may be further segmented into distinct tables for cities and countries. This arrangement leads to a more intricate, hierarchical structure that resembles a snowflake, which is the origin of its name.
Below is a comparison that outlines when to use each schema:
Here’s the structure:
STAR Schema:
Snowflake Schema:
Here’s the query performance of STAR Schema and Snowflake Schema:
STAR Schema:
Snowflake Schema:
Here is the storage efficiency of STAR Schema and Snowflake Schema:
STAR Schema:
Snowflake Schema:
Here’s the scalability of STAR Schema and Snowflake Schema:
STAR Schema:
Snowflake Schema:
Let’s extend the customer orders data example to a Snowflake schema. Instead of storing all customer information in one Customer table, we will break it down to normalize data and reduce redundancy.
In a Snowflake schema for the same customer order data, we would have the following:
The Orders fact table still contains transactional data, but the customer and product information are normalized across multiple tables (e.g., customer location will link to different levels of geographic data).
To retrieve total sales by product category in a Snowflake schema, you would join several tables to get the final results. Here’s an example SQL query:
SELECT
c.category_name,
SUM(o.total_price) AS total_sales
FROM
orders o
JOIN
products p
ON
o.product_id = p.product_id
JOIN
categories c
ON
p.category_id = c.category_id
GROUP BY
c.category_name
ORDER BY
total_sales DESC;
As you can see, due to the normalized dimension tables, the Snowflake schema requires additional joins compared to the STAR schema. This results in more complex queries but minimizes redundancy in storage.
In summary, the STAR schema is optimized for fast query performance and simplicity in analytical queries, while the Snowflake schema is designed to reduce redundancy by normalizing dimension tables. The choice between the two depends on the dataset’s specific needs and the organization’s priorities, whether that be query performance or storage efficiency.
In this article, we illustrated constructing a STAR and Snowflake schema utilizing a simulated dataset of customer orders. We fact and dimension tables for customers, products, orders, and dates, demonstrating the essential function of each table in organizing data for effective querying and analysis. This schema allows for the connection of the fact table (orders) to the dimension tables (customers, products, and dates) via foreign keys such as product_id and customer_id, thereby streamlining data retrieval and promoting versatile querying.
We also highlighted key benefits of the STAR schema:
The Snowflake schema reduces data redundancy by normalizing dimension tables, improving storage efficiency but requiring more complex queries. It is ideal for managing hierarchical relationships or optimizing storage space. In contrast, the STAR schema simplifies data management and speeds up query performance, making it better for quick insights and efficient analysis. The choice between the two depends on whether you prioritize query performance or storage efficiency.
Explore the code behind this article on GitHub.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. A STAR schema is a database schema design commonly used in data warehousing and business intelligence applications. It consists of a central fact table containing transactional or measurable data, surrounded by dimension tables containing descriptive information. This star-like structure optimizes query performance and simplifies data retrieval by minimizing complex joins and making queries more intuitive. The name “STAR” comes from the shape of the schema, where the fact table is at the center, and the dimension tables radiate outward like the points of a star.
Ans. A fact table is characterized by its inclusion of transactional or quantifiable data, such as sales figures, order counts, or revenue metrics. Dimension tables provide descriptive attributes like customer names, demographics, product classifications, or dates. The fact table holds the quantitative data, while the dimension tables provide the context.
Ans. The STAR schema optimizes query performance by reducing the number of joins required, as the fact table is directly connected to each dimension table. This simplifies queries and reduces the computational cost, leading to faster query execution times, especially for large datasets.
Ans. Indeed, the STAR schema is designed to be both scalable and flexible. New dimension tables or additional attributes can be integrated into the existing schema without causing any disruption. This adaptability allows the STAR schema to accommodate expanding datasets and evolving business needs.
Ans. If query performance and simplicity are your priorities, choose a STAR schema. If your goal is to minimize data redundancy and optimize storage efficiency, particularly for large datasets with hierarchical relationships, opt for a Snowflake schema.