Automating Data Quality Checks with Dagster and Great Expectations

Jurgita Motus | Coresignal Last Updated : 07 Oct, 2024

13 min read

Introduction

Ensuring data quality is paramount for businesses relying on data-driven decision-making. As data volumes grow and sources diversify, manual quality checks become increasingly impractical and error-prone. This is where automated data quality checks come into play, offering a scalable solution to maintain data integrity and reliability.

At my organization, which collects large volumes of public web data, we’ve developed a robust system for automated data quality checks using two powerful open-source tools: Dagster and Great Expectations. These tools are the cornerstone of our approach to data quality management, allowing us to efficiently validate and monitor our data pipelines at scale.

In this article, I’ll explain how we use Dagster, an open-source data orchestrator, and Great Expectations, a data validation framework, to implement comprehensive automated data quality checks. I’ll also explore the benefits of this approach and provide practical insights into our implementation process, including a Gitlab demo, to help you understand how these tools can enhance your own data quality assurance practices.

Let’s discuss each of them in more detail before moving to practical examples.

Learning Outcomes

Understand the importance of automated data quality checks in data-driven decision-making.
Learn how to implement data quality checks using Dagster and Great Expectations.
Explore different testing strategies for static and dynamic data.
Gain insights into the benefits of real-time monitoring and compliance in data quality management.
Discover practical steps to set up and run a demo project for automated data quality validation.

This article was published as a part of the Data Science Blogathon.

Introduction
Understanding Dagster: An Open-Source Data Orchestrator
Exploring Great Expectations: A Data Validation Framework
Why are Automated Data Quality Checks Necessary?
How to Test Data Quality?
Implementing Automated Data Quality Checks
Conclusion
Frequently Asked Questions

Understanding Dagster: An Open-Source Data Orchestrator

Used for ETL, analytics, and machine learning workflows, Dagster lets you build, schedule, and monitor data pipelines. This Python-based tool allows data scientists and engineers to easily debug runs, inspect assets, or get details about their status, metadata, or dependencies.

As a result, Dagster makes your data pipelines more reliable, scalable, and maintainable. It can be deployed in Azure, Google Cloud, AWS, and many other tools you may already be using. Airflow and Prefect can be named as Dagster competitors, but I personally see more pros in the latter, and you can find plenty of comparisons online before committing.

Exploring Great Expectations: A Data Validation Framework

A great tool with a great name, Great Expectations is an open-source platform for maintaining data quality. This Python library actually uses “Expectation” as their in-house term for assertions about data.

Great Expectations provides validations based on the schema and values. Some examples of such rules could be max or min values and count validations. It also provides data validation and can generate expectations according to the input data. Of course, this feature usually requires some tweaking, but it definitely saves some time.

Another useful aspect is that Great Expectations can be integrated with Google Cloud, Snowflake, Azure, and over 20 other tools. While it can be challenging for data users without technical knowledge, it’s nevertheless worth attempting.

Great Expectations high-level architecture

Why are Automated Data Quality Checks Necessary?

Automated quality checks have multiple benefits for businesses that handle voluminous data of critical importance. If the information must be accurate, complete, and consistent, automation will always beat manual labor, which is prone to errors. Let’s take a quick look at the 5 main reasons why your organization might need automated data quality checks.

Data integrity

Your organization can collect reliable data with a set of predefined quality criteria. This reduces the chance of wrong assumptions and decisions that are error-prone and not data-driven. Tools like Great Expectations and Dagster can be very helpful here.

Error minimization

While there’s no way to eradicate the possibility of errors, you can minimize the chance of them occurring with automated data quality checks. Most importantly, this will help identify anomalies earlier in the pipeline, saving precious resources. In other words, error minimization prevents tactical mistakes from becoming strategic.

Efficiency

Checking data manually is often time-consuming and may require more than one employee on the job. With automation, your data team can focus on more important tasks, such as finding insights and preparing reports.

Real-time monitoring

Automatization comes with a feature of real-time tracking. This way, you can detect issues before they become bigger problems. In contrast, manual checking takes longer and will never catch the error at the earliest possible stage.

Compliance

Most companies that deal with public web data know about privacy-related regulations. In the same way, there may be a need for data quality compliance, especially if it later goes on to be used in critical infrastructure, such as pharmaceuticals or the military. When you have automated data quality checks implemented, you can give specific evidence about the quality of your information, and the client has to check only the data quality rules but not the data itself.

How to Test Data Quality?

As a public web data provider, having a well-oiled automated data quality check mechanism is key. So how do we do it? First, we differentiate our tests by the type of data. The test naming might seem somewhat confusing because it was originally conceived for internal use, but it helps us to understand what we’re testing.

We have two types of data:

Static data. Static means that we don’t scrape the data in real-time but rather use a static fixture.
Dynamic data. Dynamic means that we scrape the data from the web in real-time.

Then, we further differentiate our tests by the type of data quality check:

Fixture tests. These tests use fixtures to check the data quality.
Coverage tests. These tests use a bunch of rules to check the data quality.

Let’s take a look at each of these tests in more detail.

Static Fixture Tests

As mentioned earlier, these tests belong to the static data category, meaning we don’t scrape the data in real-time. Instead, we use a static fixture that we have saved previously.

A static fixture is input data that we have saved previously. In most cases, it’s an HTML file of a web page that we want to scrape. For every static fixture, we have a corresponding expected output. This expected output is the data that we expect to get from the parser.

Steps for Static Fixture Tests

The test works like this:

The parser receives the static fixture as an input.
The parser processes the fixture and returns the output.
The test checks if the output is the same as the expected output. This is not a simple JSON comparison because some fields are expected to change (such as the last updated date), but it is still a simple process.

We run this test in our CI/CD pipeline on merge requests to check if the changes we made to the parser are valid and if the parser works as expected. If the test fails, we know we have broken something and need to fix it.

Static fixture tests are the most basic tests both in terms of process complexity and implementation because they only need to run the parser with a static fixture and compare the output with the expected output using a rather simple Python script.

Nevertheless, they are still really important because they are the first line of defense against breaking changes.

However, a static fixture test cannot check whether scraping is working as expected or whether the page layout remains the same. This is where the dynamic tests category comes in.

Dynamic Fixture Tests

Basically, dynamic fixture tests are the same as static fixture tests, but instead of using a static fixture as an input, we scrape the data in real-time. This way, we check not only the parser but also the scraper and the layout.

Dynamic fixture tests are more complex than static fixture tests because they need to scrape the data in real-time and then run the parser with the scraped data. This means that we need to launch both the scraper and the parser in the test run and manage the data flow between them. This is where Dagster comes in.

Dagster is an orchestrator that helps us to manage the data flow between the scraper and the parser.

Steps for Dynamic Fixture Tests

There are four main steps in the process:

Seed the queue with the URLs we want to scrape
Scrape
Parse
Check the parsed document against the saved fixture

The last step is the same as in static fixture tests, and the only difference is that instead of using a static fixture, we scrape the data during the test run.

Dynamic fixture tests play a very important role in our data quality assurance process because they check both the scraper and the parser. Also, they help us understand if the page layout has changed, which is impossible with static fixture tests. This is why we run dynamic fixture tests in a scheduled manner instead of running them on every merge request in the CI/CD pipeline.

However, dynamic fixture tests do have a pretty big limitation. They can only check the data quality of the profiles over which we have control. For example, if we don’t control the profile we use in the test, we can’t know what data to expect because it can change anytime. This means that dynamic fixture tests can only check the data quality for websites in which we have a profile. To overcome this limitation, we have dynamic coverage tests.

Dynamic Coverage Tests

Dynamic coverage tests also belong to the dynamic data category, but they differ from dynamic fixture tests in terms of what they check. While dynamic fixture tests check the data quality of the profiles we have control over, which is pretty limited because it is not possible in all targets, dynamic coverage tests can check the data quality without a need to control the profile. This is possible because dynamic coverage tests don’t check the exact values, but they check those against a set of rules we have defined. This is where Great Expectations comes in.

Dynamic coverage tests are the most complex tests in our data quality assurance process. Dagster also orchestrates them as dynamic fixture tests. However, we use Great Expectations instead of a simple Python script to execute the test here.

At first, we need to select the profiles we want to test. Usually, we select profiles from our database that have high field coverage. We do this because we want to ensure the test covers as many fields as possible. Then, we use Great Expectations to generate the rules using the selected profiles. These rules are basically the constraints that we want to check against the data. Here are some examples:

All profiles must have a name.
At least 50% of the profiles must have a last name.
The education count value cannot be lower than 0.

Steps for Dynamic Coverage Tests

After we have generated the rules, called expectations in Great Expectations, we can run the test pipeline, which consists of the following steps:

Seed the queue with the URLs we want to scrape
Scrape
Parse
Validate parsed documents using Great Expectations

This way, we can check the data quality of profiles over which we have no control. Dynamic coverage tests are the most important tests in our data quality assurance process because they check the whole pipeline from scraping to parsing and validate the data quality of profiles over which we have no control. This is why we run dynamic coverage tests in a scheduled manner for every target we have.

However, implementing dynamic coverage tests from scratch can be challenging because it requires some knowledge about Great Expectations and Dagster. This is why we have prepared a demo project showing how to use Great Expectations and Dagster to implement automated data quality checks.

Implementing Automated Data Quality Checks

In this Gitlab repository, you can find a demo of how to use Dagster and Great Expectations to test data quality. The dynamic coverage test graph has more steps, such as seed_urls, scrape, parse, and so on, but for the sake of simplicity, in this demo, some operations are omitted. However, it contains the most important part of the dynamic coverage test — data quality validation. The demo graph consists of the following operations:

load_items: loads the data from the file and loads them as JSON objects.
load_structure : loads the data structure from the file.
get_flat_items : flattens the data.
load_dfs : loads the data as Spark DataFrames by using the structure from the load_structure operation.
ge_validation : executes the Great Expectations validation for every DataFrame.
post_ge_validation : checks if the Great Expectations validation passed or failed.

While some of the operations are self-explanatory, let’s examine some that might require further detail.

Generating a Structure

The load_structure operation itself is not complicated. However, what is important is the type of structure. It’s represented as a Spark schema because we will use it to load the data as Spark DataFrames because Great Expectations works with them. Every nested object in the Pydantic model will be represented as an individual Spark schema because Great Expectations doesn’t work well with nested data.

For example, a Pydantic model like this:

python
class CompanyHeadquarters(BaseModel):
    city: str
    country: str

class Company(BaseModel):
    name: str
    headquarters: CompanyHeadquarters

This would be represented as two Spark schemas:

json
{
    "company": {
        "fields": [
            {
                "metadata": {},
                "name": "name",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    },
    "company_headquarters": {
        "fields": [
            {
                "metadata": {},
                "name": "city",
                "nullable": false,
                "type": "string"
            },
            {
                "metadata": {},
                "name": "country",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    }
}

The demo already contains data, structure, and expectations for Owler company data. However, if you want to generate a structure for your own data (and your own structure), you can do that by following the steps below. Run the following command to generate an example of the Spark structure:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx structure"

This command generates the Spark structure for the Pydantic model and saves it as example_spark_structure.json in the gx_demo/data directory.

Preparing and Validating Data

After we have the structure loaded, we need to prepare the data for validation. That leads us to the get_flat_items operation, which is responsible for flattening the data. We need to flatten the data because each nested object will be represented as a row in a separate Spark DataFrame. So, if we have a list of companies that looks like this:

json
[
    {
        "name": "Company 1",
        "headquarters": {
            "city": "City 1",
            "country": "Country 1"
        }
    },
    {
        "name": "Company 2",
        "headquarters": {
            "city": "City 2",
            "country": "Country 2"
        }
    }
]

After flattening, the data will look like this:

json
{
    "company": [
        {
            "name": "Company 1"
        },
        {
            "name": "Company 2"
        }
    ],
    "company_headquarters": [
        {
            "city": "City 1",
            "country": "Country 1"
        },
        {
            "city": "City 2",
            "country": "Country 2"
        }
    ]

Then, the flattened data from the get_flat_items operation will be loaded into each Spark DataFrame based on the structure that we loaded in the load_structure operation in the load_dfs operation.

The load_dfs operation uses DynamicOut, which allows us to create a dynamic graph based on the structure that we loaded in the load_structure operation.

Basically, we will create a separate Spark DataFrame for every nested object in the structure. Dagster will create a separate ge_validation operation that parallelizes the Great Expectations validation for every DataFrame. Parallelization is useful not only because it speeds up the process but also because it creates a graph to support any kind of data structure.

So, if we scrape a new target, we can easily add a new structure, and the graph will be able to handle it.

Generate Expectations

Expectations are also already generated in the demo and the structure. However, this section will show you how to generate the structure and expectations for your own data.

Make sure to delete previously generated expectations if you’re generating new ones with the same name. To generate expectations for the gx_demo/data/owler_company.json data, run the following command using gx_demo Docker image:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx expectations /gx_demo/data/owler_company_spark_structure.json /gx_demo/data/owler_company.json owler company"

The command above generates expectations for the data (gx_demo/data/owler_company.json) based on the flattened data structure (gx_demo/data/owler_company_spark_structure.json). In this case, we have 1,000 records of Owler company data. It’s structured as a list of objects, where each object represents a company.

After running the above command, the expectation suites will be generated in the gx_demo/great_expectations/expectations/owler directory. There will be as many expectation suites as the number of nested objects in the data, in this case, 13.

Each suite will contain expectations for the data in the corresponding nested object. The expectations are generated based on the structure of the data and the data itself. Keep in mind that after Great Expectations generates the expectation suite, which contains expectations for the data, some manual work might be needed to tweak or improve some of the expectations.

Generated Expectations for Followers

Let’s take a look at the 6 generated expectations for the followers field in the company suite:

expect_column_min_to_be_between
expect_column_max_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_values_to_not_be_null
expect_column_values_to_be_in_type_list

We know that the followers field represents the number of followers of the company. Knowing that, we can say that this field can change over time, so we can’t expect the maximum value, mean, or median to be the same.

However, we can expect the minimum value to be greater than 0 and the values to be integers. We can also expect that the values are not null because if there are no followers, the value should be 0. So, we need to get rid of the expectations that are not suitable for this field: expect_column_max_to_be_between, expect_column_mean_to_be_between, and expect_column_median_to_be_between.

However, every field is different, and the expectations might need to be adjusted accordingly. For example, the field completeness_score represents the company’s completeness score. For this field, it makes sense to expect the values to be between 0 and 100, so we can keep not only expect_column_min_to_be_between but also expect_column_max_to_be_between.

Take a look at the Gallery of Expectations to see what kind of expectations you can use for your data.

Running the Demo

To see everything in action, go to the root of the project and run the following commands:

Build Docker image:

docker build -t gx_demo

Run Docker container:

docker composer up

After running the above commands, the Dagit (Dagster UI) will be available at localhost:3000. Run the demo_coverage job with the default configuration from the launchpad. After the job execution, you should see dynamically generated ge_validation operations for every nested object.

In this case, the data passed all the checks, and everything is beautiful and green. If data validation for any nested object fails, then postprocess_ge_validation operations would be marked as failed (and obviously, it would be red instead of green). Let’s say the company_ceo validation failed. The postprocess_ge_validation[company_ceo] operation would be marked as failed. To see what expectations failed particularly, click on the ge_validation[company_ceo] operation and open “Expectation Results” by clicking on the “[Show Markdown]” link. It will open the validation results overview modal with all the data about the company_ceo dataset.

Conclusion

Depending on the stage of the data pipeline, there are many ways to test data quality. However, it is essential to have a well-oiled automated data quality check mechanism to ensure the accuracy and reliability of the data. Tools like Great Expectations and Dagster aren’t strictly necessary (static fixture tests don’t use any of those), but they can greatly help with a more robust data quality assurance process. Whether you’re looking to enhance your existing data quality processes or build a new system from scratch, we hope this guide has provided valuable insights.

Key Takeaways

Data quality is crucial for accurate decision-making and avoiding costly errors in analytics.
Dagster enables seamless orchestration and automation of data pipelines with built-in support for monitoring and scheduling.
Great Expectations provides a flexible, open-source framework to define, test, and validate data quality expectations.
Combining Dagster with Great Expectations allows for automated, real-time data quality checks and monitoring within data pipelines.
A robust data quality process ensures compliance and builds trust in the insights derived from data-driven workflows.

Frequently Asked Questions

Q1. What is Dagster used for?

A. Dagster is used for orchestrating, automating, and managing data pipelines, helping ensure smooth data workflows.

Q2. What are Great Expectations in data pipelines?

A. Great Expectations is a tool for defining, validating, and monitoring data quality expectations to ensure data integrity.

Q3. How do Dagster and Great Expectations work together?

A. Dagster integrates with Great Expectations to enable automated data quality checks within data pipelines, enhancing reliability.

Q4. Why is data quality important in analytics?

A. Good data quality ensures accurate insights, helps avoid costly errors, and supports better decision-making in analytics.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Jurgita Motus | Coresignal

Jurgita Motus is a seasoned data analyst with 10+ years of experience in data analysis, big data analytics, and marketing analysis. Currently, she is a senior data analyst at public data vendor Coresignal, responsible for generating data insights to support product development, implementing predictive models for specific business cases, and automating analytical processes to enhance efficiency.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Automating Data Quality Checks with Dagster and Great Expectations

Introduction

Learning Outcomes

Table of contents

Understanding Dagster: An Open-Source Data Orchestrator

Exploring Great Expectations: A Data Validation Framework

Why are Automated Data Quality Checks Necessary?

Data integrity

Error minimization

Efficiency

Real-time monitoring

Compliance

How to Test Data Quality?

Static Fixture Tests

Steps for Static Fixture Tests

Dynamic Fixture Tests

Steps for Dynamic Fixture Tests

Dynamic Coverage Tests

Steps for Dynamic Coverage Tests

Implementing Automated Data Quality Checks

Generating a Structure

Preparing and Validating Data

Generate Expectations

Generated Expectations for Followers

Running the Demo

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or