BigQuery: An Walkthrough of ML with Conventional SQL

Debanjan Last Updated : 05 Aug, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Most of us are familiar with SQL, and many of us have hands-on experience with it. Machine learning is an increasingly popular and developing trend among us. BigQueryML is a toolset that will allow us to build machine learning models by executing standard SQL queries. BigQuery ML, shortened to BQML, is a pure SQL solution that leverages BigQuery to query massive datasets and train a machine learning model with it. In this article, we’ll try out BQML, learn about its principles and how it works, and then follow an example implementation.

We will proceed step by step, starting with the introduction of BigQuery, to better grasp the entire process and what happens behind the scenes.

Prerequisite:

Intermediate knowledge and experience with standard SQL
Basic Understanding of Machine Learning concepts

What is BigQuery?

BigQuery is a highly scalable, serverless data warehouse that can process queries on petabytes of data in a few minutes. It is a cloud-based Paas, or platform as a service, data warehouse offered by Google. BigQuery features built-in functions such as geospatial analysis, real-time data collection, business intelligence, and integration with a range of Google Cloud Platform (GCP) services, in addition to Machine Learning, which we will emphasize today.

Any business works with data, and if the data is modest enough, it can probably be fit into spreadsheets. However, if the amount of data expands to gigabytes, terabytes, or even petabytes, a more efficient solution, such as a data warehouse, is required. Traditional database management systems are incapable of handling such massive amounts of data. This is where BigQuery comes in. It is built to manage huge amounts of data, such as log data from thousands of retail systems or IOT data from millions of car sensors worldwide. It can process at least 100 billion regular expressions at 1 μsec per. We can use BigQuery via clients like BigQuery Web UI, REST APIs, or bg command-line tool.

How does query processing work in BigQuery?

It is built on top of Dremel Technology, which Google has been developing internally since 2006. Dremel is the execution engine for BigQuery. Below is the representation of the BQ architecture.

^{Source: https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood}

First, the BigQuery client interacts with the Dremel engine via a client interface. Dremel converts the query into an execution tree. This tree is divided into two parts, its branches, and leaves. The branches are called mixers, which perform aggregation. The leaves are slots that perform necessary computation and read data using the Jupiter Network from the BQ filesystem. Google’s Jupiter network can deliver 1 petabit/sec of total bisection bandwidth. Now, mixers and slots are both run by Borg. Borg is a large-scale cluster management system that allocates server resources to Dremel jobs. Unlike traditional relational databases, BigQuery uses columnar storage, where data is co-located by column rather than storage. Internally, BQ stores data in a proprietary file format called a capacitor. It uses access patterns to encode data and reshuffle rows.

Databases such as MySQL and PostgreSQL use record-oriented storage to store data. It is effective for transactional modifications to a single or group of rows. In the case of aggregation, however, it must read the entire table into memory. Because BigQuery is focused on analytical use cases, its columnar storage allows it to read only a single column for aggregation.

All the files in BigQuery are stored in a distributed file system throughout Google called Colossus. Each Google data center has its Colossus cluster. Colossus ensures durability using erasure encoding, which breaks data into fragments and saves redundant pieces across a set of different disks.

Now that we have a general understanding of BigQuery and how it processes such massive quantities of data so quickly and effectively, we can move on to the machine learning portion.

Machine Learning on BigQuery

We are all aware that machine learning is an area of study in which we feed data to computers and allow them to learn and improve from that data without being explicitly programmed. In machine learning, any problem begins with identifying business problems, collecting an appropriate amount of data, preprocessing and splitting the data into train-test, training and evaluating the model, and finally deploying it to the cloud and making predictions.

Bigquery ML, on the other hand, greatly simplifies this process by automatically handling preprocessing and data splitting. It allows one to focus only on the right data formatting and choose which model to use. BQML allows us to,

Train & Deploy ML models without moving data from BigQuery
Iterate on models in SQL in BigQuery
Make predictions without worrying about model deployment

Pricing & Supported Models of BQML

BigQuery currently supports over ten models, ranging from linear regression to K-Means clustering and time series to deep neural networks. A full list of supported models can be found on BQML documentation here.

Models in BQML can be classified into two categories: built-in models or models that are trained within BigQuery and external models like any imported models, DNN, or AutoML models. BQML pricing is on-demand and dependent on data location and type of operation, such as model creation, evaluation, or prediction, in addition to the model utilized.

Hands-on Implementation of BQML

We will create a regression model to predict the probability of a buyer adding a product to the cart using BigQuery by setting up a sandbox environment. To do that, go to the URL console.cloud.google.com/bigquery and click on create project button.

BigQuery provides over 100 datasets publicly available to analyze. These datasets can be found in the marketplace section of the google cloud navigation panel following this link. All the public datasets are available under the project bigquery-public-data, and we will pin this project within our UI by clicking on the + ADD DATA button as shown below-

For our prediction, we will use the ga4_obfuscated_sample_ecommerce dataset. It has tables divided by name events_YYYYMMDD i.e., data for each day represents a table. We can find the schema for every table and write a query like the below-

In the above clip, we’re changing the table name to events_* to select all the tables under the dataset. In the upper right corner, we can also check how much memory the query will process when run.

Before beginning the ML training, the models must be stored in a dataset. We will create a new dataset using BigQuery UI, name the dataset, and choose a location like the below-

Now, we will create our training set by the following query (link to query) below,

The following table will be used as our training dataset containing data from 2020. The schema of this table can be viewed in the same manner. We will skip all the data exploration and jump straight to the model creation part.

A little introduction to the BQML convention for creating a model-

The CREATE MODEL statement is the same as CREATE TABLE is standard SQL. It’s always better to use CREATE OR REPLACE MODEL as per the standards.
The ML.TRANSFORM is used for input preprocessing.
BQML specifically looks for a column name label. If that column is not present in your query, then input_label_cols should be passed as an alternative target column.

Sample statement/convention for creating a model

Now, we will create our model using the query below

Query to Create or replace model,

(log_model_predict.sql)

We can also evaluate our prediction using this query below,

So, we’ve just created a logistic regression model that can predict the probability of adding an item to a cart event with an accuracy of 93%. Though this is a base model, many advanced techniques are available to tune our model in BQML, but we can add that as a future scope.

When to use BQML?

We’ve just seen how powerful BigQuery’s toolkit is. But still, it has several limitations and shortcomings which restrict BQ for general use in ML. One should choose BQML for any of the below cases-

When the dataset is too big to read into local memory or when there are other constraints on adding the dataset to local.
When we need to serve the model directly afterward training. Because the model is in the same location as our data, we can make predictions directly from the database, eliminating the need for code writing, unit testing, and explicitly deploying into production.
When we have a team of several languages, such as Python and R, SQL is undoubtedly the common field for all.

Conclusion

We’ve given a brief introduction to BigQuery ML. In this article, we’ve covered,

What is BigQuery, and how does BQ manage to query terra bytes of data within seconds?
Pricing and Currently Supported models in BQML.
How ML in BigQuery differentiates from traditional ML and when to use BQML.
Steps to build a logistic regression model in BQML from scratch can predict the probability of adding an item to the cart.

I hope this article was as straightforward and interactive as possible and that it inspired you to explore BigQuery for ML. If you have any suggestions or corrections, please let me know.

I’d love to connect with you via LinkedIn.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Debanjan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

BigQuery: An Walkthrough of ML with Conventional SQL

Introduction

What is BigQuery?

How does query processing work in BigQuery?

Machine Learning on BigQuery

Pricing & Supported Models of BQML

Hands-on Implementation of BQML

When to use BQML?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect