Google’s BigQuery is a powerful cloud-based data warehouse that provides fast, flexible, and cost-effective data storage and analysis capabilities. One of its unique features is the ability to build and run machine learning models directly inside the database without extracting the data and moving it to another platform.
BigQuery was created to analyse data with billions of rows using SQL-like syntax. It is hosted on the Google Cloud Storage infrastructure and is accessible via a REST-oriented application programming interface (API).
Learning Objectives
In this article, we will:
This article was published as a part of the Data Science Blogathon.
This guide will provide a step-by-step tutorial on how to build a machine-learning model in BigQuery, covering the following five main stages:
The first step in building a machine learning model in BigQuery is to get the data into the database. BigQuery supports loading data from various sources, including cloud storage (such as Google Cloud Storage or AWS S3), local files, or even other databases (such as Google Cloud SQL).
For the purposes of this tutorial, we will assume that we have a dataset in Google Cloud Storage that we want to load into BigQuery. The data in this example will be a simple CSV file with two columns: ‘age’ and ‘income’. Our goal is to build a model that predicts income based on age.
To load the data into BigQuery, we first need to create a new table in the database. This can be done using the BigQuery web interface or the command line tool.
Once the table is created, we can use the `bq` command line tool to load the data from our Cloud Storage bucket:
bq load --source_format=CSV mydataset.mytable gs://analyticsvidya/myfile.csv age:INTEGER,income:FLOAT
This command specifies that the data is in CSV format and that the columns ‘age’ and ‘income’ should be treated as an integer and float values, respectively.
The next step is to select and preprocess the features we want to use in our model. In this case, we only have two columns in our dataset, ‘age’ and ‘income’, so there’s not much to do here. However, in real-world datasets, there may be many columns with various data types, missing values, or other issues that need to be addressed.
One common preprocessing step is to normalize the data. Normalization is the process of scaling the values of a column to a specific range, such as [0,1]. This is useful for avoiding biases in the model due to differences in the scales of different columns.
In BigQuery, we can normalize the data using the `NORMALIZE` function:
WITH data AS (
SELECT age, income
FROM mydataset.mytable
)
SELECT
age,
NORMALIZE(CAST(income AS STRING)) as income_norm
FROM data
This query creates a new table with the same data as the original table but with normalized income values.
Once the data is preprocessed, we can create the machine learning model. BigQuery supports a variety of machine learning models, including linear regression, logistic regression, decision trees, and more.
For this example, we will use a simple linear regression model, which predicts a continuous target variable based on one or more independent variables. In our case, the target variable is income, and the independent variable is age.
To create the linear regression model in BigQuery, we use the `CREATE MODEL` statement:
CREATE MODEL mydataset.mymodel
OPTIONS
(model_type='linear_reg',
input_label_cols=['income']) AS
SELECT
age,
income
FROM `mydataset.mytable`
This statement creates a new model called `mymodel` in the `mydataset` dataset. The `OPTIONS` clause specifies that the model type is a linear regression model, and the input label column is `income_norm.` The `AS` clause specifies the query that returns the data that will be used to train the model.
Following are the different models supported by BigQuery
In addition to these models, BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.
Once the model is created, we need to evaluate its performance to see how well it can predict the target variable based on the independent variable. This can be done by splitting the data into a training set and a test set and using the training set to train the model and the test set to evaluate its performance.
In BigQuery, we can use the `ML.EVALUATE` function to evaluate the performance of the model:
SELECT
*
FROM
ML.EVALUATE(MODEL mydataset.mymodel,
(
SELECT
age,
income
FROM
`mydataset.mytable`))
This query returns various metrics, including mean squared error, root mean squared error, mean absolute error, and R-squared, all of which measure how well the model can predict the target variable.
Finally, once we have evaluated the model’s performance, we can use it to predict new data. In BigQuery, we can use the `ML.PREDICT` function to make predictions:
SELECT
*
FROM
ML.PREDICT(MODEL `mydataset.mymodel`,
(
SELECT
age,
FROM
`mydataset.mytable`))
This query returns the original data along with the predicted income values.
In this tutorial, we showed how to build a machine learning model in BigQuery, from loading the data to making predictions. We covered the key steps of ETL, feature selection and preprocessing, model creation, performance evaluation, and prediction. By following these steps, you can build and run your own machine learning models directly inside BigQuery, taking advantage of its fast and cost-effective data processing capabilities.
Key Takwaways from this article:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.