Image Source: Canva
The 5 features are listed below
Let’s discuss these in detail.
Stages are locations used to store data. From the stages, the data will be loaded into the snowflake tables. Stages can also be used to unload the data as well from Snowflake. Stages are classified as below-
Internal Stage where the data that has to be loaded into the Snowflake is stored inside Snowflake itself
External Stage where the data that has to be loaded into the Snowflake is stored outside Snowflake, for example in AWS S3. By using the external stage you can query the files that are available in AWS S3.
a.User Stages
User stages are created by default and it is specific to that particular user. If a user wants to load data from local into many tables then this can be used.
b.Tables Stages
Table stages are created by default when a table is created and is specific to that particular table.
c.Named Stages
Named stages can be explicitly created by us.
Remember the time machine from the good old movies!! We can go back in time and see the world right. The time travel feature offered by snowflake is similar to this.
The time travel option on an object is set using the parameter “DATA RETENTION PERIOD”.Once the time travel duration is crossed then the object enters the Fail-safe region. When the object is in Fail safe mode, if you need to retrieve this object then you need to contact the snowflake itself.
1. Create an employee table with a data retention period as 4 days. Note that I am using DEMO_DB database and PUBLIC schema
---TIME TRAVEL EXAMPLE
insert into T_EMP_DETLS values(1,'Micheal'); insert into T_EMP_DETLS values(2,'Nick'); insert into T_EMP_DETLS values(3,'George'); insert into T_EMP_DETLS values(4,'Donald'); insert into T_EMP_DETLS values(5,'Vincent'); insert into T_EMP_DETLS values(6,'Leo'); select * from T_EMP_DETLS;
Image Source: Author
After 5 minutes I inserted another row with EMPID 7 as below
insert into T_EMP_DETLS values(7,'William');
Now there are 7 rows in the table, But let us go back to 5 minutes and see how the table looks.
Image Source: Author
Like this, you can go back in time and check what data the table contains.
In a typical RDBMS system, When you are copying a table from one database to another then the table along with the data is also copied which means that the data is duplicated twice and storage is also higher.
So what happens if you apply changes in development. Every table in the snowflake is stored internally as micro partitions.
create or replace table T_EMP_DETLS_CLN clone T_EMP_DETLS ;
Let us see how much storage this lone table occupies when compared to the original table T_EMP_DETLS.
We can see that the storage for the cloned table is 0 and both the tables share the same clone id
create or replace table T_EMP_DETLS_CLN clone T_EMP_DETLS ; SELECT * FROM T_EMP_DETLS_CLN
select * from snowflake.information_Schema.table_Storage_metrics WHERE TABLE_NAME LIKE '%T_EMP_DETLS%'
Image Source: Author
Stream is a feature where you can track the changes happening in the data for a table. You can capture all inserts, updates, and deletes as well. You can capture changes from the time when the stream is enabled but you will not be able to capture the change before that.
The creation of a stream on a table will create three extra hidden columns on the table that will track the metadata about the changes. Please note that the stream itself does not contain any data.
Snowflake supports 3 types of streams
create stream st_emp_dtls on table T_EMP_DETLS;
insert into T_EMP_DETLS values(8,'Arnold');
Let us see what we have in the stream now
As you can see, the record of the row that is inserted was displayed in the stream. There is a difference between querying a stream and consuming the stream, We have queried the stream in the above section. We will see what happens when we consume the stream in the next section.
The task is one of the handiest features when creating a data pipeline in a snowflake. Do you want to schedule any SQL statement in snowflake, then use tasks? Tasks can be ordered in a hierarchical manner. At a particular point in time, only a single instance of task runs. For example, If you schedule a task for every hour, then if the task from the first hour is still executing, then the second run of the task will not be initiated. It will be skipped. Once you define the task, then it will not run immediately. You need to resume the task to start it.
CREATE TASK T_TASK1 WAREHOUSE = COMPUTE_WH SCHEDULE = '5 minute' WHEN SYSTEM$STREAM_HAS_DATA('st_emp_dtls') AS INSERT INTO test_emp_tabel(EMPID,EMPNAME) SELECT empid,empname FROM st_emp_changes WHERE METADATA$ACTION = 'INSERT';
Here we are trying to capture the newly inserted records on T_EMP_DETLS into another table called test_emp_tabel which is also one kind of change data capture. The task created also consumes the data from the stream which makes the stream empty now. Once it is created we need to start the task as by default it will be in suspended mode. Ensure that the table test_emp_tabel is created before you run the task
Image Source: Author
Let us start the task
alter task T_TASK1 resume; Image Source: Author
Now let us check the data in the target table and also the data in the stream. The stream will become empty as we have consumed it.
Image Source: Author
Image Source: Author
I hope you got an overview of the most important features in Snowflake and these features will help to design your solutions better especially in cases when you are migrating from an on-premise environment to Snowflake or building continuous data pipelines. If you want to explore more then refer to the official documentation from Snowflake here.
Happy Learning!!
Let me know your comments below if any.
If you would like to connect with me, then reach out here.