Introduction to Snowflake

Pavel Strnad
WhereScape Architect
5
minutes to read
November 28, 2022
Snowflake
Snowflake Data Cloud
Data engineering
Cloud Computing
Data Warehouse

In this article, we will cover the cloud-native platform Snowflake (SF), looking at its key features. We will also compare SaaS products with on-premises products and discuss some of the differences between OLAP optimized database systems, such as SF, and traditional OLTP databases such as ORACLE or PostgreSQL.

SF is essentially a cloud-based data warehousing platform and one of the representatives of the endless list of relational database management systems (RDBMS). It’s a cloud-agnostic service, which means that it can be hosted in three main cloud computing providers — Amazon Web Services (AWS), Microsoft Azure (Azure), and Google Cloud Platform (GCP). The decision, on where the storage will be held, depends entirely on the firm’s stance in terms of these main cloud providers.

As SF is a true Software-as-a-Service (SaaS) product, the big advantage compared to an on-premises solution is that there is no hardware and software to install and configure. That means that the firm does not have to spend resources on the maintenance of on-premises servers. Every aspect of software installation is handled automatically by Snowflake and no additional costs are incurred. This means a decrease in the costs for IT department employees, who would usually have to set up and maintain the hardware. Also, the effortless scalability of the storage and compute resources is a huge advantage in comparison to on-premises solutions on the market. One of the disadvantages of an SaaS solution is the potential dependency on the cloud provider. A second disadvantage potentially might be the data security.

One differentiating factor between Snowflake and more traditional databases is that SF architecture separates the storage (Storage layer) and compute (Compute layer) to scale independently of each other. This means that if the company needs to store a high volume of data, but at the same time the high performance of the loads is not crucial, there is no need to pay for an integrated bundle and the company can simply scale the storage automatically without paying for higher compute. If certain parts of the load in the firm’s Data warehouse (DWH) need to scale up compute processing power due to a high data volume in data pipelines, or if there is a high complexity of data transformations, it’s possible to seamlessly scale up the compute resources without any downtime.

These compute resources are made up of SF virtual warehouses, which are responsible for the processing of the queries. Each warehouse (or cluster) provides the resources (CPU, memory) for the execution of all the DML and DDL statements. The warehouse can be scaled based on T-shirt sizes (XS to 6X-Large) without affecting the storage even during runtime, and it’s possible to create different warehouses for different purposes, such as separated warehouses for DWH load, reporting, and data ingestion. The higher the T-Shirt size, the more compute resources are being used for query processing and the better the performance of the query typically is.

Other important features of Snowflake include:

  • Columnar storage — SF stores data in the columnar format (as opposed to the more frequent row format). That means that the data is organized by field, so that data for a given field are next to each other in the memory. The most known representatives of these so-called C-store DBs are Snowflake, BigQuery, and AWS Redshift. They all provide great query performance as regards data read. On the other hand, other traditional DBs keep data associated with a record next to each other in the memory. We call those DBs row store databases (row-oriented). While row-oriented DBs are perfect for OLTP applications, columnar storage DBs such as Snowflake are a more suitable choice for OLAP use cases, where multiple aggregations and ad hoc queries take place.
  • Automatic compression — By default, Snowflake determines the optimal compression algorithm and compresses data out of the box as opposed to some other DBs, where the compress method has to be specified to save storage.
  • Partitioning — Data is automatically divided into micro-partitions, which represent small units of storage and allow for outstanding performance of DML queries due to efficient scanning of individual attributes. SF micro-partitioning brings the advantages of static partitioning known from other DBs such as ORACLE or PostgreSQL without the need to manually create and maintain those partitions.
  • Snowflake Time Travel — This feature allows the querying of data that has been updated or deleted in the past. It also allows for the creating of clones of the whole table, schema, or database at a specific point in the past within the time travel retention period, which is between one and 90 days as a maximum number supported in the SF Enterprise edition. It’s also possible to restore the table, schema and db, which was previously dropped within the travel retention period.
  • Support of semi-structured data formats — SF provides above-average support for semi-structured data formats such as JSON, Avro, Parquet, or ORC. There are multiple data types for storing semi-structured data such as VARIANT, ARRAY, and OBJECT.
  • Row-level policies — SF enables the use of row access policies. These are schema-level objects that essentially determine which rows will be returned based on the role, which executes the query. As a result, multiple roles can potentially see a different result set for the same query based on what the role is supposed to see.
  • Dynamic data masking — This is one of the data security features. It enables the anonymizing of attributes using a predefined masking strategy. By applying data masking policies, it’s possible to replace sensitive data while keeping the actual data intact and unchanged. These policies are schema-level objects and can be applied to a view or a table.
  • Data sharing — Snowflake allows database sharing between SF accounts. The interesting part is that no data is copied between those accounts as the whole process is done via the metadata store. There are no additional incurred costs; users only have to pay for the compute resources used to query the shared database. The only downside is that shared data is read-only and thus cannot be changed.

As you can see, Snowflake is a very versatile data warehousing solution that offers modern features and great support. Snowflake constantly releases new features (such as hybrid tables for OLTP purposes or python procedure support, known as snowpark for python) that add more and more usability to the product.

More like this