SnowPro Core Certification: Performance Concepts (3/8)

SnowPro Core Certification: Performance Concepts (3/8)

Performance

Snowflake's unique architecture and the underlying micro-partitions storage technology mean it is not required to perform much query tuning in most situations. There are, however, several performance improvement approaches that are available and are used to increase Snowflake's overall performance. 

These include · Internal caching mechanisms that operate transparently in the background to increase performance. · Scaling up or increasing the capacity of a virtual warehouse to allow for more processing power to be available for complex queries · Horizontal scaling by increasing the capacity by using a multi-cluster virtual warehouse to handle a large number of concurrent users and concurrent queries · Automatic static and dynamic partition pruning can reduce unneeded partitions while processing a query. · It is possible to accomplish better partition pruning by redistributing data in micro-partitions using clustering keys. · Pre-computing results of complex, regularly executed queries by using materialized views. · Using Search Optimization services to improve the performance of specific types of lookup queries

query profile

Query Profile provides query execution details
It displays a graphical representation of the main components of the processing plan for the specified query, as well as statistics for each component and overall query information and statistics. 
Query Profile is available for all querieswhether running, completed, or failed. Query Profile is a valuable tool for learning how queries work. It can be used if you want or need to know more about how a query executes. It is designed to assist you in identifying typical errors in SQL query expressions so that you may identify potential performance bottlenecks and devise strategies to improve. 

https://docs.snowflake.com/en/user-guide/ui-query-profile

✅Micro-partitions

Snowflake uses HyperLogLog to estimate the approximate number of distinct values in a data set. HyperLogLog is a state-of-the-art cardinality estimation algorithm, capable of estimating distinct cardinalities of trillions of rows with an average relative error of a few percent.

HyperLogLog can be used in place of COUNT(DISTINCT …) in situations where estimating cardinality is acceptable.

We recommend using HyperLogLog whenever the input is potentially large and an approximate result is acceptable. The average relative error of our HyperLogLog implementation is 1.62338% (i.e. the average relative difference to the corresponding COUNT(DISTINCT …) result).

Micro-partitions are small and typically store 50 MB to 500 MB of uncompressed data. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html

In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected (to the total columnar data), the more efficient is the pruning performed on the table.

Snowflake saves data on the warehouse's local disk if it can't fit an operation into memory. Data spilling slows down queries because it requires more IO operations, and disk access is slower than memory access. "Bytes spilled to local storage." indicates local spillage. Snowflake will spill data to remote cloud storage if the local disk becomes full, which is even slower storage than the local disk, making this operation even slower. "Bytes spilled to remote storage" in the query profile indicates remote spillage. 

One of the ways to avoid spilling is to use a larger warehouse, which will increase the overall available RAM, local storage, and parallelism and might be able to fit the query in memory. https://docs.snowflake.com/en/user-guide/ui-query-profile#queries-too-large-to-fit-in-memory

clustering


Please subscribe YouTube Channel(请订阅油管频道): Data Driven Wealth 数说财富 DDW - YouTube


Snowflake clusters data automatically as it is added to a table. It is possible to manually specify a clustering key and redistribute the data based on that key. 


However, a table cannot have multiple clustering keys defined.

If you need to drop or alter the clustering keys for a table, you can run the ALTER table command and use the appropriate syntax to drop, add or change clustering keys. https://docs.snowflake.com/en/user-guide/tables-clustering-keys#dropping-the-clustering-keys-for-a-table

All Snowflake editions support data clustering. 

https://docs.snowflake.com/en/user-guide/intro-editions.html

https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html

When defining clustering keys, the initial candidate clustering columns are those columns that are frequently used in the WHERE clause or other selective filters. Additionally, columns that are used for joining can also be considered. Furthermore, the columns' cardinality (number of distinct values) is also important. It is crucial to choose a column with a high enough cardinality to allow effective partition pruning while having a low enough cardinality for Snowflake to group data into micro-partitions efficiently. A column with too few distinct values (e.g., gender) will result in minimal partition pruning. On the other hand, a column that has too many distinct values (e.g., customer id) will result in too much overhead when maintaining the partitions. When creating a multi-column cluster key, order the columns from the lowest cardinality to the higher cardinality; otherwise, the effectiveness of clustering will be reduced.

Snowflake recommends using a maximum of 3 or 4 columns in a clustering key. Any more columns in the clustering key result in more maintenance costs and do not provide enough benefits to justify the clustering costs. 

Clustering keys are not for every table. Tables in the multi-terabyte range are good candidates for clustering keys. Both automatic clustering and reclustering consume credit. A single clustering key can contain one or more columns or expressions. Snowflake recommends a maximum of three or four columns (or expressions) per key for most tables. Adding more than 3-4 columns tends to increase costs more than benefits.

Example - create or replace table t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);

https://docs.snowflake.com/en/user-guide/tables-clustering-keys

👉Benefit

Defining a clustering key will generally benefit queries that require filtering or sorting on the clustering keys during the query execution. ORDER BY, GROUP BY & certain joins require sorting during query execution. Queries that use the WHERE clause on the clustering keys will also benefit from an adequately defined clustering key. 

https://docs.snowflake.com/en/user-guide/tables-clustering-keys

👉clustering depth

For a populated table, the clustering depth is the average depth of overlapping micro-partitions for specific columns. The clustering depth starts at 1 (for a well-clustered table) and can be a larger number. If the average depth is smaller, the data for the specified columns are better clustered. 

https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions#label-clustering-depth

For an unpopulated table, the clustering depth is ZERO

👉Cost

For tables with a clustering key defined, Automatic Clustering, a Snowflake service, re-clusters the micro-partitions as needed, distributing data according to the clustering key to achieve appropriate partition pruning. Snowflake internally maintains the clustered tables and any resource requirements with Automatic Clustering. Automatic Clustering only adjusts those micro-partitions which benefit from the re-clustering process. Automatic Clustering does not need a virtual warehouse but uses Snowflake-managed CPU, RAM, etc. Therefore, it has a cost attached, which should appear under serverless costs. Clustering a table uses credits like any other data modification (DML) action in Snowflake. Re-clustering also adds extra storage when data is physically redistributed and new micro-partitions are created. The original micro-partitions are kept for Time Travel and Fail-safe purposes, resulting in increased storage. 

Defining clustering keys for very large tables (in the multi-terabyte range) helps optimize table maintenance and query performance. Small tables are not a good candidate for clustering.

https://docs.snowflake.com/en/user-guide/tables-auto-reclustering#credit-usage-and-warehouses-for-automatic-clustering

Query Processing

The query processing layer is the compute layer through which queries and data processing jobs are executed on the stored data. The compute layer can have multiple clusters for a given Snowflake instance simultaneously. The compute engines in Snowflake are known as virtual warehouses. 

Virtual Warehouse

XS sized warehouse can load eight files parallelly.

S sized warehouse can load sixteen files parallelly.


👉Virtual Warehouse Privileges:

OPERATE - Enables changing the state of a warehouse (stop, start, suspend, resume). In addition, enables viewing current and past queries executed on a warehouse and aborting any executing queries.

MODIFY - Enables altering any properties of a warehouse, including changing its size. Required assigning a warehouse to a resource monitor. Note that only the ACCOUNTADMIN role can assign warehouses to resource monitors. 

MONITOR - Enables viewing of current and past queries executed on a warehouse as well as usage statistics on that warehouse.   

USAGE - Enables using a virtual warehouse and, as a result, executing queries on the warehouse. If the warehouse is configured to auto-resume when a SQL statement (e.g. query) is submitted to it, the warehouse resumes automatically and executes the statement. 

OWNERSHIP - Grants full control over a warehouse. Only a single role can hold this privilege on a specific object at a time. 

ALL [ PRIVILEGES ] - Grants all privileges, except OWNERSHIP, on the warehouse.

👉Multi-cluster Warehouse

---Enterprise edition

Scaling Modes: (number of clusters (Auto-scale & Maximized) ---> how to reach the number of clusters (auto-scale: standard or economy; maximized: fixed number of clusters))

  1. Auto-scale: by specifying different values for maximum and minimum number of clusters.

Scaling Policy

  • Standard (default): 
  • Economy: Only if the system estimates there’s enough query load to keep the cluster busy for at least 6 minutes.

  1. Maximized: same value for both maximum and minimum number of clusters (note that the specified value must be larger than 1).

👉Scaling

The Economy scaling policy attempts to conserve credits over performance and user experience. It doesn't spin up more virtual warehouses as soon as queuing is observed but instead applies additional criteria to ascertain whether or not to spin up new virtual warehouses. With the scaling policy set to Standard, Snowflake prefers to spin up extra virtual warehouses almost as soon as it detects that queries are starting to queue up. The Standard scaling policy aims to prevent or minimize queuing. https://docs.snowflake.com/en/user-guide/warehouses-multicluster#setting-the-scaling-policy-for-a-multi-cluster-warehouse

It is possible to resize a virtual warehouse to suit changing workloads. Assume a customer began with a small virtual warehouse which may no longer be sufficient to meet increased query complexity. Scaling up the virtual warehouse can accommodate the increasing work complexity. When a virtual warehouse is scaled up, more nodes are added to the compute cluster. 

When a virtual warehouse is resized, any currently executing queries are not impacted—only new queries are affected by the new size. 

When we suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse. However, it allows any compute resources executing statements to continue until the statements are complete. At this time, the resources are shut down, and the warehouse status changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.

https://docs.snowflake.com/en/user-guide/warehouses-tasks#resizing-a-warehouse

✅metadata

Snowflake stores information about micro-partitions in the metadata. 

It stores the range of column values in its metadata, which includes the maximum and minimum values for each column in each micro-partition

Snowflake also stores the count of distinct values for each column in the metadata 

and certain other information to optimize a query. Because this information is stored in the metadata cache, Snowflake does not have to read the data from the tables for specific queries; instead, it may retrieve the information it needs directly from the metadata. These queries include things like count queries and queries containing functions like MIN or MAX. 

The metadata cache will not be used if you execute MIN or MAX on a column containing only characters.

Materialized View

A materialized view is a view that pre-computes data based on a SELECT query. The query's results are pre-computed and physically stored to enhance performance for similar queries that are executed in the future. When the underlying table is updated, the materialized view refreshes automatically, requiring no additional maintenance. Snowflake-managed services perform the update in the background transparent to the user without interfering with the user's experience. 

https://docs.snowflake.com/en/user-guide/views-materialized

✅cache

👉query result cache

When Snowflake executes a query, the result is cached for a period of time. The query result cache returns results for subsequent identical searches without re-executing the query and without requiring an active virtual warehouse. Snowflake can also fulfill COUNT, MIN, and MAX queries using the metadata cache, eliminating the need for an active warehouse.

Typically, query results are reused if all of the following conditions are met:

  • The new query syntactically matches the previously-executed query.

  • The query does not include functions that must be evaluated at execution time (e.g. CURRENT_TIMESTAMP() and UUID_STRING()). Note that the CURRENT_DATE() function is an exception to this rule; even though CURRENT_DATE() is evaluated at execution time, queries that use CURRENT_DATE() can still use the query reuse feature.

  • The query does not include external functions.

  • The table data contributing to the query result has not changed.

  • The persisted result for the previous query is still available.

  • The role accessing the cached results has the required privileges.

    • If the query was a SELECT query, the role executing the query must have the necessary access privileges for all the tables used in the cached query.

    • If the query was a SHOW query, the role executing the query must match the role that generated the cached results.

  • Any configuration options that affect how the result was produced have not changed.

  • The table’s micro-partitions have not changed (e.g. been reclustered or consolidated) due to changes to other data in the table.

https://docs.snowflake.com/en/user-guide/querying-persisted-results

Query result cache is enabled by default but can be turned off at a session, user, or account level using the USE_CACHED_RESULT parameter

https://docs.snowflake.com/en/user-guide/querying-persisted-results

When

Snowflake uses the query result cache if the following conditions are met. A new query matches an old query, and the underlying data contributing to the query results remains unchanged. The table micro-partitions have not changed as a result of clustering or consolidation. The query makes no use of user-defined, external, or runtime functions. Note that queries that use the CURRENT DATE function are eligible for query result caching. 

https://docs.snowflake.com/en/user-guide/querying-persisted-results

👉metadata cache

To improve query performance, Snowflake employs a variety of caching techniques. When a new query is submitted for execution, Snowflake can immediately provide the query results using either the metadata cache or the query result cacheEach virtual warehouse has its own cache, which it constructs over time while executing queries, copying relevant micro-partitions from the cloud storage to the local SSD storage.

👉Warehouse cache

A running warehouse maintains a cache of table data that can be accessed by queries running on the same warehouse. This can improve the performance of subsequent queries if they are able to read from the cache instead of from tables.

The auto-suspension setting of the warehouse can have a direct impact on query performance because the cache is dropped when the warehouse is suspended. If a warehouse is running frequent and similar queries, it might not make sense to suspend the warehouse in between queries because the cache might be dropped before the next query is executed.

You can use the following general guidelines when setting that auto-suspension time limit:

  • For tasks, Snowflake recommends immediate suspension.

  • For DevOps, DataOps and Data Science use cases, Snowflake recommends setting auto-suspension to approximately 5 minutes, as the cache is not as important for ad-hoc and unique queries.

  • For query warehouses, for example BI and SELECT use cases, Snowflake recommends setting auto-suspension to at least 10 minutes to maintain the cache for users.

Keep in mind that a running warehouse consumes credits even if it is not processing queries. Be sure that your auto-suspension setting matches your workload. For example, if a warehouse executes a query every 30 minutes, it does not make sense to set the auto-suspension setting to 10 minutes. The warehouse will consume credits while sitting idle without gaining the benefits of a cache because it will be dropped before the next query executes.

search optimization service

ENTERPRISE EDITION FEATURE

The search optimization service can significantly enhance the performance of some lookup and analytical queries that use many predicates for filtering. The search optimization service uses a persistent data structure as an optimized search access path to speed up point lookups. When the data in the table is changed (for example, by loading new data sets or performing their DML operations), the maintenance service updates the search access path to reflect the changes. The search optimization configuration on a table and the maintenance service are transparent to the users.

Search optimization is a table-level property and applies to all columns with supported data types. The search optimization service aims to significantly improve the performance of selective point lookup queries on tables. A point lookup query returns only one or a small number of distinct rows. A user can register one or more tables to the search optimization service.

|---> ALTER TABLE MY_TABLE ADD RESEARCH OPTIMIZATION;

 

|---> ALTER TABLE MY_TABLE DROP RESEARCH OPTIMIZATION;

The search optimization service currently supports equality predicate and IN list predicate searches for the following data types: Fixed-point numbers (e.g. INTEGER, NUMERIC). DATE, TIME, and TIMESTAMP. VARCHAR. BINARY. Currently, the search optimization service does not support floating point data types, semi-structured data types, or other data types not listed above.

The search optimization service currently supports specific types of queries for the following data types:

  • Fixed-point numbers (e.g. INTEGER, NUMERIC).

  • DATE, TIME, and TIMESTAMP.

  • VARCHAR.

  • BINARY.

  • VARIANT, OBJECT, and ARRAY.

  • GEOGRAPHY.

https://docs.snowflake.com/en/user-guide/search-optimization-service



No comments:

Post a Comment

Featured Posts

SnowPro Badges and Certificates

SnowPro Badges and Certificates Online Verification https://achieve.snowflake.com/profile/richardhou888/wallet

Popular Posts Recommended