SnowPro Core Certification: Performance Concepts (3/8)
✅Performance
Snowflake's unique architecture and the underlying micro-partitions storage technology mean it is not required to perform much query tuning in most situations. There are, however, several performance improvement approaches that are available and are used to increase Snowflake's overall performance.
These include · Internal caching mechanisms that operate transparently in the background to increase performance. · Scaling up or increasing the capacity of a virtual warehouse to allow for more processing power to be available for complex queries · Horizontal scaling by increasing the capacity by using a multi-cluster virtual warehouse to handle a large number of concurrent users and concurrent queries · Automatic static and dynamic partition pruning can reduce unneeded partitions while processing a query. · It is possible to accomplish better partition pruning by redistributing data in micro-partitions using clustering keys. · Pre-computing results of complex, regularly executed queries by using materialized views. · Using Search Optimization services to improve the performance of specific types of lookup queries
✅query profile
✅Micro-partitions
Micro-partitions are small and typically store 50 MB to 500 MB of uncompressed data. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected (to the total columnar data), the more efficient is the pruning performed on the table.
Snowflake saves data on the warehouse's local disk if it can't fit an operation into memory. Data spilling slows down queries because it requires more IO operations, and disk access is slower than memory access. "Bytes spilled to local storage." indicates local spillage. Snowflake will spill data to remote cloud storage if the local disk becomes full, which is even slower storage than the local disk, making this operation even slower. "Bytes spilled to remote storage" in the query profile indicates remote spillage.
One of the ways to avoid spilling is to use a larger warehouse, which will increase the overall available RAM, local storage, and parallelism and might be able to fit the query in memory. https://docs.snowflake.com/en/user-guide/ui-query-profile#queries-too-large-to-fit-in-memory
✅clustering
Snowflake clusters data automatically as it is added to a table. It is possible to manually specify a clustering key and redistribute the data based on that key.
However, a table cannot have multiple clustering keys defined.
If you need to drop or alter the clustering keys for a table, you can run the ALTER table command and use the appropriate syntax to drop, add or change clustering keys. https://docs.snowflake.com/en/user-guide/tables-clustering-keys#dropping-the-clustering-keys-for-a-table
All Snowflake editions support data clustering.
https://docs.snowflake.com/en/user-guide/intro-editions.html
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
When defining clustering keys, the initial candidate clustering columns are those columns that are frequently used in the WHERE clause or other selective filters. Additionally, columns that are used for joining can also be considered. Furthermore, the columns' cardinality (number of distinct values) is also important. It is crucial to choose a column with a high enough cardinality to allow effective partition pruning while having a low enough cardinality for Snowflake to group data into micro-partitions efficiently. A column with too few distinct values (e.g., gender) will result in minimal partition pruning. On the other hand, a column that has too many distinct values (e.g., customer id) will result in too much overhead when maintaining the partitions. When creating a multi-column cluster key, order the columns from the lowest cardinality to the higher cardinality; otherwise, the effectiveness of clustering will be reduced.
Snowflake recommends using a maximum of 3 or 4 columns in a clustering key. Any more columns in the clustering key result in more maintenance costs and do not provide enough benefits to justify the clustering costs.
Clustering keys are not for every table. Tables in the multi-terabyte range are good candidates for clustering keys. Both automatic clustering and reclustering consume credit. A single clustering key can contain one or more columns or expressions. Snowflake recommends a maximum of three or four columns (or expressions) per key for most tables. Adding more than 3-4 columns tends to increase costs more than benefits.
Example - create or replace table t1 (c1 date, c2 string, c3 number) cluster by (c1, c2);
https://docs.snowflake.com/en/user-guide/tables-clustering-keys
👉Benefit
Defining a clustering key will generally benefit queries that require filtering or sorting on the clustering keys during the query execution. ORDER BY, GROUP BY & certain joins require sorting during query execution. Queries that use the WHERE clause on the clustering keys will also benefit from an adequately defined clustering key.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys
👉clustering depth
For a populated table, the clustering depth is the average depth of overlapping micro-partitions for specific columns. The clustering depth starts at 1 (for a well-clustered table) and can be a larger number. If the average depth is smaller, the data for the specified columns are better clustered.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions#label-clustering-depth
For an unpopulated table, the clustering depth is ZERO
👉Cost
For tables with a clustering key defined, Automatic Clustering, a Snowflake service, re-clusters the micro-partitions as needed, distributing data according to the clustering key to achieve appropriate partition pruning. Snowflake internally maintains the clustered tables and any resource requirements with Automatic Clustering. Automatic Clustering only adjusts those micro-partitions which benefit from the re-clustering process. Automatic Clustering does not need a virtual warehouse but uses Snowflake-managed CPU, RAM, etc. Therefore, it has a cost attached, which should appear under serverless costs. Clustering a table uses credits like any other data modification (DML) action in Snowflake. Re-clustering also adds extra storage when data is physically redistributed and new micro-partitions are created. The original micro-partitions are kept for Time Travel and Fail-safe purposes, resulting in increased storage.
Defining clustering keys for very large tables (in the multi-terabyte range) helps optimize table maintenance and query performance. Small tables are not a good candidate for clustering.
https://docs.snowflake.com/en/user-guide/tables-auto-reclustering#credit-usage-and-warehouses-for-automatic-clustering
✅Query Processing
The query processing layer is the compute layer through which queries and data processing jobs are executed on the stored data. The compute layer can have multiple clusters for a given Snowflake instance simultaneously. The compute engines in Snowflake are known as virtual warehouses.
✅Virtual Warehouse
XS sized warehouse can load eight files parallelly.
S sized warehouse can load sixteen files parallelly.
OPERATE - Enables changing the state of a warehouse (stop, start, suspend, resume). In addition, enables viewing current and past queries executed on a warehouse and aborting any executing queries.
MODIFY - Enables altering any properties of a warehouse, including changing its size. Required assigning a warehouse to a resource monitor. Note that only the ACCOUNTADMIN role can assign warehouses to resource monitors.
MONITOR - Enables viewing of current and past queries executed on a warehouse as well as usage statistics on that warehouse.
USAGE - Enables using a virtual warehouse and, as a result, executing queries on the warehouse. If the warehouse is configured to auto-resume when a SQL statement (e.g. query) is submitted to it, the warehouse resumes automatically and executes the statement.
OWNERSHIP - Grants full control over a warehouse. Only a single role can hold this privilege on a specific object at a time.
ALL [ PRIVILEGES ] - Grants all privileges, except OWNERSHIP, on the warehouse.
👉Multi-cluster Warehouse
---Enterprise edition
Scaling Modes: (number of clusters (Auto-scale & Maximized) ---> how to reach the number of clusters (auto-scale: standard or economy; maximized: fixed number of clusters))
- Auto-scale: by specifying different values for maximum and minimum number of clusters.
Scaling Policy
- Standard (default):
- Economy: Only if the system estimates there’s enough query load to keep the cluster busy for at least 6 minutes.
- Maximized: same value for both maximum and minimum number of clusters (note that the specified value must be larger than 1).
👉Scaling
The Economy scaling policy attempts to conserve credits over performance and user experience. It doesn't spin up more virtual warehouses as soon as queuing is observed but instead applies additional criteria to ascertain whether or not to spin up new virtual warehouses. With the scaling policy set to Standard, Snowflake prefers to spin up extra virtual warehouses almost as soon as it detects that queries are starting to queue up. The Standard scaling policy aims to prevent or minimize queuing. https://docs.snowflake.com/en/user-guide/warehouses-multicluster#setting-the-scaling-policy-for-a-multi-cluster-warehouse
It is possible to resize a virtual warehouse to suit changing workloads. Assume a customer began with a small virtual warehouse which may no longer be sufficient to meet increased query complexity. Scaling up the virtual warehouse can accommodate the increasing work complexity. When a virtual warehouse is scaled up, more nodes are added to the compute cluster.
When a virtual warehouse is resized, any currently executing queries are not impacted—only new queries are affected by the new size.
When we suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse. However, it allows any compute resources executing statements to continue until the statements are complete. At this time, the resources are shut down, and the warehouse status changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.
https://docs.snowflake.com/en/user-guide/warehouses-tasks#resizing-a-warehouse
✅metadata
✅Materialized View
A materialized view is a view that pre-computes data based on a SELECT query. The query's results are pre-computed and physically stored to enhance performance for similar queries that are executed in the future. When the underlying table is updated, the materialized view refreshes automatically, requiring no additional maintenance. Snowflake-managed services perform the update in the background transparent to the user without interfering with the user's experience.
https://docs.snowflake.com/en/user-guide/views-materialized
✅cache
👉query result cache
When Snowflake executes a query, the result is cached for a period of time. The query result cache returns results for subsequent identical searches without re-executing the query and without requiring an active virtual warehouse. Snowflake can also fulfill COUNT, MIN, and MAX queries using the metadata cache, eliminating the need for an active warehouse.
Typically, query results are reused if all of the following conditions are met:
The new query syntactically matches the previously-executed query.
The query does not include functions that must be evaluated at execution time (e.g. CURRENT_TIMESTAMP() and UUID_STRING()). Note that the CURRENT_DATE() function is an exception to this rule; even though CURRENT_DATE() is evaluated at execution time, queries that use CURRENT_DATE() can still use the query reuse feature.
The query does not include external functions.
The table data contributing to the query result has not changed.
The persisted result for the previous query is still available.
The role accessing the cached results has the required privileges.
If the query was a SELECT query, the role executing the query must have the necessary access privileges for all the tables used in the cached query.
If the query was a SHOW query, the role executing the query must match the role that generated the cached results.
Any configuration options that affect how the result was produced have not changed.
The table’s micro-partitions have not changed (e.g. been reclustered or consolidated) due to changes to other data in the table.
https://docs.snowflake.com/en/user-guide/querying-persisted-results
Query result cache is enabled by default but can be turned off at a session, user, or account level using the USE_CACHED_RESULT parameter.
https://docs.snowflake.com/en/user-guide/querying-persisted-results
When
Snowflake uses the query result cache if the following conditions are met. A new query matches an old query, and the underlying data contributing to the query results remains unchanged. The table micro-partitions have not changed as a result of clustering or consolidation. The query makes no use of user-defined, external, or runtime functions. Note that queries that use the CURRENT DATE function are eligible for query result caching.
https://docs.snowflake.com/en/user-guide/querying-persisted-results
👉metadata cache
To improve query performance, Snowflake employs a variety of caching techniques. When a new query is submitted for execution, Snowflake can immediately provide the query results using either the metadata cache or the query result cache. Each virtual warehouse has its own cache, which it constructs over time while executing queries, copying relevant micro-partitions from the cloud storage to the local SSD storage.
👉Warehouse cache
The auto-suspension setting of the warehouse can have a direct impact on query performance because the cache is dropped when the warehouse is suspended. If a warehouse is running frequent and similar queries, it might not make sense to suspend the warehouse in between queries because the cache might be dropped before the next query is executed.
You can use the following general guidelines when setting that auto-suspension time limit:
For tasks, Snowflake recommends immediate suspension.
For DevOps, DataOps and Data Science use cases, Snowflake recommends setting auto-suspension to approximately 5 minutes, as the cache is not as important for ad-hoc and unique queries.
For query warehouses, for example BI and SELECT use cases, Snowflake recommends setting auto-suspension to at least 10 minutes to maintain the cache for users.
✅search optimization service
The search optimization service can significantly enhance the performance of some lookup and analytical queries that use many predicates for filtering. The search optimization service uses a persistent data structure as an optimized search access path to speed up point lookups. When the data in the table is changed (for example, by loading new data sets or performing their DML operations), the maintenance service updates the search access path to reflect the changes. The search optimization configuration on a table and the maintenance service are transparent to the users.
Search optimization is a table-level property and applies to all columns with supported data types. The search optimization service aims to significantly improve the performance of selective point lookup queries on tables. A point lookup query returns only one or a small number of distinct rows. A user can register one or more tables to the search optimization service.
|---> ALTER TABLE MY_TABLE ADD RESEARCH OPTIMIZATION;
|---> ALTER TABLE MY_TABLE DROP RESEARCH OPTIMIZATION;
The search optimization service currently supports equality predicate and IN list predicate searches for the following data types: Fixed-point numbers (e.g. INTEGER, NUMERIC). DATE, TIME, and TIMESTAMP. VARCHAR. BINARY. Currently, the search optimization service does not support floating point data types, semi-structured data types, or other data types not listed above.
The search optimization service currently supports specific types of queries for the following data types:
Fixed-point numbers (e.g. INTEGER, NUMERIC).
DATE, TIME, and TIMESTAMP.
VARCHAR.
BINARY.
VARIANT, OBJECT, and ARRAY.
GEOGRAPHY.
https://docs.snowflake.com/en/user-guide/search-optimization-service
No comments:
Post a Comment