SnowPro Core Certification: Data Loading and Unloading (4/8)

SnowPro Core Certification: Data Loading and Unloading (4/8)



✅warehouse and parallel loading

XS sized warehouse can load eight files parallelly.

S sized warehouse can load sixteen files parallelly.

✅ Clone and load

Cloning doesn't copy the load metadata of a cloned tableA cloned table does not include the load history of the source table. Therefore, the load metadata for a cloned table would be empty. Thus, files already loaded for the source table can be loaded again into the cloned table.

✅ Load

👉File size

To optimize the number of parallel operations for a load, Snowflake recommends aiming to produce data files roughly 100-250 MB (or larger) in size compressed

👉Pattern matching

Pattern matching using a regular expression is generally the slowest of the three options for identifying/specifying data files to load from a stage; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order 

👉Loading time limit

In addition, if a data loading operation continues beyond the maximum allowed duration of 24 hours, it could be aborted without any portion of the file being committed.

👉Transformation in Loading


When loading data into a table using the COPY command, Snowflake allows you to do simple transformations on the data as it is being loaded. During the load process, the COPY command allows for modifying the order of columns, omitting one or more columns, casting data into specified data types, and truncating values. 

While loading the data, complex transformations such as joins, filters, aggregations, and the use of FLATTEN are not supported as they are not essential data transformations. Therefore, joining, filtering, and aggregating the data are supported ONLY after the data has been loaded into a table.

The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.

👉File Format

If no File Format object or options are not provided to either the stage or copy statement, the default behaviour will be to try and interpret the contents of a stage as a CSV with UTF-8 encoding.

👉Parameters in copy into

ENFORCE_LENGTH:   
  • If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.

  • If FALSE, strings are automatically truncated to the target column length. 

TRUNCATECOLUMNS:

  • If TRUE, strings are automatically truncated to the target column length.

  • If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.

👉Load metadata

The load metadata stores a variety of information, such as the name of every file that was loaded into that table and the time stamp corresponding to the time that a file was loaded. By utilizing this load metadata, Snowflake ensures that it will not reprocess a previously loaded file. The load metadata expires after 64 days

Snowflake skips over any older files for which the load status is undetermined.

Try Data Engineering in Python with Snowpark in Just 10 Minutes [new feature]

Try Data Engineering in Python with Snowpark


Getting Started with Snowpark in Snowflake Python Worksheets

By completing this guide, you will be able to try Snowpark for Python all from within the Snowflake UI. At the end, you’ll have a better understanding of how to perform core data engineering tasks using Snowpark in Snowflake Python Worksheet.
What is Snowpark
It allows developers to query data and write data applications inside Snowflake with languages other than SQL using a set of APIs and DataFrame-style programming constructs in Python, Java, and Scala. These applications run on and take advantage of the same distributed computation on Snowflake's elastic engine as your SQL workloads. Learn more about Snowpark.
What is Python Worksheet
Python worksheets are a new type of worksheet in Snowsight that helps you get started with Snowpark faster. Users can develop data pipelines, ML models and applications directly inside Snowflake, no additional IDE (development UI) to spin up, set up or maintain for Python. These worksheets can be converted into procedures to schedule your Snowpark applications.
What you will learn
- Load data from Snowflake tables into Snowpark DataFrames

- Perform Exploratory Data Analysis on Snowpark DataFrames

- Pivot and Join data from multiple tables using Snowpark DataFrames

- Save transformed data into Snowflake table
What you will build
A prepared dataset that can be used in downstream analysis and applications. For example, training a machine learning model.


Setup Lab

You may start with a 30-day trial account ($400 credit).






















SnowPro Core Certification: Performance Concepts (3/8)

SnowPro Core Certification: Performance Concepts (3/8)

Performance

Snowflake's unique architecture and the underlying micro-partitions storage technology mean it is not required to perform much query tuning in most situations. There are, however, several performance improvement approaches that are available and are used to increase Snowflake's overall performance. 

These include · Internal caching mechanisms that operate transparently in the background to increase performance. · Scaling up or increasing the capacity of a virtual warehouse to allow for more processing power to be available for complex queries · Horizontal scaling by increasing the capacity by using a multi-cluster virtual warehouse to handle a large number of concurrent users and concurrent queries · Automatic static and dynamic partition pruning can reduce unneeded partitions while processing a query. · It is possible to accomplish better partition pruning by redistributing data in micro-partitions using clustering keys. · Pre-computing results of complex, regularly executed queries by using materialized views. · Using Search Optimization services to improve the performance of specific types of lookup queries

query profile

Query Profile provides query execution details
It displays a graphical representation of the main components of the processing plan for the specified query, as well as statistics for each component and overall query information and statistics. 
Query Profile is available for all querieswhether running, completed, or failed. Query Profile is a valuable tool for learning how queries work. It can be used if you want or need to know more about how a query executes. It is designed to assist you in identifying typical errors in SQL query expressions so that you may identify potential performance bottlenecks and devise strategies to improve. 

https://docs.snowflake.com/en/user-guide/ui-query-profile

✅Micro-partitions

Snowflake uses HyperLogLog to estimate the approximate number of distinct values in a data set. HyperLogLog is a state-of-the-art cardinality estimation algorithm, capable of estimating distinct cardinalities of trillions of rows with an average relative error of a few percent.

HyperLogLog can be used in place of COUNT(DISTINCT …) in situations where estimating cardinality is acceptable.

We recommend using HyperLogLog whenever the input is potentially large and an approximate result is acceptable. The average relative error of our HyperLogLog implementation is 1.62338% (i.e. the average relative difference to the corresponding COUNT(DISTINCT …) result).

Micro-partitions are small and typically store 50 MB to 500 MB of uncompressed data. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html

In other words, the closer the ratio of scanned micro-partitions and columnar data is to the ratio of actual data selected (to the total columnar data), the more efficient is the pruning performed on the table.

Snowflake saves data on the warehouse's local disk if it can't fit an operation into memory. Data spilling slows down queries because it requires more IO operations, and disk access is slower than memory access. "Bytes spilled to local storage." indicates local spillage. Snowflake will spill data to remote cloud storage if the local disk becomes full, which is even slower storage than the local disk, making this operation even slower. "Bytes spilled to remote storage" in the query profile indicates remote spillage. 

One of the ways to avoid spilling is to use a larger warehouse, which will increase the overall available RAM, local storage, and parallelism and might be able to fit the query in memory. https://docs.snowflake.com/en/user-guide/ui-query-profile#queries-too-large-to-fit-in-memory

clustering


Please subscribe YouTube Channel(请订阅油管频道): Data Driven Wealth 数说财富 DDW - YouTube

Python中的变量类型转换

Python中的变量类型转换

⏩类型转换四个函数 int() float() str() bool()

👉int() 可以用来将其他的对象转换为整型

规则:

#   布尔值:True -> 1   False -> 0
#   浮点数:直接取整,省略小数点后的内容
#   字符串:合法的整数字符串,直接转换为对应的数字
#           如果不是一个合法的整数字符串,则报错 ValueError: invalid literal for int() with base 10: '11.5'
             如果是其它base,需要多加一个base参数来转换,例如:
                    print(int('0x123',16))
                    print(int('0o123',8))


 

#   对于其他不可转换为整型的对象,直接抛出异常 ValueError

👉float() 和 int()基本一致,不同的是它会将对象转换为浮点数

👉str() 可以将对象转换为字符串

#  True -> 'True'
#  False -> 'False'
#  123 -> '123' 

👉 bool() 可以将对象转换为布尔值,任何对象都可以转换为布尔值

#   规则:对于所有表示空性的对象都会转换为False,其余的转换为True
#       哪些表示的空性:0 、 None 、 '' 。。。

👉用type进行类型检查

print(type(a))

SnowPro Core Certification: Account Access and Security (2/8)

SnowPro Core Certification: Account Access and Security (2/8)

#Account

You will need to create two different accounts, one for each region. 

Each Snowflake account is hosted in a particular Snowflake region. 

To use Snowflake in multiple regions, a Snowflake customer needs to maintain multiple ✅Snowflake accounts, at least one for each region.

 https://docs.snowflake.com/en/user-guide/intro-regions.html

#reader account

A reader account can only consume data from the data provider account that created it. 

https://docs.snowflake.com/en/user-guide/data-sharing-reader-create#what-is-restricted-allowed-in-a-reader-account

Sharing with a non-Snowflake user requires the creation of a reader account. The reader account provides the non-Snowflake user with a Snowflake account through which they can consume the shared data. The data providers own the reader account, and ❗all costs (including query costs) are charged to the data provider. 

https://docs.snowflake.com/en/user-guide/data-sharing-reader-create


#different cloud platform 

It is possible to share data with Snowflake accounts in another cloud platform, but the provider must enable replication and replicate your existing database to the other cloud platform. 

https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-plaforms

#user and #role

👍Yes, it is possible to create a user in Snowflake without a password. We cannot use the Snowflake web interface to create users with no passwords or remove passwords from existing users, and we must use CREATE USER or ALTER USER. 

❗Without a password in Snowflake, a user cannot log in using Snowflake authentication and must use federated authentication instead.

#ACCOUNTADMIN


 is the most powerful role in a Snowflake account. Due to the role hierarchy and privileges inheritance, the ACCOUNTADMIN inherits all the privileges that SECURITYADMIN & USERAMDIN has. 

## resource monitor

From a privilege perspective, only Account Administrators (users with ACCOUNTADMIN role) can create new resource monitors. However, account administrators can grant privileges to an existing resource monitor to allow other users to view and modify the resource monitor configuration. The MONITOR and MODIFY privileges on a resource monitor allow other users to view and modify a specific resource monitor. 

https://docs.snowflake.com/en/user-guide/resource-monitors#access-control-privileges-for-resource-monitors


#SYSADMIN
 
has privileges to create warehouses and databases (and other objects) in an account. This role also has the ability to grant privileges on warehouses, databases, and other objects to other roles.

https://docs.snowflake.com/en/user-guide/security-access-control-overview#system-defined-roles.

Snowflake's access control system

Snowflake's access control system is built on the 🔒 RBAC idea, which means that privileges are issued to roles and roles to users. The privileges associated with a role are given to all users assigned to it. Snowflake also supports🔒discretionary access control (DAC), which means that the role that created an object owns it and can provide access to other roles to that item. 

https://docs.snowflake.com/en/user-guide/security-access-control-overview

#SECURITYADMIN  (aka Security Administrator)


Role that can manage any object grant globally, as well as create, monitor, and manage users and roles. More specifically, this role:

  • Is granted the MANAGE GRANTS security privilege to be able to modify any grant, including revoking it.

  • Inherits the privileges of the USERADMIN role via the system role hierarchy (i.e. USERADMIN role is granted to SECURITYADMIN).

#useradmin

Role that is dedicated to user and role management only. More specifically, this role:

  • Is granted the CREATE USER and CREATE ROLE security privileges.

  • Can create users and roles in the account.

    This role can also manage users and roles that it owns. Only the role with the OWNERSHIP privilege on an object (i.e. user or role), or a higher role, can modify the object properties.

The USERADMIN role is typically meant for creating and managing users. However, the privileges of the USERADMIN role are inherited by SECURITYADMIN and ACCOUNTADMIN; therefore, they also get the privileges to create users. ACCOUNTADMIN is the most powerful role anyway and can do anything in a Snowflake account.

#Private connectivity

Private connectivity enables you to ensure that access to your Snowflake instance is via a secure connection and, potentially, to block internet-based access completely. Private connectivity to Snowflake requires the Business-Critical edition as a minimum.

#Replication

Database and share replication are available in all editions, including the Standard edition. 

Replication of all other objects is only available for Business Critical Edition (or higher).

https://docs.snowflake.com/en/user-guide/account-replication-intro


Security, Governance, and Data Protection

Choose the geographical location where your data is stored, based on your region.

User authentication through standard user/password credentials.

Enhanced authentication:

  • Multi-factor authentication (MFA).
  • Federated authentication and single sign-on (SSO).
  • Snowflake OAuth.
  • External OAuth.

All communication (in transit) between clients and the server protected through TLS.

All static data

Deployment inside a cloud platform VPC (AWS or GCP) or VNet (Azure).

Isolation of data (for loading and unloading) using:

  • Amazon S3 policy controls.
  • Azure storage access controls.
  • Google Cloud Storage access permissions.

Support for PHI data (in compliance with HIPAA and HITRUST CSF regulations) — requires Business Critical Edition (or higher).
Automatic data encryption by Snowflake using Snowflake-managed keys.
Object-level access control.
Snowflake Time Travel (1 day standard for all accounts; additional days, up to 90, allowed with Snowflake Enterprise) for:
  • Querying historical data in tables.
  • Restoring and cloning historical data in databases, schemas, and tables.
Snowflake Fail-safe (7 days standard for all accounts) for disaster recovery of historical data.
Column-level Security
to apply masking policies to columns in tables or views — requires Enterprise Edition (or higher).
Row-level Security to apply row access policies to tables and views — requires Enterprise Edition (or higher).
Object Tagging to apply tags to Snowflake objects to facilitate tracking sensitive data and resource usage — requires Enterprise Edition (or higher).


Important

Each Snowflake account is hosted in a single region. If you wish to use Snowflake across multiple regions, you must maintain a Snowflake account in each of the desired regions.


Connecting to Snowflake

Snowflake supports multiple ways of connecting to the service:

  • A web-based user interface from which all aspects of managing and using Snowflake can be accessed.

  • Command line clients (e.g. SnowSQL) which can also access all aspects of managing and using Snowflake.

  • ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to Snowflake.

  • Native connectors (e.g. Python, Spark) that can be used to develop applications for connecting to Snowflake.

  • Third-party connectors that can be used to connect applications such as ETL tools (e.g. Informatica) and BI tools (e.g. ThoughtSpot) to Snowflake.

Snowflake-provided clients, including SnowSQL (command line interface, for linux, windows, macOS), connectors for Python, Kafka, and Spark, and drivers for Node.js, JDBC, ODBC, and more.

https://developers.snowflake.com/snowsql/

The Snowflake SQL API is a REST API that you can use to access and update data in a Snowflake database.

Snowflake SQL API provides operations that we can use to:

  • Submit SQL statements for execution.
  • Check the status of the execution of a statement.
  • Cancel the execution of a statement.
  • Fetch query results concurrently. 

#secure view

You may create a secure view if you need to share data from many tables in separate databases. Because several databases cannot be added to a single share, Snowflake suggests creating secure views within a single database and sharing that database. 

https://docs.snowflake.com/en/user-guide/data-sharing-mutiple-db

authentication methods

Snowflake supports the following authentication methods: Username and password, MFA, SAML, and other authentication methods such as OAuth, Key-pair, and SCIM.

it is possible to create a user in Snowflake without a password. We cannot use the Snowflake web interface to create users with no passwords or remove passwords from existing users, and we must use CREATE USER or ALTER USER. Without a password in Snowflake, a user cannot log in using Snowflake authentication and must use federated authentication instead.

Snowflake supports SCIM 2.0 and is compatible with Okta and Azure Active Directory. SCIM is an open standard that provides automatic user provisioning and role synchronization based on identity provider information. When a new user is created in the identity provider, the SCIM automatically provisions the user in Snowflake. Additionally, SCIM can sync groups defined in an identity provider with Snowflake roles. https://docs.snowflake.com/en/user-guide/scim

#Securable Object

Securable Object is an entity to which access can be granted. Unless allowed by a grant, access will be denied.





#NETWORK POLICY

Only security administrators (i.e., users with the SECURITYADMIN role) or higher or a role with the global CREATE NETWORK POLICY privilege can create network policies using Snowsight, Classic Web Interface, and SQL.

The SHOW PARAMETERS command determines whether a network policy is set on the account or for a specific user.

For Account level: SHOW PARAMETERS LIKE 'network_policy' IN ACCOUNT;

For User level : SHOW PARAMETERS LIKE 'network_policy' IN USER <username>; 

        Example - SHOW PARAMETERS LIKE 'network_policy' IN USER john;


Network policies currently support only Internet Protocol version 4 (i.e. IPv4) addresses.

##Tri-Secret Secure 

Tri-Secret Secure refers to the combination of a Snowflake-managed key and a customer-managed key, which results in the creation of a composite master key to protect your data. Tri-Secret Secure requires the Business Critical edition as a minimum and can be activated by contacting Snowflake support. 

https://docs.snowflake.com/en/user-guide/security-encryption-manage

encryption keys

By default, Snowflake manages encryption keys automatically, requiring no customer intervention. Snowflake-managed keys are 

--> rotated regularly (at 30-day intervals), and 

--> an annual rekeying process re-encrypts data with new keys. The data encryption and key management processes are entirely transparent to the users. Snowflake uses 

--> AES 256-bit encryption to encrypt data at rest

https://docs.snowflake.com/en/user-guide/security-encryption-manage

Snowflake encrypts all data in transit using Transport Layer Security (TLS) 1.2. This applies to all Snowflake connections, including those made through the Snowflake Web interface, JDBC, ODBC, and the Python connector. 

Snowflake-managed keys

All Snowflake-managed keys are automatically rotated by Snowflake when they are more than 30 days old. Active keys are retired, and new keys are created. When Snowflake determines the retired key is no longer needed, the key is automatically destroyed. When active, a key is used to encrypt data and is available for usage by the customer. When retired, the key is used solely to decrypt data and is only available for accessing the data.

https://docs.snowflake.com/en/user-guide/security-encryption-end-to-end

##Multi-factor authentication

MFA is enabled by default for all Snowflake accounts and is available in all Snowflake editions. 

All Snowflake client tools, including the web interface, SnowSQL, and the various connectors and drivers, support MFA. 

Snowpipe is a snowflake-managed serverless service. A Snowflake user can not log into it; therefore, it doesn't require MFA. https://docs.snowflake.com/en/user-guide/security-mfa

Multi-factor authentication adds additional protection to the login process in Snowflake. Snowflake provides key pair authentication as a more secure alternative to the traditional username/password authentication approach. Additionally, Snowflake offers federated authentication, enabling users to access their accounts via a single sign-on (SSO). Users authenticate using SAML 2.0-compliant single sign-on (SSO) via an external identity provider (IdP).

Snowflake strongly recommends that all users with the ACCOUNTADMIN role be required to use MFA.

Okta and Microsoft ADFS provide native Snowflake support for federated authentication and SSO.

After a specified period of time (defined by the IdP), a user’s session in the IdP automatically times out, but this does not affect their Snowflake sessions. Any Snowflake sessions that are active at the time remain open and do not require re-authentication. However, to initiate any new Snowflake sessions, the user must log into the IdP again.

Snowflake SQL API supports Oauth, and Key Pair authentication.

external Tokenization

Snowflake supports masking policies that may be applied to columns and enforced at the column level to provide column-level security. Column-level security is achieved by dynamic data masking or external Tokenization.

https://docs.snowflake.com/en/user-guide/security-column

Database failover and failback

Database failover and failback between Snowflake accounts are provided first in the Business Critical edition and are also available in the virtual private Snowflake (VPS) edition. 

https://docs.snowflake.com/en/user-guide/intro-editions.html

#IP Addresses 

If you provide both Allowed IP Addresses and Blocked IP Addresses, Snowflake applies the Blocked List first. This would block your own access. Additionally, in order to block all IP addresses except a select list, you only need to add IP addresses to ALLOWED_IP_LIST. Snowflake automatically blocks all IP addresses not included in the allowed list.

Featured Posts

SnowPro Badges and Certificates

SnowPro Badges and Certificates Online Verification https://achieve.snowflake.com/profile/richardhou888/wallet

Popular Posts Recommended