SnowPro Core Certification: Data Loading and Unloading (4/8)

SnowPro Core Certification: Data Loading and Unloading (4/8)



✅warehouse and parallel loading

XS sized warehouse can load eight files parallelly.

S sized warehouse can load sixteen files parallelly.

✅ Clone and load

Cloning doesn't copy the load metadata of a cloned tableA cloned table does not include the load history of the source table. Therefore, the load metadata for a cloned table would be empty. Thus, files already loaded for the source table can be loaded again into the cloned table.

✅ Load

👉File size

To optimize the number of parallel operations for a load, Snowflake recommends aiming to produce data files roughly 100-250 MB (or larger) in size compressed

👉Pattern matching

Pattern matching using a regular expression is generally the slowest of the three options for identifying/specifying data files to load from a stage; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order 

👉Loading time limit

In addition, if a data loading operation continues beyond the maximum allowed duration of 24 hours, it could be aborted without any portion of the file being committed.

👉Transformation in Loading


When loading data into a table using the COPY command, Snowflake allows you to do simple transformations on the data as it is being loaded. During the load process, the COPY command allows for modifying the order of columns, omitting one or more columns, casting data into specified data types, and truncating values. 

While loading the data, complex transformations such as joins, filters, aggregations, and the use of FLATTEN are not supported as they are not essential data transformations. Therefore, joining, filtering, and aggregating the data are supported ONLY after the data has been loaded into a table.

The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.

👉File Format

If no File Format object or options are not provided to either the stage or copy statement, the default behaviour will be to try and interpret the contents of a stage as a CSV with UTF-8 encoding.

👉Parameters in copy into

ENFORCE_LENGTH:   
  • If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.

  • If FALSE, strings are automatically truncated to the target column length. 

TRUNCATECOLUMNS:

  • If TRUE, strings are automatically truncated to the target column length.

  • If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.

👉Load metadata

The load metadata stores a variety of information, such as the name of every file that was loaded into that table and the time stamp corresponding to the time that a file was loaded. By utilizing this load metadata, Snowflake ensures that it will not reprocess a previously loaded file. The load metadata expires after 64 days

Snowflake skips over any older files for which the load status is undetermined.


✅ Unload

The OBJECT_CONSTRUCT function can be used in combination with the COPY command to convert the rows in a relational table to a single VARIANT column and unload the rows into a file.

The unloading process automatically exports to multiple files so that it can take advantage of the parallelism offered by Snowflake. However, if needed, you can set the SINGLE parameter to true to ensure the export goes to a single file. The default size of each output file is 16 MB but can be changed using the MAX_FILE_SIZE parameter. The maximum allowed size per file is 5GB if you export data to cloud storage

When data is unloaded from Snowflake, it is automatically compressed using gzip compression. This is the default behavior; however, you can specify alternate compression methods or turn off compression entirely.

Currently, only JSON and Parquet file formats are supported for data unloading. Other formats are not supported at this time. However, all of these file formats are supported for data loading.

👉Stage

Similar to how data warehouses use staging, Snowflake uses a Stage object. Snowflake uses stages to aid in the loading and unloading of data. The data must first be available in a Snowflake stage to load data into a Snowflake table. COPY command can be used to load data into a table after the data is loaded in a stage. Data unloading or exporting is also performed via a Stage object; the data can only be extracted to a stage, internal or external. 

👉Validate

Validates the files loaded in a past execution of the COPY INTO <table> command and returns all the errors encountered during the load, rather than just the first error.

Usage Notes
  • The validation returns no results for COPY statements that specify ON_ERROR = ABORT_STATEMENT (default value).

  • Validation fails if:

    • SELECT statements are used to transform data during a COPY INTO <table> operation.

    • The current user does not have access to table_name.

    • The current user is not the user who executed query_id and does not have access control privileges on this user.

    • The copy history metadata has expired. For more information, refer to Load Metadata.

  • If new files have been added to the stage used by query_id since the load was executed, the new files added are ignored during the validation.

  • If files have been removed from the stage used by query_id since the load was executed, the files removed are reported as missing.

#validation_mode

No data is loaded when this copy option is specified.

VALIDATION_MODE instructs the COPY command to validate the data files instead of loading them into the specified table; i.e., the COPY command tests the files for errors but does not load them.

The command validates the data to be loaded and returns results based on the validation option specified: 

Syntax: VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS

RETURN_n_ROWS (e.g. RETURN_10_ROWS) - Validates the specified number of rows, if no errors are encountered; otherwise, fails at the first error encountered in the rows. 

RETURN_ERRORS - Returns all errors (parsing, conversion, etc.) across all files specified in the COPY statement. 

RETURN_ALL_ERRORS - Returns all errors across all files specified in the COPY statement, including files with errors that were partially loaded during an earlier load because the ON_ERROR copy option was set to CONTINUE during the load."


✅ Put/Get


The PUT command uploads data from an on-premises system to an internal stage (including named internal stages, table stages & user stages). 

The GET command is used to download data from an internal stage to an on-premises system

To download or upload data to an external stagecloud provider utilities or other tools are used to interact with data in the cloud storage pointed to by the external stage. 

File Staging Commands – PUT (to a stage), GET (from a stage), LIST, and REMOVE. These commands are specific for working with stages.

✅ Snowpipe

Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe enables you to load data in a micro-batch manner, loading small volumes of data on each execution. The micro-batch-based data loading is used when a continuous stream of data, such as transactions or events, must be loaded and made available to enterprises quickly. Snowpipe enables continuous data loading and can load data within a few minutes after it arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore, it does not rely on virtual warehouses for processing. Snowflake automatically manages the compute required by a Snowpipe. Snowflake also manages the scaling up and down of a Snowpipe as per the data load requirement. Since a Snowpipe is serverless, its costs are charged separately from virtual warehousing fees.

AUTO_INGEST = TRUE enables automatic data loading. Snowpipe supports loading from external stages (Amazon S3, Google Cloud Storage, or Microsoft Azure). AUTO_INGEST = FALSE disables automatic data loading. You must make calls to the Snowpipe REST API endpoints to load data files.

👉list the pipes for which you have access privileges

show pipes;

Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).

The file loading metadata is associated with the pipe object rather than the table. As a result:

  • Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.

  • Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.

However, note that pipes only maintain the load history metadata for 14 days.

In addition, duplicate records can be loaded into the target table if COPY INTO <table> statements are executed that reference the same bucket/container, path, and target table as in your active Snowpipe loads. The load histories for the COPY command and Snowpipe are stored separately in Snowflake.

👉Automating 

Automating Snowpipe using cloud messaging (notification) and Calling Snowpipe REST endpoints are the mostly used methods for triggering loading with Snowpipe.



No comments:

Post a Comment

Featured Posts

SnowPro Badges and Certificates

SnowPro Badges and Certificates Online Verification https://achieve.snowflake.com/profile/richardhou888/wallet

Popular Posts Recommended