SnowPro Core Certification: Data Loading and Unloading (4/8)
✅warehouse and parallel loading
✅ Clone and load
✅ Load
👉File size
👉Pattern matching
👉Loading time limit
👉Transformation in Loading
The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.
👉File Format
👉Parameters in copy into
If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.
If FALSE, strings are automatically truncated to the target column length.
TRUNCATECOLUMNS:
If TRUE, strings are automatically truncated to the target column length.
If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.
If TRUE, strings are automatically truncated to the target column length.
If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.
👉Load metadata
The load metadata stores a variety of information, such as the name of every file that was loaded into that table and the time stamp corresponding to the time that a file was loaded. By utilizing this load metadata, Snowflake ensures that it will not reprocess a previously loaded file. The load metadata expires after 64 days.
Snowflake skips over any older files for which the load status is undetermined.
✅ Unload
The OBJECT_CONSTRUCT function can be used in combination with the COPY command to convert the rows in a relational table to a single VARIANT column and unload the rows into a file.The unloading process automatically exports to multiple files so that it can take advantage of the parallelism offered by Snowflake. However, if needed, you can set the SINGLE parameter to true to ensure the export goes to a single file. The default size of each output file is 16 MB but can be changed using the MAX_FILE_SIZE parameter. The maximum allowed size per file is 5GB if you export data to cloud storage.
When data is unloaded from Snowflake, it is automatically compressed using gzip compression. This is the default behavior; however, you can specify alternate compression methods or turn off compression entirely.
Currently, only JSON and Parquet file formats are supported for data unloading. Other formats are not supported at this time. However, all of these file formats are supported for data loading.
👉Stage
Similar to how data warehouses use staging, Snowflake uses a Stage object. Snowflake uses stages to aid in the loading and unloading of data. The data must first be available in a Snowflake stage to load data into a Snowflake table. COPY command can be used to load data into a table after the data is loaded in a stage. Data unloading or exporting is also performed via a Stage object; the data can only be extracted to a stage, internal or external.
👉Validate
The validation returns no results for COPY statements that specify
ON_ERROR = ABORT_STATEMENT
(default value).Validation fails if:
SELECT statements are used to transform data during a COPY INTO <table> operation.
The current user does not have access to
table_name
.The current user is not the user who executed
query_id
and does not have access control privileges on this user.The copy history metadata has expired. For more information, refer to Load Metadata.
If new files have been added to the stage used by
query_id
since the load was executed, the new files added are ignored during the validation.If files have been removed from the stage used by
query_id
since the load was executed, the files removed are reported as missing.
#validation_mode
VALIDATION_MODE instructs the COPY command to validate the data files instead of loading them into the specified table; i.e., the COPY command tests the files for errors but does not load them.
The command validates the data to be loaded and returns results based on the validation option specified:
Syntax: VALIDATION_MODE = RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS
RETURN_n_ROWS (e.g. RETURN_10_ROWS) - Validates the specified number of rows, if no errors are encountered; otherwise, fails at the first error encountered in the rows.
RETURN_ERRORS - Returns all errors (parsing, conversion, etc.) across all files specified in the COPY statement.
RETURN_ALL_ERRORS - Returns all errors across all files specified in the COPY statement, including files with errors that were partially loaded during an earlier load because the ON_ERROR copy option was set to CONTINUE during the load."
✅ Put/Get
The PUT command uploads data from an on-premises system to an internal stage (including named internal stages, table stages & user stages).
The GET command is used to download data from an internal stage to an on-premises system.
To download or upload data to an external stage, cloud provider utilities or other tools are used to interact with data in the cloud storage pointed to by the external stage.
File Staging Commands – PUT (to a stage), GET (from a stage), LIST, and REMOVE. These commands are specific for working with stages.
✅ Snowpipe
Snowflake allows continuous data loading using Snowpipe, a serverless service. Snowpipe enables you to load data in a micro-batch manner, loading small volumes of data on each execution. The micro-batch-based data loading is used when a continuous stream of data, such as transactions or events, must be loaded and made available to enterprises quickly. Snowpipe enables continuous data loading and can load data within a few minutes after it arrives in a stage. Snowpipe is serverless and has its own computational capability; therefore, it does not rely on virtual warehouses for processing. Snowflake automatically manages the compute required by a Snowpipe. Snowflake also manages the scaling up and down of a Snowpipe as per the data load requirement. Since a Snowpipe is serverless, its costs are charged separately from virtual warehousing fees.
AUTO_INGEST = TRUE enables automatic data loading. Snowpipe supports loading from external stages (Amazon S3, Google Cloud Storage, or Microsoft Azure). AUTO_INGEST = FALSE disables automatic data loading. You must make calls to the Snowpipe REST API endpoints to load data files.
👉list the pipes for which you have access privileges
show pipes;
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days.
In addition, duplicate records can be loaded into the target table if COPY INTO <table> statements are executed that reference the same bucket/container, path, and target table as in your active Snowpipe loads. The load histories for the COPY command and Snowpipe are stored separately in Snowflake.
No comments:
Post a Comment