ReadingList :: Data Engineer (Lifelong Learning)

读书就像习武之人修炼内功，好的书可以使自己功力大大提升，就像拜师武林泰斗了；

如果能面对面学习，那就更厉害了！

这是我修炼Data Engineer功力的书单和读书进度，加油！

01 Building a Scalable Data Warehouse with Data Vault 2.0 [Paperback]

Reading Plan (to finish by 17/Dec)

Starting from 08/Oct/2023 ✅

15-Oct

c1 ✅ 09/Oct

c2 ✅ 10/Oct

c3 ✅ 16/OCT

Notes:

Kimball: two-layer data warehouse
Inmon: three-layer data warehouse model
Data sources: ERP, CRM, Files etc
TDQM, DWQ

💪💪😄😄

22-Oct

c4 ✅ 19/OCT
c5 ✅ 22/OCT

Entity definitions (Hub, Link, Sat)

SAL (same-as link), various links, sats

29-Oct

c6 ✅ 24/OCT
c7 ✅ 26/OCT

Slow Changing Dimensions (SCD)

Star schemas

Snowflake design (indirect dimensions)

5-Nov

c8 ✅ 02/Nov
c9 ✅ 05/Nov

MDM

12-Nov

c10 ✅ 27/NOV ✌

Metrics and Error Marts

19-Nov

c11 ✅ 12/Nov

Data Extraction (stage loading: historical/batch)

26-Nov

c12 ✅ 16/Nov

DV loading

3-Dec

c13 ✅ 28/Nov

Data Quality

10-Dec

c14 ✅ 01/Dec 😀😀😀

17-Dec

c15 ✅ 17/OCT

Business users: Accessing Information Mart to build a multidimensional database

Merry Christmas!

嗯，应该想想怎么奖励自己了，要提前完成任务了！😀

这个假期读几本闲书吧

[The Summer Job]

SnowPro Core Certification: Data Transformations （5/8）

Please subscribe YouTube Channel（请订阅油管频道）: Data Driven Wealth 数说财富 DDW - YouTube

✅Transforming Data During a Load

https://docs.snowflake.com/en/user-guide/data-load-transform

Snowflake supports transforming data while loading it into a table using the COPY INTO <table> command, dramatically simplifying your ETL pipeline for basic transformations. This feature helps you avoid the use of temporary tables to store pre-transformed data when reordering columns during a data load. This feature applies to both bulk loading and Snowpipe.

The COPY command supports:

Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your data files to have the same number and ordering of columns as your target table.
The ENFORCE_LENGTH | TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.

When loading data into a table using the COPY command, Snowflake allows you to do simple transformations on the data as it is being loaded. During the load process, the COPY command allows for modifying the order of columns, omitting one or more columns, casting data into specified data types, and truncating values.

While loading the data, complex transformations such as joins, filters, aggregations, and the use of FLATTEN are not supported as they are not essential data transformations. Therefore, joining, filtering, and aggregating the data are supported ONLY after the data has been loaded into a table.

The table stages do not allow basic transformations during the COPY process; thus, basic transformations may only be performed while loading data from external stages, named internal stages or user stages.

👉Supported File Formats

The following file format types are supported for COPY transformations:

CSV
JSON
Avro
ORC
Parquet
XML

To parse a staged data file, it is necessary to describe its file format:

CSV

The default format is character-delimited UTF-8 text. The default field delimiter is a comma character (,). The default record delimiter is the new line character. If the source data is in another format, specify the file format type and options.

When querying staged data files, the ERROR_ON_COLUMN_COUNT_MISMATCH option is ignored. There is no requirement for your data files to have the same number and ordering of columns as your target table.

JSON

To transform JSON data during a load operation, you must structure the data files in NDJSON (“Newline delimited JSON”) standard format; otherwise, you might encounter the following error:

Error parsing JSON: more than one document in the input

All other file format types

Specify the format type and options that match your data files.

To explicitly specify file format options, set them in one of the following ways:

Querying staged data files using a SELECT statement:	As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the SELECT statement.
Loading columns from staged data files using a COPY INTO <table> statement:	As file format options specified directly in the COPY INTO <table> statement. As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

If no File Format object or options are not provided to either the stage or copy statement, the default behaviour will be to try and interpret the contents of a stage as a CSV with UTF-8 encoding.

👉Parameters in copy into

ENFORCE_LENGTH:

If TRUE, the COPY statement produces an error if a loaded string exceeds the target column length.
If FALSE, strings are automatically truncated to the target column length.

TRUNCATECOLUMNS:
If TRUE, strings are automatically truncated to the target column length.
If FALSE, the COPY statement produces an error if a loaded string exceeds the target column length.

Data Driven Wealth DDW

DataEngineer

Data Engineer :: Certifications (Lifelong Learning)

Data Engineer :: Certification (Lifelong Learning)

01 AWS Certified Developer - Associate

02 Data Vault 2.0 Practitioner

03 SnowPro Core Certification