Airflow etl redshift

9/1/2023

Observing the current data warehouse table, we always include the updated_at column of the main driving table.

Implementation of the distribution key should help us to reduce the data scan when joining two tables provided if the joining is applied on the distribution key of the two tables and sort key should help us when we are filtering the data. We have implemented distribution key and sort key in our data warehouse. Implementing Sort Key : The sort key is a column or a combination of columns used to determine the order of data storage within each block on disk.Implementing Distribution Key : The distribution key, also known as the distribution column, is a column or set of columns that determines how data is distributed across the compute nodes in a Redshift cluster.From AWS documentation, there are several way to help optimizing the query, such as : We use the left join clause to combine the two pieces of data, since not all consultations ended with doctor prescription.īefore we optimized the query, this simple query took 4 sec runtime and 149,44 MB of total data scan. This information will be useful for the analytics team to analyze how many consultations have a prescription or not.

Here is the sample from part of our ETL query before optimization.įrom the given query above, we want to fetch the consultation id and prescription time. One single SQL file could have multiple join conditions for transforming the results. In summary, a data scan in Redshift refers to the process of reading and processing data from one or more tables in the Redshift cluster to satisfy a query.įrom the above picture, we execute the sql file and read the data from Redshift Dimensional / Fact schema before ingested into the Presentation layer. If the query requires data from one or more tables, Redshift will need to perform a scan of the table(s) to locate the required data. When we execute a query in Redshift, the query planner will determine how to retrieve the data needed by the query. We are using Apache Airflow to run scheduled transformation SQL (ETL) and Redshift as our Data Warehouse.Īs the data grows, it will affect scanning time when ETL performs a query and also consume more CPU resources & Disk space from the instance that makes the instance pushing to the limit.

Data transformation is a critical step in the data management process that involves converting raw data into a format that is ready for analysis or further processing. In Halodoc, we extract and transform millions of data points on our platform every day.

0 Comments

Airflow etl redshift

Leave a Reply.

Author

Archives

Categories