Big Data ETL - Massively Scalable Data Pipelines

Data volumes are exploding - and to process massive amounts of data - you need pipelines that can stand up to the challenge.

What is ETL in the Context of Big Data?

Definition: Extract, Transform, Load (ETL) is the process of pulling datasets from one system, packaging the data for delivery, and loading the data into a data warehouse or data lake.

Use Cases: Data teams use ETL or ELT connectors to extract data from source systems into a data warehouse for data analytics, automation, or product development.

Big Data ETL Tools: There are many cloud data integration tools that you can use for data pipelines. Portable includes 300+ no-code connectors out-of-the-box. Fivetran is commonly used for data flows that extract data from databases. The largest enterprises may use a more legacy solution like Talend or Informatica as a drag-and-drop solution.

Steps In The ETL Process: There are three steps in the traditional ETL process: extract, transform, and load. We'll outline each step in more detail below.

Extract: How Do You Extract Information from Data Sources?

To pull data from your source system, you need to you to understand how the data can be extracted using an ETL pipeline.

It's most common for systems to expose data for ETL in one of 4 formats:

A public API
A database interface
Webhooks
Flat file exports

Let's walk through the big data extraction process in all four scenarios.

How do you extract data from an API?

To extract large amounts of data from an API, you need to leverage pagination, respect API rate limits, and be able to checkpoint the progress of data extraction in case your processing job fails midway through.

At Portable, we have built over 300 no-code ETL connectors that extract data from SaaS applications, so we understand the nuances of extracting large amounts of data.

When data volumes become very large, scalability becomes the bottleneck. In these scenarios, it is common for APIs to expose query capabilities and metadata that make it possible for you to ask for only a subset of the data. These could be pagination parameters, time filters, or capabilities that look more like a full-blown SQL language.

How do you extract data from a database?

To extract massive data sets from databases, it's common to use change data capture as the technical approach to syncing data.

Change data capture (CDC) involves reading the write-ahead logs of the database to identify changes that have been made and then syncing those changes in near real-time to your target database or data warehouse to apply the updates.

For the largest enterprises, extracting data from databases like Oracle, SAP, or IBM can introduce even more complexity. For instance, if your database is located on-premises, you will need a way for the data to be exposed to your cloud ETL solution, or you will need to deploy your ETL pipeline within your on-premises environment.

Can you use webhooks for ETL?

Webhooks are not typically used for ETL pipelines, especially when data volumes become large.

It is most common to use webhooks as a way of passing notifications between systems, not as a way of moving data sets at scale.

That being said, many upstream systems do expose APIs that allow users to create export jobs that export data at scale (as JSON, XML, CSV, etc.).

While webhooks aren't used to deliver the data in these scenarios, they are a great way to notify your ETL pipeline when a job is ready for processing.

How do you read files into an ETL pipeline?

To extract files for an ETL pipeline, it's most common to pull the data from either 1) an SFTP server 2) a cloud data storage bucket like Amazon S3, Google Cloud Storage, or Azure Blob Storage, or 3) a spreadsheet like Excel or Google Sheets.

Files can be one of the most effective means of transferring large volumes of data in a timely and cost-effective manner.

Depending on the latency requirements (i.e. real-time vs. batch), files can be used to communicate data as streams, they can be processed incrementally, or they can be processed on a cadence.

Transform: How is data transformed in an ETL pipeline?

After extracting raw data from your source systems, the next step in the ETL process is data transformation.

At a small scale, data transformation could be done manually, or even in a spreadsheet like Microsoft Excel or Google sheets.

As the volume of data increases and you need to handle more complex data sets, you will need a more scalable approach to data processing. Sometimes it can even make sense to engage a data integration consultant to help.

Let's walk through the types of data transformation, and the specifics of using SQL and Python for data transformation.

What types of transformation take place in the ETL process?

There are two types of data transformation in the ETL process:

Data processing while information is in motion from the source to the destination
Data transformation after the data has landed at the destination

At the highest level, there are two points in the ETL process where data is transformed.

While data is in motion, it's common to filter rows or columns out of the data set. It's also common to apply security and privacy technologies to tokenize, mask, or hash data on its way to the destination. The third most common type of transformation that takes place while data is in motion is aggregation - i.e. converting lots of data into a more manageable amount of data before loading it into the destination.

Most transformation takes place once data has been delivered to your data warehouse, data lake, or database. This is where SQL is used to process the data into useful insights and models. Teams can then use visualization tools and reporting interfaces to build insights on top of the transformed data.

Let's dive into the two most common programming languages involved in data transformation.

How is SQL used in data transformation?

Structured Query Language (SQL) is the lingua franca of the transformation world. It allows data analysts and data engineers to convert raw data into modeled data in a simple, scalable, and automated manner.

For analytics purposes, you can think of SQL as a simple means of selecting:

The data sources you care about
How do those data sources relate to one another
The rows to include or exclude
The columns you want to include or exclude, and
The calculated fields you want to create

By writing SQL queries that convert raw data into packaged data sets, you can create entire pipelines that automatically convert complex data sets into bundled insights efficiently.

SQL can also be used in the extract and load phases of an ETL pipeline - specifically when working with relational databases - but in most scenarios, SQL does the heaviest lifting during the transformation phase.

How is Python used in the ETL process?

Python is used by data engineers to write custom ETL logic as code. While data transformation can be done with Python, the language is more commonly used for the extraction and loading of data, especially when interfacing with complicated interfaces.

For companies that need the ETL process deployed on-premises or within their cloud environment, there are open-source frameworks like Meltano and Airbyte that can help organize your Python or Java code to accelerate development.

That being said, nowadays there are companies like Portable that are entirely focused on building custom API connectors. As a result, it's less common for data teams to manage their ETL jobs using Python, and the language is being reserved for truly custom requirements.

Load: How do you prepare data for visualization, automation, or product development?

By loading data into your database, data warehouse, or data lake, it can be processed, transformed, and curated into dashboards, process automation, or data products for consumption.

When you load data - especially at a massive scale - it is important to ensure schemas are created, data quality is validated and information is ready to query.

Let's quickly walk through the most common data warehouses, data lakes, and databases used for analytics.

What are the most common destinations for an ETL pipeline?

The most common destinations for an ETL pipeline are:

Data warehouses
Data lakes
Relational databases

Portable already supports Snowflake, Google BigQuery, Amazon Redshift, PostgreSQL, and MySQL as destinations.

Other common analytics environments that are on our roadmap are: Microsoft Azure Synapse, SQL Server, Databricks, NoSQL environments, and cloud storage environments like Amazon S3, Google Cloud Storage, and Azure Blob Storage

Next Steps

Whether you're a business intelligence team building your first dashboard, a data engineer conducting real-time data analysis, or working in data science implementing a machine learning model, ETL is critical to any data management strategy.

Ready to get started? Try Portable today!