ETL Tools

ETL can be defined as the process of integrating data from different places into a single location, where it can be used by people in your company to draw insights.

What is ETL/ELT?

With a multitude of marketing applications, sales tools, and business intelligence software, we are constantly running headfirst into the problem of needing to move data from a lot of different places into one or more centralized locations. E(xtract), T(ransform), and L(oad) came from a necessity of moving data from siloed locations into databases or data warehouses in where they could be leveraged.

Extract refers to acquiring data from an original source.

Transform refers to normalizing and/or sanitizing this data

Load refers to moving the data into a destination where it will be leveraged.

ETL and ELT solutions are generally used for the following reasons:.

  • Business Insights: Organizations are empowered by the ability to move all their marketing data into once place, allowing for powerful analytics to be performed on their now centralized data.
  • Migration: Whether it’s moving data from one database provider to another or loading data into the cloud, ETL/ELT tools enable data engineering teams to be flexible with the location of their data.
  • Machine Learning: Businesses require large amounts of data to perform machine learning, and this is only possible with efficient and reliable ways of transporting that data. ETL/ELT tools enable learning patterns on large datasets by quickly transporting data to where it needs to be.

While the origins of ETL came about in the 70s, the early tools weren’t built to handle the volume that they are required to process today. We’ll do a brief overview of how we got here and what solutions are looking to take us into the future.

A short history of data integration to enlighten the future

Pre-2000: Data pipelines? Definitely an IT problem.

In this era, we had dedicated teams managing a very small number of data integration pipelines. Generally, these teams would always build these solutions in house, as it was cost effective for the time period.

2000 to 2010: Social networks usher in a rebirth of data.

Here’s the issue. We have way more data than we know what to do with, but we have none of the capabilities of handling this volume. The tools aren’t reliable, everything is still custom made and managed in-house, and cloud infrastructure has not hit its stride yet. Without the reliability or scalability needed to support this level of ingestion and processing, the majority of analytics is still being done at a small scale. We do see the emergence of tools like Hadoop for parallelizing computational workloads, but it was only in 2008 where it really took hold..

2010 to 2015: The advent of cloud computing and ETL solutions.

We’re starting to see the explosion of cloud tools and SaaS solutions everywhere. We definitely have the capability of scaling our resources quickly and it showed; the number of cloud tools went through the roof, and the adoption of ETL solutions followed quickly, as building these pipelines in house was starting to make less and less sense. Although, many companies continued to do so anyway. Marketing and sales teams need to compute their cost of acquisition by consolidating data from several sources (ads, marketing, CRM, payment, etc.). And that’s only one use case amid hundreds of them within the company. For the time, ETL was great! The paradigm gave businesses strong insight into their legacy data, it allowed visualization to be done on top of a unified view once loading had been completed, and it enabled the rise of the data analyst, who enjoyed this new playground of sanitized data once the pipelines had been set up.

2015 to 2020: The rise of modern cloud data warehouses and ELT

For a while, ETL was great, and it still does have some modern applications. But in most cases, we started to see a few holes and disadvantages in the second half of the last decade:

  • Inflexibility: ETL is inherently rigid. It forces data analysts to know beforehand every way they are going to use the data, every report they are going to produce. Any change they make can be costly. It can potentially affect data consumers downstream of the initial extraction.
  • Lack of visibility: Every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse, only the one that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced
  • Lack of autonomy for analysts: Building an ETL-based data pipeline is often beyond the technical capabilities of analysts. It typically requires the close involvement of engineering talent, along with additional code to extract and transform each source of data.

The puzzle pieces are starting to come together. We have the resources and capabilities of moving large amounts of data into destinations. We have large amounts of data to move. But how does it all fit together? Different teams never had the full picture and legacy data warehouses still created data silos in different departments of an organization due to lack of intelligent and cost effective scaling solutions. In comes the modern cloud data warehouses; BigQuery, Redshift, Snowflake started to emerge. They quickly became the best place to consolidate that data, as they were offering data computation and storage at much lower costs than traditional data warehouses. Not only that, they enabled transformation to be done after loading, enabling the more powerful ELT paradigm:

  • ELT supports agile decision-making for analysts: When analysts can load data before transforming it, they don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get.
  • ELT promotes data literacy across the whole company: When used in combination with cloud-based business intelligence tools such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of analytics across organizations. Business intelligence dashboards become accessible even to relatively nontechnical users.

2020 to 2021: The rise of dbt and the analytics engineer

An open-source project named dbt emerged as the data transformation standard, making it much easier for data analysts to handle data transformation on their own. Enabling data analysts in this way paved the path for the “analytics engineer.” However, while the number of tools companies were using internally continued to grow, the ETL and ELT solutions didn’t keep up. They seem to all plateau at around 150-200 connectors. The reason is simple, the hard part about data integration is not about building the connectors, but maintaining them. It is costly, and any cloud-based closed-source solution will be restricted by ROI (return on investment) evaluations. It isn’t profitable for them to support the long tail of connectors, so they only focus on the most popular integrations. And that brings us to today.

Data integration in 2021 and beyond

As we see it, ETL/ELT now faces these three problems:

  • Lack of integrations: As mentioned, closed-source solutions may be high quality, but aren’t incentivized to create connectors for every data source.
  • Security: Companies are constantly fighting to remain compliant with their data and cannot trust black-box solutions that move data outside of their VPC.
  • Cost Efficiency: Volume based pricing is an anti-pattern that doesn’t enable data analysts and data engineers to do their job. Companies shouldn’t be punished for leveraging more data in the modern landscape.

Decoupling EL from T

For every specific need within any company, there are unique transformations to be done; there will be unique schema normalizations, deduplication needs, and sanitization processes for each and every one of the tools. Decoupling EL from the T allows the creation of general purpose connectors and it enables the industry to cover start covering the long tail of connectors., And, especially with an open-source approach, it enables anyone to build connectors in a much easier way than starting it from scratch by doing it in-house. This led to the emergence of open-source solutions and connector development kits that allow users to easily support their unique data sources and destinations. With open-source EL(T), you can imagine a world where there is a lot less pressure on data engineering teams and where data integration could be commoditized.

An overview of the current solutions in the modern data stack

Here’s a non-exhaustive listing of current ETL/ELT solutions that showcase the different approaches:

Fivetran - a closed-source ELT solution

Fivetran is a closed-source, managed ELT service that was created in 2012. As of July 2021, they have built 150 data connectors and have over 1000 users using them to sync data. They are considered a reliable choice and have a large customer base. Fivetran charges on volume transferred.

Airbyte - an open-source ELT solution

Airbyte is an open-source ELT tool that was created in July 2020. As of July 2021, they have built over 90 data connectors, and have 3,500 companies using them to sync data. Their ambition is to support 200 connectors, and to release an open-core cloud solution that will not use volume-based pricing by the end of 2021.

Stitch Data - a closed-source ELT solution

Stitch Data is a managed ELT service that sets itself apart from Fivetran by boasting some open-source integrations, options for self-service, and higher customer satisfaction. Stitch also charges based on usage.

Matillion - a closed-source self-hosted ETL solution

Matillion, just like Fivetran and Stitch, is closed-source, but is self hosted, so data never needs to leave your VPC. Because of this, they use a credit pricing model based on the size of the instances that you use to run their software.

Abhi Vaidyanatha
Senior Developer Advocate
Airbyte

Featured Companies

Here are some amazing companies in the ETL Tools.

Airbyte is an open-source data integration engine that helps you conso ...

Zero-Code Data aggregation platform for analytics and visualization ...

RudderStack is the CDP for developers.

Hevo is a fully managed data pipeline solution that saves a large part ...