ETL can be defined as the process of integrating data from different places into a single location, where it can be used by people in your company to draw insights.
With a multitude of marketing applications, sales tools, and business intelligence software, we are constantly running headfirst into the problem of needing to move data from a lot of different places into one or more centralized locations. E(xtract), T(ransform), and L(oad) came from a necessity of moving data from siloed locations into databases or data warehouses in where they could be leveraged.
Extract refers to acquiring data from an original source.
Transform refers to normalizing and/or sanitizing this data
Load refers to moving the data into a destination where it will be leveraged.
ETL and ELT solutions are generally used for the following reasons:.
While the origins of ETL came about in the 70s, the early tools weren’t built to handle the volume that they are required to process today. We’ll do a brief overview of how we got here and what solutions are looking to take us into the future.
In this era, we had dedicated teams managing a very small number of data integration pipelines. Generally, these teams would always build these solutions in house, as it was cost effective for the time period.
Here’s the issue. We have way more data than we know what to do with, but we have none of the capabilities of handling this volume. The tools aren’t reliable, everything is still custom made and managed in-house, and cloud infrastructure has not hit its stride yet. Without the reliability or scalability needed to support this level of ingestion and processing, the majority of analytics is still being done at a small scale. We do see the emergence of tools like Hadoop for parallelizing computational workloads, but it was only in 2008 where it really took hold..
We’re starting to see the explosion of cloud tools and SaaS solutions everywhere. We definitely have the capability of scaling our resources quickly and it showed; the number of cloud tools went through the roof, and the adoption of ETL solutions followed quickly, as building these pipelines in house was starting to make less and less sense. Although, many companies continued to do so anyway. Marketing and sales teams need to compute their cost of acquisition by consolidating data from several sources (ads, marketing, CRM, payment, etc.). And that’s only one use case amid hundreds of them within the company. For the time, ETL was great! The paradigm gave businesses strong insight into their legacy data, it allowed visualization to be done on top of a unified view once loading had been completed, and it enabled the rise of the data analyst, who enjoyed this new playground of sanitized data once the pipelines had been set up.
For a while, ETL was great, and it still does have some modern applications. But in most cases, we started to see a few holes and disadvantages in the second half of the last decade:
The puzzle pieces are starting to come together. We have the resources and capabilities of moving large amounts of data into destinations. We have large amounts of data to move. But how does it all fit together? Different teams never had the full picture and legacy data warehouses still created data silos in different departments of an organization due to lack of intelligent and cost effective scaling solutions. In comes the modern cloud data warehouses; BigQuery, Redshift, Snowflake started to emerge. They quickly became the best place to consolidate that data, as they were offering data computation and storage at much lower costs than traditional data warehouses. Not only that, they enabled transformation to be done after loading, enabling the more powerful ELT paradigm:
An open-source project named dbt emerged as the data transformation standard, making it much easier for data analysts to handle data transformation on their own. Enabling data analysts in this way paved the path for the “analytics engineer.” However, while the number of tools companies were using internally continued to grow, the ETL and ELT solutions didn’t keep up. They seem to all plateau at around 150-200 connectors. The reason is simple, the hard part about data integration is not about building the connectors, but maintaining them. It is costly, and any cloud-based closed-source solution will be restricted by ROI (return on investment) evaluations. It isn’t profitable for them to support the long tail of connectors, so they only focus on the most popular integrations. And that brings us to today.
As we see it, ETL/ELT now faces these three problems:
For every specific need within any company, there are unique transformations to be done; there will be unique schema normalizations, deduplication needs, and sanitization processes for each and every one of the tools. Decoupling EL from the T allows the creation of general purpose connectors and it enables the industry to cover start covering the long tail of connectors., And, especially with an open-source approach, it enables anyone to build connectors in a much easier way than starting it from scratch by doing it in-house. This led to the emergence of open-source solutions and connector development kits that allow users to easily support their unique data sources and destinations. With open-source EL(T), you can imagine a world where there is a lot less pressure on data engineering teams and where data integration could be commoditized.
Here’s a non-exhaustive listing of current ETL/ELT solutions that showcase the different approaches:
Fivetran is a closed-source, managed ELT service that was created in 2012. As of July 2021, they have built 150 data connectors and have over 1000 users using them to sync data. They are considered a reliable choice and have a large customer base. Fivetran charges on volume transferred.
Airbyte is an open-source ELT tool that was created in July 2020. As of July 2021, they have built over 90 data connectors, and have 3,500 companies using them to sync data. Their ambition is to support 200 connectors, and to release an open-core cloud solution that will not use volume-based pricing by the end of 2021.
Stitch Data is a managed ELT service that sets itself apart from Fivetran by boasting some open-source integrations, options for self-service, and higher customer satisfaction. Stitch also charges based on usage.
Matillion, just like Fivetran and Stitch, is closed-source, but is self hosted, so data never needs to leave your VPC. Because of this, they use a credit pricing model based on the size of the instances that you use to run their software.
Here are some amazing companies in the ETL Tools.