ETL: What are the challenges of traditional ETL tools?
"Data volumes are exploding faster than ever, and by 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet."
With so much data being generated every second, the need to extract, transform, and load (store) this data into third-party systems, and then extract useful information, has become a necessity for every business.
However, by using a dedicated transformation engine, the traditional ETL tools on the market are facing the following limitations when managing large volumes of information:
The more data flows are developed, the more the engine usage becomes important
This can be penalizing in terms of performance
This is generating an important network traffic
This type of architecture can be expensive because of:
The cost usually related to the power of the transformation engine (billing based on processors, for example)
The cost of the resources used (the larger the volume of your data will be important and the more dedicated resources will be needed)
The need to physically locate the transformation engine:
Cloud Architecture: Data needs to pass through the engine.
Hadoop architecture: same observation
Traditional ETLs experience "bottleneck" effects in proportion to the amount of data to be processed. 30-40%: estimate of the volume of data created each year
Small illustration about the limits of the ETL approach
To be convinced of the performance issues of an ETL approach let's take a concrete example.
Imagine feeding a business intelligence database. Let's say a table in a target, already containing millions or billions of records cumulated over time.
The incremental feeding of this target from a few thousand new records from the source (that is, updating only the source changes since the last run) will require a comparison of the new records from the source with existing records in the target.
In a traditional transformation engine-based approach, the execution would look like the diagram on the left, involving lookup (or equivalent) features to compare the data. This lookup will deal with billion or million of records.
Very often, in this example, the performance will be an important challenge.
In such a case, in order to be performant, the traditional ETLs will disavow their "engine" logic by executing instructions on the databases rather than on their own system.
Delegated approach of transformations (ELT), what is it?
The ELT approach (delegation of transformations) takes advantage of existing information systems by having the technologies already in place (underlying technologies) do the transformation work, rather than using a separate proprietary engine.
The transformations are carried out by databases or other technologies (Hadoop cluster, cloud cluster, MOLAP systems, operating systems, etc.).
Extractions and integrations of data can be realized through existing native and performant tools on these technologies.
In this architecture the load is distributed among the different systems: the ELT is based on a non-centralized architecture unlike the centralized ETL approach.
The timing and location of the transformation are the two key elements that characterize an ELT approach.
A simple illustration of the benefits of an ELT approach
Let's take the above example of the ETL approach but now with an ELT.
In the previous example above we had to collect a few thousand "new" source records, then compare them to 1 billion data already existing in the target.
The ELT approach will consist in loading the few thousands of new source data inside a temporary space on the target database side (the three tables in gray in the schema).
The comparison will then be done directly within the target database, without requiring any look-up operations or other expensive operations for the database.
Indeed, the latter will use its indexes or other internal mechanisms, which are optimized for this type of operation.
Likewise, once the data has been integrated into the target table (the first green table at the top of the schema), any aggregation operation to another table within the same database (the second green table of the schema) will only need actions inside the database (no external engine needed outside of the database).
For which use case is the ELT approach better?
ELT architecture, best approach for cloud projects and hybrid architectures?
As stated above, an ETL approach requires the use of a transformation engine. What is happening with applications hosted in the cloud? Where will the engine be located?
The use of the cloud is rarely at 100%. So, this implies, in an ETL approach, the need to install multiple engines (on-site and in the cloud) to handle the different cases.
The situation is much more complicated in multi-cloud architectures (Amazon, Azure, Google ..., Salesforce, Oracle ...): Should we install multiple ETL engines? Should we invest in specific Cloud ETLs, and lose the ability to rationalize and govern data?
In contrast, the choice of an ELT approach works in all use cases , because it does not require the installation of proprietary engines. It will make the underlying technologies do the work and will have a very light footprint on the systems.
The ELT approach, champion of large data volume handling?
The ELT approach minimizes the use of intermediate resources (server, network, disk ...).
It also guarantees a perfect usage of the native and powerful functions of existing environments (file loader, database transformation capability or Big Data environments).
The ELT approach avoids taking data out of their environment to be processed by an external engine.
Take the example of aggregating a set of data within a database.
Thanks to the delegation of transformation approach (ELT), there will be no requirement for using an external engine.
The data will remain in its own environment.
The benefits of using an ELT solution for your data integration projects
Summary of the benefits of the ELT
The ELT approach allows to:
Effective: no intermediate technologies
Fast: native communication with existing technologies
Simplify and rationalize the architecture
Not intrusive: no additional system to install
Distributed: the load can be smoothed over the different systems.
Optimize the use of existing technologies
Cost-effectiveness: the available power of existing systems is used
Better experience : Customer's knowledge and control of the existent technologies are consolidated
Cost-reduction: Reduce the infrastructure costs and capitalize on your investments
To be naturally evolutionary
Automatic Scalability: The available power of the underlying systems is used. If they evolve, the ELT benefits from it.
Fast Deployment: Adding sources or targets does not require adding engines. The tool is operational immediately.
Benefits of the Stambia ELT
A top down approach, designing with the universal data mapping
An ELT is interesting only if it makes it possible to simplify the work of the developers by abstracting the underlying technologies and giving them a more business oriented view.
In this sense, Stambia's universal data mapping vision makes the ELT mode accessible, without the need for advanced technical knowledge.
The top down approach allows to develop by focusing on the business rules, and let Stambia generate the transformations on the appropriate platforms. This automatic generation will be made using the best practices on these technologies, in order to guarantee the best levels of performance.