ETL Tools – What Are They and How to Start Using Them?

Michał Maliszewski

Jan 25, 2022

Business Intelligence

ETL

.lazy.sizename-full { display: none !important; }

When working with data, we often encounter many challenges. The most significant are the dispersion of data across different systems and the quality of that data. Dispersed data doesn’t allow us to view all aspects of the business at once. Data from accounting systems, CRM, sales, data stored in local user files – analyzing these data and drawing the right conclusions becomes very difficult in such an environment. This raises the question: how do we integrate all of this?

What is GIGO?

The optimal result would be a single database containing all the data so that we could easily process, visualize, and analyze it.

Another aspect of data-related issues is their quality. This problem is perfectly illustrated by the saying: Garbage in = Garbage out. This evocative term means that if we don’t take care of the quality of the data input into the analytical tool, in the end, we will get an equally low-quality result.

.lazy.sizename-full { display: none !important; }

Źródło: Figure 1. GIGO Garbage in – Garbage out. Source: https://www.r-bloggers.com/2019/08/new-course-learn-advanced-data-cleaning-in-r/

ETL Specifics

ETL is an acronym that consists of three stages of data processing to ultimately have data available in a single, high-quality source. This acronym stands for:

E (xtraction) – extracting data from dispersed sources;
T (ransformation) – transforming, cleaning, and implementing business rules;
L (Loading) – loading data into a single source, e.g., into a data warehouse.

.lazy.sizename-full { display: none !important; }

Źródło: Figure 2. ETL Process. Source: https://medium.datadriveninvestor.com/how-to-automate-the-etl-process-for-data-from-magento-google-analytics-72a6f8f0f6f5

Stage 1. Extraction

The extraction process involves pulling data from all available sources. The most commonly encountered sources include databases, flat files, web services, and other types such as RSS.

If you intend to create a quality data model, you must first familiarize yourself with the data.

This is often an overlooked stage, but it is incredibly important if we are concerned with the quality of the data upon which we will later analyze. Getting to know the data involves recognizing sources, problems related to data refreshing, and errors that might be included. Special attention should be paid to data entered by humans in applications that do not validate them. Programs like Excel or other applications with manually entered data will pose potential risks in terms of data quality and timeliness. A good practice for getting to know the data is to audit it.

Data auditing should cover the three most essential aspects:

Identifying data owners. Data dispersed throughout the organization have different owners. Only they can provide comprehensive information regarding the collection, storage, refreshing, and use of data. Recognizing data owners is key as only then can we gather the aforementioned information.
Mapping data locations. This means describing the locations of the data, their architecture, users, and how they are utilized.
Evaluating the quality, scope, and logic of the data. Key issues here include the date format, null management, data increments, the need to create new dimensions, measures, or new functionalities.

Stage 2. Transformation

In the next stage, we will primarily focus on data cleaning. Given the multiple data sources and the potential human factor, there’s a need to automate the process of improving data quality. Considering that such a process will likely be continuous, it’s beneficial for the tool we choose to conduct this in a transparent, automated manner, but also in a way that we can easily modify it as needed.

Simultaneously, with the increasing volume of data, the tool must clean the data optimally, removing unnecessary duplicates and correcting errors while saving space and maintaining the desired operational speed.

After cleaning the data, the next step is to transform it. For this, we will need proper data mapping and appropriate standardization of formats. Deduplication of rows, adding calculations (new columns), splitting field contents, grouping, and summarization are also essential.

Stage 3. Loading

Once we have clean and transformed data, we can begin the loading process. Data can be loaded in its entirety (full load) or incrementally (incremental load).

Loading data in its entirety may, over time, cause an exponential increase in data volume and significantly slow down the entire process. Incremental loading, on the other hand, allows for rapid processing.

We can load the data into a Data Warehouse, Data Hub, or Data Lake – this is a single source with vast computational capabilities, providing integrated data for the business user.

ETL and Business Intelligence

The process of preparing, transforming, and loading data is especially crucial when it comes to Business Intelligence. Using the right ETL tool, in a previously prepared process, we can deliver high-quality data quickly, enabling comprehensive analytics.

With an optimally arranged process, we save time. The time needed for manual data preparation can then be devoted to proper data analysis and deriving business conclusions – which is primarily what we aim for.

A well-prepared ETL process allows for rapid analysis of all data, provides historical context, and also frees up time for other crucial processes in the organization, such as innovations.

ETL Tools

In the era of dispersed data in large volumes, ETL tools become key in the Business Intelligence process.

Previously, the ETL process was most often conducted internally within IT, heavily relying on coding and requiring knowledge of the target schema. The traditional ETL process is also challenging to scale, and ultimately, it’s often necessary to forgo detailed data granularity in favor of speed in operation. As a result, traditional data warehouses then require further processing or the use of other tools for detailed analysis.

Another issue can be the process’s low flexibility, making changes required by a dynamically changing environment very difficult to implement.

Modern ETL tools are designed to easily and transparently build the process, scale it, and operate in real-time. They also handle semi-structured data from various sources efficiently.

A significant advantage is also using one’s own infrastructure (traditional tools have their own, making implementation costs much higher). Therefore, scaling can also be tailored to the organization’s size.

A trend in BI, and consequently in ETL, is designing tools so that the process can be graphically arranged and dynamically modified.

ELT – Modified Process

With large volumes of data and the need for optional access to raw data, the ELT process is used. It’s a process that leaves data transformation for the end, precisely to have the opportunity to analyze unprocessed data. Additionally, the transformation takes place in the final tool, eliminating the need to store processed data elsewhere.