Back to Glossary Index


Combine data from multiple sources into a single dataset.

Data consolidation definition:

Data consolidation is a process of combining multiple datasets into a single dataset to create a more comprehensive view of the data. When designing data pipelines, it enables us to work with large volumes of data from different sourceswhich may have different structures, formats, or qualityin a more efficient and effective manner.

In the context of modern data engineering, data consolidation typically involves the following steps:

  1. Identify data sources: Data engineers need to identify the sources of data they want to consolidate, which can include internal and external data sources, such as databases, files, APIs, and streaming data.

  2. Extract data: The data from the identified sources is extracted, either in batches or real-time, and stored in a data lake or staging area.

  3. Transform data: Data from different sources may have different formats, structures, or quality. Data engineers need to transform the data to ensure consistency, quality, and compatibility. This step may involve data cleaning, normalization, deduplication, and enrichment.

  4. Integrate data: Data from different sources is integrated into a single dataset by mapping data attributes, resolving data conflicts, and creating a unified schema.

  5. Load data: The consolidated data is loaded into a data warehouse or data mart, where it can be queried, analyzed, and visualized.

Data consolidation is a key component of modern data engineering because it enables organizations to leverage the power of big data and analytics to gain insights, improve decision-making, and drive innovation. By consolidating data from multiple sources into a single dataset, data engineers can create a more comprehensive and accurate view of their business operations, customers, and markets, which can help them identify patterns, trends, and opportunities for growth.

Example of Data Consolidation using Python

Let's create two sample datasets, sales_data and marketing_data, each with a common column, Date and Region, and different metrics, Sales and Impressions, respectively.

Next, we create two Pandas DataFrames from the sample data using the pd.DataFrame() method.

Finally, we consolidate the data from both dataframes into a single dataset by merging them on the common columns, Date and Region, using the pd.merge() method with an outer join. This creates a new dataframe consolidated_df that contains all the data from the sales_df and marketing_df dataframes.

Please note that you need to have Pandas installed in your Python environment to run this code.

import pandas as pd

# create sample data
sales_data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03'],
              'Region': ['North', 'South', 'East'],
              'Sales': [1000, 2000, 1500]}

marketing_data = {'Date': ['2022-01-02', '2022-01-03', '2022-01-04'],
                  'Region': ['South', 'East', 'West'],
                  'Impressions': [500, 1000, 800]}

# create pandas dataframes from sample data
sales_df = pd.DataFrame(sales_data)
marketing_df = pd.DataFrame(marketing_data)

# consolidate data by merging dataframes
consolidated_df = pd.merge(sales_df, marketing_df, on=['Date', 'Region'], how='outer')


This data consolidation script will print out:

         Date Region   Sales  Impressions
0  2022-01-01  North  1000.0          NaN
1  2022-01-02  South  2000.0        500.0
2  2022-01-03   East  1500.0       1000.0
3  2022-01-04   West     NaN        800.0

In the context of the output from the example I provided, NaN stands for "Not a Number."

NaN is a special floating-point value that represents undefined or unrepresentable values, such as the result of an invalid mathematical operation or missing data.

In this specific case, the NaN values in the Impressions column of the consolidated dataframe represent missing data, which means that there were no marketing impressions data available for the North and West regions on January 1 and January 4, respectively.

Other data engineering terms related to
Data Aggregation and Summarization: