Data Ingestion in Data Pipelines

In this Glossary entry:

Data ingestion definition
Data ingestion: Build or Buy
Data ingestion with Dagster
A Python example of data ingestion

Data ingestion definition:

Data ingestion refers to the initial collection and import of data from various sources into a system where it can be processed, analyzed, and/or stored. In a data pipeline, this data typically originate from SaaS applications, external databases, APIs, social media, or streaming from IoT devices; it generally infers centralizing the data in one system such as a data warehouse or data lake.

The process of data ingestion can involve both batch and real-time ingestion. Batch ingestion involves importing data in large, scheduled groups, while real-time ingestion, often used in time-sensitive scenarios, involves importing data immediately as it is generated.

In designing your data pipelines, data ingestion is an important first step as it provides the raw materials for subsequent steps such as data transformation, storage, and analysis. It needs to be robust and efficient to ensure the timely availability of data for further processing.

The goal of data ingestion in data engineering is to facilitate seamless data integration and ensure a reliable flow of data from source systems to destinations, such as a data lake or data warehouse, where it can be harnessed to derive business insights, predictive analytics, or drive data-driven decision-making.

But there are also costs to running ingestion jobs, so you need to map this out carefully. In designing your data pipeline, you will need to make some compromises between frequency of data updates and cost of operating the data platform.

Data ingestion: build or buy?

Since the early days of the so-called Modern Data Stack, commercial services have emerged to allow for low/no-code data ingestion services. These systems provide integrations with popular data sources and data destinations. Some only offer proprietary integrations, while other support running open-source or custom integrations.

Popular services for traditional SaaS analytics include Matillion, Airbyte, Fivetran, improvado, Stitch, Meltano, and many others.

Ingestion of streaming data is more specialized, and includes services such as Amazon Kinesis, Apache Kafka, Confluent (hosted Kafka), Streamsets, GCP Dataflow

Designing your own data ingestion process vs. using a commercial ingestion service each come with their own benefits. The right choice often depends on your specific requirements, such as the complexity of your data sources and your need to customize the ingestion, the frequency of updates, the criticality of the data, your team's technical expertise, budget constraints, and other factors. For less common data sources, you might actually find that commercial ingestion services simply don't support your source or support it poorly (although services like Portable.io specialize in 'rare' data sources).

Benefits of Designing Your Own Ingestion Process:

Customization: You can build a solution perfectly tailored to your specific needs, including handling unique data sources, custom transformation logic, or special security requirements.
Cost Control: Building your own ingestion process could potentially save costs in the long term, especially if your data volumes are high and commercial service costs are proportional to the data processed (as many are).
Keep the knowledge in-house: It provides your team with an opportunity to deeply understand the data ingestion process and the architecture of the data pipeline, which could be beneficial in troubleshooting and future enhancements.
No Vendor Lock-in: You're not tied to the technologies, pricing changes, upgrade cycles, or terms of a third-party provider.
Data sovereignty: In some cases, security requirements prevent you from using a 3rd party data processor.

Benefits of Using a Commercial Ingestion Service:

Speed and Ease of Deployment: Commercial services usually offer plug-and-play solutions that can quickly be set up, reducing time to value. They typically offer free historical backfills, which is handy.
Expertise and Best Practices: These services are built by teams of experts who incorporate industry best practices, which may translate to more efficient and reliable data ingestion.
Scalability and Performance: Commercial solutions often provide built-in mechanisms for handling large amounts of data, ensuring high availability and performance.
Support and Maintenance: Vendors generally provide ongoing support and handle maintenance, upgrades, and security, which can significantly reduce the burden on your team as well as avoid fire drills. This said many customers remain frustrated at the slow pace of development as vendors try to support a very large library of integrations.
Focus on Core Competencies: By using a commercial service, your team can focus on other aspects of data processing and analysis rather than spending time on the complexities of data ingestion.

In general, teams will try these data ingestion services using their free trial options, and see if the off-the-shelf integrations will meet the specifics of their use case. Most large data platforms may lean on different ingestion services depending on the quality of their integration. It's not unusual to find that a commercial ingestion tool hits a snag if your data has some unexpected values, such as nested JSON objects or uncommon character encoding.

Data ingestion with Dagster

Python is the most popular programming language for data ingestion solutions. For example, the Singer standard (a standard for developing ingestion plug-ins) favors Python. So naturally, writing an ingestion step into your pipeline using Dagster is fairly trivial. Dagster provides integrations for several ingestion solutions like Airbyte, Fivetran, or Meltano.

You will find several examples of data ingestion in our tutorials, such as ingesting data from the Bureau of Labor Statistics, from the GitHub API, or your local ski resort!

A Python example of data ingestion

In this example, we'll use Python's requests module to call a public API, specifically the JSONPlaceholder API, which is a simple fake REST API for testing and prototyping. We'll then parse the returned JSON data into a pandas DataFrame for further analysis. JSONPlaceholder doesn't require any authentication, so you can run this code as it is.

import requests
import pandas as pd

def ingest_data(url):
    try:
        # Send a GET request to the API
        response = requests.get(url)

        # Raise an exception if the request was unsuccessful
        response.raise_for_status()

    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None

    else:
        # If the request was successful, load the data into a DataFrame
        data = pd.json_normalize(response.json())
        print("Data ingestion successful. Here are the first few rows of your data:")
        print(data.head())

        return data

# Define the data source
url = 'https://jsonplaceholder.typicode.com/posts'

# Ingest data
df = ingest_data(url)

In this script, we send a GET request to the "/posts" endpoint of the JSONPlaceholder API. The API returns a list of posts, each represented as a JSON object. We then use pandas.json_normalize to flatten these JSON objects and convert them into a tabular format.

Here is the output produced:

Data ingestion successful. Here are the first few rows of your data:
   userId  id                                              title                                               body
0       1   1  sunt aut facere repellat provident occaecati e...  quia et suscipit\nsuscipit recusandae consequu...
1       1   2                                       qui est esse  est rerum tempore vitae\nsequi sint nihil repr...
2       1   3  ea molestias quasi exercitationem repellat qui...  et iusto sed quo iure\nvoluptatem occaecati om...
3       1   4                               eum et est occaecati  ullam et saepe reiciendis voluptatem adipisci\...
4       1   5                                 nesciunt quas odio  repudiandae veniam quaerat sunt sed\nalias aut...

Again, remember that in a real-world scenario, data ingestion might involve more complexity, such as error handling, data validation, and transformation steps. In this simple example, we're assuming that the API is available, that it returns data in the expected format, and that there are no network errors - assumptions that might not hold in a production environment.

Such error handling, scheduling, and staying on top of changing APIs are some of the top reasons that teams will adopt third-party data ingestion services.