Upserting Records

Data Upsert: A definition

'Upsert' is a operation that combines the actions of 'update' and 'insert' when working with any system of records. It refers to the process of either updating an existing record if it already exists or inserting a new record if it does not, based on specified criteria, typically using key fields to identify records uniquely.

Jump to the code example ↓

Data Upserting in Data Engineering

Record upserting is a key concept in data engineering. It ensures that the latest information is correctly stored and that duplicates are avoided.

In a typical database, you have two primary operations: "INSERT" to add new records and "UPDATE" to modify existing records. However, in many real-world scenarios, you may not always know whether a record already exists. This is where upserting comes into play.

If a record with the specified key field already exists, an upsert operation will update the existing record with new data.
If a record with the specified key fields doesn't exist, an upsert operation will insert a new record with the provided data.

An illustration of a basic approach to UPSERT in database records

Efficient upsert operations are critical in high-throughput systems. Data engineers need to optimize their database schemas, indexing strategies, and query execution plans to ensure fast upserts, especially in scenarios where a large volume of data is involved.

Depending on the specific use case and database system, different strategies for upserting data may be employed. These can include using database-specific syntax like "MERGE" or "UPSERT," or using conditional logic within code to perform upserts. We provide database-specific examples below.

Basic upserting considerations:

Key Fields: Upserting usually relies on one or more key fields that uniquely identify a record. These key fields are used to determine if a record already exists in the database.

Upserting and Concurrency Control: concurrency control is crucial. Multiple users or processes may attempt upsert operations simultaneously. To prevent conflicts and ensure data consistency, database systems often provide mechanisms like locks, transactions, or optimistic concurrency control.

Dedicated upsert commands

Many modern database management systems (DBMS) offer specific upsert commands or functionalities to simplify the process of inserting or updating records. The specific syntax and implementation can vary between different database systems. Here are some examples of upsert commands in popular DBMS:

MySQL/MariaDB, you can use the INSERT INTO ... ON DUPLICATE KEY UPDATE statement to perform an upsert operation. If a duplicate key violation occurs, it updates the existing record with new values. See the docs.
PostgreSQL provides the INSERT INTO ... ON CONFLICT DO UPDATE syntax to perform an upsert. You specify the conflict resolution criteria and the update action. See the docs.
SQL Server offers the MERGE statement, which allows you to perform both insert and update actions based on a specified condition. It is a versatile way to implement upserts. See the docs.
Oracle Database supports the MERGE INTO statement, similar to SQL Server's MERGE, which allows you to perform conditional insert and update operations in a single statement. See the docs.
SQLite, you can use the INSERT OR REPLACE INTO statement to perform an upsert. It replaces the existing record if a conflict occurs. The specific UPSERT syntax was added to SQLite with version 3.24.0. See the docs
Cassandra, you can use the INSERT INTO ... IF NOT EXISTS statement to insert a record if it does not exist. This is a specific use case of upserting in a NoSQL context. See the docs.
In MongoDB, you typically use the updateOne() or updateMany() methods to perform upserts. These methods allow you to specify update criteria and insert new documents if no matching documents are found. The update command modifies documents in a collection. See the docs.
AWS DynamoDB, you can use the PutItem operation with conditional expressions to perform upserts. It allows you to specify conditions for inserting or updating an item. See the docs.

The syntax and behavior of upsert commands may vary, and not all database systems provide dedicated upsert statements. In some cases, you may need to implement custom logic to achieve upsert functionality, combining conditional inserts and updates within application code or stored procedures.

DuckDB for instance does not (at time of writing - version 0.9.2) have a dedicated UPSERT command, but rather encourages the use of INSERT and UPDATE.

Data Upsert: Files vs. databases

While the concept originates from database management systems, upserting can equally be applied to the large file formats associated with data engineering. When dealing with files, an "upsert" operation would involve reading the file, performing the insert or update operation in memory (like in a DataFrame), and then writing the modified DataFrame back to file.

However, it's important to note that a file-based upsert process is not as straightforward as in a database system. File formats do not natively support transactions or concurrent modifications like a database does. Each upsert operation would typically involve reading the entire file, making the necessary changes, and then writing it back, which can be resource-intensive for large datasets. Additionally, handling concurrent updates can be challenging and might require additional mechanisms like file locks or versioning.

For instance, as a columnar storage format, Parquet is not designed for in-place updates or insertions. In Parquet, data is typically written once and then read many times, and updates usually involve creating new versions of the entire Parquet file rather than modifying existing records.

When designing a process requiring upsert steps, you would be advised to adopt an extension of Parquet like Delta Lake that adds support for ACID (Atomicity, Consistency, Isolation, Durability) transactions. It provides features like INSERT, UPDATE, DELETE, and MERGE operations on data stored in Parquet files.

A basic example of record upserting in Python using SQLite

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

About SQLite

SQLite is a C-language library that implements a small, fast, self-contained SQL database engine. From this perspective, it's great for a simple tutorial, but not reflective of tools used in Data Engineering.

About our example

In this example we:

Create a simple database called languages.db with a table records
Define a function for the upsert operation. This function will insert a new record if the ID doesn't exist, or update the existing record if the ID is found.
Use this function to perform upsert operations
retrieve and print all records from the table records.

Note that here we are not using SQLite's INSERT OR REPLACE INTO or UPSERT in order to illustrate the basic logic of an upsert command, namely check if a record exists, update it if it does, or create it if it does not.

import sqlite3

# Connect to SQLite database (it will be created if it doesn't exist)
conn = sqlite3.connect('languages.db')

# Create a cursor object using the cursor() method
cursor = conn.cursor()

# Create table
cursor.execute('''CREATE TABLE IF NOT EXISTS records
               (id INT PRIMARY KEY, name TEXT, release_year INT)''')

# Commit the changes then close the connection
conn.commit()
conn.close()


def upsert_record(conn, id, name, release_year):
    cursor = conn.cursor()

    # Check if the record exists
    cursor.execute('SELECT * FROM records WHERE id=?', (id,))
    data = cursor.fetchone()

    if data is None:
        # Insert a new record
        cursor.execute('INSERT INTO records (id, name, release_year) VALUES (?, ?, ?)', (id, name, release_year))
    else:
        # Update the existing record
        cursor.execute('UPDATE records SET name=?, release_year=? WHERE id=?', (name, release_year, id))

    # Commit the changes
    conn.commit()

def fetch_all_records(conn):
    cursor = conn.cursor()

    # Execute a query to fetch all records
    cursor.execute('SELECT * FROM records')

    # Fetch all rows from the database
    return cursor.fetchall()

# Reconnect to the database
conn = sqlite3.connect('example.db')

# Upsert some records
upsert_record(conn, 1, 'Plankalkul', 1940)
upsert_record(conn, 2, 'SQL', 1970)
upsert_record(conn, 3, 'Python', 1991)
upsert_record(conn, 4, 'Rust', 2015)

# Upsert an existing record
upsert_record(conn, 2, 'SQL', 1972)  # This will update the existing record

# Fetch and print all records
records = fetch_all_records(conn)
for record in records:
    print(record)

# Close the connection
conn.close()

The output from this script will be:

(1, 'Plankalkul', 1940)
(2, 'SQL', 1972)
(3, 'Python', 1991)
(4, 'Rust', 2015)

As you will see the entry with id of 2 properly reflects the first release date of SQL in 1972.

Advanced upserting considerations

While upserting is a basic and very common operation, there are some nuances to improving its performance and reliability. If you look at the database specific commands, you will notice that many use a different logic than the one we spelt out.

The basic approach is:

Lookup a record with a given id
Update it if it exists
Create it if it does not exist.

A repeat of the earlier illustration of a basic approach to UPSERT in database records

While this logic is sound, it requires several calls, creating more overhead. It also opens up some potential issues of concurrency and interpretation. Therefore, it is generally more efficient to use the following logic:

Update a record with a given id
If the update failed, create the record.

An illustration of a more performant approach to UPSERT in database records

Aaron Bertrand explains the benefits of this approach in this article. You will notice that the database specific approaches listed above often use this pattern.