Database Sharding | Dagster Glossary

Definition of Sharding in database administration

Sharding is the practice of partitioning a database into smaller, more manageable pieces called shards. Each shard is essentially a standalone database that stores a subset of the data.

Sharding improves performance, scalability, and manageability in databasing. In data engineering, sharding is a crucial concept when dealing with massive datasets that can't be effectively handled by a single machine or database instance. However, it's essential to approach sharding with a clear understanding of both its benefits and challenges.

Sharding is a technique used at firms like Dropbox.com to handle 10 million requests per second across petabytes of metadata.

When to Use Sharding:

Not all databases or storage situations benefit from sharding. It's essential to assess whether the complexity introduced by sharding will be offset by the performance and scalability gains.

Databases with very high write loads, massive datasets, or those anticipated to grow significantly in the future are good candidates.

Before jumping into sharding, consider other scalability techniques like replication, caching, or using specialized database systems optimized for certain types of workloads.

Benefits and Challenges of Sharding:

The key benefits of sharing are:

Scalability: As datasets grow, the demands on a database's CPU, RAM, and storage can overwhelm a single server. By sharding, you distribute the data across multiple servers, thereby distributing the load as well.
Performance: Large datasets can lead to slow query performance, especially in distributed multi-tenant systems. Breaking the data into shards can lead to faster query responses since each query can be executed on a smaller subset of the data.
High Availability: With data distributed across multiple servers, the failure of one server doesn't bring the entire system down.

But sharding also presents challenges:

Complexity: Sharding introduces operational and architectural complexity. Querying across shards, for example, can be more challenging than querying a single, unsharded database.
Rebalancing: Over time, as data grows or access patterns change, there might be a need to redistribute the data. This rebalancing can be a non-trivial operation.
Join Operations: Performing joins between tables that are on different shards can be complex and inefficient.
Backups and Maintenance: Each shard essentially being a separate database means backups, updates, and other maintenance tasks need to be managed for multiple databases.

Database sharding concepts

Shard Key:

Central to the sharding strategy is the choice of a shard key. This key determines how data is distributed across the shards. A good shard key ensures that data is distributed evenly and access patterns are efficient.
The choice of a shard key is crucial. A poorly chosen shard key can lead to uneven data distribution (some shards have way more data than others) or inefficient query patterns.

Types of Sharding:

Horizontal Sharding: Data rows are partitioned across shards. For example, if you have a table with user profiles, you might shard based on user ID, where IDs 1-1000 are on shard A, 1001-2000 on shard B, and so on.
Vertical Sharding: Different tables or columns of a table are stored in different shards. For example, a user profile table might be split such that names and addresses are on one shard, while user preferences are on another.

Two examples of database sharding in Python

Example 1: Sharding in SQLite

Let look at an example where we need to shard user data based on the user's ID across multiple databases using Python with the sqlite3 library. While SQLite might not be used for sharding in production, it's useful for illustrating the concept. In a real-world scenario, you'd probably be working with distributed databases like Cassandra, MongoDB (illustrated in the next example below), or even sharded PostgreSQL or MySQL setups.

Steps:

Create multiple SQLite databases (representing 10 shards).
- We're simulating the insertion of 1000 users.
- The random_name function generates random user names.
Write functions to determine which shard to write to or read from based on a user's ID.
Demonstrate writing and reading data by randomly retrieving and printing details of 5 users to demonstrate retrieval from the large dataset.

We'll generate a large volume of data using random functions to simulate many users and we'll insert them into the sharded databases. Feel free to change the value of NUM_USERS to make the example run faster.

In a real-world scenario, you'd also need to think about optimizing batch inserts, handling potential errors, and so forth. But this example is meant to illustrate the basic principle of sharding with a larger dataset.

This is a simplified example, and real-world sharding introduces many complexities, such as:

Handling joins across shards.
Ensuring balanced data distribution.
Managing failures, backups, and replication.
Rebalancing shards.

However, this SQLite example illustrates the basic principles of determining where data should reside based on a sharding key (user ID, in this case) and how to interact with the appropriate shard when reading or writing data.

import sqlite3
import os
import random
import string

# Number of shards
NUM_SHARDS = 10

# Shard database names
SHARD_NAMES = [f"shard_{i}.db" for i in range(NUM_SHARDS)]

# Number of users to simulate (change this to a lower number if you want the example to run faster)
NUM_USERS = 1000

# Cleanup: Delete shard files if they exist to start fresh
for shard_name in SHARD_NAMES:
    if os.path.exists(shard_name):
        os.remove(shard_name)

def setup_shards():
    """Initialize the databases."""
    for shard_name in SHARD_NAMES:
        conn = sqlite3.connect(shard_name)
        cursor = conn.cursor()
        cursor.execute('''CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT)''')
        conn.commit()
        conn.close()

def get_shard(user_id):
    """Determine which shard a user ID should go to."""
    shard_num = user_id % NUM_SHARDS
    return SHARD_NAMES[shard_num]

def insert_user(user_id, name):
    shard_name = get_shard(user_id)
    conn = sqlite3.connect(shard_name)
    cursor = conn.cursor()
    cursor.execute('INSERT INTO users (id, name) VALUES (?, ?)', (user_id, name))
    conn.commit()
    conn.close()

def get_user(user_id):
    shard_name = get_shard(user_id)
    conn = sqlite3.connect(shard_name)
    cursor = conn.cursor()
    cursor.execute('SELECT name FROM users WHERE id=?', (user_id,))
    result = cursor.fetchone()
    conn.close()
    return None if result is None else result[0]

# Generate random user names
def random_name(length=5):
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(length))

# Setup shards
setup_shards()

# Insert many users
for user_id in range(1, NUM_USERS + 1):
    insert_user(user_id, random_name())

# Retrieve and print some user names by random IDs
for _ in range(5):
    random_id = random.randint(1, NUM_USERS)
    print(f"User ID: {random_id}, Name: {get_user(random_id)}")

Because the data is randomly generated, the exact output will vary but will look something like this:

User ID: 97057, Name: zkrlu
User ID: 59872, Name: tnyvr
User ID: 1760, Name: rtzpp
User ID: 58910, Name: azkkm
User ID: 48120, Name: iwmlc

Example 2: Sharding with MongoDB

Assuming you have MongoDB running locally, we can also issutrate sharding on this NoSQL solution.

MongoDB is a popular database that has built-in support for sharding. Here's a simple example using Python and the pymongo library to illustrate how sharding could be done:

To follow this example, you need to have pymongo and MongoDB running.

This example will:

Start MongoDB (multiple instances for shards, config servers, etc.).
Set up sharding using MongoDB's commands.
Use pymongo in Python to insert and retrieve data.


from pymongo import MongoClient, ASCENDING
from pymongo.errors import OperationFailure

# Connect to the Mongo client
client = MongoClient("localhost", 27017)

# Access the database
db = client.shard_example_db

# Drop previous collection if it exists (for this example to be repeatable)
db.users.drop()

# Here we assume the sharding environment is set up.
# Normally, this step requires MongoDB configuration servers, adding shards, etc.
# Let's enable sharding on our database and collection:
try:
    client.admin.command("enableSharding", "shard_example_db")
except OperationFailure:
    # The database is already sharded
    pass

# Create an index, which will be used as the shard key
db.users.create_index([("user_id", ASCENDING)])

# Shard the collection
try:
    client.admin.command("shardCollection", "shard_example_db.users", key={"user_id": "hashed"})
except OperationFailure:
    # The collection is already sharded
    pass

# Insert data into the collection
for user_id in range(1, 10001):  # Insert 10,000 documents
    db.users.insert_one({"user_id": user_id, "name": f"User_{user_id}"})

# Retrieve and print some user data
print(db.users.find_one({"user_id": 5000}))

client.close()

The final print(db.users.find_one({"user_id": 5000})) command will output something like this (note, the ObjectId will vary):

{'_id': ObjectId('650b561b2e46072998072507'), 'user_id': 5000, 'name': 'User_5000'}

Definition of Sharding in database administration

When to Use Sharding:

Database sharding concepts

Two examples of database sharding in Python

Example 1: Sharding in SQLite

Example 2: Sharding with MongoDB

Backpressure

Compact

Compress

Ingest

Load

NoSQL

Pickle

Rebalance

Repartition

Scrape

Shuffle

Spill

Stored Procedure

Upsert