Use of NoSQL in data engineering:
NoSQL databases emerged as an alternative to traditional relational databases, especially for applications that require scalability, speed, and flexible schema design.
NoSQL stands for "Not Only SQL", indicating that while SQL querying can sometimes be used, these databases don't exclusively rely on SQL for data manipulation.
NoSQL encompasses a variety of database technologies that were developed as a response to the limitations of traditional relational databases. Note that over time, the boundaries between SQL and NoSQL have become blurrier. Some traditional relational databases have incorporated NoSQL features and vice versa. For instance, PostgreSQL now supports JSON columns, allowing for some NoSQL-like operations. The choice between SQL and NoSQL should always be based on the specific needs of a project.
Why use NoSQL and not a traditional database or data warehouse?
- Scalability: Traditional relational databases are typically scaled vertically (adding more power to the existing machine), while many NoSQL databases scale horizontally (adding more machines to the system). This horizontal scaling makes it easier to support huge amounts of data and traffic.
- Flexible Schema: NoSQL databases often allow for flexible and dynamic schemas, which means that data structures can change over time without the need to modify the whole database.
- Performance: For specific use-cases, NoSQL databases can offer optimized performance, especially when read and write operations are more frequent.
Types of NoSQL Databases:
- Document-Based (e.g., MongoDB, CouchDB): Store data as documents, typically using JSON or BSON format. Each document is a self-contained unit with data and does not need to adhere to a strict schema.
- Column-Based (e.g., Cassandra, HBase): Organized by columns rather than rows. This setup can be more efficient for read and write operations for datasets that have many columns.
- Key-Value Stores (e.g., Redis, Riak): The simplest type, where every item in the database is stored as an attribute name (or 'key') with its value.
- Graph-Based (e.g., Neo4j, OrientDB): Designed to represent and store data as graphs. They are particularly suitable for relationships and connections between data points.
Best use cases for NoSQL:
While NoSQL databases are versatile, they excel in specific scenarios:
- Large-scale, high-speed, dynamic and semi-structured data: Due to their scalability and schema flexibility.
- Real-time analytics and high-speed caching.
- Mobile apps, content management, and IoT applications: Because of variable data structures and real-time requirements.
Limitations of NoSQL in data engineering:
- Not all NoSQL databases support ACID (Atomicity, Consistency, Isolation, Durability) transactions.
- Some NoSQL databases might have limited querying capabilities compared to SQL databases.
- The diverse landscape of NoSQL databases means there's a steeper learning curve since there's no one-size-fits-all approach.
CAP Theorem:
CAP theorem is fundamental to understanding NoSQL databases. It states that it's impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency: Every read receives the most recent write.
- Availability: Every request (read or write) receives a response without the guarantee that it contains the most recent version.
- Partition Tolerance: The system continues to function even when network partitions occur.
Depending on the use case, some NoSQL databases might prioritize two of these properties over the third.
An example of NoSQL in Python:
A simple way to illustrate NoSQL in Python is by using MongoDB, a popular document-based NoSQL database, along with the pymongo library. Let's demonstrate the creation, insertion, querying, and updating operations in MongoDB:
Unlike SQL databases, where you'd typically define a schema first, in MongoDB you can simply insert documents. If the database or the collection doesn't exist, MongoDB will create it for you.The flexibility of NoSQL databases like MongoDB is evident in how you can easily nest dictionaries (like the "address" field) without needing to create a separate table or relationship.
Before running this code, you need to have MongoDB installed and running on your machine. Installation instructions for MongoDB can be found here.
Here are the commands for MacOS:
xcode-select --install
brew tap mongodb/brew
brew update
brew install mongodb-community@7.0
brew services start mongodb/brew/mongodb-community
That final services start
command will run MongoDB in the background as a service, and we can now tap into that in our code:
# First ensure MongeDB is installed and running on port 27017 (the default port)
# Ensure pymongo is installed:
# pip install pymongo
import pymongo
# Connect to the local instance of MongoDB
client = pymongo.MongoClient("mongodb://localhost:27017/")
# Create a new database called "advancedDB"
db = client["advancedDB"]
# Create a new collection called "users"
users = db["users"]
# Insert users with different attributes (heterogeneous data formats)
dagster_data = {
"name": "Dagster",
"founded": 2018,
"skills": ["python", "data engineering", "data orchestration"],
"address": {
"street": "548 Market St Ste 50093",
"city": "San Francisco",
"zipcode": "94104"
}
}
dbt_data = {
"name": "dbt",
"founded": 2016,
"expertise": ["SQL", "data transformation"],
"email": "support@getdbt.com"
}
mongodb_data = {
"name": "MongoDB",
"projects": ["NoSQL", "data storage"],
"phone": "1 (844) 666-4632"
}
users.insert_many([dagster_data, dbt_data, mongodb_data])
# Query the database for users with specific attributes
# Find users with an "email" attribute
email_users = users.find({"email": {"$exists": True}})
print("Users with an email:")
for user in email_users:
print(user)
# Find users with a "skills" attribute containing "python"
python_skilled_users = users.find({"skills": "python"})
print("\nUsers skilled in Python:")
for user in python_skilled_users:
print(user)
# Clean up (delete the database)
client.drop_database("advancedDB")
Note that unlike a traditional SQL approach, we are entering heterogeneous data per record.
This will yield:
Users with an email:
{'_id': ObjectId('64fccc45d30491489f1c7158'), 'name': 'dbt', 'founded': 2016, 'expertise': ['SQL', 'data transformation'], 'email': 'support@getdbt.com'}
Users skilled in Python:
{'_id': ObjectId('64fccc45d30491489f1c7157'), 'name': 'Dagster', 'founded': 2018, 'skills': ['python', 'data engineering', 'data orchestration'], 'address': {'street': '548 Market St Ste 50093', 'city': 'San Francisco', 'zipcode': '94104'}}
Naturally this is a very simple example, but hopefully illustrative, of the flexibility of NoSQL tools in data engineering.