Dagster Data Engineering Glossary:
Data Fragmentation
data:image/s3,"s3://crabby-images/bb58b/bb58ba57ab4915b284120cfeea05961e24c80050" alt="Glossary entry badge for Fragment"
Data fragmentation definition:
Data fragmentation refers to the breaking down of data into smaller chunks or fragments for storage and management purposes. In modern data engineering, data fragmentation is often used to optimize data processing and storage, as well as to improve data availability and scalability.
Fragmentation can be applied to various types of data, such as files, databases, or objects, and can be done in various ways, such as horizontal fragmentation (splitting data by rows), vertical fragmentation (splitting data by columns), or hybrid fragmentation (a combination of horizontal and vertical fragmentation).
Data fragmentation example using Python:
Please note that you need to have the Pandas library installed in your Python environment to run the following code examples.
Here's a self-contained Python example that demonstrates horizontal fragmentation of a pandas DataFrame:
import pandas as pd
# Create a sample DataFrame with 10 rows and 3 columns
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],
'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
# Split the DataFrame into two fragments: one with rows 0 to 4 and one with rows 5 to 9
df1 = df.iloc[:5]
df2 = df.iloc[5:]
# Print the original DataFrame and the two fragments
print('Original DataFrame:')
print(df)
print('Fragment 1:')
print(df1)
print('Fragment 2:')
print(df2)
This will yield the following output:
Original DataFrame:
A B C
0 1 a 0.1
1 2 b 0.2
2 3 c 0.3
3 4 d 0.4
4 5 e 0.5
5 6 f 0.6
6 7 g 0.7
7 8 h 0.8
8 9 i 0.9
9 10 j 1.0
Fragment 1:
A B C
0 1 a 0.1
1 2 b 0.2
2 3 c 0.3
3 4 d 0.4
4 5 e 0.5
Fragment 2:
A B C
5 6 f 0.6
6 7 g 0.7
7 8 h 0.8
8 9 i 0.9
9 10 j 1.0
In this example, we create a sample DataFrame with 10 rows and 3 columns. We then split the DataFrame into two fragments using the iloc
method of pandas: df1
contains rows 0 to 4, and df2
contains rows 5 to 9. Finally, we print the original DataFrame and the two fragments to verify that the data has been correctly fragmented. This example demonstrates horizontal fragmentation, as we split the data by rows.
In practice, data fragmentation can be used in a variety of data engineering scenarios, such as distributed computing, parallel processing, or database sharding, to name a few. By breaking down data into smaller, more manageable fragments, we can distribute data processing and storage more efficiently, reduce data access times, and improve overall system performance and scalability.
Align
data:image/s3,"s3://crabby-images/073ff/073ffbe09b4c4d617afffc0dc783a92fbedb46b9" alt="An image representing the data engineering concept of 'Align'"
Clean or Cleanse
data:image/s3,"s3://crabby-images/7c5c1/7c5c12ba981567f42d2e572185962f55d78181a4" alt="An image representing the data engineering concept of 'Clean or Cleanse'"
Cluster
data:image/s3,"s3://crabby-images/a9b9e/a9b9e024385081102ff1fa06ae10197a9a3fdb07" alt="An image representing the data engineering concept of 'Cluster'"
Curate
data:image/s3,"s3://crabby-images/02fbd/02fbdbfd25c12895e3c1845253d1c8390702a81f" alt="An image representing the data engineering concept of 'Curate'"
Denoise
data:image/s3,"s3://crabby-images/03652/036520f3b3ceaf7eb5b220791bf3fe015f8628f9" alt="An image representing the data engineering concept of 'Denoise'"
Denormalize
data:image/s3,"s3://crabby-images/30973/30973855602e687a19860afee452fa5c38253b66" alt="An image representing the data engineering concept of 'Denormalize'"
Derive
data:image/s3,"s3://crabby-images/8868d/8868df11c09fcb1fc228c63cc9a70e8b89259f95" alt="An image representing the data engineering concept of 'Derive'"
Discretize
data:image/s3,"s3://crabby-images/24314/24314c9524897531246f2bc3ad672045de378206" alt="An image representing the data engineering concept of 'Discretize'"
ETL
data:image/s3,"s3://crabby-images/40a39/40a39a81233d130bb56974986abf0e49080b548d" alt="An image representing the data engineering concept of 'ETL'"
Encode
data:image/s3,"s3://crabby-images/21dfd/21dfd8c0594cbbcae37d3f4c8360cd868ba13286" alt="An image representing the data engineering concept of 'Encode'"
Filter
data:image/s3,"s3://crabby-images/a0a22/a0a2234aca5306f46b6285b6db5fd4c7e1229529" alt="An image representing the data engineering concept of 'Filter'"
Homogenize
data:image/s3,"s3://crabby-images/a2613/a26138ccc92e54b3a08babbd8cf88df74590b14b" alt="An image representing the data engineering concept of 'Homogenize'"
Impute
data:image/s3,"s3://crabby-images/284fb/284fb1f49840d5bcbc3a964c5975b16652bdf76c" alt="An image representing the data engineering concept of 'Impute'"
Linearize
data:image/s3,"s3://crabby-images/5e726/5e726b7a2f609527d90d9ba5ae140a2d82a97d55" alt="An image representing the data engineering concept of 'Linearize'"
Munge
data:image/s3,"s3://crabby-images/43a40/43a4023698d24750292d43bd0cb9fcaa39fc047a" alt="An image representing the data engineering concept of 'Munge'"
Normalize
Reduce
data:image/s3,"s3://crabby-images/44b00/44b00f12c6a3733ca3c759242bd5f58a19b40e67" alt="An image representing the data engineering concept of 'Reduce'"
Reshape
data:image/s3,"s3://crabby-images/d2a65/d2a65f6667ed2523ae4a0475f06b5499e93eed18" alt="An image representing the data engineering concept of 'Reshape'"
Serialize
data:image/s3,"s3://crabby-images/973dd/973dd4e813736d655ab28481ecc6c65d4255320e" alt="An image representing the data engineering concept of 'Serialize'"
Shred
Skew
Split
Standardize
Tokenize
data:image/s3,"s3://crabby-images/ee124/ee124b41e0f8adbf5a71edea7f0c34e26ccf68d6" alt="An image representing the data engineering concept of 'Tokenize'"
Transform
data:image/s3,"s3://crabby-images/2852d/2852da6b2d64d08f5b066032cbca8aa960e5d335" alt="An image representing the data engineering concept of 'Transform'"
Wrangle
data:image/s3,"s3://crabby-images/c2d53/c2d538c5370d81f10b41dc93d93e96daecc16ab5" alt="An image representing the data engineering concept of 'Wrangle'"