Dagster Data Engineering Glossary:
Data Tokenizing
data:image/s3,"s3://crabby-images/ee124/ee124b41e0f8adbf5a71edea7f0c34e26ccf68d6" alt="Glossary entry badge for Tokenize"
Tokenization definition:
Tokenization is the process of breaking down a piece of text into individual words or tokens. This is a common technique used in data engineering to prepare text data for analysis.
Tokenizing example using Python:
Here are some practical examples of tokenization in data engineering using Python-specific functions. Please note that you need to have the necessary Python libraries installed in your Python environment to run this code:
Using the split()
function: The split() function can be used to split a string into a list of words based on a delimiter, such as a space or a comma.
For example:
text = "This is a sample sentence."
tokens = text.split()
print(tokens)
This would output:
['This', 'is', 'a', 'sample', 'sentence.']
Using the word_tokenize() function from the NLTK library: The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing. The word_tokenize() function from the NLTK library can be used to tokenize text data into individual words. For example:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "This is a sample sentence."
tokens = word_tokenize(text)
print(tokens)
Will output:
['This', 'is', 'a', 'sample', 'sentence', '.']
Using regular expressions: Regular expressions can be used to define patterns for tokenizing text data. For example, the following code uses regular expressions to split a string into words based on whitespace and punctuation:
import re
text = "This is a sample sentence."
tokens = re.findall(r'\b\w+\b', text)
print(tokens)
This code would produce the following output in the terminal:
['This', 'is', 'a', 'sample', 'sentence']
Using the split()
function with a regular expression pattern: The split()
function can also be used with regular expression patterns to tokenize text data. For example:
import re
text = "This is a sample sentence."
tokens = re.split('\W+', text)
print(tokens)
This will yield:
['This', 'is', 'a', 'sample', 'sentence', '']
These are just a few examples of how tokenization can be used to prepare data for analysis and extract insights from text.
Align
data:image/s3,"s3://crabby-images/073ff/073ffbe09b4c4d617afffc0dc783a92fbedb46b9" alt="An image representing the data engineering concept of 'Align'"
Clean or Cleanse
data:image/s3,"s3://crabby-images/7c5c1/7c5c12ba981567f42d2e572185962f55d78181a4" alt="An image representing the data engineering concept of 'Clean or Cleanse'"
Cluster
data:image/s3,"s3://crabby-images/a9b9e/a9b9e024385081102ff1fa06ae10197a9a3fdb07" alt="An image representing the data engineering concept of 'Cluster'"
Curate
data:image/s3,"s3://crabby-images/02fbd/02fbdbfd25c12895e3c1845253d1c8390702a81f" alt="An image representing the data engineering concept of 'Curate'"
Denoise
data:image/s3,"s3://crabby-images/03652/036520f3b3ceaf7eb5b220791bf3fe015f8628f9" alt="An image representing the data engineering concept of 'Denoise'"
Denormalize
data:image/s3,"s3://crabby-images/30973/30973855602e687a19860afee452fa5c38253b66" alt="An image representing the data engineering concept of 'Denormalize'"
Derive
data:image/s3,"s3://crabby-images/8868d/8868df11c09fcb1fc228c63cc9a70e8b89259f95" alt="An image representing the data engineering concept of 'Derive'"
Discretize
data:image/s3,"s3://crabby-images/24314/24314c9524897531246f2bc3ad672045de378206" alt="An image representing the data engineering concept of 'Discretize'"
ETL
data:image/s3,"s3://crabby-images/40a39/40a39a81233d130bb56974986abf0e49080b548d" alt="An image representing the data engineering concept of 'ETL'"
Encode
data:image/s3,"s3://crabby-images/21dfd/21dfd8c0594cbbcae37d3f4c8360cd868ba13286" alt="An image representing the data engineering concept of 'Encode'"
Filter
data:image/s3,"s3://crabby-images/a0a22/a0a2234aca5306f46b6285b6db5fd4c7e1229529" alt="An image representing the data engineering concept of 'Filter'"
Fragment
data:image/s3,"s3://crabby-images/bb58b/bb58ba57ab4915b284120cfeea05961e24c80050" alt="An image representing the data engineering concept of 'Fragment'"
Homogenize
data:image/s3,"s3://crabby-images/a2613/a26138ccc92e54b3a08babbd8cf88df74590b14b" alt="An image representing the data engineering concept of 'Homogenize'"
Impute
data:image/s3,"s3://crabby-images/284fb/284fb1f49840d5bcbc3a964c5975b16652bdf76c" alt="An image representing the data engineering concept of 'Impute'"
Linearize
data:image/s3,"s3://crabby-images/5e726/5e726b7a2f609527d90d9ba5ae140a2d82a97d55" alt="An image representing the data engineering concept of 'Linearize'"
Munge
data:image/s3,"s3://crabby-images/43a40/43a4023698d24750292d43bd0cb9fcaa39fc047a" alt="An image representing the data engineering concept of 'Munge'"
Normalize
Reduce
data:image/s3,"s3://crabby-images/44b00/44b00f12c6a3733ca3c759242bd5f58a19b40e67" alt="An image representing the data engineering concept of 'Reduce'"
Reshape
data:image/s3,"s3://crabby-images/d2a65/d2a65f6667ed2523ae4a0475f06b5499e93eed18" alt="An image representing the data engineering concept of 'Reshape'"
Serialize
data:image/s3,"s3://crabby-images/973dd/973dd4e813736d655ab28481ecc6c65d4255320e" alt="An image representing the data engineering concept of 'Serialize'"
Shred
Skew
Split
Standardize
Transform
data:image/s3,"s3://crabby-images/2852d/2852da6b2d64d08f5b066032cbca8aa960e5d335" alt="An image representing the data engineering concept of 'Transform'"
Wrangle
data:image/s3,"s3://crabby-images/c2d53/c2d538c5370d81f10b41dc93d93e96daecc16ab5" alt="An image representing the data engineering concept of 'Wrangle'"