Data Threading | Dagster Glossary

Back to Glossary Index

Data Threading

Enable concurrent execution in Python by decoupling tasks which are not sequentially dependent.

Definition of threading in a Python environment

Threading refers to the use of multiple threads of execution within a single process. Each thread runs independently while sharing the same memory space, allowing for parallel execution of tasks.

Threading can enhance program efficiency, particularly in I/O-bound tasks or tasks involving concurrent execution, though it's important to note that due to Python's Global Interpreter Lock (GIL), true parallelism isn't achieved for CPU-bound tasks.

Python's threading module is built on top of the lower-level _thread module, which in turn is built on top of an even lower-level module that communicates directly with the operating system.

Threading is a technique for decoupling tasks which are not sequentially dependent. For instance, threading is good for I/O-bound tasks, such as making multiple requests to an API or performing read/write operations on a database.

Example of threading in Python

Here's a very simple example to illustrate the use of threading in Python:

import threading
import time

def print_numbers():
    for i in range(10):
        print(i, flush=True, end="")
        time.sleep(0.1)

def print_letters():
    for letter in 'abcdefghij':
        print(letter, flush=True, end="")
        time.sleep(0.1)

t1 = threading.Thread(target=print_numbers)
t2 = threading.Thread(target=print_letters)

t1.start()
t2.start()

t1.join()
t2.join()

In this code, two threads are created: one that prints the numbers from 0 to 9, and another that prints the letters 'a' to 'j'. These threads are then started with the start method, and the program waits for both threads to finish with the join method. The result is that numbers and letters are printed out concurrently.

Your output will look something like this:

0a1b2c3d4e5f6g7h8i9j%

The flush=True argument to the print() function ensures that the output is printed immediately, and the time.sleep(0.1) introduces a small delay to help observe the overlapping execution of threads. The output from this code should be a mix of numbers and letters printed out concurrently, but we caused this artificially as the order of execution for threads is not guaranteed. The threads scheduler determines the order, and it can vary every time you run the program. Feel free to remove the time.sleep(0.1) commands and experiment.

When considering the use of threading in your pipeline design, you should also research multithreading, and understand the tradeoffs between these approaches. Python's Global Interpreter Lock (GIL) allows only one thread to execute at a time in a single process, which can be a limiting factor for CPU-bound tasks. This is one of the reasons why Python's threading module isn't suitable for tasks that require heavy CPU computation. For these tasks, multiprocessing is generally a better option.


Other data engineering terms related to
Data Processing: