Type Hinting in Python

The following article is part of a series on Python for data engineering aimed at helping data engineers, data scientists, data analysts, Machine Learning engineers, or others who are new to Python master the basics. To date this beginners guide consists of:

Part 1: Python Packages: a Primer for Data People (part 1 of 2), explored the basics of Python modules, Python packages and how to import modules into your own projects.
Part 2: Python Packages: a Primer for Data People (part 2 of 2), covered dependency management and virtual environments.
Part 3: Best Practices in Structuring Python Projects, covered 9 best practices and examples on structuring your projects.
Part 4: From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.
Part 5: Environment Variables in Python, we cover the importance of environment variables and how to use them.
Part 6: Type Hinting, or how type hints reduce errors.
Part 7: Factory Patterns, or learning design patterns, which are reusable solutions to common problems in software design.
Part 8: Write-Audit-Publish in data pipelines a design pattern frequently used in ETL to ensure data quality and reliability.
Part 9: CI/CD and Data Pipeline Automation (with Git), learn to automate data pipelines and deployments with Git.
Part 10: High-performance Python for Data Engineering, learn how to code data pipelines in Python for performance.
Part 11: Breaking Packages in Python, in which we explore the sharp edges of Python’s system of imports, modules, and packages.

Sign up for our newsletter to get all the updates! And if you enjoyed this guide check out our data engineering glossary, complete with Python code examples.

One of the powerful tools Python provides to promote clear and reliable code is the concept of 'type hints'. You might wonder, "Python is a dynamically-typed language, so why should I bother with types?"

As a data engineer or a Python beginner interested in coding best practices, understanding and applying type hints in your Python code can be a real asset.

In this article, we will delve deeper into type hints, their applications, and their benefits in Python programming. As Dagster is a type-annotated framework, we’ll also explain how types can be used in data engineering pipelines to improve its readability and make it less error-prone. It's like providing a map to your future self and other developers who may interact with your code - a map that details the types of data flowing in and out of your functions and classes.

What is dynamic typing?
Basic type hints
Built-in types for Python
Function annotations
Complex types
User-defined types
Generics
Type checking with pyright
Type hints and docstrings
Appendix: Common Python types

What is dynamic typing?

Python is a dynamically-typed language. In static-typed languages like Java or C++, you have to declare the type of variables before using them. For example, you need to specify whether a variable is an integer, a float, a string, etc. In Python, you can code without giving a second thought to data types until runtime–which is one of the features that make Python particularly beginner-friendly.

For example, you can declare a variable and directly assign a value to it without specifying its type, hence the term 'dynamically-typed'. Python interpreter implicitly binds the value and its type at runtime.

x = 10  # x is an integer
x = "Hello"  # now x is a string

In the first line, x is an integer. In the second line, the same x becomes a string. Python handles this transition seamlessly thanks to its dynamic typing nature.

However, this dynamic nature can also lead to bugs that are difficult to debug, especially in large codebases or complex data processing pipelines, where the data flow might not be immediately obvious.

Type hints, introduced in Python 3.5 as a part of the standard library via PEP 484, allow you to specify the expected type of a variable, function parameter, or return value.

Why use type hints?

While dynamic typing offers flexibility, it also creates room for potential bugs. Here's where type hints come in. They can significantly enhance code readability and prevent type-related errors.

Improved code readability: Type hints act as a form of documentation that helps developers understand the types of arguments a function expects and what it returns. This enhanced clarity makes the code more readable and easier to understand.

Error detection: Tools like 'pyright' and mypy can be used to statically analyze your Python code. It checks the consistency of types in your code based on the type hints and alerts you about type-related errors before runtime. Learn why the Dagster team recommends skipping mypy entirely and just using pyright.

Better IDE support: Many Integrated Development Environments (IDEs) and linters can utilize type hints to provide better code completion, error checking, and refactoring.

Facilitates large-scale projects: For larger projects with multiple developers, type hints can be very beneficial in understanding the structure and flow of data throughout the codebase. We’ve published a guide on how to include and maintain type annotations for public Python projects.

Limitations

Not enforced at runtime: Python's type hints are not enforced but are merely hints, and the Python interpreter will not raise errors if the provided types do not match the actual values. This might lead to a misconception that type hints can enforce type safety, which they cannot.

Over-complicated: For small or simple scripts, type hints might seem like an overkill, and could potentially complicate code that should be straightforward and simple.

Not flexible: One of the reasons for Python's popularity is its dynamic nature and type hints can restrict this.

Basic type hints

Python's typing module contains several functions and classes that are used to provide type hints for your Python code. Here's how you can apply type hints in different scenarios.

Declare types for variables

To provide type hints for variables, you can use the colon : symbol followed by the type. Here's an example:

age: int = 20
name: str = "Alice"
is_active: bool = True

Here, age is hinted as an integer, name as a string, and is_active as a boolean.

Function annotations

You can provide type hints for function parameters and return values. This helps other developers understand what types of arguments are expected by the function and what type the function returns.

def greet(name: str) -> str: \ return f"Hello, {name}"

In this example, the function greet expects name to be a string and will return a string.

Built-in types in Python

Python has several built-in types. The most commonly used are:

int: Represents an integer
float: Represents a floating-point number
bool: Represents a boolean value (True or False)
str: Represents a string

There are also complex types such as lists, tuples, and dictionaries that can be used to provide more detailed type hints that we will look at later on.

You will also find a list of Python's main types in the appendix.

Atomic vs. composite types

In Python, there is a distinction between atomic and composite types when it comes to type hinting. Atomic types, such as int, float, and str, are simple and indivisible, and their type annotations can be provided directly using the type itself, like str.

def my_function(my_string: str) -> int:
    return len(my_string)

On the other hand, composite types like List and Dict are composed of other types, and before Python 3.9, they often required importing specific definitions from the typing module, such as typing.List[int] for a list of integers.

from typing import List

def my_function(numbers: List[int]) -> int:
    return sum(numbers)

In newer versions of Python, you can write list[str] instead of typing.List[int].

Function annotations

Type hints can be particularly useful when incorporated into function signatures. This not only allows developers to understand what types of arguments a function expects but also gives them an idea of what the function will return.

How to specify argument types and return type of a function

You can specify the types of arguments and the return type of a function using the : symbol for the arguments and the-> symbol for the return type. Here's the general syntax:

def function_name(arg1: type1, arg2: type2, ...) -> return_type:
    # function body

In this syntax, arg1, arg2, etc. are the function arguments, and type1, type2, etc. are the types of these arguments. return_type is the type of value the function returns.

Examples of using type hints in function signatures

Let's consider a function that calculates the area of a rectangle:

def area_rectangle(length: float, breadth: float) -> float:
    return length * breadth

In this function, length and breadth are expected to be floats, and the function also returns a float. The function will still work if you pass integers or even strings that can be converted to a float, but the type hint makes it clear that it's designed to handle floating-point numbers.

Another example can be a function that accepts a list of integers and returns their sum as an integer:

def sum_elements(numbers: list[int]) -> int:
    return sum(numbers)

In this example, the numbers parameter is hinted as a list of integers, and the return type is an integer.

Note that these type hints do not enforce type checking at runtime. They are hints for developers, and Python will not raise a TypeError if the actual types do not match the specified types.

Complex types

The typing module in Python provides several classes that can be used to provide more complex type hints. Below are some of the most commonly used classes:

List, dict, tuple, set

The list, dict, tuple, and set classes can be used to provide type hints for lists, dictionaries, tuples, and sets respectively. They can be parameterized to provide even more detailed type hints.

### A list of integers
numbers: list[int] = [1, 2, 3]

### A dictionary with string keys and float values
weights: dict[str, float] = {"apple": 0.182, "banana": 0.120}

### A tuple with an integer and a string
student: tuple[int, str] = (1, "John")

### A set of strings
flags: set[str] = {"apple", "banana", "cherry"}

In these examples, numbers is hinted as a list of integers, weights is a dictionary with string keys and float values, student is a tuple with integers and a string, and flags is a set of strings.

Optional

The Optional type hint can be used to indicate that a variable can be either a specific type or None.

from typing import Optional

def find_student(student_id: int) -> Optional[dict[str, str]]:
    # If the student is found, return a dictionary containing their data
    # If the student is not found, return None

Union

The Union type hint is used to indicate that a variable can be one of several types. For example, if a variable can be either a str or an int, you can provide a type hint like this:

from typing import Union

def process(data: Union[str, int]) -> None:
    # This function can handle either a string or an integer

In newer versions of Python, you can use the pipe (|) operator to indicate a type that can be one of several options, replacing the need for Union:

def process(data: str | int) -> None:
    # This function can handle either a string or an integer

Any

The Any class is used to indicate that a variable can be of any type. This is equivalent to not providing a type hint at all.

from typing import Any

def process(data: Any) -> None:
    # This function can handle data of any type

These tools from the typing module can help you provide detailed type hints that make your code easier to understand and debug.

However, remember that Python's type hints are optional and not enforced at runtime. They are intended as a tool for developers, not a way to enforce type safety.

User-Defined types

In Python, you can define your own types using classes, which is the fundamental mechanism to create custom types. You can use these classes in type hints just like you'd use the built-in types. The typing module also provides additional tools for creating more specific types, including Type and NewType.

Defining your own types using classes

You can create a class and use it as a type hint. Here's an example:

class Student:
    def __init__(self, name: str, age: int):
        self.name = name
        self.age = age

def print_student_details(student: Student) -> None:
    print(student.name, student.age)

Student is a user-defined type, and it's used as a type hint in the print_student_details function.

Using `Type` for type hinting

The Type class from the typing module can be used to indicate that a variable will be a class, not an instance of a class. This is commonly used when a function argument is expected to be a class, for example in factory functions.

from typing import Type

def create_student(cls: Type[Student], name: str, age: int) -> Student:
    return cls(name, age)

In this example, create_student expects a Student class (or a subclass of Student) as its first argument.

Using `NewType` to create distinct types

NewType is used to create distinct types. It's useful when you want to distinguish between two types that would otherwise be the same.

For example, let's say you're dealing with student IDs and course IDs in your program, and you want to make sure you don't mix them up. Both are represented as integers, so you can use NewType to create two distinct types:

from typing import NewType

StudentID = NewType('StudentID', int)
CourseID = NewType('CourseID', int)

def get_student(student_id: StudentID) -> None:
    # Fetch student data...

def enroll_in_course(student_id: StudentID, course_id: CourseID) -> None:
    # Enroll the student in the course...

Even though StudentID and CourseID are both integers, they are considered distinct types and cannot be used interchangeably. However, remember that this check is not enforced at runtime, but during static type checking using tools like mypy.

Generics

Generics allows you to define a function, class, or data structure that works with different types. The Generic class and the TypeVar function from the typing module are used to define generic types. For example, a list is a generic data structure because it can contain elements of any type.

TypeVar

TypeVar is used to define a type variable, which can be any type, and the specific type is determined by the client code. Here's an example:

from typing import TypeVar

T = TypeVar('T')

def first_element(lst: List[T]) -> T:
    return lst[0]

Here, T is a type variable that can be any type. The first_element function works with a list of any type and returns an element of that type. The specific type of T would be determined by the list you pass to the function.

Generic

Generic is used to define generic classes. A generic class can be initialized with a variety of types, and those types are used in type hints within the class.

from typing import Generic, TypeVar

T = TypeVar('T')

class Box(Generic[T]):
    def __init__(self, value: T):
        self.value = value

    def get(self) -> T:
        return self.value

Here, Box is a generic class that works with any type T. When you create an instance of Box, you can specify the type of T, and that type is used in the value attribute and the get method.

box1 = Box[int](10)
box2 = Box[str]("Hello")

box1 is a Box that contains an integer, and box2 is a Box that contains a string.

Type checking with `pyright`

A type checker like pyright is a tool used to enforce type hinting in Python. At Dagster, we really like pyright because it is faster than other alternatives such as mypy.

Python itself is a dynamically-typed language, which means type checks happen at runtime and it does not enforce type hinting rules. If you try to perform an operation that's not supported for a given data type, Python will raise an error during runtime. For example, calling an undefined method on an object will only trigger an error during runtime.

However, when developing large or complex systems, enforcing type consistency can help catch potential bugs early. pyright performs static type checking, meaning it checks the types of your variables, function arguments, and return values before the code is actually run. It uses the type hints you've provided in your code to do this. It's important to understand that pyright does not execute or run your code; it simply reads and analyzes it.

How to use a type checker to verify your types

To use pyright, you first need to install it:

pip install pyright

Then, to check a Python file, you run pyright with the file as an argument:

pyright my_file.py

Pyright will then analyze the file and report any type errors it finds.

For example, if you have a function that's annotated to receive a str as an argument and you pass an int, pyright will catch this.

Static vs dynamic type checking

Static type checking is the process of verifying the type safety of a program based on analysis of a program's text (source code). Static type checking is done at compile-time (before the program is run). Languages that enforce static type checking include C++, Java, and Rust.

Dynamic type checking, on the other hand, is the process of verifying the type safety of a program at runtime. Dynamic type checking occurs while the program is running. Languages that use dynamic type checking include Python, Ruby, and JavaScript.

In static type checking, types are checked before the program runs, which makes it easier to catch and prevent type errors. This makes the program safer to run, as most type-related bugs have been caught at compile-time. However, it also requires the programmer to explicitly declare the types of all variables and function return values, which can be seen as reducing flexibility.

Dynamic type checking provides more flexibility, as you don't have to explicitly declare the type of every variable. However, this also means that type errors can occur at runtime, which could potentially cause the program to crash.

Python is a dynamically-typed language, but it also supports optional static type checking through tools like pyright and type hints. This provides Python programmers with a unique flexibility, allowing them to choose when they want the safety of static type checking and when they prefer the flexibility of dynamic typing.

Type hints and docstrings

Type hints, as we've discussed, indicate the types of variables, function parameters, and return values. They can help other developers understand what types of data your function expects and what it will return.

Docstrings, on the other hand, are used to provide a description of what your function, class, or module does. A docstring can include a description of the function's purpose, its arguments, its return value, and any exceptions it may raise.

Here's an example of how you can use type hints and docstrings together:

def filter_and_sort_products(products: list[dict[str, int]], attribute: str, min_value: int) -> list[dict[str, int]]:
    """
    Filters a list of products by a given attribute and minimum value, and then sorts the filtered products by the attribute.

    Args:
        products (list[dict[str, int]]): A list of products represented as dictionaries.
        attribute (str): The attribute to filter and sort by.
        min_value (int): The minimum acceptable value of the specified attribute.

    Returns:
        list[dict[str, int]]: A list of filtered and sorted products.

    Raises:
        KeyError: If the specified attribute is not found in any product.

    Examples:
        >>> products = [{"name": "Apple", "price": 10}, {"name": "Banana", "price": 5}]
        >>> filter_and_sort_products(products, "price", 6)
        [{"name": "Apple", "price": 10}]
    """
    filtered_products = [product for product in products if product[attribute] >= min_value]
    return sorted(filtered_products, key=lambda x: x[attribute])

Here, the function signature shows that the function takes a list of dictionaries representing products, a string representing an attribute, and an integer representing a minimum value. It returns a list of filtered and sorted dictionaries.

The doc strings explain the purpose of the function, its parameters, return value, possible exceptions (such as a KeyError if the given attribute is not present), and includes an example of how to call the function.

This combination of type hints and docstrings can greatly improve the readability and maintainability of your code.

Conclusion

Building on Python programming best practices, we’ve looked at how type hints improve the readability and maintainability of your code.

If you have any questions or need further clarification, feel free to join the Dagster Slack and ask the community for help. Thank you for reading!

Our next article builds on these data engineering concepts and explores how Factory Patterns help you automate steps in your pipeline.

Appendix: Python types

To summarize, here are the most common types used in Python. You might also come across or utilize custom data types from external libraries or those defined by other developers.

Numeric Types:
- int: Integer, e.g., 5, -3, 42
- float: Floating-point number, e.g., 3.14, -0.001, 2.71
- complex: Complex number, e.g., 3+4j, 2-5j
Text Type:
- str: String, e.g., "Hello, World!", 'Python'
Sequence Types:
- list: List, e.g., [1, 2, 3], ["a", "b", "c"]
- tuple: Tuple, e.g., (1, 2, 3), ("a", "b", "c")
- range: Range, e.g., range(5), range(0, 5, 2)
Mapping Type:
- dict: Dictionary, e.g., {"name": "John", "age": 30}, {1: "one", 2: "two"}
Set Types:
- set: Set, e.g., {1, 2, 3}, {"apple", "banana", "cherry"}
- frozenset: Immutable set, created using frozenset()
Boolean Type:
- bool: Boolean, e.g., True, False
Binary Types:
- bytes: Immutable sequence of bytes, e.g., b'hello', bytes([65, 66, 67])
- bytearray: Mutable sequence of bytes, e.g., bytearray([65, 66, 67])
- memoryview: Memory view object, created using memoryview()
None Type:
- NoneType: Represents the absence of a value or a null value. The only value it can have is None.

Python also has many built-in modules that offer additional data types, and you can also define custom data types using classes. Some of the other noteworthy types/modules include:

datetime Module:
- datetime.date: Represents a date.
- datetime.datetime: Represents date and time.
- datetime.time: Represents time.
- datetime.timedelta: Represents a duration or difference between two dates/times.
- datetime.tzinfo: Base for dealing with time zones.
collections Module:
- collections.namedtuple: Returns a new tuple subclass named 'typename'.
- collections.deque: Double-ended queue.
- collections.Counter: Dict subclass for counting hashable objects.
- collections.OrderedDict: Dict subclass that remembers the order entries were added.
- collections.defaultdict: Dict subclass that calls a factory function to supply missing values.
array Module:
- array.array: A space-efficient array with type specification.
struct Module:
- Used for packing and unpacking binary data.
json Module:
- Helps in encoding and decoding JSON data.
enum Module:
- enum.Enum: Base class for creating enumerated constants.
- enum.IntEnum: Base class for creating enumerated constants that are also subclasses of int.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: