March 21, 2023 • 5 minute read •

Best Practices in Structuring Python Projects

We cover 9 best practices and examples on structuring your Python projects for collaboration and productivity.

Name: Elliot Gunn
Handle: @elliot

The following article is part of a series on Python for data engineering aimed at helping data engineers, data scientists, data analysts, Machine Learning engineers, or others who are new to Python master the basics. To date this beginners guide consists of:

Part 1: Python Packages: a Primer for Data People (part 1 of 2), explored the basics of Python modules, Python packages and how to import modules into your own projects.
Part 2: Python Packages: a Primer for Data People (part 2 of 2), covered dependency management and virtual environments.
Part 3: Best Practices in Structuring Python Projects, covered 9 best practices and examples on structuring your projects.
Part 4: From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.
Part 5: Environment Variables in Python, we cover the importance of environment variables and how to use them.
Part 6: Type Hinting, or how type hints reduce errors.
Part 7: Factory Patterns, or learning design patterns, which are reusable solutions to common problems in software design.
Part 8: Write-Audit-Publish in data pipelines a design pattern frequently used in ETL to ensure data quality and reliability.
Part 9: CI/CD and Data Pipeline Automation (with Git), learn to automate data pipelines and deployments with Git.
Part 10: High-performance Python for Data Engineering, learn how to code data pipelines in Python for performance.
Part 11: Breaking Packages in Python, in which we explore the sharp edges of Python’s system of imports, modules, and packages.

Sign up for our newsletter to get all the updates! And if you enjoyed this guide check out our data engineering glossary, complete with Python code examples.

Python is a versatile and widely-adopted programming language used for everything from data analysis to web development. However, as Python projects grow in complexity, it can become challenging to keep track of all the moving parts and ensure that everything stays organized.

That's where having an understanding of Python file and project directory structures comes in. In this article, we'll review some key concepts in structuring Python projects and how to best apply them.

If you're just starting out, these best practices will not only help you write better code, but they can also help you better maintain and scale your Python codebases or data pipelines.

Why break up your projects?
9 Best practices for structuring Python projects
Demo: A simple data engineering project
Structuring projects for easy collaboration
Python project folder structure and key files

Why break up your projects?

As Python programs become larger, they become harder to manage. This is especially true if a team is collaborating on the project and making changes to multiple aspects of the project at once.

In order to make the Python project more manageable and maintainable, we break large, unwieldy programming tasks into more manageable subtasks or modules. Modules, as their name suggests, can then be reused in multiple places.

Breaking the project down into modules is called “Modular programming." As discussed in previous articles, functions, modules, and packages are all mechanisms used in Modular Programming. Modular programming provides many advantages.

It simplifies your work by allowing you to focus on one module at a time.
It makes your project more maintainable. If a team is working together on a project, adopting a modular approach reduces the likelihood your work will end up in version conflicts.
It makes your code more reusable. If your project is one large monolith, anybody looking to reuse it has to parse through a lot of code. If your code is organized in modules, importing just the parts that are needed becomes easier.
It reduces duplication. As per the point above, modular code is more reusable, meaning it is less likely you will end up duplicating functions.
It helps avoid namespace conflicts as each module can define a separate namespace.

9 Best practices for structuring Python projects

When you look for advice online about how to organize and write your Python code, you will find a lot of different ideas. But really, it comes down to a few basic things.

1. Organize your code

Properly organizing your code is crucial when working on a Python project. You can start by creating separate folders for different parts of the project, such as one for the code itself, one for data, one for testing, and one for project documentation. This way to structure will help you find what you need more quickly and make it easier for others to navigate your code.

2. Use consistent naming

It's important to use consistent names for files and folders throughout your project. Try to follow conventions like using underscores for variables and functions and capital letters for classes. This will make it easier to read and understand your code.

3. Use version control

Even when working alone, using a tool like Git to keep track of changes to your code is a recommended best practice. It allows you to keep a record of your changes and easily back up your work to a cloud-based repository. Most cloud-based Git solutions offer a free tier for solo practitioners.

4. Use a package manager

Using a package manager like pip to manage dependencies will help you install and keep track of all the different pieces of software your project needs to run. This is especially important when working with large projects with many dependencies.

5. Create virtual environments

To keep your Python project isolated from other projects on your computer, you can use a virtual environment. A virtual environment will prevent conflicts between packages used in different projects.

6. Comment your code

Adding comments to explain what your code is doing and how to use it is important for making your code more accessible to others. It also helps you remember what you were thinking when you wrote the code.

7. Test, test, test

Using automated tests to check that your code works as expected is essential for catching bugs early on. This will save you time and prevent issues down the line.

8. Lint and Style

Using a tool like Ruff or Flake8 to make sure your code looks consistent and catch common mistakes will help you write better code. These tools check your code for consistency with PEP 8, the official Python style guide. You can also use a tool like Black to ensure that your code looks the same across your project. This will help you maintain consistency and make it easier to read and understand your code.

Using a tool like setuptools to package and distribute your Python code will make it easier to share your work with others. This will also help ensure that others can use your code without encountering any issues.

Demo: A simple data engineering project

Let’s apply these tips to a hands-on example. We’ll use several best practices, including organized code, consistent naming, helpful comments, and tests.

We’ll write a script to extract data from a database, transform it, and load it into another database. Let’s first define variables and functions in a module named my_module.py:

### Constants
SOURCE_DB = "source_db"
DESTINATION_DB = "destination_db"
TABLE_NAME = "table_name"
DATE_COLUMN = "date_column"
SALES_COLUMN = "sales_column"

### Functions
def extract_data(table_name):
    # Code to extract data from the source database

def transform_data(data_frame):
    # Code to transform the data

def load_data(data_frame):
    # Code to load the data into the destination database

In this example, we use consistent names for the variables and functions. Constants like SOURCE_DB, DESTINATION_DB, TABLE_NAME, DATE_COLUMN, and SALES_COLUMN are all capitalized and separated by underscores. Functions like extract_data, transform_data, and load_data are all lowercase and separated by underscores.

Next, let’s write a simple unit test in a test.py file to make sure our code works the way we want it to:

import unittest
import pandas as pd
from my_module import transform_data

class TestDataTransformation(unittest.TestCase):
    def test_transform_data(self):
        # Create a sample input DataFrame
        input_data = pd.DataFrame({
            "date_column": ["2022-01-01", "2022-02-01", "2022-03-01"],
            "sales_column": [100, 200, 300]
        })

        # Call the transform_data function
        output_data = transform_data(input_data)

        # Check the output DataFrame
        expected_output = pd.DataFrame({
            "year": [2022, 2022, 2022],
            "month": [1, 2, 3],
            "sales": [100, 200, 300]
        })

        pd.testing.assert_frame_equal(output_data, expected_output)

Here, we imported the unittest and pandas module, as well as the transform_data function from my_module.py.

Then, we defined a test class called TestDataTransformation that inherits from unittest.TestCase. In this class, we define a single test function called test_transform_data. This function creates a sample input DataFrame with three rows of data, calls the transform_data function with this input data, and then checks the output DataFrame to make sure it has the expected values.

We define the expected output DataFrame with the same three rows of data and use the pd.testing.assert_frame_equal function to check that the output DataFrame matches the expected DataFrame. If the output DataFrame matches the expected DataFrame, the test will pass. If not, the test will fail and provide information on where the mismatch occurred.

Structuring projects for easy collaboration

When you code with other developers, the most important things to consider are:

making sure you have the correct version of the code and
the correct software dependencies.

This is where version control comes in handy. Suppose a team of developers is working on a Python project that involves developing a web application. Each developer is working on a specific feature or module of the project. Without version control, the team would have to coordinate their work manually, which can be daunting and error-prone. One developer may accidentally overwrite another's work, leading to lost progress or conflicts.

However, by using a version control system like Git, the team can easily track changes, collaborate on code, and ensure that everyone is working with the same version of the codebase. With Git, each developer can work on their own branch, which allows them to make changes independently without affecting the main codebase.

When it's time to integrate everyone's changes, Git provides tools for merging code and resolving conflicts, making it much easier for the team to work together. Git also provides a complete history of the codebase, including all changes made by each developer, which can be useful for debugging and understanding how the code has evolved over time.

To ensure that changes to shared code don't break other projects, it's a good idea to use automated testing tools like pytest. It's also a good practice to review code changes before merging them into the main code base.

Finally, when it comes to ‘easy collaboration’ and as mentioned in our list of best practices, it's important to document your code as you write it. This makes it easier for new team members to understand the code and get up to speed.

Recommended Python Project Structure: folder structure and key files

You might have wondered why Python projects have the structure my-project/my_project. This way of organizing a project is popular in Python because it helps keep everything neat and tidy. The top folder, called "my-project," is like the main folder for the entire project. The use of the dash and the underscore help differentiate between the two levels of the project. The underscore, in particular, helps distinguish the project reference from a variable or function name that might use the dash symbol. Because a dash is also a minus sign, the "inner level", which defines a python package, must use the underscore.

There are loose norms for how to organize files inside "my-project". The top level typically contains a README.md file and various configuration files. The most important part of your project is the one or more python packages it contains. A python package is a directory constituting a valid target for Python's import system, typically containing an __init__.py file. In this example, there is one Python package named "my_project", which is placed in the top level.

From here, the key files and subfolders found in most projects are:

A dependency management file (typically setup.py or pyproject.toml) This file is used to configure your project and its dependencies. We explored dependecy management in part two of this blog series.
README.md: This file, written as markdown, provides a brief introduction to your project, its purpose, and any installation or usage instructions.
LICENSE: This plain text file specifies the license under which your code is released. There are many open-source licenses to choose from, so be sure to select one that fits your needs.

In addition, many projects will have the following two folders:

src/: This directory is an alternative location for python packages if you don't want to place them directly in the project root.
tests/: This directory contains test code that verifies the functionality of your project. You can use a testing framework like pytest or unittest to write and run tests.

Up next…

We’ve shared how organizing Python projects in a structured manner is crucial as the project's complexity grows, especially if you are part of a team or plan on sharing your work with other developers. By following the recommended python project structure outlined in this article, you can maintain and scale your codebases or data pipelines more easily.

In part 4 of our guide, From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: