March 21, 2023 • 5 minute read •
Best Practices in Structuring Python Projects
- Name
- Elliot Gunn
- Handle
- @elliot
Python is a versatile and widely-adopted programming language used for everything from data analysis to web development. However, as Python projects grow in complexity, it can become challenging to keep track of all the moving parts and ensure that everything stays organized.
That's where having an understanding of Python file and project directory structures comes in. In this article, we'll review some key concepts in structuring Python projects and how to best apply them.
If you're just starting out, these best practices will not only help you write better code, but they can also help you better maintain and scale your Python codebases or data pipelines.
Table of Contents
- Why break up your projects?
- 9 Best practices for structuring Python projects
- Demo: A simple data engineering project
- Structuring projects for easy collaboration
- Python project folder structure and key files
Why break up your projects?
As Python programs become larger, they become harder to manage. This is especially true if a team is collaborating on the project and making changes to multiple aspects of the project at once.
In order to make the Python project more manageable and maintainable, we break large, unwieldy programming tasks into more manageable subtasks or modules. Modules, as their name suggests, can then be reused in multiple places.
Breaking the project down into modules is called “Modular programming." As discussed in previous articles, functions, modules, and packages are all mechanisms used in Modular Programming. Modular programming provides many advantages.
- It simplifies your work by allowing you to focus on one module at a time.
- It makes your project more maintainable. If a team is working together on a project, adopting a modular approach reduces the likelihood your work will end up in version conflicts.
- It makes your code more reusable. If your project is one large monolith, anybody looking to reuse it has to parse through a lot of code. If your code is organized in modules, importing just the parts that are needed becomes easier.
- It reduces duplication. As per the point above, modular code is more reusable, meaning it is less likely you will end up duplicating functions.
- It helps avoid namespace conflicts as each module can define a separate namespace.
9 Best practices for structuring Python projects
When you look for advice online about how to organize and write your Python code, you will find a lot of different ideas. But really, it comes down to a few basic things.
1. Organize your code
Properly organizing your code is crucial when working on a Python project. You can start by creating separate folders for different parts of the project, such as one for the code itself, one for data, one for testing, and one for project documentation. This way to structure will help you find what you need more quickly and make it easier for others to navigate your code.
2. Use consistent naming
It's important to use consistent names for files and folders throughout your project. Try to follow conventions like using underscores for variables and functions and capital letters for classes. This will make it easier to read and understand your code.
3. Use version control
Even when working alone, using a tool like Git to keep track of changes to your code is a recommended best practice. It allows you to keep a record of your changes and easily back up your work to a cloud-based repository. Most cloud-based Git solutions offer a free tier for solo practitioners.
4. Use a package manager
Using a package manager like pip to manage dependencies will help you install and keep track of all the different pieces of software your project needs to run. This is especially important when working with large projects with many dependencies.
5. Create virtual environments
To keep your Python project isolated from other projects on your computer, you can use a virtual environment. A virtual environment will prevent conflicts between packages used in different projects.
6. Comment your code
Adding comments to explain what your code is doing and how to use it is important for making your code more accessible to others. It also helps you remember what you were thinking when you wrote the code.
7. Test, test, test
Using automated tests to check that your code works as expected is essential for catching bugs early on. This will save you time and prevent issues down the line.
8. Lint and Style
Using a tool like Ruff or Flake8 to make sure your code looks consistent and catch common mistakes will help you write better code. These tools check your code for consistency with PEP 8, the official Python style guide. You can also use a tool like Black to ensure that your code looks the same across your project. This will help you maintain consistency and make it easier to read and understand your code.
9. Package to share
Using a tool like setuptools to package and distribute your Python code will make it easier to share your work with others. This will also help ensure that others can use your code without encountering any issues.
Demo: A simple data engineering project
Let’s apply these tips to a hands-on example. We’ll use several best practices, including organized code, consistent naming, helpful comments, and tests.
We’ll write a script to extract data from a database, transform it, and load it into another database. Let’s first define variables and functions in a module named my_module.py:
### Constants
SOURCE_DB = "source_db"
DESTINATION_DB = "destination_db"
TABLE_NAME = "table_name"
DATE_COLUMN = "date_column"
SALES_COLUMN = "sales_column"
### Functions
def extract_data(table_name):
# Code to extract data from the source database
def transform_data(data_frame):
# Code to transform the data
def load_data(data_frame):
# Code to load the data into the destination database
In this example, we use consistent names for the variables and functions. Constants like SOURCE_DB
, DESTINATION_DB
, TABLE_NAME
, DATE_COLUMN
, and SALES_COLUMN
are all capitalized and separated by underscores. Functions like extract_data
, transform_data
, and load_data
are all lowercase and separated by underscores.
Next, let’s write a simple unit test in a test.py
file to make sure our code works the way we want it to:
import unittest
import pandas as pd
from my_module import transform_data
class TestDataTransformation(unittest.TestCase):
def test_transform_data(self):
# Create a sample input DataFrame
input_data = pd.DataFrame({
"date_column": ["2022-01-01", "2022-02-01", "2022-03-01"],
"sales_column": [100, 200, 300]
})
# Call the transform_data function
output_data = transform_data(input_data)
# Check the output DataFrame
expected_output = pd.DataFrame({
"year": [2022, 2022, 2022],
"month": [1, 2, 3],
"sales": [100, 200, 300]
})
pd.testing.assert_frame_equal(output_data, expected_output)
Here, we imported the unittest
and pandas
module, as well as the transform_data
function from my_module.py.
Then, we defined a test class called TestDataTransformation
that inherits from unittest.TestCase
. In this class, we define a single test function called test_transform_data
. This function creates a sample input DataFrame with three rows of data, calls the transform_data
function with this input data, and then checks the output DataFrame to make sure it has the expected values.
We define the expected output DataFrame with the same three rows of data and use the pd.testing.assert_frame_equal
function to check that the output DataFrame matches the expected DataFrame. If the output DataFrame matches the expected DataFrame, the test will pass. If not, the test will fail and provide information on where the mismatch occurred.
Structuring projects for easy collaboration
When you code with other developers, the most important things to consider are:
- making sure you have the correct version of the code and
- the correct software dependencies.
This is where version control comes in handy. Suppose a team of developers is working on a Python project that involves developing a web application. Each developer is working on a specific feature or module of the project. Without version control, the team would have to coordinate their work manually, which can be daunting and error-prone. One developer may accidentally overwrite another's work, leading to lost progress or conflicts.
However, by using a version control system like Git, the team can easily track changes, collaborate on code, and ensure that everyone is working with the same version of the codebase. With Git, each developer can work on their own branch, which allows them to make changes independently without affecting the main codebase.
When it's time to integrate everyone's changes, Git provides tools for merging code and resolving conflicts, making it much easier for the team to work together. Git also provides a complete history of the codebase, including all changes made by each developer, which can be useful for debugging and understanding how the code has evolved over time.
To ensure that changes to shared code don't break other projects, it's a good idea to use automated testing tools like pytest. It's also a good practice to review code changes before merging them into the main code base.
Finally, when it comes to ‘easy collaboration’ and as mentioned in our list of best practices, it's important to document your code as you write it. This makes it easier for new team members to understand the code and get up to speed.
Recommended Python Project Structure: folder structure and key files
You might have wondered why Python projects have the structure my-project/my_project
. This way of organizing a project is popular in Python because it helps keep everything neat and tidy. The top folder, called "my-project," is like the main folder for the entire project. The use of the dash and the underscore help differentiate between the two levels of the project. The underscore, in particular, helps distinguish the project reference from a variable or function name that might use the dash symbol. Because a dash is also a minus sign, the "inner level", which defines a python package, must use the underscore.
There are loose norms for how to organize files inside "my-project". The top level typically contains a README.md
file and various configuration files. The most important part of your project is the one or more python packages it contains. A python package is a directory constituting a valid target for Python's import system, typically containing an __init__.py
file. In this example, there is one Python package named "my_project", which is placed in the top level.
From here, the key files and subfolders found in most projects are:
- A dependency management file (typically
setup.py
orpyproject.toml
) This file is used to configure your project and its dependencies. We explored dependecy management in part two of this blog series. README.md
: This file, written as markdown, provides a brief introduction to your project, its purpose, and any installation or usage instructions.LICENSE
: This plain text file specifies the license under which your code is released. There are many open-source licenses to choose from, so be sure to select one that fits your needs.
In addition, many projects will have the following two folders:
src/
: This directory is an alternative location for python packages if you don't want to place them directly in the project root.tests/
: This directory contains test code that verifies the functionality of your project. You can use a testing framework like pytest or unittest to write and run tests.
Up next…
We’ve shared how organizing Python projects in a structured manner is crucial as the project's complexity grows, especially if you are part of a team or plan on sharing your work with other developers. By following the recommended python project structure outlined in this article, you can maintain and scale your codebases or data pipelines more easily.
In part 4 of our guide, From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Breaking Packages in Python
- Name
- Pedram Navid
- Handle
- @pdrmnvd
High-performance Python for Data Engineering
- Name
- Elliot Gunn
- Handle
- @elliot
CI/CD and Data Pipeline Automation (with Git)
- Name
- Elliot Gunn
- Handle
- @elliot