August 7, 2023 • 7 minute read •

Environment Variables in Python

Name: Elliot Gunn
Handle: @elliot

The following article is part of a series on Python for data engineering aimed at helping data engineers, data scientists, data analysts, Machine Learning engineers, or others who are new to Python master the basics. To date this beginners guide consists of:

Part 1: Python Packages: a Primer for Data People (part 1 of 2), explored the basics of Python modules, Python packages and how to import modules into your own projects.
Part 2: Python Packages: a Primer for Data People (part 2 of 2), covered dependency management and virtual environments.
Part 3: Best Practices in Structuring Python Projects, covered 9 best practices and examples on structuring your projects.
Part 4: From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.
Part 5: Environment Variables in Python, we cover the importance of environment variables and how to use them.
Part 6: Type Hinting, or how type hints reduce errors.
Part 7: Factory Patterns, or learning design patterns, which are reusable solutions to common problems in software design.
Part 8: Write-Audit-Publish in data pipelines a design pattern frequently used in ETL to ensure data quality and reliability.
Part 9: CI/CD and Data Pipeline Automation (with Git), learn to automate data pipelines and deployments with Git.
Part 10: High-performance Python for Data Engineering, learn how to code data pipelines in Python for performance.
Part 11: Breaking Packages in Python, in which we explore the sharp edges of Python’s system of imports, modules, and packages.

Sign up for our newsletter to get all the updates! And if you enjoyed this guide check out our data engineering glossary, complete with Python code examples.

Today, we’ll take a look at managing environment variables in Python. Environment variables provide a way to configure applications in a non-hardcoded manner, enabling you to modify application behavior without changing the actual code.

The use of environment variables becomes especially important when parameterizing data pipelines in a production environment. They allow sensitive information like database credentials or API keys to be stored outside the codebase. This not only enhances security but also makes the code more portable and easier to manage.

In this article, we'll demystify the concept of environment variables in Python, explain how they work, why they matter, and how to effectively utilize them to bolster your Python programming skills. We'll walk you through practical examples, introducing techniques and best practices that will enable you to set up your Python environment, configure paths to important tools, or set environment variables that your scripts rely on.

What are environment variables?
Scope and lifecycle of environment variables
Environment variables and configuration
Best practices for managing environment variables

What are environment variables?

Python provides a built-in module named os as an interface to interact with the underlying operating system. This module provides a dictionary-like object, os.environ, that allows you to interact with environment variables.

os.environ acts as a mapping between environment variable names and their values. It can access, modify, or create environment variables in a Python program. Let's break it down.

Reading Python Environment Variables

To read the value of an environment variable in Python, you treat os.environ as a dictionary and access the variable by its name. Here's an example:

import os

### Accessing an environment variable
print(os.environ['HOME'])

This will print your default ‘home’ directory: /Users/username on a macOS system or /home/username on a Linux system:

/Users/elliot

Here, HOME is the name of the environment variable that typically stores the path to the current user's home directory. If the environment variable exists, its value will be printed to the console. If the environment variable does not exist, however, this will raise a KeyError.

To avoid this error, you might choose to use the get method, which allows you to provide a default value that will be returned if the environment variable is not found:

import os

### Accessing an environment variable with a default value
print(os.environ.get('HOME', '/home/default'))

In this case, if the HOME variable does not exist, the string /home/default will be returned instead.

You can print out and explore all environment variables with the following script:

import os

for name, value in os.environ.items():
    print("{0}: {1}".format(name, value))

Modifying and Adding Environment Variables

You can modify the value of an existing environment variable, or add a new one, by assigning a value to a key in the os.environ object:

import os

### Setting an environment variable
os.environ['MY_VARIABLE'] = 'foo'

### Now, the new variable is accessible via os.environ
print(os.environ['MY_VARIABLE'])  # Outputs: foo

In this case, the MY_VARIABLE environment variable is created and set to foo. However, this variable will be available for only as long as the current process is running.

Remember, changes made to the os.environ object are local to the process where they are made. If you set an environment variable in your Python script, it will be available to that script and any subprocesses it creates, but it will not be visible in the broader environment or to other unrelated processes.

The scope and lifecycle of environment variables set in this way are limited to the current process where they are assigned. As this might be a complicated point, consider our example above.

Our Python script has set an environment variable called MY_VARIABLE to foo. Let’s say we call another Python script and create a subprocess (Process 1). This second script will also be able to access MY_VARIABLE. But if you tried to access MY_VARIABLE from a different Python script that was not launched by the first script (Process 2) or from the command line after running the script, you would find that MY_VARIABLE is not available - it's not part of the broader environment. Both processes however have access to any global variables such as GLOBAL_VARIABLE in the example above.

Scope and lifecycle of environment variables

To persist environment variables across different sessions or processes, you'll need to set them in your operating system's environment outside of Python. The way to do this varies based on your operating system and shell but generally involves editing shell configuration files or using command-line utilities.

Let’s review what we mean when we talk about operating systems, shells, and command-line utilities.

Operating systems, shells, and command-line utilities

Operating systems are the most fundamental piece of software that run on a computer. It acts as an intermediary between the user and the computer hardware, making it possible for users to execute programs, manage files, and interact with the device. Operating systems allow you to manage files, processes, memory, and security. Popular operating systems include Microsoft Windows, macOS, and Linux distributions (like Ubuntu, Fedora, etc.).

Shells allow you to interact with your computer's operating system. As a data engineer, you'll primarily work with command-line shells, which let you issue commands to the operating system by typing them in as text.

When you open a terminal window (on Linux or Mac) or command prompt (on Windows), you interact with a shell. You might type a command like python my_script.py to run a Python script, or ls (on Linux or Mac) or dir (on Windows) to list the files in the current directory. These are all examples of interacting with a command-line shell.

Shell configuration files are special files that the shell reads when it starts up. As a data engineer, you’ll probably use them to set your environment variables that should be available every time you open a new terminal window. This makes it easier since the shell will execute automatically every time it starts.

Command-line utilities are programs that are designed to be used from a text interface. Instead of clicking on buttons, you can run Python scripts from the command line (e.g. use pip to install Python packages or jupyter to start a Jupyter notebook server) or use Git from the command line for version control. Other examples include Unix command-line utilities such as grep, awk, sed to search, filter and transform text data; curl and wget to download files from the web; and ssh to connect with remote servers when working with distributed systems or cloud-based resources.

Process-level scope

The environment variables set or modified via os.environ in your Python script are accessible within the same process. If your script starts another process (called a subprocess), that subprocess will inherit the environment of its parent, including any environment variables set by the parent. But if two separate processes both set an environment variable with the same name, they won't interfere with each other; each process has its own independent copy of the environment.

User-level and system-level scope:

If you want to set environment variables that are accessible to all processes of a particular user, or to all processes on the system, you need to do this outside of Python, in the configuration files for your shell or operating system. These changes have a broader scope but are not immediate; they typically require you to start a new shell session or reboot the system.

Persistence of changes

Changes to environment variables made using os.environ in a Python script are not persisted after the script finishes executing. This means that if you run a script that sets an environment variable and then run another script or return to your shell, the environment variable changes from the first script will not be visible.

This non-persistence can actually be an advantage. It allows your script to modify its environment as needed for its own purposes without affecting other processes or leaving lasting changes that could affect future shell sessions.

If you want to set an environment variable that persists across multiple shell sessions or is available to other applications, you must do so at the shell level or via the operating system's interfaces for setting environment variables. The exact method for doing so depends on your operating system and shell. However, this should be done carefully, as it can potentially have wide-ranging effects on your system's behavior.

Environment variables and configuration

Utilizing environment variables for configuration is considered a best practice in software development for two reasons.

First, they allow you to avoid storing sensitive information, like passwords, API keys, or database URIs, directly in your code. This prevents such information from being inadvertently exposed, for example, by being included in version control repositories.

Second, they also make your code more portable. When different settings are needed for development, testing, and production environments, these can be controlled through environment variables without modifying the code. If a configuration value needs to be changed, it can be updated in the environment variable without requiring a change to the application's code and without needing to redeploy the application.

Using environment variables for sensitive information

To use environment variables for sensitive data like API keys or database URIs, you simply need to ensure that the relevant environment variables are set in the environment where your Python code is running.

For example, you might have a Python script that connects to a database and uses an API. Instead of hardcoding the database URI and the API key in the script, you could use environment variables:

import os

### Accessing sensitive data from environment variables
db_uri = os.environ.get('DATABASE_URI')
api_key = os.environ.get('API_KEY')

In this case, the DATABASE_URI and API_KEY environment variables would need to be set in the environment where the script is run. This could be done in the shell before starting the script, or in a configuration file that the shell reads when it starts.

Don’t hardcode!

Hardcoding sensitive information (i.e., directly embedding specific values or parameters directly into your code) like API keys or database URIs in your code is a significant security risk. If your code is shared or made public, for example, on GitHub, anyone who can see the code can see these sensitive details. This could allow unauthorized access to your database or misuse of your API keys.

Even if you don't plan to share your code, there's still a risk. Hardcoded information stays in version control history, so anyone who later gets access to the repository could find sensitive data in old commits.

Hardcoding also makes your code less flexible. If you want to use a different database for testing or need to change API keys, you would need to change and redeploy your code. By contrast, if you use environment variables for such details, you can simply change the environment variables when needed without touching your code.

Best practices for managing environmental variables

With so-called "dotenv" files, you can store your configuration in a .env file which loads when your application starts. The python-dotenv library provides a way to load environment variables from .env files into the os.environ object in Python. Here's how you can use it:

Install the python-dotenv library if you haven't done so already. You can install it via pip:

pip install python-dotenv

Create a .env file in your project root directory and add some variables to it. The file could look something like this:

DATABASE_URI=postgres://myuser:mypassword@localhost:5432/mydatabase
API_KEY=abcdef123456

Note: this file should not be committed to your version control system. You should add .env to your .gitignore file to ensure Git ignores it.

In your Python script, you can use python-dotenv to load the .env file:

from dotenv import load_dotenv

### Load the .env file
load_dotenv()

### Now you can access the variables
import os
db_uri = os.getenv('DATABASE_URI')
api_key = os.getenv('API_KEY')

The load_dotenv() function reads the .env file and loads its contents into os.environ.

Environment variables in production

In development, it's often beneficial to have a simple and secure way to set environment variables that your application can use. This can look very different depending on your specific production environment.

You can leverage specific services for managing secrets in cloud environments:

Amazon Web Services (AWS) Secrets Manager or Parameter Store
Azure Key Vault
Google Cloud Secret Manager

You can also use specific secret management objects in container orchestration systems:

Kubernetes Secrets or integrations with tools like HashiCorp Vault
Docker secrets

We’ll see how this works in Amazon Elastic Container Service (Amazon ECS) and Kubernetes.

Amazon ECS task definitions and AWS Secrets Manager

You can define environment variables directly within the task definition. Here's an example in JSON format:

{
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "my-image",
      "environment": [
        {
          "name": "ENV_VARIABLE_NAME",
          "value": "value"
        }
      ]
    }
  ]
}

For sensitive data, you can store the secrets in AWS Secrets Manager and reference them in your task definition:

{
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "my-image",
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:region:aws_account_id:secret:secret_name"
        }
      ]
    }
  ]
}

Kubernetes ConfigMaps

In Kubernetes, you can manage environment variables using ConfigMaps for non-sensitive data and Secrets for sensitive data.

Here's an example of defining environment variables using a ConfigMap. First, create a ConfigMap in a yaml file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  ENV_VARIABLE_NAME: value

Then, reference the ConfigMap in a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: my-container
      image: my-image
      envFrom:
        - configMapRef:
            name: app-config

Alternatively, you can use Kubernetes Secrets for sensitive data. First, create a bash file with your secret:

kubectl create secret generic app-secrets --from-literal=DB_PASSWORD=secretpassword

Then, reference the Secret in a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
    - name: my-container
      image: my-image
      env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: DB_PASSWORD

When to use environment variables

Environment variables are best used for configuration data that varies between deployment environments and for sensitive data that should not be stored directly in the code.

You should consider using environment variables for:

Database URLs and other related settings
API keys, tokens, or secrets
Hostnames or URLs for external services
Any kind of sensitive data that should not be exposed in the code

However, environment variables may not be the best choice for:

Fine-tuned configuration options that may need to be changed often
Large amounts of binary data
Data that would be better stored in a database or other storage service

Additional considerations for managing secrets and sensitive information

Managing secrets and sensitive information is one of the key uses of environment variables. By storing this data in environment variables, you keep it out of your code. This prevents secrets from being exposed in your version control system and allows you to change the secrets without modifying your code.

However, just using environment variables isn't enough to keep your secrets secure. You should also:

Avoid logging environment variables, as logs can often be accessed by people who should not see the secrets.
Be cautious about error messages that might expose environment variables.
Ensure that your .env file is ignored by your version control system.

Environment variables foster consistency

Keeping development and production environments consistent can help prevent bugs that appear when your code behaves differently in different environments. Environment variables can help with this.

By using environment variables for configurations that vary between environments, you can ensure that your code itself remains consistent across all environments. Only the settings in the environment variables change.

This means that you can run your code in a development, testing, or production environment just by setting the appropriate environment variables. If your code works in one environment, it's more likely to work in others.

However, this requires discipline. You should resist the temptation to hardcode configuration data and instead always use environment variables. Also, all members of your team should understand how to set and use environment variables in their development environments. This can be facilitated by using tools like python-dotenv, which simplify the process of managing environment variables.

Conclusion

We’ve looked at how using environment variables can help manage production applications more securely and parameterize your data engineering pipelines.

If you have any questions or need further clarification, please join the Dagster Slack and ask the community for help. Thank you for reading!

Check out our next article in our Python series which builds on these data engineering concepts and explores type hints in Python.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: