Data Engineering Glossary: Package install commands

List of Python packages and install instructions

PackageDescriptionsinstall commandDocs link


Asynchronous HTTP Client/Server for asyncio and Python.

pip install aiohttpdocs


Pull data out of HTML and XML files.

pip install beautifulsoup4docs


Compress and decompress files using the bzip2 compression algorithm.

Bz2 comes packaged with Python. It is unlikely you will need to install it.



A lightweight and extensible data validation library that supports type checking, value constraints, and schema-based validation.

pip install Cerberusdocs


Provides various cryptographic services, including encryption, decryption, hashing, and key management.

pip install cryptographydocs


Parallel computing: allows you to scale Python computations across multiple CPUs or even across a cluster of machines.

python -m pip install daskdocs


Data preparation, cleaning, and exploration, provides tools for handling missing data, encoding categorical variables, and visualizing data distributions.

pip install dataprepdocs


Exploratory data analysis and feature engineering, provides tools for data visualization, data profiling, and data cleaning.

pip install doradocs


A version control system for data science projects that enables you to track changes to your data and models, and collaborate with others on your project.

brew install dvc Or pip install dvcdocs


Profile and analyze data. Provides statistics and visualizations to help you understand your data and identify potential issues.

pip install DataProfilerdocs


A python implementation of the parquet format, aiming integrate into python-based big data work-flows.

pip install fastparquetdocs


Python Avro (data serialization and data exchange services for Apache Hadoop), but fast.

pip install fastavrodocs


Fast tools for functional programming.

pip install functoolsdocs


Work with geospatial data, provides tools for reading, writing, and manipulating spatial data in a pandas-like framework.

pip install geopandasdocs

Great Expectations

Data validation, testing, and documentation, enabling you to define and enforce expectations about your data.

python -m pip install great_expectationsdocs


Compress and decompress files using the gzip compression algorithm.

Gzip is installed natively in Python3 and it’s unlikely you will need to install it.



Secure hashing: providesvarious hash functions such as SHA-256 and SHA-512.

pip install hashlibdocs


Clean and transform data with tools for renaming columns, removing duplicate rows, and filling missing values.

pip install pyjanitordocs


Data visualization: provides tools for creating various types of charts and plots.

python3 -m pip install -U matplotlibdocs

Natural Language Toolkit (NLTK) library

Natural language processing (NLP), provides tools for tokenization, part-of-speech tagging, and sentiment analysis.

pip install --user -U nltkdocs


A Python interface for storing multidimensional scientific data (variables) such as temperature, humidity, pressure, wind speed, and direction.

pip install netCDF4docs


A leading platform for building Python programs to work with human language data.

pip install nltkdocs


Scientific computing: provides tools for working with arrays, matrices, and numerical operations.

pip install numpydocs

OpenCV (cv2)

Computer vision: provides tools for image and video processing, object detection, and feature extraction.

pip install opencv-pythondocs


Data manipulation and analysis: provides tools for working with tabular data, handling missing values, and performing aggregations.

pip install pandasdocs

Pandas Profiling

**Note: Obsolete** Generate data profiling reports: provides statistics,and visualizations to help you understand your data and identify potential issues.

pip install pandas_profiling (Note this package is obsolete and you should use ydata-profiling instead.)



A blazingly fast and memory-efficient Python library for data manipulation and analysis: provides a DataFrame API similar to Pandas but optimized for performance on large datasets.

pip install polarsdocs


Time series forecasting, provides a simple yet powerful model based on decomposable time series components.

pip install fbprophetdocs


The most popular PostgreSQL database adapter for Python.

pip install psycopg2 docs


The Python API of Apache Arrow. Apache Arrow is a development platform for in-memory analytics.

pip install pyarrowdocs


Data validation and settings management, provides a declarative syntax for defining data models with type hints and validation rules.

pip install pydanticdocs


Work with MongoDB: provides tools for connecting to a MongoDB instance, querying data, and performing CRUD operations.

Pip install pymongodocs


Apache Spark: provides tools for distributed computing, data processing, and machine learning on large datasets.

pip install pyspark - or - Brew install apache-sparkdocs


A Python interface to Stan, a probabilistic programming language for Bayesian inference and statistical modeling.

pip install pystandocs


A publish-subscribe API to facilitate event-based programming and decoupling an application’s in-memory components.

pip install pypubsubdocs


Work with SQLite databases, provides tools for connecting to a database, querying data, and performing CRUD operations.



Wavelet transforms and signal processing, provides tools for time-frequency analysis, denoising, and compression.

pip install PyWaveletsdocs

Re (regex)

Regular expressions, provides tools for pattern matching and string manipulation based on a specified pattern.

pip install regexdocs


Elegant mapping of the HTTP protocol onto Python's object-oriented semantics.

pip install requestsdocs


Scientific computing and technical computing, provides tools for optimization, interpolation, signal processing, and statistics.

pip install scipydocs


Machine learning tools for classification, regression, clustering, dimensionality reduction, and model selection.

pip install -U scikit-learndocs


Natural Language Processing (NLP) library.

pip install -U spacydocs


Statistical modeling and analysis tools for regression analysis, time series analysis, and hypothesis testing.

pip install statsmodelsdocs


Machine learning and deep learning tools for building and training neural networks and other machine learning models.

python3 -m pip install tensorflowdocs

Ydata Profiling

Data profiling tools for generating data profiling reports with statistics and visualizations.

pip install ydata-profilingdocs


Compress and decompress files using the ZIP compression algorithm.

ZipFile is installed natively in Python3 and it’s unlikely you will need to install it.