Data Engineering Glossary: Package install commands

List of Python packages and install instructions


PackageDescriptionsinstall commandDocs link

aiohttp

Asynchronous HTTP Client/Server for asyncio and Python.

pip install aiohttpdocs

beautifulsoup

Pull data out of HTML and XML files.

pip install beautifulsoup4docs

bz2

Compress and decompress files using the bzip2 compression algorithm.

Bz2 comes packaged with Python. It is unlikely you will need to install it.

docs

Cerberus

A lightweight and extensible data validation library that supports type checking, value constraints, and schema-based validation.

pip install Cerberusdocs

Cryptography

Provides various cryptographic services, including encryption, decryption, hashing, and key management.

pip install cryptographydocs

Dask

Parallel computing: allows you to scale Python computations across multiple CPUs or even across a cluster of machines.

python -m pip install daskdocs

dataprep

Data preparation, cleaning, and exploration, provides tools for handling missing data, encoding categorical variables, and visualizing data distributions.

pip install dataprepdocs

dora

Exploratory data analysis and feature engineering, provides tools for data visualization, data profiling, and data cleaning.

pip install doradocs

DVC

A version control system for data science projects that enables you to track changes to your data and models, and collaborate with others on your project.

brew install dvc Or pip install dvcdocs

DataProfiler

Profile and analyze data. Provides statistics and visualizations to help you understand your data and identify potential issues.

pip install DataProfilerdocs

Fastparquet

A python implementation of the parquet format, aiming integrate into python-based big data work-flows.

pip install fastparquetdocs

Fastavro

Python Avro (data serialization and data exchange services for Apache Hadoop), but fast.

pip install fastavrodocs

Functools

Fast tools for functional programming.

pip install functoolsdocs

GeoPandas

Work with geospatial data, provides tools for reading, writing, and manipulating spatial data in a pandas-like framework.

pip install geopandasdocs

Great Expectations

Data validation, testing, and documentation, enabling you to define and enforce expectations about your data.

python -m pip install great_expectationsdocs

gzip

Compress and decompress files using the gzip compression algorithm.

Gzip is installed natively in Python3 and it’s unlikely you will need to install it.

docs

hashlib

Secure hashing: providesvarious hash functions such as SHA-256 and SHA-512.

pip install hashlibdocs

janitor

Clean and transform data with tools for renaming columns, removing duplicate rows, and filling missing values.

pip install pyjanitordocs

Matplotlib

Data visualization: provides tools for creating various types of charts and plots.

python3 -m pip install -U matplotlibdocs

Natural Language Toolkit (NLTK) library

Natural language processing (NLP), provides tools for tokenization, part-of-speech tagging, and sentiment analysis.

pip install --user -U nltkdocs

NetCDF4

A Python interface for storing multidimensional scientific data (variables) such as temperature, humidity, pressure, wind speed, and direction.

pip install netCDF4docs

nltk

A leading platform for building Python programs to work with human language data.

pip install nltkdocs

numpy

Scientific computing: provides tools for working with arrays, matrices, and numerical operations.

pip install numpydocs

OpenCV (cv2)

Computer vision: provides tools for image and video processing, object detection, and feature extraction.

pip install opencv-pythondocs

Pandas

Data manipulation and analysis: provides tools for working with tabular data, handling missing values, and performing aggregations.

pip install pandasdocs

Pandas Profiling

**Note: Obsolete** Generate data profiling reports: provides statistics,and visualizations to help you understand your data and identify potential issues.

pip install pandas_profiling (Note this package is obsolete and you should use ydata-profiling instead.)

docs

Polars

A blazingly fast and memory-efficient Python library for data manipulation and analysis: provides a DataFrame API similar to Pandas but optimized for performance on large datasets.

pip install polarsdocs

Prophet

Time series forecasting, provides a simple yet powerful model based on decomposable time series components.

pip install fbprophetdocs

Psycopg2

The most popular PostgreSQL database adapter for Python.

pip install psycopg2 docs

Pyarrow

The Python API of Apache Arrow. Apache Arrow is a development platform for in-memory analytics.

pip install pyarrowdocs

Pydantic

Data validation and settings management, provides a declarative syntax for defining data models with type hints and validation rules.

pip install pydanticdocs

pymongo

Work with MongoDB: provides tools for connecting to a MongoDB instance, querying data, and performing CRUD operations.

Pip install pymongodocs

PySpark

Apache Spark: provides tools for distributed computing, data processing, and machine learning on large datasets.

pip install pyspark - or - Brew install apache-sparkdocs

PyStan

A Python interface to Stan, a probabilistic programming language for Bayesian inference and statistical modeling.

pip install pystandocs

PyPubSub

A publish-subscribe API to facilitate event-based programming and decoupling an application’s in-memory components.

pip install pypubsubdocs

pysqlite3

Work with SQLite databases, provides tools for connecting to a database, querying data, and performing CRUD operations.

https://www.sqlite.org/download.html

docs

pywt

Wavelet transforms and signal processing, provides tools for time-frequency analysis, denoising, and compression.

pip install PyWaveletsdocs

Re (regex)

Regular expressions, provides tools for pattern matching and string manipulation based on a specified pattern.

pip install regexdocs

requests

Elegant mapping of the HTTP protocol onto Python's object-oriented semantics.

pip install requestsdocs

SciPy

Scientific computing and technical computing, provides tools for optimization, interpolation, signal processing, and statistics.

pip install scipydocs

sklearn

Machine learning tools for classification, regression, clustering, dimensionality reduction, and model selection.

pip install -U scikit-learndocs

spaCy

Natural Language Processing (NLP) library.

pip install -U spacydocs

statsmodels

Statistical modeling and analysis tools for regression analysis, time series analysis, and hypothesis testing.

pip install statsmodelsdocs

Tensorflow

Machine learning and deep learning tools for building and training neural networks and other machine learning models.

python3 -m pip install tensorflowdocs

Ydata Profiling

Data profiling tools for generating data profiling reports with statistics and visualizations.

pip install ydata-profilingdocs

zipfile

Compress and decompress files using the ZIP compression algorithm.

ZipFile is installed natively in Python3 and it’s unlikely you will need to install it.

docs