Back to Glossary Index

Dagster Data Engineering Glossary:


Data Vectorization

Executing a single operation on multiple data points simultaneously.

Definition of vectorizing

Vectorization refers to improving the performance of operations on data, particularly with large data sets, by executing a single operation on multiple data points simultaneously — a significant characteristic of array programming.

Traditionally, many data management tasks are done in a loop, where each operation is performed sequentially on each data item. This is known as scalar computing. However, this approach can be quite slow when dealing with large amounts of data.

Vectorization, on the other hand, allows for the parallelization of computations. Instead of performing operations one data point at a time, vectorization allows these operations to be performed on whole arrays of data at once. This technique takes advantage of the architecture of modern CPUs and GPUs, which are designed to perform these types of parallel operations more efficiently. This is particularly prominent in libraries such as NumPy in Python, which is optimized for vectorized operations.

For example, let's say we have two large arrays of numbers, and we want to add the corresponding numbers in these arrays together. Without vectorization, we might use a for loop to go through each pair of numbers one by one and add them together. With vectorization, we can add the entire arrays together in one operation, which can result in a significant speedup.

When vectorization should be used...

Here are some conditions where you should consider vectorizing your data:

  1. Handling Large Data Sets: If you're working with large amounts of data, the speedup provided by vectorization can be substantial. The larger the dataset, the more noticeable the speedup usually is.

  2. Performing Mathematical and Statistical Operations: Vectorization is very beneficial when performing mathematical or statistical operations on arrays of data. NumPy, for example, provides vectorized versions of many functions that operate on whole arrays of data at once.

  3. When Code Readability is Important: Vectorized operations can often be written more concisely than their non-vectorized counterparts, leading to more readable code. For example, adding two arrays together with a + b is more readable than using a loop to add corresponding elements together.

  4. In Machine Learning Applications: Many machine learning algorithms require operations on large matrices or arrays. Libraries like NumPy or scikit-learn, which are optimized for vectorized operations, are usually the best tools for these kinds of tasks.

...and when you shouldn't.

Vectorization is not always the best solution. There are situations where it might not be beneficial or even possible to use vectorized operations. This includes cases where the operations on the data cannot be performed in parallel because each operation depends on the result of the previous one. In such cases, traditional loop-based operations might be necessary.

An example of data vectorization in Python using NumPy

Let's look at an example of data vectorization in Python using NumPy. In this example, we're going to generate two large arrays of random numbers and then perform some mathematical operations on them.

import numpy as np
import time

# Create two large arrays of random numbers
array_size = 10**6
A = np.random.rand(array_size)
B = np.random.rand(array_size)

# Perform operations using loops (non-vectorized approach)
start_time = time.time()
C = np.empty(array_size, np.float64)
for i in range(array_size):
    C[i] = A[i] + 2*B[i] - A[i]*B[i]
end_time = time.time()

print("Time taken using loops: {:.6f} seconds".format(end_time - start_time))

# Perform the same operations using vectorized approach
start_time = time.time()
C_vectorized = A + 2*B - A*B
end_time = time.time()

print("Time taken using vectorized operations: {:.6f} seconds".format(end_time - start_time))

# Check if the results of the two methods are close enough
print("Are the results close? ", np.allclose(C, C_vectorized))

This script will create two large arrays of random numbers. It then perform some calculations on these arrays using a for loop (the non-vectorized approach), and measure the time it takes. Then, it performs the same calculations using vectorized operations, and again measure the time it takes. Finally, we check that the results of the two methods are close enough to be considered equal.

You'll see that the vectorized operations are significantly faster than the loop. This is a simple example, but it demonstrates the power of vectorization when working with large arrays of data. Here are the results on my M1 Mac:

Time taken using loops: 0.412291 seconds
Time taken using vectorized operations: 0.007932 seconds
Are the results close?  True

An example of vectorization in Python using scikit-learn

Next, let's look as a slightly more advanced example of using vectorized operations in machine learning using scikit-learn and pandas.

In this example, we'll create a simple classifier for the Iris dataset, which is a built-in dataset in scikit-learn. The Iris dataset contains measurements of 150 iris flowers from three different species.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # we only take the first two features.
y = iris.target

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (mean=0, variance=1) using a vectorized operation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a logistic regression classifier
clf = LogisticRegression(random_state=42)
clf.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred = clf.predict(X_test_scaled)

# Output the accuracy of the classifier
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))

In this script:

  • We first load the Iris dataset, and split it into a training set and a test set.
  • We then standardize the features to have zero mean and unit variance. This is done using a vectorized operation in the StandardScaler class. The fit_transform method computes the mean and standard deviation of the training set, subtracts the mean from each feature, and then divides by the standard deviation. The transform method applies this same transformation to the test set.
  • We then train a logistic regression classifier on the training data. This involves several vectorized operations, including the computation of the gradient of the loss function.
  • We use the trained classifier to predict the species of each flower in the test set. This again involves vectorized operations, as the prediction for each instance is computed as a weighted sum of its features.
  • Finally, we compute the accuracy of the classifier, which is the proportion of test instances that were correctly classified. This again is a vectorized operation, as it involves comparing two arrays of predictions and true labels.

The final print() command will output the accuracy of the logistic regression model on the test dataset.

The accuracy is calculated as the proportion of correct predictions out of total predictions. It ranges from 0 to 1, with 1 indicating that every single prediction made by the model was correct, and 0 indicating that every prediction was incorrect. For a well-tuned logistic regression model and the Iris dataset, which is relatively simple and well-behaved, we might expect an accuracy in the high 90% range (0.90 to 1.00).

However, keep in mind that the randomness from the train_test_split function (defined by random_state=42 in this example) means that your result might not be exactly the same each time you run the code.

Here's what the output might look like:

Accuracy: 0.97

This means that the model correctly predicted the species of the iris flowers for 97% of the instances in the test set. This is a fairly high accuracy, indicating that our model is performing well on this particular dataset. If you set test_size to 0.3 or less, you would expect an accuracy of 100%.

vectorizing vs. vector databases

The term "vector" as used in vector databases is related to, but not exactly the same as, the concept of "vectorization" in programming and data science.

Vector databases (also known as "vector search engines", or more broadly a "vector store"), are a type of database (or component or module within a larger system) optimized for performing operations on vectors in high dimensional spaces. These vectors often represent complex, multidimensional data, such as images, sound, and text, in a form that machine learning models can understand. The databases are designed to efficiently perform nearest neighbor search in high dimensional spaces, which is a common operation in many machine learning tasks.

So in the context of vector databases, a "vector" is a representation of some piece of data in a high-dimensional space, as opposed to our definition above (performing operations on entire arrays of data at once).

So both concepts involve the manipulation of arrays or lists of numbers, they're used in different contexts and for different purposes.


Other data engineering terms related to
Data Processing: