Cosine Similarity

Cosine Similarity definition:

Cosine similarity is a metric used to measure how similar two entities (e.g., documents, vectors, data points) are irrespective of their size.

In the context of data engineering, it's primarily used in text analysis and natural language processing, but also in other areas like recommendation systems. Cosine similarity is particularly useful because it is independent of the magnitude of the vectors. This makes it a good choice for cases where the magnitudes of the vectors are not relevant or are misleading. For example, in text analysis, a longer document might have higher term frequencies, but that doesn't necessarily mean it's more "similar" to another document compared to a shorter one.

How cosine similarity works:

Here's a breakdown of how cosine similarity works:

Vector Representation of Data: The first step in applying cosine similarity is to represent the data as vectors. In text analysis, this could mean representing documents or sentences as vectors of term frequencies (TF) or TF-IDF (Term Frequency-Inverse Document Frequency) scores. Each dimension of the vector represents a unique term (word) from the corpus (collection of texts). In other contexts, the data points might be represented in different ways, but the key is that they need to be converted into vectors.

Calculating the Cosine of the Angle: Once you have two vectors, the cosine similarity is calculated by taking the dot product of these vectors and dividing by the product of their magnitudes (or lengths). This is equivalent to finding the cosine of the angle between the two vectors in a multi-dimensional space. The formula is as follows:

where A and B are vectors, A.B is the dot product, and ||A|| and ||B|| are the magnitudes (lengths) of the vectors A and B.

Interpreting the Results: The cosine similarity will always give a value between -1 and 1, where:

1 indicates that the two vectors are identical (they point in the same direction).
0 indicates that the two vectors are orthogonal (unrelated).
-1 indicates that the two vectors are diametrically opposed.

Let's note that in the context of text represented as TF-IDF vectors, cosine similarity ranges from 0 to 1 and not -1 to 1, because TF-IDF vectors are non-negative.

An example of cosine similarity in Python:

As an example of cosine similarity in Python, let's consider a scenario where we have multiple documents and we want to find how similar they are to a query. This kind of analysis is typical in information retrieval systems or recommendation engines.

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code, namely sklearn, numpy, and nltk.

Also, the code uses NLTK's stopwords and tokenizer, so you'll need to download the necessary NLTK data using nltk.download() as we do in the code below.

In this example, I'll demonstrate:

Preprocessing of Text: Including tokenization, removal of stopwords, and stemming/lemmatization.
Vectorization of Text: Using TF-IDF vectors.
Cosine Similarity Calculation: Between a query document and a set of other documents.
Ranking Documents: Based on their similarity to the query.

Here's the Python example:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Define a basic preprocessor
def preprocess(document):
    # Tokenize
    tokens = word_tokenize(document.lower())
    # Remove stopwords and stem
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in tokens if word not in stopwords.words('english')])

# Sample "documents" and a query document
documents = [
    "Dagster helps data engineers tame complexity. Elevate your data pipelines with software-defined assets, first-class testing, and deep integration with the modern data stack.",
    "Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.",
    "Declare—as Python functions—the data assets that you want to build. Dagster then helps you run your functions at the right time and keep your assets up-to-date.",
    "Dagster is an open-source, cloud-native data orchestration engine to build, deploy, and manage data pipelines. It is inspired heavily by Airflow and started as an open-source library for building ETL/ELT processes and ML pipelines."
]
query = ["Dagster is a cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability."]

# Preprocess the documents and the query
processed_docs = [preprocess(doc) for doc in documents]
processed_query = preprocess(query[0])

# Vectorize the text
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_docs + [processed_query])

# Calculate Cosine Similarity
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# Display the similarity scores
print("Cosine Similarity Scores:", cosine_similarities)

# Optionally, rank the documents based on similarity
ranking = np.argsort(cosine_similarities[0])[::-1]
print("Ranking of documents based on similarity to query:", ranking)

In this code:

The preprocess function tokenizes the documents, removes stopwords, and applies stemming. This is a common preprocessing step in text analysis.
TfidfVectorizer is used to convert the preprocessed text into TF-IDF vectors.
Cosine similarity is calculated between the query and each of the documents.
The documents are then ranked based on their similarity to the query.

Here is the output of the code above, indicating the second document in our array is the closest to the reference 'query' document:

Cosine Similarity Scores: [[0.09165726 0.96072959 0.01847582 0.07214393]]
Ranking of documents based on similarity to query: [1 0 3 2]