Dataset Splitting | Dagster Glossary

Data splitting definition:

Data Splitting is the process of dividing a dataset into training, validation, and testing sets in preparation for the training and testing of a machine learning model.

Example of data splitting for machine learning, using Python

Here is a simple example of data splitting using the train_test_split function from the sklearn library. We will use the iris dataset from the sklearn.datasets module for simplicity. Please note that you may need to add the necessary Python libraries installed in your Python environment to run this code.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data size (X_train): {X_train.shape}")
print(f"Test data size (X_test): {X_test.shape}")
print(f"Training labels size (y_train): {y_train.shape}")
print(f"Test labels size (y_test): {y_test.shape}")

Aside: Uppercase matrix, lowercase vector

The convention of using an uppercase X and a lowercase y, as shown in our code above, is typical in machine learning and statistics. X usually represents the matrix of feature vectors, and since it's a matrix (which can be multi-dimensional), it's represented as an uppercase X. Each row in X represents a different instance (or sample), and each column represents a different feature.

On the other hand, y typically represents the target variable (or labels) we are trying to predict, and it's usually a one-dimensional vector, hence the lowercase. Each element in y corresponds to the label or target value of the corresponding row in X.

While it's not a strict rule, it's a common convention in machine learning and statistics to use X for the matrix of input data and y for the vector of target values.

In the above code:

We first load the iris dataset which is a simple, pre-cleaned classification dataset. X contains the feature values and y contains the corresponding labels.
We then use the train_test_split function from sklearn.model_selection to split our data.
The test_size argument determines what fraction of the original data is used for the test set. In this case, we're using 20% of the data for testing and the remaining 80% for training.

The train_test_split function shuffles the dataset and then splits it. Shuffling is important to ensure that our model doesn't learn patterns from the order of the data. The function returns four values: the training data, the testing data, the training labels, and the testing labels.

The random_state argument is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices (provided shuffle=True, which is the default). This is important because it ensures that the split you generate is reproducible. In other words, every time you run your code, you will get the exact same split of data, which can be important for debugging and comparison purposes.

The value you provide to random_state can be any integer (We use 42 but that is arbitrary.) The specific value is not inherently important; what matters is that using the same random_state value across different runs will ensure that the same random splits are generated, given the other parameters to train_test_split are the same.

If you do not specify the random_state parameter or if you set it to None, then each time you run the code, a different split will be generated based on the numpy's random number state at the time train_test_split is called.

The final four lines in our program print out the following:

Training data size (X_train): (120, 4)
Test data size (X_test): (30, 4)
Training labels size (y_train): (120,)
Test labels size (y_test): (30,)

You can now use X_train and y_train to train your model, and X_test and y_test to evaluate its performance.

In the above Python script, y_test contains the labels for the test set. These are the true values that a machine learning model will aim to predict accurately.

The expression y_test.shape gives you the dimensions of this array. Since y_test is a one-dimensional array, the shape will be expressed as (n,), where n is the number of instances (or rows) in the test set.

When we performed the train/test split with test_size=0.2, we reserved 20% of the data for the test set. The iris dataset contains 150 instances, so y_test contains 20% of 150, or 30 instances.

These labels will be used to evaluate the performance of a machine learning model. Once the model has made predictions for the X_test data, these predictions can be compared to the true labels in y_test to calculate metrics such as accuracy, precision, recall, etc.