Data splitting definition:
Data Splitting is the process of dividing a dataset into training, validation, and testing sets in preparation for the training and testing of a machine learning model.
Example of data splitting for machine learning, using Python
Here is a simple example of data splitting using the train_test_split
function from the sklearn
library. We will use the iris dataset from the sklearn.datasets
module for simplicity. Please note that you may need to add the necessary Python libraries installed in your Python environment to run this code.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data size (X_train): {X_train.shape}")
print(f"Test data size (X_test): {X_test.shape}")
print(f"Training labels size (y_train): {y_train.shape}")
print(f"Test labels size (y_test): {y_test.shape}")
In the above code:
We first load the iris dataset which is a simple, pre-cleaned classification dataset.
X
contains the feature values andy
contains the corresponding labels.We then use the
train_test_split
function fromsklearn.model_selection
to split our data.The
test_size
argument determines what fraction of the original data is used for the test set. In this case, we're using 20% of the data for testing and the remaining 80% for training.
The train_test_split
function shuffles the dataset and then splits it. Shuffling is important to ensure that our model doesn't learn patterns from the order of the data. The function returns four values: the training data, the testing data, the training labels, and the testing labels.
The random_state
argument is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices (provided shuffle=True, which is the default). This is important because it ensures that the split you generate is reproducible. In other words, every time you run your code, you will get the exact same split of data, which can be important for debugging and comparison purposes.
The value you provide to random_state
can be any integer (We use 42
but that is arbitrary.) The specific value is not inherently important; what matters is that using the same random_state value across different runs will ensure that the same random splits are generated, given the other parameters to train_test_split
are the same.
If you do not specify the random_state
parameter or if you set it to None
, then each time you run the code, a different split will be generated based on the numpy's random number state at the time train_test_split
is called.
The final four lines in our program print out the following:
Training data size (X_train): (120, 4)
Test data size (X_test): (30, 4)
Training labels size (y_train): (120,)
Test labels size (y_test): (30,)
You can now use X_train
and y_train
to train your model, and X_test
and y_test
to evaluate its performance.
In the above Python script, y_test
contains the labels for the test set. These are the true values that a machine learning model will aim to predict accurately.
The expression y_test.shape
gives you the dimensions of this array. Since y_test
is a one-dimensional array, the shape will be expressed as (n,)
, where n
is the number of instances (or rows) in the test set.
When we performed the train/test split with test_size=0.2
, we reserved 20% of the data for the test set. The iris dataset contains 150 instances, so y_test
contains 20% of 150, or 30 instances.
These labels will be used to evaluate the performance of a machine learning model. Once the model has made predictions for the X_test
data, these predictions can be compared to the true labels in y_test
to calculate metrics such as accuracy, precision, recall, etc.