Using PyTorch DataLoaders with Datasets

Using PyTorch DataLoaders with Datasets

Today, I will demonstrate how we can use the PyTorch's Dataset and DataLoaders class to feed data into PyTorch models. If you are not familiar with Dataset classes, please take a look at my previous article here. I recommend you to go through the article as I will be using the same movie review dataset here.

DataLoaders

When I started using datasets and dataloaders, I could not make a distinction between the two. If I already have a dataset object, can't I get all my data from the dataset class? Why do I need an additional code? Well for starters, dataset will only produce one sample and needs indexing. However, dataloaders facilitates how you can load these sample. Maybe you want to randomly sample, or load a sample of certain batch size, or even oversample your original data. Dataloaders will make it very easy to achieve these!

Let's see this with an example:

Loading batch sample without using DataLoaders

The code below shows how you can use just use datasets to load data in batch size of 16. Two loops? Just to load a batch?? Siddhi! you must be kidding me! Show me something easy! Right? Let me also tell you that the following code will not load the last batch if the sample size is not exactly divisible by the batch size.

batch_size = 16

# Calculate the total number of batches
total_batches = len(dataset) // batch_size
# Iterate over batches
for batch_idx in range(total_batches):
    # Calculate the start and end index for the current batch
    start_idx = batch_idx * batch_size
    end_idx = (batch_idx + 1) * batch_size

    # Get the data for the current batch
    batch_data = []
    for i in range(start_idx, end_idx):
        batch_data.append(dataset.__getitem__(i))
    print(batch_data)

Let's see how this can be solved by using dataloader class.

Loading using DataLoaders

from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=16, shuffle=True, drop_last=True)

for batch in dataloader:
    data, target = batch['text'],batch['label'] 
    print(data)
    print(target)

Using data loaders in pytorch can save you a lot of headache. A lot of stuffs are abstracted. For example, you do not have to worry about whether the length of dataset is exactly divisible by the batch size (you can add a parameter to drop the last batch). You can also shuffle the dataset if you want to.

Splitting the datasets

Since we have already implemented the dataset class, it becomes a breeze to split datasets into training, validation, and testing sets. We can do this using torch.utils.data.random_split. We can use the dataloaders to these train, test, and validation splits in similar way to load datasets.

train_set, val_set, test_set =\
    torch.utils.data.random_split(dataset, [0.7, 0.2, 0.1])

train_dataloader = \
    DataLoader(train_set, batch_size=16, shuffle=True, drop_last=True)

for batch in train_dataloader :
    data, target = batch['text'],batch['label'] 
    print(data)
    print(target)