Skip to main content

Command Palette

Search for a command to run...

Multithreading in python.

Can this thing go any faster?

Published
3 min read
S

Hi there!

I'm a machine learning, python guy. Reach out to me for collaboration and stuff! :D

I could go on and on about multi-threading, but let’s start with this—it can make some of your tasks way faster. Instead of diving into the usual ‘what’ and ‘how’ of multi-threading, I want to focus on the ‘why’ and most importantly, ‘where’ you should use it. And, I intend to go through a real life example of multi-threading. And not just.an example with print statements to check the speed improvement. I bet you have seen that a lot.

So, you’ve written a slow piece of code, and now you’re wondering if multithreading can speed it up. Before diving in, ask yourself two key questions: Is the task sequential? And is it CPU-bound or I/O-bound? These will determine whether multithreading is the right solution—or if you should consider an alternative like multiprocessing or asynchronous programming.

If your task is sequential (each step depends on previous tasks), then multithreading is not the right solution. Threads work when you have unrelated tasks that can run in parallel. For example, like downloading multiple files. If your task is /O bound. An I/O-bound task is one where the program spends most of its time waiting for input/output operations rather than using the CPU. For example, waiting for an file read/write, database read/write, etc. If you have a program that is heavily CPU bound and/or needs to run in sequential, multithreading may not be for you.

Example

Let’s jump straight into a simple example of multithreading. As a rule of thumb, whenever I see a list of objects that can be processed independently—without them needing to communicate with each other—I immediately think about multithreading. For this example, I’ll show you how to download images using a list of links. I found a GitHub repository that contains URLs of dog images. Instead of downloading them one by one, we’ll use multithreading to speed up the process. Here is the link to the file in github.

Import necessary libraries

We use ThreadPoolExecutor from concurrent.futures. This provides a higher level abstraction and keeps the code clean.

import os
import requests
import time
from concurrent.futures import ThreadPoolExecutor

Getting the list of urls

Using the github link, we can download the links of images. I’ve used request library to download the text file from github. We will use the same library to download images.

url = "https://raw.githubusercontent.com/iblh/not-cat/refs/heads/master/urls/not-cat/dog-urls.txt"
response = requests.get(url)
if response.status_code == 200:
    image_url_list = response.text.split("\n")
    image_url_list = [url for url in image_url_list if url]

else:
    print(f"Failed to fetch file. Status code: {response.status_code}")

Downloading the image

def download_image_from_url(url):
    try:
        response = requests.get(url)
        print(f"Downloading image from {url}")
        if response.status_code == 200:
            with open(os.path.join("dog_image", url.split("/")[-1]), "wb") as f:
                f.write(response.content)
        else:
            print(
                f"Failed to download image from {url}. Status code: {response.status_code}"
            )
    except Exception as e:
        print(f"Failed to download image from {url}. Exception: {e}")

This is a simple function that downloads the image file from a url, and writes to dog_image folder.

Without multithreading, you can simply loop through the list.

start_time = time.time()
for url in image_url_list:
    download_image_from_url(url)
print(f"Time taken without multithreading: {time.time() - start_time}")

Now, to use multithreading, it is a very simple change:

max_workers = os.cpu_count()
print(f"Number of CPUs: {max_workers}")
start_time = time.time()

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    executor.map(download_image_from_url, image_url_list) # function name, argument
print(f"Time taken with multithreading: {time.time() - start_time}")

As you can see, the change required are very simple. And this will speed up your script by a lot! These are the examples from my experiments:

Without multithreading

156.25

With Multithreading

15.97

That is 9.7X faster than regular for loop. I bet you did not expect it to be that much faster. Here is the complete script.