Using the full capabilities of contemporary multi-core processors, multiprocessing is a fundamental idea in computer science that enables programs to run numerous tasks or processes concurrently. By separating tasks into several processes, each with its own memory space, multiprocessing enables software to overcome performance constraints, in contrast to conventional single-threaded techniques. Because processes are isolated, there is stability and security because memory conflicts are avoided. Especially for CPU-bound jobs requiring extensive computational operations, multiprocessing’s ability to optimize code execution is crucial. It is a game-changer for python applications where speed and effectiveness are crucial, such as data processing, scientific simulations, image and video processing, and machine learning.
Utilizing the capabilities of contemporary multi-core processors, multiprocessing is a powerful approach in computer programming that enables programs to conduct numerous tasks or processes simultaneously. Multiprocessing generates several processes, each with its own memory space, instead of multi-threading, which involves operating multiple threads within a single process. This isolation prevents processes from interfering with one another’s memory, which enhances stability and security.
This article was published as a part of the Data Science Blogathon.
An important objective in software development is to optimize code execution. The processing capability of a single core can be a constraint for traditional sequential programming. By permitting the allocation of tasks across several cores, multiprocessing overcomes this limitation and makes the most of the capabilities of contemporary processors. As a result, jobs requiring a lot of processing run faster and with significantly better performance.
The achievement of concurrency and parallelism depends heavily on using processes and threads, the basic units of execution in a computer program.
Processes:
An isolated instance of a use program is a process. Each process has its execution environment, memory space, and resources. Because processes are segregated, they do not directly share memory. Inter-process communication (IPC) represents one of the most intricate mechanisms to facilitate communication between processes. Given their size and inherent separation, processes excel at handling heavyweight tasks, such as executing numerous independent programs.
Threads:
Threads are the smaller units of execution within a process. Multiple threads with the same resources and memory can exist within a single process. As they share the same memory environment, threads running in the same process can communicate via shared variables. Compared to processes, threads are lighter and better suited for activities involving large amounts of shared data and slight separation.
A mutex called the Global Interpreter Lock (GIL) is used in CPython, the most popular Python implementation, to synchronize access to Python objects and stop several threads from running Python bytecode concurrently within the same process. This means that even on systems with several cores, only one thread can run Python code concurrently within a given process.
I/O-Bound Tasks: I/O-bound operations, where threads frequently wait for external resources like file I/O or network responses, are less significantly affected by the GIL. The lock and release actions of the GIL have a comparatively smaller effect on performance in such circumstances.
Threads: When handling I/O-bound activities, threads are advantageous when the software must wait a long time for external resources. They can operate in the background without interfering with the main thread, making them suitable for applications that demand responsive user interfaces.
Processes: For CPU-bound operations or when you wish to utilize multiple CPU cores fully, processes are more appropriate. Multiprocessing enables parallel execution across several cores without the restrictions of the GIL because each process has its own GIL.
Python’s multiprocessing module is a potent tool for attaining concurrency and parallelism by creating and administrating several processes. It offers a high-level interface for launching and managing processes, enabling programmers to run parallel activities on multi-core machines.
Enabling Concurrent Execution Through Multiple Processes:
By establishing numerous distinct processes, each with its own Python interpreter and memory space, the multiprocessing module makes it possible to run multiple programs at once. As a result, real parallel execution on multi-core platforms is made possible by getting beyond the Global Interpreter Lock (GIL) restrictions of the default threading module.
Process Class:
The Process class serves as the multiprocessing module’s brain. You can construct and manage an independent process using this class, which represents one. Essential techniques and qualities include:
Start (): Initiates the Process, causing the target function to run in a new process.
Terminate (): Terminates the Process forcefully.
Queue Class: The Queue class offers a secure method of interprocess communication via a synchronized queue. It supports adding and removing items from the queue using methods like put() and get().
Pool Class: It is possible to parallelize the execution of a function across various input values thanks to the Pool Class, which controls a pool of worker processes. Fundamental techniques include:
Pool(processes): Constructor for creating a process pool with a specified number of worker processes.
Lock Class: When many processes use the same shared resource, race situations can be avoided using the Lock class to implement mutual exclusion.
Value and Array Classes: These classes let you build shared objects that other processes can use. Useful for securely transferring data between processes.
Manager Class: Multiple processes can access shared objects and data structures created using the Manager class. It provides more complex abstractions like namespaces, dictionaries, and lists.
Pipe Function:
The Pipe() function constructs a pair of connection objects for two-way communication between processes.
You may identify the process running using the current object that this function returns.
Returns the number of available CPU cores, which is useful for figuring out how many tasks to run simultaneously.
You may construct and control different processes in Python using the Process class from the multiprocessing package. Here is a step-by-step explanation of how to establish processes using the Process class and how to provide the function to run in a new process using the target parameter:
import multiprocessing
# Example function that will run in the new process
def worker_function(number):
print(f"Worker process {number} is running")
if __name__ == "__main__":
# Create a list of processes
processes = []
num_processes = 4
for i in range(num_processes):
# Create a new process, specifying the target function and its arguments
process = multiprocessing.Process(target=worker_function, args=(i,))
processes.append(process)
process.start() # Start the process
# Wait for all processes to finish
for process in processes:
process.join()
print("All processes have finished")
Worker process 0 is running.
Worker process 1 is running.
Worker process 2 is running.
Worker process 3 is running.
All processes have finished.
You may construct and control different processes in Python using the Process class from the multiprocessing package. Here is a step-by-step explanation of how to establish processes using the Process class and how to provide the function to run in a new process using the target parameter.
In a multi-process environment, processes can synchronize their operations and share data using various techniques and procedures known as inter-process communication (IPC). Communication is crucial in a multiprocessing environment, where numerous processes operate simultaneously. This enables processes to cooperate, share information, and plan their operations.
Pipes:
Data passes between two processes using the fundamental IPC structure known as pipes. While the other process reads from the pipe, the first process writes data. Pipes can be either named or anonymous. Pipes, however, can only be used for two distinct processes to communicate with one another.
Queues:
The multiprocessing module’s queues offer a more adaptable IPC method. By sending messages across the queue, they enable communication between numerous processes. Messages are added to the queue by the transmitting process, and the receiving Process retrieves them. Data integrity and synchronization are automatically handled via queues.
Shared Memory:
Multiple processes can access the same area thanks to shared memory, facilitating effective data sharing and communication. Controlling shared memory necessitates precise synchronization to avoid race situations and guarantee data consistency.
Due to their simplicity and built-in synchronization, queues are a popular IPC technique in Python’s multiprocessing module. Here is an illustration showing how to use queues for interprocess communication:
import multiprocessing
# Worker function that puts data into the queue
def producer(queue):
for i in range(5):
queue.put(i)
print(f"Produced: {i}")
# Worker function that retrieves data from the queue
def consumer(queue):
while True:
data = queue.get()
if data is None: # Sentinel value to stop the loop
break
print(f"Consumed: {data}")
if __name__ == "__main__":
# Create a queue for communication
queue = multiprocessing.Queue()
# Create producer and consumer processes
producer_process = multiprocessing.Process(target=producer, args=(queue,))
consumer_process = multiprocessing.Process(target=consumer, args=(queue,))
# Start the processes
producer_process.start()
consumer_process.start()
# Wait for the producer to finish
producer_process.join()
# Signal the consumer to stop by adding a sentinel value to the queue
queue.put(None)
# Wait for the consumer to finish
consumer_process.join()
print("All processes have finished")
In this instance, the producer process uses The put() method to add data to the queue. The consumer process retrieves data from the queue using the get() method. Once the producer is finished, the consumer is advised to discontinue using a sentinel value (None). Waiting for both processes to complete is done using the join() function. This exemplifies how queues offer processes a practical and secure method of exchanging data without explicit synchronization techniques.
You can parallelize the execution of a function across various input values by using the Pool class in the multiprocessing module, which is a useful tool for managing a pool of worker processes. It makes the assignment of tasks and the gathering of their results more straightforward. Commonly utilized to achieve parallel execution is the Pool class’s map() and apply() operations.
map() Function:
The map() method applies the supplied function to each member of an iterable and divides the burden among the available processes. A list of outcomes is returned in the same order that the input values were entered. Here’s an illustration:
import multiprocessing
def square(number):
return number ** 2
if __name__ == "__main__":
input_data = [1, 2, 3, 4, 5]
with multiprocessing.Pool() as pool:
results = pool.map(square, input_data)
print("Squared results:", results)
apply() Function:
When you need to apply a function to a single parameter over a pool of processes, you use the apply() function. It gives back the outcome of using the function on the input. Here’s an illustration:
import multiprocessing
def cube(number):
return number ** 3
if __name__ == "__main__":
number = 4
with multiprocessing.Pool() as pool:
result = pool.apply(cube, (number,))
print(f"{number} cubed is:", result)
CPU-Bound Tasks: The Pool class can execute parallel versions of tasks that require a lot of CPU power, such as simulations or calculations. Multiple CPU cores can be effectively used by distributing the burden across the active tasks.
Data processing: The Pool class can handle many dataset components simultaneously when dealing with data processing tasks like data transformation, filtering, or analysis. The processing time may be significantly shortened as a result.
Web scraping: The Pool class can simultaneously request data from various URLs while scraping information from multiple websites. This speeds up the data-gathering process.
Synchronization and Locking: When two or more processes access the same shared resources or variables simultaneously in a multiprocessing system, race circumstances happen, resulting in unpredictable or inaccurate behavior. Data corruption, crashes, and inaccurate program output can all be caused by race circumstances. Data integrity and race scenarios are avoided by using synchronization techniques like locks.
The synchronisation primitive known as a “lock” (short for “mutual exclusion”) makes sure that only one process can access a crucial piece of code or a shared resource at any given moment. Once a process has a lock, it has sole access to the protected region and can’t be accessed by other processes until the lock is released.
By requiring that processes access resources sequentially, locks create a form of cooperation that avoids race situations.
Examples of Locks Used to Protect Data Integrity
import multiprocessing
def increment(counter, lock):
for _ in range(100000):
with lock:
counter.value += 1
if __name__ == "__main__":
counter = multiprocessing.Value("i", 0)
lock = multiprocessing.Lock()
processes = []
for _ in range(4):
process = multiprocessing.Process(target=increment, args=(counter, lock))
processes.append(process)
process.start()
for process in processes:
process.join()
print("Final counter value:", counter.value)
CPU-Bound Tasks: A CPU-bound task extensively uses the CPU’s processing capabilities. These jobs take significant CPU resources, including intricate calculations, mathematical operations, simulations, and data processing. CPU-bound jobs infrequently interface with external resources like files and networks and spend most of their time executing code.
I/O-Bound tasks: I/O-bound tasks include reading and writing files, sending requests across networks, and communicating with databases, all of which need a substantial amount of waiting time for I/O operations to finish. These jobs spend more time “waiting” for I/O operations to complete than actively using the CPU.
Process pools are beneficial for controlling CPU-intensive workloads. Process pools divide CPU-bound tasks over numerous processes so they can run concurrently on various CPU cores because, most of the time, they involve computations that can be parallelized. This considerably shortens the execution time and effectively utilizes the available CPU resources.
Using process pools, you can ensure that multi-core processors are fully utilized to finish CPU-bound tasks more quickly. The multiprocessing module’s Pool class makes creating and managing these worker processes easier.
Asynchronous programming is an appropriate strategy for I/O-bound jobs, where the main bottleneck is waiting for I/O operations (such as reading/writing files or making network requests). By effectively transitioning between activities while waiting for I/O, asynchronous programming enables a single thread to manage numerous tasks concurrently rather than using multiple processes.
Setting up separate processes, such as process pools, is unnecessary while using asynchronous programming. Instead, it employs a cooperative multitasking strategy, where activities give up control to the event loop while they wait for I/O to happen so that other tasks can carry on with their work. This can significantly enhance I/O-bound apps’ responsiveness.
Several factors influence the performance of multiprocessing solutions:
Here’s an illustrative comparison of different implementations using a simple CPU-bound task of calculating factorials:
import time
import multiprocessing
import threading
import math
def factorial(n):
return math.factorial(n)
def single_thread():
for _ in range(4):
factorial(5000)
def multi_thread():
threads = []
for _ in range(4):
thread = threading.Thread(target=factorial, args=(5000,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
def multi_process():
processes = []
for _ in range(4):
process = multiprocessing.Process(target=factorial, args=(5000,))
processes.append(process)
process.start()
for process in processes:
process.join()
if __name__ == "__main__":
start_time = time.time()
single_thread()
print("Single-threaded:", time.time() - start_time)
start_time = time.time()
multi_thread()
print("Multi-threaded:", time.time() - start_time)
start_time = time.time()
multi_process()
print("Multi-processing:", time.time() - start_time)
Multiprocessing has drawbacks even if it can significantly boost performance for CPU-bound tasks:
An effective technique for comprehending the behavior and effects of multiprocessing is visualization. You may follow the progress of processes, evaluate data for various scenarios, and visually show the performance gains from parallel processing by making graphs and charts.
Here are two examples of how you can use Matplotlib to visualize multiprocessing execution and speedup:
Example 1: Visualising Process Execution
Let’s consider a scenario where you’re processing a batch of images using multiple processes. You can visualize the progress of each process using a bar chart:
import multiprocessing
import time
import matplotlib.pyplot as plt
def process_image(image):
time.sleep(2) # Simulating image processing
return f"Processed {image}"
if __name__ == "__main__":
images = ["image1.jpg", "image2.jpg", "image3.jpg", "image4.jpg"]
num_processes = 4
with multiprocessing.Pool(processes=num_processes) as pool:
results = pool.map(process_image, images)
plt.bar(range(len(images)), [1] * len(images), align="center", color="blue",
label="Processing")
plt.bar(range(len(results)), [1] * len(results), align="center", color="green",
label="Processed")
plt.xticks(range(len(results)), images)
plt.ylabel("Progress")
plt.title("Image Processing Progress")
plt.legend()
plt.show()
Example 2: Speedup Comparison
import time
import threading
import multiprocessing
import matplotlib.pyplot as plt
def task():
time.sleep(1) # Simulating work
def run_single_thread():
for _ in range(4):
task()
def run_multi_thread():
threads = []
for _ in range(4):
thread = threading.Thread(target=task)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
def run_multi_process():
processes = []
for _ in range(4):
process = multiprocessing.Process(target=task)
processes.append(process)
process.start()
for process in processes:
process.join()
if __name__ == "__main__":
times = []
start_time = time.time()
run_single_thread()
times.append(time.time() - start_time)
start_time = time.time()
run_multi_thread()
times.append(time.time() - start_time)
start_time = time.time()
run_multi_process()
times.append(time.time() - start_time)
labels = ["Single Thread", "Multi Thread", "Multi Process"]
plt.bar(labels, times)
plt.ylabel("Execution Time (s)")
plt.title("Speedup Comparison")
plt.show()
In many sectors where tasks may be broken down into smaller work units that can be completed concurrently, multiprocessing is vital. Here are a few real-world scenarios where multiprocessing is crucial:
Multiprocessing can speed up web scraping and crawling processes, gathering data from numerous websites. Data gathering and analysis using multiple procedures to retrieve data from various sources.
Deep learning and machine learning: Using massive datasets to train machine learning models frequently requires computationally demanding activities. Using several cores or GPUs for data and training operations reduces training time and enhances model convergence.
Processing in batches is necessary for many applications, such as rendering animation frames or business programs processing reports. The efficient parallel execution of these activities is by multiprocessing.
Complex financial simulations, risk analysis, and scenario modeling can involve many calculations. Multiprocessing speeds up these computations, enabling faster decision-making and analysis.
Exploring Python’s multiprocessing capabilities gives you the power to alter the performance of your code and speed up applications. This voyage has revealed the complex interplay of threads, processes, and multiprocessing module power. New life by multiprocessing, which offers efficiency and optimization. Remember that multiprocessing is your key to innovation, speed, and efficiency as we part ways. Your newly acquired skills prepare you for difficult projects, including complex simulations and data-intensive activities. Let this information stoke your enthusiasm for coding, propelling your apps to higher effectiveness and impact. The trip goes on, and now that you have multiprocessing at your disposal, the possibilities of your code are limitless.
A. In contrast to multi-threading, which involves executing numerous threads within a single process while sharing the same memory, multiprocessing includes operating multiple independent processes, each with its own memory space. While multi-threading may be constrained by the Global Interpreter Lock (GIL), multiprocessing can achieve real parallelism across several CPU cores.
A. Multiprocessing is appropriate for CPU-bound jobs that demand intense computations and can profit from parallel execution, to answer your question. Use multi-threading for I/O-bound operations where waiting for outside resources is significant. Bypassing the GIL during multiprocessing is more effective for jobs that are CPU-bound.
A. The GIL is a mutex that only permits one thread at a time to run Python code within a single process. This restricts the parallel execution of Python scripts with multiple threads on multi-core platforms. Multiprocessing gets around this restriction by employing distinct processes, each with its own GIL
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.