Matrix Multiplication Using Pthreads in C

Matrix multiplication is a fundamental operation in many scientific and engineering applications, especially in fields like computer graphics, machine learning, and physics simulations. In large-scale systems, efficient matrix multiplication can be computationally expensive, making it a perfect candidate for parallel computing.

One popular way to achieve parallelism in C is by using POSIX Threads (Pthreads). This article will explain how to implement matrix multiplication using Pthreads to divide the workload across multiple threads, achieving faster computation on multi-core systems.

What is Matrix Multiplication?

Matrix multiplication involves taking two matrices, say A and B, and producing a resultant matrix C, where each element of C is computed as the dot product of the corresponding row of A with the column of B.

Where C[i][j] is the element in the i-th row and j-th column of matrix C.

Why Use Pthreads for Matrix Multiplication?

Matrix multiplication is computationally intensive, especially for large matrices. Using Pthreads allows us to parallelize this computation by distributing the work across multiple threads, reducing execution time.

In the case of matrix multiplication, each element of the resulting matrix C can be computed independently, making it a perfect candidate for parallelism.

Pthreads Basics

Thread creation: In Pthreads, threads are created using pthread_create().
Thread synchronization: When threads finish their execution, you can use pthread_join() to wait for them to complete.

We’ll divide the task of calculating matrix C across multiple threads, where each thread will compute a portion of the matrix.

Matrix Multiplication Using Pthreads: Step-by-Step Guide

Step 1: Define the Problem

Let’s consider two matrices A (of size m x n) and B (of size n x p). The goal is to compute the product C = A * B (of size m x p) using multiple threads.

Step 2: Divide the Task

We can divide the task of computing the matrix C in several ways:

Row-wise division: Assign each thread a set of rows to compute.
Element-wise division: Assign each thread the task of computing one or more elements in the resulting matrix.

For simplicity, let’s go with row-wise division, where each thread computes one or more rows of the resulting matrix C.

Step 3: Code Implementation

Here’s a C program to multiply two matrices using Pthreads:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

// Define matrix sizes
#define M 4  // Rows in A and C
#define N 4  // Columns in A and Rows in B
#define P 4  // Columns in B and C

// Matrices
int A[M][N];
int B[N][P];
int C[M][P];

// Number of threads (ideally <= M)
#define NUM_THREADS 4

// Thread argument structure
typedef struct {
    int start_row;
    int end_row;
} thread_arg_t;

// Function to perform matrix multiplication for a subset of rows
void* multiply_matrices(void* arg) {
    thread_arg_t* thread_arg = (thread_arg_t*)arg;
    int start_row = thread_arg->start_row;
    int end_row = thread_arg->end_row;

    for (int i = start_row; i < end_row; i++) {
        for (int j = 0; j < P; j++) {
            C[i][j] = 0;
            for (int k = 0; k < N; k++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    pthread_exit(NULL);
}

int main() {
    pthread_t threads[NUM_THREADS];
    thread_arg_t thread_args[NUM_THREADS];

    // Initialize matrices A and B with some values
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            A[i][j] = i + j;
        }
    }

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < P; j++) {
            B[i][j] = i * j;
        }
    }

    // Divide rows among threads
    int rows_per_thread = M / NUM_THREADS;
    for (int i = 0; i < NUM_THREADS; i++) {
        thread_args[i].start_row = i * rows_per_thread;
        thread_args[i].end_row = (i == NUM_THREADS - 1) ? M : (i + 1) * rows_per_thread;

        // Create a thread to compute a portion of matrix C
        pthread_create(&threads[i], NULL, multiply_matrices, &thread_args[i]);
    }

    // Wait for all threads to finish
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }

    // Print the result matrix C
    printf("Result matrix C:\n");
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < P; j++) {
            printf("%d ", C[i][j]);
        }
        printf("\n");
    }

    return 0;
}

Explanation of the Code

Matrix Initialization:
- Matrices A and B are initialized with some sample values. In a real-world scenario, these matrices could be populated with user input or data from files.
Thread Division:
- The rows of matrix C are divided equally among the available threads. Each thread is responsible for computing one or more rows of the result matrix.
Thread Function (multiply_matrices):
- Each thread computes the dot product of the corresponding rows of matrix A and the columns of matrix B to calculate the values for matrix C.
Thread Creation and Joining:
- pthread_create() is used to spawn threads, and pthread_join() is used to ensure that the main program waits for all threads to complete their execution before proceeding.
Result Matrix:
- After all threads finish their computation, the resultant matrix C is printed.

Step 4: Compiling and Running the Code

To compile and run the above code on a system with Pthreads support:

gcc -pthread matrix_multiply_pthreads.c -o matrix_multiply
./matrix_multiply

Sample Output

Result matrix C:
0 0 0 0 
6 12 18 24 
12 24 36 48 
18 36 54 72

This output corresponds to the dot product of matrix A and B, with each element computed by a different thread.

Performance Considerations

Thread Overhead: While Pthreads allow for parallelism, creating too many threads can lead to overhead. If the matrix is small or the number of threads is too high, the overhead of managing threads may negate the performance gains.
Core Affinity: The system’s architecture plays a role in how efficiently threads are executed. For best results, the number of threads should ideally match the number of CPU cores.
Load Balancing: When dividing the matrix into chunks, ensure that the work is balanced across threads. In the example, each thread gets an equal number of rows, but for matrices where rows take different times to compute, other strategies might be necessary.

Conclusion

Using Pthreads to parallelize matrix multiplication can significantly improve performance for large matrices. By dividing the work across multiple threads, each processing a different part of the result matrix, we can utilize multi-core processors more effectively. While this implementation uses row-wise division, other approaches like element-wise division can be used depending on the problem size and complexity.

Parallel computing is a powerful tool, and Pthreads provide a lightweight, efficient way to achieve this in C programs.