admin管理员组

文章数量:1344297

The task was to implement various matrix multiplication algorithms using OpenMP. It turned out that with num_threads(1), the program runs faster than with any other number of threads. Is this due to the overhead of data transfer taking too much time, or is the issue something else?

#include <omp.h>
#include <stdio.h>
#include <cstdlib>
#include <ctime>
int main(){
    const int n = 1024;
    const int m = 128;

    int** A = new int* [n];
    int** B = new int* [n];
    int** C = new int* [n];
    for (int i=0; i<n; i++){
        A[i] = new int[n];
        B[i] = new int[n];
        C[i] = new int[n];
    }

    for (int i=0; i<n; i++){
        for (int j=0; j<n; j++){
            A[i][j] = rand() % 10;
            B[i][j] = rand() % 10;
        }
    }

    clock_t start = clock();
    #pragma omp parallel for num_threads(6) schedule (dynamic,m)
    for (int j=0; j<n; j++){
        for (int i=0; i<n; i++){
            for (int k=0; k<n; k++)
                C[i][j] += A[i][k]*B[k][j];
        }
    }
    clock_t end = clock();
    printf("Time for j-i-k: %f\n", (double) (end - start) / CLOCKS_PER_SEC);

    start = clock();
    #pragma omp parallel for num_threads(6) schedule (dynamic,m)
    for (int k=0; k<n; k++){
        for (int i=0; i<n; i++){
            for (int j=0; j<n; j++)
                C[i][j] += A[i][k]*B[k][j];
        }
    }
    end = clock();
    printf("Time for k-i-j: %f\n", (double) (end - start) / CLOCKS_PER_SEC);

    for (int i=0; i<n; i++){
        delete[] A[i];
        delete[] B[i];
        delete[] C[i];
    }
    delete[] A;
    delete[] B;
    delete[] C;
}
    

The task was to implement various matrix multiplication algorithms using OpenMP. It turned out that with num_threads(1), the program runs faster than with any other number of threads. Is this due to the overhead of data transfer taking too much time, or is the issue something else?

#include <omp.h>
#include <stdio.h>
#include <cstdlib>
#include <ctime>
int main(){
    const int n = 1024;
    const int m = 128;

    int** A = new int* [n];
    int** B = new int* [n];
    int** C = new int* [n];
    for (int i=0; i<n; i++){
        A[i] = new int[n];
        B[i] = new int[n];
        C[i] = new int[n];
    }

    for (int i=0; i<n; i++){
        for (int j=0; j<n; j++){
            A[i][j] = rand() % 10;
            B[i][j] = rand() % 10;
        }
    }

    clock_t start = clock();
    #pragma omp parallel for num_threads(6) schedule (dynamic,m)
    for (int j=0; j<n; j++){
        for (int i=0; i<n; i++){
            for (int k=0; k<n; k++)
                C[i][j] += A[i][k]*B[k][j];
        }
    }
    clock_t end = clock();
    printf("Time for j-i-k: %f\n", (double) (end - start) / CLOCKS_PER_SEC);

    start = clock();
    #pragma omp parallel for num_threads(6) schedule (dynamic,m)
    for (int k=0; k<n; k++){
        for (int i=0; i<n; i++){
            for (int j=0; j<n; j++)
                C[i][j] += A[i][k]*B[k][j];
        }
    }
    end = clock();
    printf("Time for k-i-j: %f\n", (double) (end - start) / CLOCKS_PER_SEC);

    for (int i=0; i<n; i++){
        delete[] A[i];
        delete[] B[i];
        delete[] C[i];
    }
    delete[] A;
    delete[] B;
    delete[] C;
}
    
Share Improve this question asked yesterday Илья АнненковИлья Анненков 112 bronze badges New contributor Илья Анненков is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 11
  • 8 Your arrays are tiny (1024 elements are nothing. Try 100 million or a billion). I would expect the overhead of spinning up a new thread, scheduling it to run, letting it do its work, synchronizing with other threads etc, is far more work than the actual work done in the thread. So it doesn't surprise me that a single thread does this faster than multiple threads. – Jesper Juhl Commented yesterday
  • 2 Side note : Why do you tag your code as C++ and then use "C"? In C++ do not use ` int** A = new int* [n];` but use std::vector<std::vector<int>> (you're already leaking memory). As for using clock to measure performance... I would recommend you use a profiler, that will tell you where your bottlenecks really are. – Pepijn Kramer Commented yesterday
  • 2 Side note: Why all the manual memory management? Manual new/new[] and delete/delete[] doesn't belong in modern C++. Use smart pointers, RAII etc. Don't write C++ like we did in 1998. – Jesper Juhl Commented yesterday
  • 3 Also, please read clock() - I don't think it does what you think it does. – Jesper Juhl Commented yesterday
  • 3 The problem is likely clock indeed. It measure the CPU time which means it increase with the number of threads running. Note OpenMP spawn threads and not processes. Oh and dynamic schedule is generally not a good idea, especially for matrix multiplications with a very stable uniform execution. Besides using num_threads(6) is a bad practice here. You should really use the environment variable OMP_NUM_THREADS instead. – Jérôme Richard Commented yesterday
 |  Show 6 more comments

1 Answer 1

Reset to default 0

I believe the problem might be with you clock usage. OpenMP has a function omp_get_wtime() that gets the wall clock time and allows time measurements in parallel regions which is more accurate. Furthermore after running you code with omp_get_wtime() and modifying the code with num_threads(1) and (6) the time difference is quite drastic.

Comparison on times

You can also set the number of threads using OMP_NUM_THREADS=1 or OMP_NUM_THREADS=6 instead of in the code, since it allows for faster testing.

Also make sure to compile with the -fopenmp flag

本文标签: cWhy do OpenMP programs run faster on a single process than on multiple processesStack Overflow