A little research. Parallel matrix multiplication or cache magic

Your problem is due to a race condition on the inner loop variable j . It needs to be made private.

In C89 I would do something like this:

#pragma omp parallel ( int i, j, k; #pragma omp for for(i=0; ...

For C++ or C99, use mixed declarations

#pragma omp parallel for for(int i=0; ...

By doing this, you don't have to explicitly declare anything public or private.

Some additional comments to your code. When using B[k][j] your single threaded code does not cache. This reads the cashew, then moves on to the next cache line, and so on until the dot product is executed, by which time the other pins will have been evicted. Instead, the transpose must first be transferred and accessed as BT[j][k] . Also, you've allocated arrays of arrays, not one contiguous 2D array. I've corrected your code to use transpose and a contiguous 2D array.

Here is the time I get for size = 512.

No transpose no openmp 0.94s no transpose, openmp 0.23s tranpose, no openmp 0.27s transpose, openmp 0.08s #include #include #include void transpose(double *A, double *B, int n) ( int i,j; for(i=0; i

This article will cover the POSIX Threads and OpenMP standards. Namely, parallel multiplication of NxN matrices using these standards and their comparison.

1) Matrix multiplication code using OpenMP.

#include #include #include #include #include #include #include #include #include #include #include #include int *alloc_matrix(int ​​size)( int *matrix = (int*)malloc(size * size * sizeof(int)); printf("matrix %dx%d allocated\n", size, size); return matrix; ) void del_matrix(int ​​*matrix)( free(matrix); ) double dtime())( struct timeval tv; struct timezone tz; double t; gettimeofday(&tv, &tz); t = (double)tv.tv_sec + (double)tv .tv_usec*1e-6; return(t); ) int main() ( int N = 50; int THR = 2; int i, j, k; printf("Enter the number of threads\n"); scanf("% i",&THR); omp_set_dynamic(0); omp_set_num_threads(THR); printf("Enter matrix size\n"); scanf("%i",&N); int *A = alloc_matrix(N); int *B = alloc_matrix(N); int *C = alloc_matrix(N); int rand_num; srand(time(NULL)); for(int i=0; i

2) Code for matrix multiplication using Pthread.

#include #include #include #include #include #include #include #include #include #include #define SIZE 50 int *alloc_matrix(int ​​size)( int *matrix = (int*)malloc(size * size * sizeof(int)); printf("matrix %dx%d allocated\n", size , size); return matrix; ) void del_matrix(int ​​*matrix)( free(matrix); ) double dtime())( struct timeval tv; struct timezone tz; double t; gettimeofday(&tv, &tz); t = (double) tv.tv_sec + (double)tv.tv_usec*1e-6; return(t); ) struct matrix_args( int *m1; int *m2; int *ans; int size; int start; int end; ) ARG; void *multiply_matrix(void *pargs)( struct matrix_args *args = (struct matrix_args *) pargs; int i, j, k, l, m, tmp = 0; double t0 = dtime(); int *m1 = args-> m1; int *m2 = args->m2; int *ans = args->ans; int size = args->size; int start = args->start; int end = args->end; for(i = start; i< end; i++){ m = i * size; for(j = 0; j < size; j++){ l = 0; for(k = 0; k < size; k++){ tmp += m1 * m2; l += size; } ans = tmp; tmp = 0; } } printf("thread works %fs\n", dtime() - t0); pthread_exit(NULL); } main(int argc, char** argv){ int i, j, size, k, step, pos, res; int THR_NUM,THR,N; printf("Введите число потоков\n"); scanf("%i",&THR); printf("Введите размер матрици\n"); scanf("%i",&N); THR_NUM = THR; size = N; pthread_t threads; pthread_attr_t attr; printf("allocating\n"); int *m1 = alloc_matrix(size); int *m2 = alloc_matrix(size); int *ans = alloc_matrix(size); srand(time(NULL)); for(int i=0; i

3) Testing

Multiplying 1000x1000-element matrices on an Intel Atom processor

Based on the data obtained, it is clear that a slight increase in performance occurs on 2 threads due to the hyper threading technology that the processor supports. The subsequent increase in the number of threads did not lead to an increase in performance, but rather slowed it down. It is also clear that manual parallelization using Pthread is more effective than automatic parallelization using OpenMP.

Multiplying 1000x1000-element matrices on an Intel Xeon processor

Based on the data obtained, it is clear that performance increases on 2 and 4 threads since the processor is able to allocate a separate core for each thread. But performance starts to drop when we try to run a program with 10 threads, this also happens due to the time spent on quantization between threads. From this table we can assume that it is most efficient to run one thread per core (the application showed the best performance on 4 threads). It is also clear that manual parallelization via Pthread turned out to be more effective than automatic parallelization via OpenMP.

Sandbox

funny barbel July 25, 2011 at 00:59

Parallel matrix multiplication or cache magic. A little research

Introduction

This article is about a small study that was conducted out of pure curiosity. But since the results were interesting, I decided to publish them. The basics of parallel computing were discussed using a simple and convenient task: matrix multiplication.

There are several pictures under the cut!

Formally, the task is stated as follows:
Given two matrices A and B of dimension NxN. It is necessary to calculate their product: matrix C.
C[i][j] = sum over k from 1 to N A[i][k]*B[k][j].

Solution without parallel computing

The simplest algorithm for solving this problem with asymptotics is as follows:
for (int i = 0; i< n; ++i)
for (int j = 0; j< n; ++j){
c[i][j] = 0;
for (int k = 0; k< n; ++k)
c[i][j] = a[i][k]*b[k][j];
}

There is a well-known trick that allows you to speed up the above code by 4-5 times. This is replacing array b with array bt, which contains the transposed matrix B:
for (int i = 0; i< n; ++i)
for (int j = 0; j< n; ++j)
bt[i][j] = b[j][i];
for (int i = 0; i< n; ++i)
for (int j = 0; j< n; ++j){
c[i][j] = 0;
for (int k = 0; k< n; ++k)
c[i][j] = a[i][k]*bt[j][k];
}

* This source code was highlighted with Source Code Highlighter .


The acceleration in this case is related to the processor cache design. When accessing a memory cell, not only it, but also several neighboring cells are written to the cache. In the first case, the elements of array b, which we access in a loop through k, are located in different rows, but in the same column, which means that in RAM they are located at a distance equal to the size of one row of the array. In the second case, we sequentially access the elements of one line of the bt array, which are located in memory in a row.

Parallel Computing Solution

Since each element of matrix C is calculated independently of the others and matrices A and B do not change, in order to calculate the product in parallel, it is enough to simply indicate which elements of C to which thread to calculate.
For this purpose, a function was created that receives 2 numbers: lf and rg. The function calculates the rows of matrix C from lf to rg inclusive, using the original matrix B:
#define forn(i, n) for (int i = 0; i< int (n); ++i)

struct matrixMulParam(
int lf, rg;
matrixMulParam(int cLf = 0, int cRg = 0) : lf(cLf), rg(cRg)()
};

DWORD matrixMulNoHack(LPVOID p)(
for (int i = ((matrixMulParam*)p)->lf; i<= ((matrixMulParam*)p)->rg; ++i)
forn(j, n)(
c[i][j] = 0;
forn(k, n)
c[i][j] += a[i][k] * b[k][j];
}

Delete ((matrixMulParam*)p);
return 0;
}


* This source code was highlighted with Source Code Highlighter .
In the same way, a function was written using a transposed matrix.

Collecting operating time data

The calculations were carried out on 4 different computers with different processors:
  1. Intel Atom N550
    • Clock frequency: 1.5 GHz
    • Number of cores: 2
    • Number of threads: 4
      This is not a typo. Here is a link to the Intel website, and here are screenshots of my netbook's Task Manager and Device Manager
    • L2 cache size: 1 MB
  2. Intel Core2Duo P8600
    • Clock frequency: 2.4 GHz
    • Number of cores: 2
    • Number of threads: 2
    • L2 cache size: 3 MB
  3. Intel Core2Quad Q6600
    • Clock frequency: 2.4 GHz
    • Number of cores: 4
    • Number of threads: 4
    • L2 cache size: 8 MB
  4. Amazon EC2 High-CPU Extra Large Instance,
    approximately equal in power to the 2nd Intel Xeon E5410
    • Clock frequency: 2.33 GHz
    • Number of cores: 4 (8 in total)
    • Number of streams: 4 (8 in total)
    • L2 cache size: 12 (24 in total) MB
To obtain statistics, the following parameters were varied:
  • Matrix size: 1000x1000 or 2000x2000.
    The matrices were filled with pseudorandom integers from 1 to 1000.
  • Number of threads from 1 to 16.
  • Using the original B matrix or a transposed one.
For each combination of the specified parameters, calculations were performed at least 5 times, and the arithmetic mean of the program running times was used as the result.

results

Since I considered different processors with different parameters, the absolute running time of the program in milliseconds cannot be used for comparison. To do this, we introduce relative operating time: the ratio, in percentage, of the time of calculating the product of matrices with certain parameters to the time of calculating the product of the same matrices with the same parameters into one thread.

Let's look at the graphs of the dependence of the relative operating time on the number of threads for the algorithm using the original matrix:

This graph meets all expectations: the running time drops as the number of threads increases to the number of cores, and then remains approximately constant. However, if the results for Core2Quad and Xeon differ from ideal by less than 1%, then for Core2Duo and Atom the results are worse.

When the data size increases by 4 times, the lag becomes even stronger. Especially on the Atom N550 (netbooks are clearly not designed to multiply 2000x2000 matrices per ). This is explained in exactly the same way as the reduction in time when using a transposed matrix. When there are several threads accessing different areas of memory, each thread individually will work slower than one thread traversing the memory in order. But despite this, several threads will work faster than one.

The following graphs represent the relative running time versus the number of threads for the algorithm using a transposed matrix:


Since the speedup when using a transposed matrix is ​​associated with more efficient use of the cache, it begins to perform worse if the number of threads is increased and the memory access sequence is disrupted. Atom and Core2Quad behave interestingly: the Atom graph remained virtually unchanged, while the Core2Quad performance deteriorated sharply when moving from 1000x1000 to 2000x2000.
This is explained as follows: with a matrix size of 1000x1000, we have 2 arrays (a and b) with 1000*1000 elements of 4 bytes (int). That is approximately 8 megabytes. The 1 MB cache of the Atom N550 contains only 1/8 of all data, when the Core2Quad cache fits all the data. When we expand the arrays to 2000x2000, we increase the amount of data by 4 times. Thus, only 1/4 of the data now fits into the Core2Quad cache and the transposition trick becomes less effective, and it did not work on the Atom processor and continues to not work.
To illustrate the effect described above, consider two more graphs.
Let us call the efficiency of using a transposed matrix the ratio of the running time of the algorithm using the transposed matrix under certain parameters to the running time of the algorithm using the original matrix under the same parameters.


This graph illustrates all of the above well. The efficiency of Core2Quad and Xeon decreases due to the large number of concurrent threads accessing different areas of memory and tearing the cache apart. The efficiency of the Core2Duo is almost unchanged due to the fact that it only has 2 threads running at a time and because the cache size is not as large as in previous processors.
The following figure contains the same graph, only with the addition of Atom N550 indicators.


I have never been able to understand this increase in efficiency of the cache and transpose trick. I hope we can get to the bottom of the truth in the comments.

Conclusion

Firstly, The processor cache significantly affects the speed of program execution, although in most cases we don’t even think about it.
Secondly, the source code, and in general everything that I used in this article, can be downloaded from the link in one archive.
Third, compiler used g++ 4.5.2, compilation line g++ -Wall -O,
Windows 7 and Windows Server 2008 R2 on Amazon.
Thank you for your attention.
Best regards, Ivan.

P.S. This is what the data collection process looked like

Tags: parallel computing, multiplication, matrix, c plus plus

This article is not subject to comment, since its author is not yet a full member of the community. You will only be able to contact the author after he receives an invitation from someone in the community. Until this moment, his username will be hidden by an alias.

Internet