[問題] cuda 矩陣乘法有幾個元素有錯
遇到的問題: (題意請描述清楚)
最近剛開始學習CUDA,
照著國網中心的教學寫了個矩陣乘法練習,
出來的結果確在某些元素會有錯,
而且相同的輸入、相同的程式碼情況下,
跑出來的結果確可能是不一樣的(某幾個版本的答案在變動)
這是我輸入的兩個矩陣:
M =
1.000000 3.000000 2.000000 0.000000
1.000000 0.000000 2.000000 2.000000
2.000000 0.000000 1.000000 1.000000
1.000000 3.000000 1.000000 3.000000
N =
3.000000 2.000000 3.000000 0.000000
3.000000 0.000000 2.000000 1.000000
0.000000 2.000000 1.000000 0.000000
2.000000 3.000000 3.000000 2.000000
希望得到的正確結果:
P =
12.000000 6.000000 11.000000 3.000000
7.000000 12.000000 11.000000 4.000000
8.000000 9.000000 10.000000 2.000000
18.000000 13.000000 19.000000 9.000000
程式跑出來的錯誤結果:
主要就以下幾個版本在變動…
P =
12.000000 6.000000 11.000000 3.000000
8.000000 5.000000 11.000000 4.000000
8.000000 5.000000 10.000000 0.000000
9.000000 5.000000 17.000000 5.000000
P =
12.000000 6.000000 11.000000 3.000000
8.000000 5.000000 11.000000 4.000000
8.000000 9.000000 8.000000 3.000000
18.000000 13.000000 5.000000 3.000000
P =
12.000000 6.000000 11.000000 3.000000
8.000000 5.000000 11.000000 4.000000
8.000000 9.000000 10.000000 2.000000
7.000000 4.000000 19.000000 9.000000
P =
12.000000 6.000000 7.000000 6.000000
7.000000 12.000000 7.000000 6.000000
8.000000 9.000000 8.000000 3.000000
18.000000 13.000000 5.000000 3.000000
開發平台: (例: VC++ or gcc/g++ or Dev-C++, Windows or Linux)
XP - VS2005
GeForce GT 220 - CUDA 2.3
有問題的code: (請善用置底文標色功能)
其實我覺得code應該是沒什麼問題,
但還是將處理資料的那部份貼出來,
也許是我沒寫好 ->
// Matrix multiplication dernel - per thread code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
// Block ID
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
// Pvalue stores the element of the block sub-matrix
// that is computed by the thread - automatic variable!
float Pvalue = 0;
// Loop over all the sub-matrices of M and N
// required to compute the block sub-matrix
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Get a pointer to the current sub-matrix Msub of M
float *Mdsub = GetSubMatrix(Md, m, by, Width);
// Get a pointer to the current sub-matrix Nsub of N
float *Ndsub = GetSubMatrix(Nd, bx, m, Width);
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
// each thread loads one element of the sub-matrix
Mds[ty][tx] = GetMatrixElement(Mdsub, tx, ty, Width);
// each thread loads one element of the sub-matrix
Nds[ty][tx] = GetMatrixElement(Ndsub, tx, ty, Width);
// synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// each thread computes one element of the block sub-matrix
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of M and N in the next iteration
__syncthreads();
}
// Get a pointer to the block sub-matrix of P
float *Psub = GetSubMatrix(Pd, bx, by, Width);
// Write the block sub-matrix to device memory;
// each thread wreites one element
SetMatrixElement(Psub, tx, ty, Pvalue, Width);
}
__device__ float* GetSubMatrix(float* Md, int x, int y, int Width) {
return (Md + y*TILE_WIDTH*Width + x*TILE_WIDTH);
}
__device__ float GetMatrixElement(float* Mdsub, int x, int y, int Width) {
return *(Mdsub + y*Width + x);
}
__device__ void SetMatrixElement(float* Psub, int x, int y, float Pvalue, int
Width) {
*(Psub + y*Width + x) = Pvalue;
}
補充說明:
爬過其它文好像有人說過硬體相關也會影響,
但這方面我就不知道要怎麼解決了,
還請大家幫幫忙,謝謝!!
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 163.22.18.83
→
12/16 11:58, , 1F
12/16 11:58, 1F
→
12/16 11:58, , 2F
12/16 11:58, 2F
→
12/16 11:59, , 3F
12/16 11:59, 3F
推
12/16 13:15, , 4F
12/16 13:15, 4F
→
12/16 17:23, , 5F
12/16 17:23, 5F
→
12/16 17:24, , 6F
12/16 17:24, 6F
推
12/16 19:25, , 7F
12/16 19:25, 7F
→
12/16 19:25, , 8F
12/16 19:25, 8F
→
12/17 12:28, , 9F
12/17 12:28, 9F
推
12/17 13:29, , 10F
12/17 13:29, 10F
→
12/17 13:29, , 11F
12/17 13:29, 11F
→
12/17 21:39, , 12F
12/17 21:39, 12F
→
12/18 16:41, , 13F
12/18 16:41, 13F
C_and_CPP 近期熱門文章
PTT數位生活區 即時熱門文章