[問題] cuda 矩陣乘法有幾個元素有錯

看板C_and_CPP (C/C++)作者caifu (C將。)時間16年前 (2009/12/15 23:03)推噓3(3推 0噓 10→)

留言13則, 3人參與討論串1/1

遇到的問題： (題意請描述清楚) 最近剛開始學習CUDA，照著國網中心的教學寫了個矩陣乘法練習，出來的結果確在某些元素會有錯，而且相同的輸入、相同的程式碼情況下，跑出來的結果確可能是不一樣的(某幾個版本的答案在變動) 這是我輸入的兩個矩陣： M = 1.000000 3.000000 2.000000 0.000000 1.000000 0.000000 2.000000 2.000000 2.000000 0.000000 1.000000 1.000000 1.000000 3.000000 1.000000 3.000000 N = 3.000000 2.000000 3.000000 0.000000 3.000000 0.000000 2.000000 1.000000 0.000000 2.000000 1.000000 0.000000 2.000000 3.000000 3.000000 2.000000 希望得到的正確結果： P = 12.000000 6.000000 11.000000 3.000000 7.000000 12.000000 11.000000 4.000000 8.000000 9.000000 10.000000 2.000000 18.000000 13.000000 19.000000 9.000000 程式跑出來的錯誤結果：主要就以下幾個版本在變動… P = 12.000000 6.000000 11.000000 3.000000 8.000000 5.000000 11.000000 4.000000 8.000000 5.000000 10.000000 0.000000 9.000000 5.000000 17.000000 5.000000 P = 12.000000 6.000000 11.000000 3.000000 8.000000 5.000000 11.000000 4.000000 8.000000 9.000000 8.000000 3.000000 18.000000 13.000000 5.000000 3.000000 P = 12.000000 6.000000 11.000000 3.000000 8.000000 5.000000 11.000000 4.000000 8.000000 9.000000 10.000000 2.000000 7.000000 4.000000 19.000000 9.000000 P = 12.000000 6.000000 7.000000 6.000000 7.000000 12.000000 7.000000 6.000000 8.000000 9.000000 8.000000 3.000000 18.000000 13.000000 5.000000 3.000000 開發平台： (例: VC++ or gcc/g++ or Dev-C++, Windows or Linux) XP - VS2005 GeForce GT 220 - CUDA 2.3 有問題的code： (請善用置底文標色功能) 其實我覺得code應該是沒什麼問題，但還是將處理資料的那部份貼出來，也許是我沒寫好 -> // Matrix multiplication dernel - per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Block ID int bx = blockIdx.x; int by = blockIdx.y; // Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue stores the element of the block sub-matrix // that is computed by the thread - automatic variable! float Pvalue = 0; // Loop over all the sub-matrices of M and N // required to compute the block sub-matrix for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Get a pointer to the current sub-matrix Msub of M float *Mdsub = GetSubMatrix(Md, m, by, Width); // Get a pointer to the current sub-matrix Nsub of N float *Ndsub = GetSubMatrix(Nd, bx, m, Width); __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // each thread loads one element of the sub-matrix Mds[ty][tx] = GetMatrixElement(Mdsub, tx, ty, Width); // each thread loads one element of the sub-matrix Nds[ty][tx] = GetMatrixElement(Ndsub, tx, ty, Width); // synchronize to make sure the sub-matrices are loaded // before starting the computation __syncthreads(); // each thread computes one element of the block sub-matrix for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of M and N in the next iteration __syncthreads(); } // Get a pointer to the block sub-matrix of P float *Psub = GetSubMatrix(Pd, bx, by, Width); // Write the block sub-matrix to device memory; // each thread wreites one element SetMatrixElement(Psub, tx, ty, Pvalue, Width); } __device__ float* GetSubMatrix(float* Md, int x, int y, int Width) { return (Md + y*TILE_WIDTH*Width + x*TILE_WIDTH); } __device__ float GetMatrixElement(float* Mdsub, int x, int y, int Width) { return *(Mdsub + y*Width + x); } __device__ void SetMatrixElement(float* Psub, int x, int y, float Pvalue, int Width) { *(Psub + y*Width + x) = Pvalue; } 補充說明：爬過其它文好像有人說過硬體相關也會影響，但這方面我就不知道要怎麼解決了，還請大家幫幫忙，謝謝!! -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 163.22.18.83