Re: [問題] CUDA 大矩陣相乘

看板C_and_CPP (C/C++)作者lgen7604時間14年前 (2011/03/28 10:42)推噓1(1推 0噓 1→)

留言2則, 1人參與討論串2/2 (看更多)

我移植你的 kernel function 到我以前寫的測試程式裡在修正部分程式碼之後是可以正常執行修改完的內容如下： http://paste.plurk.com/show/408923/ 我在猜該不會是你的平台只有一張顯示卡吧我測試你的 kernel function 跑 5003x5003 的矩陣能正確跑出結果耗時如下 CPU time: 1598789.734000 msec GPU time: 3879.848877 msec 測試平台 CPU: intel i7 930 GPU: GTX 480 你說你跑4000x4000正常請問你執行時間多久? 用來顯示螢幕的GPU有執行時間5秒的限制哦請參考 http://tinyurl.com/4j7hhsx Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. 另外提一下一般來說不太需要擔心register用完的問題除非 kernel function 用非常多變數如果想知道register使用多少可以在compile時下參數 -Xptxas=-v 我測過你的 kernel function 只用了18個register 所以你每個block用了18432個register 而 GTS 450 的架構 (Compute capability 2.1) 應該有32k個 register 所以還剩很多除此之外每個block最多是只能有1024個thread沒錯不過每個grid最多可以有65535x65535個block 所以其實矩陣乘法應該是不用iter就可以解決的最後不得不推一下akasan大提到的善用cudaGetLastError()以及cuda-memcheck 寫CUDA程式寫久之後會發現這對debug有非常大的幫助啊慢慢養成習慣會有好處的 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 122.120.34.95