113 lines
5.8 KiB
Plaintext
113 lines
5.8 KiB
Plaintext
=== CPU (OpenMP) 不同线程数 ===
|
||
CPU矩阵乘法性能测试 (OpenMP多线程)
|
||
=================================================================
|
||
Matrix Threads Time(ms) FLOPS(G) Speedup
|
||
-----------------------------------------------------------------
|
||
256x256 8 90.372 0.37 1.07
|
||
256x256 64 83.707 0.40 1.16
|
||
256x256 256 84.262 0.40 1.15
|
||
-----------------------------------------------------------------
|
||
512x512 8 815.295 0.33 1.01
|
||
512x512 64 813.476 0.33 1.01
|
||
512x512 256 812.463 0.33 1.01
|
||
-----------------------------------------------------------------
|
||
1024x1024 8 6571.000 0.33 1.00
|
||
1024x1024 64 6586.094 0.33 1.00
|
||
1024x1024 256 6569.582 0.33 1.00
|
||
-----------------------------------------------------------------
|
||
2048x2048 8 55244.488 0.31 1.00
|
||
2048x2048 64 55211.832 0.31 1.00
|
||
2048x2048 256 55239.930 0.31 1.00
|
||
-----------------------------------------------------------------
|
||
|
||
|
||
ASCII图表:CPU性能分析
|
||
=================================================================
|
||
1. 不同线程数下的加速比趋势
|
||
Matrix Threads=8 Threads=64 Threads=256
|
||
|
||
2. 不同矩阵规模下的性能趋势
|
||
Threads 256x256 512x512 1024x1024 2048x2048
|
||
|
||
注意:完整图表建议使用Python (matplotlib) 生成。
|
||
推荐生成以下图表:
|
||
- 折线图:不同线程数下的加速比 vs 矩阵规模
|
||
- 柱状图:不同配置下的GFLOPS对比
|
||
- 热力图:线程数 × 矩阵规模 的性能分布
|
||
=== CUDA Kernel1 (基础版本) ===
|
||
CUDA Kernel1 矩阵乘法性能测试结果
|
||
=================================
|
||
Matrix Size Time(s) Time(ms) GFLOPS
|
||
---------------------------------
|
||
512x512 0.000312 0.312 860.70
|
||
1024x1024 0.002373 2.373 905.03
|
||
2048x2048 0.019180 19.180 895.72
|
||
4096x4096 0.129868 129.868 1058.30
|
||
=================================
|
||
=== CUDA Kernel2 (共享内存优化) ===
|
||
CUDA Kernel2 (共享内存优化) 矩阵乘法性能测试结果
|
||
=================================
|
||
Matrix Size Time(s) Time(ms) GFLOPS
|
||
---------------------------------
|
||
512x512 0.000826 0.826 324.87
|
||
1024x1024 0.006479 6.479 331.43
|
||
2048x2048 0.053598 53.598 320.53
|
||
4096x4096 0.432496 432.496 317.78
|
||
=================================
|
||
=== CPU (OpenMP) 不同线程数 ===
|
||
CPU矩阵乘法性能测试 (OpenMP多线程)
|
||
=================================================================
|
||
Matrix Threads Time(ms) FLOPS(G) Speedup
|
||
-----------------------------------------------------------------
|
||
256x256 8 90.532 0.37 1.08
|
||
256x256 64 83.896 0.40 1.17
|
||
256x256 256 83.807 0.40 1.17
|
||
-----------------------------------------------------------------
|
||
512x512 8 814.564 0.33 1.00
|
||
512x512 64 817.633 0.33 1.00
|
||
512x512 256 812.408 0.33 1.01
|
||
-----------------------------------------------------------------
|
||
1024x1024 8 6639.308 0.32 1.00
|
||
1024x1024 64 6627.468 0.32 1.00
|
||
1024x1024 256 6656.504 0.32 1.00
|
||
-----------------------------------------------------------------
|
||
2048x2048 8 55719.875 0.31 1.00
|
||
2048x2048 64 55636.734 0.31 1.00
|
||
2048x2048 256 55657.629 0.31 1.00
|
||
-----------------------------------------------------------------
|
||
|
||
|
||
ASCII图表:CPU性能分析
|
||
=================================================================
|
||
1. 不同线程数下的加速比趋势
|
||
Matrix Threads=8 Threads=64 Threads=256
|
||
|
||
2. 不同矩阵规模下的性能趋势
|
||
Threads 256x256 512x512 1024x1024 2048x2048
|
||
|
||
注意:完整图表建议使用Python (matplotlib) 生成。
|
||
推荐生成以下图表:
|
||
- 折线图:不同线程数下的加速比 vs 矩阵规模
|
||
- 柱状图:不同配置下的GFLOPS对比
|
||
- 热力图:线程数 × 矩阵规模 的性能分布
|
||
=== CUDA Kernel1 (基础版本) ===
|
||
CUDA Kernel1 矩阵乘法性能测试结果
|
||
=================================
|
||
Matrix Size Time(s) Time(ms) GFLOPS
|
||
---------------------------------
|
||
512x512 0.000316 0.316 848.68
|
||
1024x1024 0.002367 2.367 907.12
|
||
2048x2048 0.019190 19.190 895.24
|
||
4096x4096 0.138181 138.181 994.63
|
||
=================================
|
||
=== CUDA Kernel2 (共享内存优化) ===
|
||
CUDA Kernel2 (共享内存优化) 矩阵乘法性能测试结果
|
||
=================================
|
||
Matrix Size Time(s) Time(ms) GFLOPS
|
||
---------------------------------
|
||
512x512 0.000828 0.828 324.24
|
||
1024x1024 0.006483 6.483 331.27
|
||
2048x2048 0.053603 53.603 320.50
|
||
4096x4096 0.432285 432.285 317.94
|
||
=================================
|