hpc-lab-code/lab4/experiment_data/matrixmul_comparison.txt
2026-01-21 18:30:58 +08:00

113 lines
5.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

=== CPU (OpenMP) 不同线程数 ===
CPU矩阵乘法性能测试 (OpenMP多线程)
=================================================================
Matrix Threads Time(ms) FLOPS(G) Speedup
-----------------------------------------------------------------
256x256 8 90.372 0.37 1.07
256x256 64 83.707 0.40 1.16
256x256 256 84.262 0.40 1.15
-----------------------------------------------------------------
512x512 8 815.295 0.33 1.01
512x512 64 813.476 0.33 1.01
512x512 256 812.463 0.33 1.01
-----------------------------------------------------------------
1024x1024 8 6571.000 0.33 1.00
1024x1024 64 6586.094 0.33 1.00
1024x1024 256 6569.582 0.33 1.00
-----------------------------------------------------------------
2048x2048 8 55244.488 0.31 1.00
2048x2048 64 55211.832 0.31 1.00
2048x2048 256 55239.930 0.31 1.00
-----------------------------------------------------------------
ASCII图表CPU性能分析
=================================================================
1. 不同线程数下的加速比趋势
Matrix Threads=8 Threads=64 Threads=256
2. 不同矩阵规模下的性能趋势
Threads 256x256 512x512 1024x1024 2048x2048
注意完整图表建议使用Python (matplotlib) 生成。
推荐生成以下图表:
- 折线图:不同线程数下的加速比 vs 矩阵规模
- 柱状图不同配置下的GFLOPS对比
- 热力图:线程数 × 矩阵规模 的性能分布
=== CUDA Kernel1 (基础版本) ===
CUDA Kernel1 矩阵乘法性能测试结果
=================================
Matrix Size Time(s) Time(ms) GFLOPS
---------------------------------
512x512 0.000312 0.312 860.70
1024x1024 0.002373 2.373 905.03
2048x2048 0.019180 19.180 895.72
4096x4096 0.129868 129.868 1058.30
=================================
=== CUDA Kernel2 (共享内存优化) ===
CUDA Kernel2 (共享内存优化) 矩阵乘法性能测试结果
=================================
Matrix Size Time(s) Time(ms) GFLOPS
---------------------------------
512x512 0.000826 0.826 324.87
1024x1024 0.006479 6.479 331.43
2048x2048 0.053598 53.598 320.53
4096x4096 0.432496 432.496 317.78
=================================
=== CPU (OpenMP) 不同线程数 ===
CPU矩阵乘法性能测试 (OpenMP多线程)
=================================================================
Matrix Threads Time(ms) FLOPS(G) Speedup
-----------------------------------------------------------------
256x256 8 90.532 0.37 1.08
256x256 64 83.896 0.40 1.17
256x256 256 83.807 0.40 1.17
-----------------------------------------------------------------
512x512 8 814.564 0.33 1.00
512x512 64 817.633 0.33 1.00
512x512 256 812.408 0.33 1.01
-----------------------------------------------------------------
1024x1024 8 6639.308 0.32 1.00
1024x1024 64 6627.468 0.32 1.00
1024x1024 256 6656.504 0.32 1.00
-----------------------------------------------------------------
2048x2048 8 55719.875 0.31 1.00
2048x2048 64 55636.734 0.31 1.00
2048x2048 256 55657.629 0.31 1.00
-----------------------------------------------------------------
ASCII图表CPU性能分析
=================================================================
1. 不同线程数下的加速比趋势
Matrix Threads=8 Threads=64 Threads=256
2. 不同矩阵规模下的性能趋势
Threads 256x256 512x512 1024x1024 2048x2048
注意完整图表建议使用Python (matplotlib) 生成。
推荐生成以下图表:
- 折线图:不同线程数下的加速比 vs 矩阵规模
- 柱状图不同配置下的GFLOPS对比
- 热力图:线程数 × 矩阵规模 的性能分布
=== CUDA Kernel1 (基础版本) ===
CUDA Kernel1 矩阵乘法性能测试结果
=================================
Matrix Size Time(s) Time(ms) GFLOPS
---------------------------------
512x512 0.000316 0.316 848.68
1024x1024 0.002367 2.367 907.12
2048x2048 0.019190 19.190 895.24
4096x4096 0.138181 138.181 994.63
=================================
=== CUDA Kernel2 (共享内存优化) ===
CUDA Kernel2 (共享内存优化) 矩阵乘法性能测试结果
=================================
Matrix Size Time(s) Time(ms) GFLOPS
---------------------------------
512x512 0.000828 0.828 324.24
1024x1024 0.006483 6.483 331.27
2048x2048 0.053603 53.603 320.50
4096x4096 0.432285 432.285 317.94
=================================