MPI+OpenMP Hybrid Parallel Matrix Multiplication Experiments
Overview
This document summarizes the experimental analysis of MPI+OpenMP hybrid parallel matrix multiplication performance.
Generated Files
Analysis Scripts
analyze_mpi_openmp.py- Python script for data analysis and visualization
Figures (All labels in English)
-
experiment1_analysis.png - Experiment 1: Varying MPI Processes (OpenMP threads=1)
- Execution Time vs MPI Processes
- Speedup vs MPI Processes
- Parallel Efficiency vs MPI Processes
- Parallel Efficiency Heatmap
-
experiment2_analysis.png - Experiment 2: Varying Both MPI and OpenMP
- Efficiency Comparison (Total Processes=16)
- Best Configuration Efficiency vs Matrix Size
- MPI Process Impact on Efficiency
- Speedup Comparison for Different Configurations
-
experiment3_analysis.png - Experiment 3: Optimization Results
- Execution Time Comparison (Before/After)
- Efficiency Comparison (Before/After)
- Optimization Effect for Different Matrix Sizes
- Best Configuration Efficiency Comparison
Data Files
experiment_results.csv- Complete experimental dataserial_results.csv- Serial baseline performance
Reports (in Chinese)
MPI_OpenMP实验分析报告.md- Detailed analysis report实验总结.md- Summary of key findings
Key Findings
Experiment 1: MPI Process Scaling
- Optimal configuration: 6 MPI processes
- Efficiency: 75%-89% for 1-6 processes
- Performance bottleneck: Communication overhead increases significantly beyond 6 processes
Experiment 2: MPI+OpenMP Configuration
- Optimal configuration: 4×4 (4 MPI processes × 4 OpenMP threads)
- Superlinear speedup: Achieved for large matrices (4096×4096) with 107% efficiency
- Key insight: Balance between node-level (MPI) and node-internal (OpenMP) parallelism is crucial
Experiment 3: Optimization Results
- Performance improvement: 1.1-2.3x speedup
- Optimization techniques:
- Loop tiling (64×64 blocks)
- Loop unrolling
- Memory access optimization
- Best result: 4×4 configuration achieves 107% efficiency for 4096×4096 matrix
Recommendations
Configuration Selection
- Small matrices (<1024): 2×2 or 4×2 configuration
- Medium matrices (1024-2048): 4×4 configuration
- Large matrices (>2048): 4×4 or 8×2 configuration
Avoid
- 1×N configurations (too few MPI processes)
- N×1 configurations (too few OpenMP threads)
- Excessive total processes (>48)
Running the Analysis
cd /home/yly/dev/hpc-lab-code/work
python3 analyze_mpi_openmp.py
Requirements
- Python 3.x
- pandas
- matplotlib
- numpy
Notes
- All figures have been regenerated with English labels
- Font: DejaVu Sans (supports all characters)
- Resolution: 300 DPI for publication quality