87 lines
2.8 KiB
Markdown
87 lines
2.8 KiB
Markdown
# MPI+OpenMP Hybrid Parallel Matrix Multiplication Experiments
|
||
|
||
## Overview
|
||
This document summarizes the experimental analysis of MPI+OpenMP hybrid parallel matrix multiplication performance.
|
||
|
||
## Generated Files
|
||
|
||
### Analysis Scripts
|
||
- `analyze_mpi_openmp.py` - Python script for data analysis and visualization
|
||
|
||
### Figures (All labels in English)
|
||
1. **experiment1_analysis.png** - Experiment 1: Varying MPI Processes (OpenMP threads=1)
|
||
- Execution Time vs MPI Processes
|
||
- Speedup vs MPI Processes
|
||
- Parallel Efficiency vs MPI Processes
|
||
- Parallel Efficiency Heatmap
|
||
|
||
2. **experiment2_analysis.png** - Experiment 2: Varying Both MPI and OpenMP
|
||
- Efficiency Comparison (Total Processes=16)
|
||
- Best Configuration Efficiency vs Matrix Size
|
||
- MPI Process Impact on Efficiency
|
||
- Speedup Comparison for Different Configurations
|
||
|
||
3. **experiment3_analysis.png** - Experiment 3: Optimization Results
|
||
- Execution Time Comparison (Before/After)
|
||
- Efficiency Comparison (Before/After)
|
||
- Optimization Effect for Different Matrix Sizes
|
||
- Best Configuration Efficiency Comparison
|
||
|
||
### Data Files
|
||
- `experiment_results.csv` - Complete experimental data
|
||
- `serial_results.csv` - Serial baseline performance
|
||
|
||
### Reports (in Chinese)
|
||
- `MPI_OpenMP实验分析报告.md` - Detailed analysis report
|
||
- `实验总结.md` - Summary of key findings
|
||
|
||
## Key Findings
|
||
|
||
### Experiment 1: MPI Process Scaling
|
||
- **Optimal configuration**: 6 MPI processes
|
||
- **Efficiency**: 75%-89% for 1-6 processes
|
||
- **Performance bottleneck**: Communication overhead increases significantly beyond 6 processes
|
||
|
||
### Experiment 2: MPI+OpenMP Configuration
|
||
- **Optimal configuration**: 4×4 (4 MPI processes × 4 OpenMP threads)
|
||
- **Superlinear speedup**: Achieved for large matrices (4096×4096) with 107% efficiency
|
||
- **Key insight**: Balance between node-level (MPI) and node-internal (OpenMP) parallelism is crucial
|
||
|
||
### Experiment 3: Optimization Results
|
||
- **Performance improvement**: 1.1-2.3x speedup
|
||
- **Optimization techniques**:
|
||
- Loop tiling (64×64 blocks)
|
||
- Loop unrolling
|
||
- Memory access optimization
|
||
- **Best result**: 4×4 configuration achieves 107% efficiency for 4096×4096 matrix
|
||
|
||
## Recommendations
|
||
|
||
### Configuration Selection
|
||
- **Small matrices (<1024)**: 2×2 or 4×2 configuration
|
||
- **Medium matrices (1024-2048)**: 4×4 configuration
|
||
- **Large matrices (>2048)**: 4×4 or 8×2 configuration
|
||
|
||
### Avoid
|
||
- 1×N configurations (too few MPI processes)
|
||
- N×1 configurations (too few OpenMP threads)
|
||
- Excessive total processes (>48)
|
||
|
||
## Running the Analysis
|
||
|
||
```bash
|
||
cd /home/yly/dev/hpc-lab-code/work
|
||
python3 analyze_mpi_openmp.py
|
||
```
|
||
|
||
## Requirements
|
||
- Python 3.x
|
||
- pandas
|
||
- matplotlib
|
||
- numpy
|
||
|
||
## Notes
|
||
- All figures have been regenerated with English labels
|
||
- Font: DejaVu Sans (supports all characters)
|
||
- Resolution: 300 DPI for publication quality
|