hpc-lab-code/work/README.md

# MPI+OpenMP Hybrid Parallel Matrix Multiplication Experiments

## Overview
This document summarizes the experimental analysis of MPI+OpenMP hybrid parallel matrix multiplication performance.

## Generated Files

### Analysis Scripts
- `analyze_mpi_openmp.py` - Python script for data analysis and visualization

### Figures (All labels in English)
1. **experiment1_analysis.png** - Experiment 1: Varying MPI Processes (OpenMP threads=1)
   - Execution Time vs MPI Processes
   - Speedup vs MPI Processes
   - Parallel Efficiency vs MPI Processes
   - Parallel Efficiency Heatmap

2. **experiment2_analysis.png** - Experiment 2: Varying Both MPI and OpenMP
   - Efficiency Comparison (Total Processes=16)
   - Best Configuration Efficiency vs Matrix Size
   - MPI Process Impact on Efficiency
   - Speedup Comparison for Different Configurations

3. **experiment3_analysis.png** - Experiment 3: Optimization Results
   - Execution Time Comparison (Before/After)
   - Efficiency Comparison (Before/After)
   - Optimization Effect for Different Matrix Sizes
   - Best Configuration Efficiency Comparison

### Data Files
- `experiment_results.csv` - Complete experimental data
- `serial_results.csv` - Serial baseline performance

### Reports (in Chinese)
- `MPI_OpenMP实验分析报告.md` - Detailed analysis report
- `实验总结.md` - Summary of key findings

## Key Findings

### Experiment 1: MPI Process Scaling
- **Optimal configuration**: 6 MPI processes
- **Efficiency**: 75%-89% for 1-6 processes
- **Performance bottleneck**: Communication overhead increases significantly beyond 6 processes

### Experiment 2: MPI+OpenMP Configuration
- **Optimal configuration**: 4×4 (4 MPI processes × 4 OpenMP threads)
- **Superlinear speedup**: Achieved for large matrices (4096×4096) with 107% efficiency
- **Key insight**: Balance between node-level (MPI) and node-internal (OpenMP) parallelism is crucial

### Experiment 3: Optimization Results
- **Performance improvement**: 1.1-2.3x speedup
- **Optimization techniques**:
  - Loop tiling (64×64 blocks)
  - Loop unrolling
  - Memory access optimization
- **Best result**: 4×4 configuration achieves 107% efficiency for 4096×4096 matrix

## Recommendations

### Configuration Selection
- **Small matrices (<1024)**: 2×2 or 4×2 configuration
- **Medium matrices (1024-2048)**: 4×4 configuration
- **Large matrices (>2048)**: 4×4 or 8×2 configuration

### Avoid
- 1×N configurations (too few MPI processes)
- N×1 configurations (too few OpenMP threads)
- Excessive total processes (>48)

## Running the Analysis

```bash
cd /home/yly/dev/hpc-lab-code/work
python3 analyze_mpi_openmp.py
```

## Requirements
- Python 3.x
- pandas
- matplotlib
- numpy

## Notes
- All figures have been regenerated with English labels
- Font: DejaVu Sans (supports all characters)
- Resolution: 300 DPI for publication quality