# MPI+OpenMP Hybrid Parallel Matrix Multiplication Experiments ## Overview This document summarizes the experimental analysis of MPI+OpenMP hybrid parallel matrix multiplication performance. ## Generated Files ### Analysis Scripts - `analyze_mpi_openmp.py` - Python script for data analysis and visualization ### Figures (All labels in English) 1. **experiment1_analysis.png** - Experiment 1: Varying MPI Processes (OpenMP threads=1) - Execution Time vs MPI Processes - Speedup vs MPI Processes - Parallel Efficiency vs MPI Processes - Parallel Efficiency Heatmap 2. **experiment2_analysis.png** - Experiment 2: Varying Both MPI and OpenMP - Efficiency Comparison (Total Processes=16) - Best Configuration Efficiency vs Matrix Size - MPI Process Impact on Efficiency - Speedup Comparison for Different Configurations 3. **experiment3_analysis.png** - Experiment 3: Optimization Results - Execution Time Comparison (Before/After) - Efficiency Comparison (Before/After) - Optimization Effect for Different Matrix Sizes - Best Configuration Efficiency Comparison ### Data Files - `experiment_results.csv` - Complete experimental data - `serial_results.csv` - Serial baseline performance ### Reports (in Chinese) - `MPI_OpenMP实验分析报告.md` - Detailed analysis report - `实验总结.md` - Summary of key findings ## Key Findings ### Experiment 1: MPI Process Scaling - **Optimal configuration**: 6 MPI processes - **Efficiency**: 75%-89% for 1-6 processes - **Performance bottleneck**: Communication overhead increases significantly beyond 6 processes ### Experiment 2: MPI+OpenMP Configuration - **Optimal configuration**: 4×4 (4 MPI processes × 4 OpenMP threads) - **Superlinear speedup**: Achieved for large matrices (4096×4096) with 107% efficiency - **Key insight**: Balance between node-level (MPI) and node-internal (OpenMP) parallelism is crucial ### Experiment 3: Optimization Results - **Performance improvement**: 1.1-2.3x speedup - **Optimization techniques**: - Loop tiling (64×64 blocks) - Loop unrolling - Memory access optimization - **Best result**: 4×4 configuration achieves 107% efficiency for 4096×4096 matrix ## Recommendations ### Configuration Selection - **Small matrices (<1024)**: 2×2 or 4×2 configuration - **Medium matrices (1024-2048)**: 4×4 configuration - **Large matrices (>2048)**: 4×4 or 8×2 configuration ### Avoid - 1×N configurations (too few MPI processes) - N×1 configurations (too few OpenMP threads) - Excessive total processes (>48) ## Running the Analysis ```bash cd /home/yly/dev/hpc-lab-code/work python3 analyze_mpi_openmp.py ``` ## Requirements - Python 3.x - pandas - matplotlib - numpy ## Notes - All figures have been regenerated with English labels - Font: DejaVu Sans (supports all characters) - Resolution: 300 DPI for publication quality