1 问题场景
在测试程序的并行效率时,发现Release版本的并行效率很低,而Debug版测出来则符合预期,怀疑程序受内存带宽限制而无法发挥应有的性能。但需要用实际数据来验证这个合理的猜想。
想象内存带宽就是一根水管,数据是水,CPU就是在等水来做事。Debug版水的流速很慢、Release较快,当多进程时Debug版的速度明显翻倍(离水流量上限较远)、而Release版达到水流量上限后,进程越多则每个人的流速越慢(并行计算时,多个进程/核同时访问内存,争抢相同的带宽,导致单个进程速度下降)。
2 背景知识
算术强度(Arithmetic Intensity, AI) = 浮点运算数(Floating Point Operations, FLOPs) / 内存访问字节数
举例,稀疏矩阵-向量乘(SpMV)是迭代法的核心操作,假设用CSR存储, 每个非零元需要:
- 读数值(8字节,double)
- 读列索引(4或8字节)
- 读对应的向量元素(8字节)
- 计算只涉及一个乘法 + 一个加法(2 FLOPs)
结果:每20~24 字节数据,只有2FLOPs → AI ≈ 0.1 FLOPs/Byte
最大内存带宽(Max Memory Bandwidth) : CPU可以从内存读取数据或将数据存储到内存中的最大速率 (以GB/s为单为)
如Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
的官方标注最大内存带宽为76.8 GB/s
FLOPS,即每秒浮点运算次数, 是每秒所执行的浮点运算次数(Floating-point operations per second, FLOPS)
估算方式,FLOPS =
核数*单核主频*CPU单个周期浮点计算次数
,其中前两个参数CPU都会标,后面一个参数看Instruction Set Extensions
名称 指令集 每时钟周期的64 bits运算次数 Nehalem SSE(128-bits) 2 Sandy Bridge AVX(256-bits) 4 Haswell AVX2(256-bits) 4 Pueley AVX512(512-bits) 16(融合乘加FMA=2) 8(FMA=1) 举例,
Intel® Xeon® Processor E5-2650 v4
,Processor Base Frequency
2.20 GHz
,Total Cores
12
,Instruction Set Extensions
Intel® AVX2
, FLOPS=2.2*12*8
=211.2GFLOPS
Roofline Model,计算量为A且访存量为B的模型在算力为C且带宽为D的计算平台所能达到的理论性能上限E是多少
- 计算瓶颈区域 Compute-Bound,算法强度很高时,受峰值FLOPS(横线)限制
- 带宽瓶颈区域 Memory-Bound,带宽(斜率)决定FLOPS上限
FLOPS ≤ min(峰值FLOPS, 内存带宽×算法强度)
- 算法强度很低时:
FLOPS≈实际内存带宽×算法强度
3 验证测试
3.1 内存带宽上限测试(cs-roofline-toolkit)
git clone https://bitbucket.org/berkeleylab/cs-roofline-toolkit.git
- 输入:
cd cs-roofline-toolkit/Empirical_Roofline_Tool-1.1.0
cp Config/config.madonna.lbl.gov.01 Config/myPCTest
- 编辑
Config/myPCTest
,主要是将编译器从intel改为gnu
ERT_RESULTS Results.myPCtest
ERT_DRIVER driver1
ERT_KERNEL kernel1
ERT_MPI True
ERT_MPI_CFLAGS
ERT_MPI_LDFLAGS
ERT_OPENMP True
ERT_OPENMP_CFLAGS -fopenmp
ERT_OPENMP_LDFLAGS -fopenmp
ERT_FLOPS 1,2,4,8,16
ERT_ALIGN 32
ERT_CC mpic++
ERT_CFLAGS -O3 -march=native -msse3
ERT_LD mpic++
ERT_LDFLAGS
ERT_LDLIBS
ERT_PRECISION FP64
ERT_RUN export OMP_NUM_THREADS=ERT_OPENMP_THREADS; mpirun -np ERT_MPI_PROCS
ERT_CODE
ERT_PROCS_THREADS 1-8
ERT_MPI_PROCS 1,2,4,8
ERT_OPENMP_THREADS 1,2,4,8
ERT_NUM_EXPERIMENTS 3
ERT_MEMORY_MAX 1073741824
ERT_WORKING_SET_MIN 1
ERT_TRIALS_MIN 1
ERT_GNUPLOT gnuplot
绘图若使用较高版本gnuplot,需要修改
Plot
文件夹中所有文件的其中一行:# 原命令 set clabel '%8.3g' # 替换后: set cntrlabel format '%8.3g'
运行测试:
./ert Config/myPCtest
查看
Results.myPCtest/Run.001/roofline.ps
文件
3.2 实际内存带宽测试(likwid)
源码安装
likwid
git clone https://github.com/RRZE-HPC/likwid.git cd likwid vi config.mk make -j8 sudo make install
启动MSR:
sudo modprobe msr
运行程序:
- 串行:
likwid-perfctr -g MEM_DP ./xxx |tee log.perf.np1
- 并行:
likwid-mpirun -np $1 -mpi openmpi -g MEM_DP ./xxx |tee log.perf.np$1
- 串行:
结果示例:
# Debug版本
## 1进程
+----------------------------------------+--------------+--------------+-----------------+--------------+
| Metric | Sum | Min | Max | Avg |
+----------------------------------------+--------------+--------------+-----------------+--------------+
| Runtime (RDTSC) [s] STAT | 1479.7008 | 30.8271 | 30.8271 | 30.8271 |
| Runtime unhalted [s] STAT | 31.1694 | 2.426207e-05 | 27.6407 | 0.6494 |
| Clock [MHz] STAT | 72421.4641 | 1199.9661 | 2497.0138 | 1508.7805 |
| CPI STAT | 66.3224 | 0.4918 | 5.3376 | 1.3817 |
| Energy [J] STAT | 2597.8198 | 0 | 1348.1428 | 54.1212 |
| Power [W] STAT | 84.2708 | 0 | 43.7325 | 1.7556 |
| Energy DRAM [J] STAT | 183.7499 | 0 | 107.3479 | 3.8281 |
| Power DRAM [W] STAT | 5.9607 | 0 | 3.4823 | 0.1242 |
| MFLOP/s STAT | 95.8335 | 1.427318e-06 | 90.9389 | 1.9965 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 6.487808e-08 | 0 | 6.487808e-08 | 1.351627e-09 |
| Scalar [MUOPS/s] STAT | 95.8335 | 1.427318e-06 | 90.9389 | 1.9965 |
| Memory read bandwidth [MBytes/s] STAT | 820.1712 | 0 | 820.1712 | 17.0869 |
| Memory read data volume [GBytes] STAT | 2.361183e+12 | 0 | 2361183000000 | 4.919131e+10 |
| Memory write bandwidth [MBytes/s] STAT | 231.5628 | 0 | 231.5628 | 4.8242 |
| Memory write data volume [GBytes] STAT | 2.361183e+12 | 0 | 2361183000000 | 4.919131e+10 |
| Memory bandwidth [MBytes/s] STAT | 1.531890e+14 | 0 | 153189000000000 | 3.191438e+12 |
| Memory data volume [GBytes] STAT | 4.722366e+12 | 0 | 4722366000000 | 9.838263e+10 |
| Operational intensity [FLOP/Byte] STAT | 0.0912 | 1.000000e-20 | 0.0865 | 0.0019 |
| Vectorization ratio [%] STAT | 0.1795 | 0 | 0.1795 | 0.0037 |
+----------------------------------------+--------------+--------------+-----------------+--------------+
## 2进程
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 35.5453 | 17.7625 | 17.7828 | 17.7726 | 17.7625 | 17.7625 | 17.7828 |
| Runtime unhalted [s] STAT | 37.1324 | 18.5648 | 18.5676 | 18.5662 | 18.5648 | 18.5648 | 18.5676 |
| Clock [MHz] STAT | 4994.1269 | 2497.0395 | 2497.0874 | 2497.0634 | 2497.0395 | 2497.0395 | 2497.0874 |
| CPI STAT | 0.9934 | 0.4885 | 0.5049 | 0.4967 | 0.4885 | 0.4885 | 0.5049 |
| Energy [J] STAT | 922.7993 | 0 | 922.7993 | 461.3997 | 0 | 0 | 922.7993 |
| Power [W] STAT | 51.9522 | 0 | 51.9522 | 25.9761 | 0 | 0 | 51.9522 |
| Energy DRAM [J] STAT | 79.1592 | 0 | 79.1592 | 39.5796 | 0 | 0 | 79.1592 |
| Power DRAM [W] STAT | 4.4565 | 0 | 4.4565 | 2.2283 | 0 | 0 | 4.4565 |
| MFLOP/s STAT | 190.9296 | 94.4443 | 96.4853 | 95.4648 | 94.4443 | 94.4443 | 96.4853 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Scalar [MUOPS/s] STAT | 190.9296 | 94.4443 | 96.4853 | 95.4648 | 94.4443 | 94.4443 | 96.4853 |
| Memory read bandwidth [MBytes/s] STAT | 1562.0553 | 0 | 1562.0553 | 781.0276 | 0 | 0 | 1562.0553 |
| Memory read data volume [GBytes] STAT | 27.7459 | 0 | 27.7459 | 13.8729 | 0 | 0 | 27.7459 |
| Memory write bandwidth [MBytes/s] STAT | 413.7612 | 0 | 413.7612 | 206.8806 | 0 | 0 | 413.7612 |
| Memory write data volume [GBytes] STAT | 7.3494 | 0 | 7.3494 | 3.6747 | 0 | 0 | 7.3494 |
| Memory bandwidth [MBytes/s] STAT | 1975.8165 | 0 | 1975.8165 | 987.9082 | 0 | 0 | 1975.8165 |
| Memory data volume [GBytes] STAT | 35.0954 | 0 | 35.0954 | 17.5477 | 0 | 0 | 35.0954 |
| Operational intensity [FLOP/Byte] STAT | 0.0488 | 0 | 0.0488 | 0.0244 | 0 | 0 | 0.0488 |
| Vectorization ratio [%] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
## 4进程
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 36.0001 | 8.9877 | 9.0053 | 9.0000 | 8.9877 | 9.0029 | 9.0042 |
| Runtime unhalted [s] STAT | 37.8236 | 9.4377 | 9.4879 | 9.4559 | 9.4377 | 9.4459 | 9.4521 |
| Clock [MHz] STAT | 9988.1020 | 2497.0171 | 2497.0344 | 2497.0255 | 2497.0171 | 2497.0228 | 2497.0277 |
| CPI STAT | 1.9578 | 0.4859 | 0.4919 | 0.4894 | 0.4859 | 0.4893 | 0.4907 |
| Energy [J] STAT | 549.6915 | 0 | 549.6915 | 137.4229 | 0 | 0 | 0 |
| Power [W] STAT | 61.1607 | 0 | 61.1607 | 15.2902 | 0 | 0 | 0 |
| Energy DRAM [J] STAT | 46.4047 | 0 | 46.4047 | 11.6012 | 0 | 0 | 0 |
| Power DRAM [W] STAT | 5.1632 | 0 | 5.1632 | 1.2908 | 0 | 0 | 0 |
| MFLOP/s STAT | 375.9728 | 92.8303 | 96.2008 | 93.9932 | 92.8303 | 93.1243 | 93.8174 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Scalar [MUOPS/s] STAT | 375.9728 | 92.8303 | 96.2008 | 93.9932 | 92.8303 | 93.1243 | 93.8174 |
| Memory read bandwidth [MBytes/s] STAT | 3151.1117 | 0 | 3151.1117 | 787.7779 | 0 | 0 | 0 |
| Memory read data volume [GBytes] STAT | 28.3211 | 0 | 28.3211 | 7.0803 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] STAT | 837.0949 | 0 | 837.0949 | 209.2737 | 0 | 0 | 0 |
| Memory write data volume [GBytes] STAT | 7.5235 | 0 | 7.5235 | 1.8809 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] STAT | 3988.2066 | 0 | 3988.2066 | 997.0516 | 0 | 0 | 0 |
| Memory data volume [GBytes] STAT | 35.8447 | 0 | 35.8447 | 8.9612 | 0 | 0 | 0 |
| Operational intensity [FLOP/Byte] STAT | 0.0241 | 0 | 0.0241 | 0.0060 | 0 | 0 | 0 |
| Vectorization ratio [%] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
## 8进程
+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 37.2697 | 4.6400 | 4.6783 | 4.6587 | 4.6538 | 4.6568 | 4.6605 |
| Runtime unhalted [s] STAT | 38.8445 | 4.8415 | 4.8793 | 4.8556 | 4.8419 | 4.8510 | 4.8547 |
| Clock [MHz] STAT | 19975.2132 | 2496.5400 | 2497.0420 | 2496.9017 | 2496.8017 | 2496.9713 | 2496.9762 |
| CPI STAT | 3.9348 | 0.4852 | 0.4938 | 0.4919 | 0.4900 | 0.4926 | 0.4937 |
| Energy [J] STAT | 337.8464 | 0 | 337.8464 | 42.2308 | 0 | 0 | 0 |
| Power [W] STAT | 72.8114 | 0 | 72.8114 | 9.1014 | 0 | 0 | 0 |
| Energy DRAM [J] STAT | 29.4452 | 0 | 29.4452 | 3.6806 | 0 | 0 | 0 |
| Power DRAM [W] STAT | 6.3459 | 0 | 6.3459 | 0.7932 | 0 | 0 | 0 |
| MFLOP/s STAT | 722.5548 | 88.2318 | 93.2152 | 90.3194 | 89.0568 | 89.3543 | 91.0526 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Scalar [MUOPS/s] STAT | 722.5548 | 88.2318 | 93.2152 | 90.3194 | 89.0568 | 89.3543 | 91.0526 |
| Memory read bandwidth [MBytes/s] STAT | 6235.6248 | 0 | 6235.6248 | 779.4531 | 0 | 0 | 0 |
| Memory read data volume [GBytes] STAT | 28.9334 | 0 | 28.9334 | 3.6167 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] STAT | 1648.2151 | 0 | 1648.2151 | 206.0269 | 0 | 0 | 0 |
| Memory write data volume [GBytes] STAT | 7.6478 | 0 | 7.6478 | 0.9560 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] STAT | 7883.8399 | 0 | 7883.8399 | 985.4800 | 0 | 0 | 0 |
| Memory data volume [GBytes] STAT | 36.5812 | 0 | 36.5812 | 4.5727 | 0 | 0 | 0 |
| Operational intensity [FLOP/Byte] STAT | 0.0118 | 0 | 0.0118 | 0.0015 | 0 | 0 | 0 |
| Vectorization ratio [%] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+
# Release版本
## 1进程
+-----------------------------------+--------------+
| Metric | simforge:0:0 |
+-----------------------------------+--------------+
| Runtime (RDTSC) [s] | 6.3306 |
| Runtime unhalted [s] | 5.0403 |
| Clock [MHz] | 2497.0459 |
| CPI | 0.7337 |
| Energy [J] | 314.7977 |
| Power [W] | 49.7263 |
| Energy DRAM [J] | 32.3864 |
| Power DRAM [W] | 5.1158 |
| MFLOP/s | 467.1491 |
| AVX [MFLOP/s] | 0 |
| Packed [MUOPS/s] | 64.0155 |
| Scalar [MUOPS/s] | 339.1181 |
| Memory read bandwidth [MBytes/s] | 3835.0151 |
| Memory read data volume [GBytes] | 24.2780 |
| Memory write bandwidth [MBytes/s] | 1000.8994 |
| Memory write data volume [GBytes] | 6.3363 |
| Memory bandwidth [MBytes/s] | 4835.9145 |
| Memory data volume [GBytes] | 30.6143 |
| Operational intensity [FLOP/Byte] | 0.0966 |
| Vectorization ratio [%] | 15.8795 |
+-----------------------------------+--------------+
## 2进程
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 7.3827 | 3.6819 | 3.7008 | 3.6913 | 3.6819 | 3.6819 | 3.7008 |
| Runtime unhalted [s] STAT | 5.9300 | 2.9573 | 2.9727 | 2.9650 | 2.9573 | 2.9573 | 2.9727 |
| Clock [MHz] STAT | 4993.3202 | 2496.5943 | 2496.7259 | 2496.6601 | 2496.5943 | 2496.5943 | 2496.7259 |
| CPI STAT | 1.3248 | 0.6557 | 0.6691 | 0.6624 | 0.6557 | 0.6557 | 0.6691 |
| Energy [J] STAT | 203.2261 | 0 | 203.2261 | 101.6131 | 0 | 0 | 203.2261 |
| Power [W] STAT | 55.1963 | 0 | 55.1963 | 27.5982 | 0 | 0 | 55.1963 |
| Energy DRAM [J] STAT | 23.0437 | 0 | 23.0437 | 11.5219 | 0 | 0 | 23.0437 |
| Power DRAM [W] STAT | 6.2587 | 0 | 6.2587 | 3.1294 | 0 | 0 | 6.2587 |
| MFLOP/s STAT | 919.2860 | 453.8133 | 465.4727 | 459.6430 | 453.8133 | 453.8133 | 465.4727 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 120.4004 | 59.2092 | 61.1912 | 60.2002 | 59.2092 | 59.2092 | 61.1912 |
| Scalar [MUOPS/s] STAT | 678.4853 | 335.3949 | 343.0904 | 339.2427 | 335.3949 | 335.3949 | 343.0904 |
| Memory read bandwidth [MBytes/s] STAT | 7451.4953 | 0 | 7451.4953 | 3725.7476 | 0 | 0 | 7451.4953 |
| Memory read data volume [GBytes] STAT | 27.4355 | 0 | 27.4355 | 13.7178 | 0 | 0 | 27.4355 |
| Memory write bandwidth [MBytes/s] STAT | 1951.3573 | 0 | 1951.3573 | 975.6786 | 0 | 0 | 1951.3573 |
| Memory write data volume [GBytes] STAT | 7.1847 | 0 | 7.1847 | 3.5924 | 0 | 0 | 7.1847 |
| Memory bandwidth [MBytes/s] STAT | 9402.8526 | 0 | 9402.8526 | 4701.4263 | 0 | 0 | 9402.8526 |
| Memory data volume [GBytes] STAT | 34.6202 | 0 | 34.6202 | 17.3101 | 0 | 0 | 34.6202 |
| Operational intensity [FLOP/Byte] STAT | 0.0495 | 0 | 0.0495 | 0.0248 | 0 | 0 | 0.0495 |
| Vectorization ratio [%] STAT | 30.1405 | 15.0047 | 15.1358 | 15.0702 | 15.0047 | 15.0047 | 15.1358 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
## 4进程
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 8.2769 | 2.0532 | 2.0822 | 2.0692 | 2.0532 | 2.0701 | 2.0714 |
| Runtime unhalted [s] STAT | 6.8580 | 1.7066 | 1.7271 | 1.7145 | 1.7066 | 1.7098 | 1.7145 |
| Clock [MHz] STAT | 9987.9943 | 2496.8601 | 2497.1454 | 2496.9986 | 2496.8601 | 2496.9319 | 2497.0569 |
| CPI STAT | 2.9292 | 0.7068 | 0.7549 | 0.7323 | 0.7068 | 0.7220 | 0.7455 |
| Energy [J] STAT | 126.1763 | 0 | 126.1763 | 31.5441 | 0 | 0 | 0 |
| Power [W] STAT | 61.4538 | 0 | 61.4538 | 15.3635 | 0 | 0 | 0 |
| Energy DRAM [J] STAT | 16.3530 | 0 | 16.3530 | 4.0883 | 0 | 0 | 0 |
| Power DRAM [W] STAT | 7.9647 | 0 | 7.9647 | 1.9912 | 0 | 0 | 0 |
| MFLOP/s STAT | 1635.4090 | 401.4841 | 421.1108 | 408.8523 | 401.4841 | 404.7493 | 408.0648 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 214.6206 | 52.6419 | 55.4525 | 53.6552 | 52.6419 | 52.7975 | 53.7287 |
| Scalar [MUOPS/s] STAT | 1206.1680 | 296.2004 | 310.2059 | 301.5420 | 296.2004 | 299.1543 | 300.6074 |
| Memory read bandwidth [MBytes/s] STAT | 13624.7225 | 0 | 13624.7225 | 3406.1806 | 0 | 0 | 0 |
| Memory read data volume [GBytes] STAT | 27.9741 | 0 | 27.9741 | 6.9935 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] STAT | 3568.9472 | 0 | 3568.9472 | 892.2368 | 0 | 0 | 0 |
| Memory write data volume [GBytes] STAT | 7.3277 | 0 | 7.3277 | 1.8319 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] STAT | 17193.6696 | 0 | 17193.6696 | 4298.4174 | 0 | 0 | 0 |
| Memory data volume [GBytes] STAT | 35.3019 | 0 | 35.3019 | 8.8255 | 0 | 0 | 0 |
| Operational intensity [FLOP/Byte] STAT | 0.0245 | 0 | 0.0245 | 0.0061 | 0 | 0 | 0 |
| Vectorization ratio [%] STAT | 60.4202 | 15.0014 | 15.1651 | 15.1050 | 15.0014 | 15.0905 | 15.1632 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
## 8进程
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg | %ile 25 | %ile 50 | %ile 75 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 11.9013 | 1.4603 | 1.4996 | 1.4877 | 1.4830 | 1.4848 | 1.4990 |
| Runtime unhalted [s] STAT | 10.1243 | 1.2500 | 1.3127 | 1.2655 | 1.2549 | 1.2599 | 1.2614 |
| Clock [MHz] STAT | 19973.1141 | 2496.3416 | 2497.0836 | 2496.6393 | 2496.6011 | 2496.6143 | 2496.6251 |
| CPI STAT | 8.7761 | 0.9795 | 1.1799 | 1.0970 | 0.9848 | 1.1228 | 1.1395 |
| Energy [J] STAT | 105.2261 | 0 | 105.2261 | 13.1533 | 0 | 0 | 0 |
| Power [W] STAT | 72.0554 | 0 | 72.0554 | 9.0069 | 0 | 0 | 0 |
| Energy DRAM [J] STAT | 13.8533 | 0 | 13.8533 | 1.7317 | 0 | 0 | 0 |
| Power DRAM [W] STAT | 9.4863 | 0 | 9.4863 | 1.1858 | 0 | 0 | 0 |
| MFLOP/s STAT | 2263.1222 | 274.1682 | 296.1761 | 282.8903 | 277.2784 | 279.2491 | 285.9232 |
| AVX [MFLOP/s] STAT | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 298.0918 | 36.0281 | 39.0467 | 37.2615 | 36.5567 | 36.7915 | 37.5327 |
| Scalar [MUOPS/s] STAT | 1666.9384 | 202.1119 | 218.0828 | 208.3673 | 203.9546 | 205.6660 | 211.3595 |
| Memory read bandwidth [MBytes/s] STAT | 19786.0005 | 0 | 19786.0005 | 2473.2501 | 0 | 0 | 0 |
| Memory read data volume [GBytes] STAT | 28.8945 | 0 | 28.8945 | 3.6118 | 0 | 0 | 0 |
| Memory write bandwidth [MBytes/s] STAT | 5186.3296 | 0 | 5186.3296 | 648.2912 | 0 | 0 | 0 |
| Memory write data volume [GBytes] STAT | 7.5739 | 0 | 7.5739 | 0.9467 | 0 | 0 | 0 |
| Memory bandwidth [MBytes/s] STAT | 24972.3301 | 0 | 24972.3301 | 3121.5413 | 0 | 0 | 0 |
| Memory data volume [GBytes] STAT | 36.4683 | 0 | 36.4683 | 4.5585 | 0 | 0 | 0 |
| Operational intensity [FLOP/Byte] STAT | 0.0119 | 0 | 0.0119 | 0.0015 | 0 | 0 | 0 |
| Vectorization ratio [%] STAT | 121.3594 | 14.9942 | 15.2373 | 15.1699 | 15.1290 | 15.1848 | 15.2175 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
- 示例结果分析: Debug版这浮点计算量,8进程带宽都没吃满,并行效率正常。Release版本,4进程到8进程发现浮点计算效率明显降低,带宽提升不够,显然发生了竞争,总带宽已经接近该机器的上限了。