程序访存性能瓶颈分析

1 问题场景

在测试程序的并行效率时,发现Release版本的并行效率很低,而Debug版测出来则符合预期,怀疑程序受内存带宽限制而无法发挥应有的性能。但需要用实际数据来验证这个合理的猜想。

想象内存带宽就是一根水管,数据是水,CPU就是在等水来做事。Debug版水的流速很慢、Release较快,当多进程时Debug版的速度明显翻倍(离水流量上限较远)、而Release版达到水流量上限后,进程越多则每个人的流速越慢(并行计算时,多个进程/核同时访问内存,争抢相同的带宽,导致单个进程速度下降)。

2 背景知识

  • 算术强度(Arithmetic Intensity, AI) = 浮点运算数(Floating Point Operations, FLOPs) / 内存访问字节数

    举例,稀疏矩阵-向量乘(SpMV)是迭代法的核心操作,假设用CSR存储, 每个非零元需要:

    1. 读数值(8字节,double)
    2. 读列索引(4或8字节)
    3. 读对应的向量元素(8字节)
    4. 计算只涉及一个乘法 + 一个加法(2 FLOPs)
      结果:每20~24 字节数据,只有2FLOPs → AI ≈ 0.1 FLOPs/Byte
  • 最大内存带宽(Max Memory Bandwidth) : CPU可以从内存读取数据或将数据存储到内存中的最大速率 (以GB/s为单为)
    Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz的官方标注最大内存带宽为76.8 GB/s

  • FLOPS,即每秒浮点运算次数, 是每秒所执行的浮点运算次数(Floating-point operations per second, FLOPS

    1. 估算方式,FLOPS = 核数*单核主频*CPU单个周期浮点计算次数,其中前两个参数CPU都会标,后面一个参数看Instruction Set Extensions

      名称 指令集 每时钟周期的64 bits运算次数
      Nehalem SSE(128-bits) 2
      Sandy Bridge AVX(256-bits) 4
      Haswell AVX2(256-bits) 4
      Pueley AVX512(512-bits) 16(融合乘加FMA=2) 8(FMA=1)
    2. 举例,Intel® Xeon® Processor E5-2650 v4, Processor Base Frequency 2.20 GHzTotal Cores 12Instruction Set Extensions Intel® AVX2, FLOPS= 2.2*12*8 = 211.2GFLOPS

  • Roofline Model,计算量为A且访存量为B的模型在算力为C且带宽为D的计算平台所能达到的理论性能上限E是多少

    • 计算瓶颈区域 Compute-Bound,算法强度很高时,受峰值FLOPS(横线)限制
    • 带宽瓶颈区域 Memory-Bound,带宽(斜率)决定FLOPS上限
    • FLOPS ≤ min(峰值FLOPS, 内存带宽×算法强度)
    • 算法强度很低时: FLOPS≈实际内存带宽×算法强度

3 验证测试

3.1 内存带宽上限测试(cs-roofline-toolkit)

  1. git clone https://bitbucket.org/berkeleylab/cs-roofline-toolkit.git
  2. 输入:
cd cs-roofline-toolkit/Empirical_Roofline_Tool-1.1.0
cp Config/config.madonna.lbl.gov.01 Config/myPCTest
  1. 编辑Config/myPCTest,主要是将编译器从intel改为gnu
ERT_RESULTS Results.myPCtest

ERT_DRIVER  driver1
ERT_KERNEL  kernel1

ERT_MPI         True
ERT_MPI_CFLAGS
ERT_MPI_LDFLAGS

ERT_OPENMP         True
ERT_OPENMP_CFLAGS  -fopenmp
ERT_OPENMP_LDFLAGS -fopenmp

ERT_FLOPS   1,2,4,8,16
ERT_ALIGN   32

ERT_CC      mpic++
ERT_CFLAGS  -O3 -march=native -msse3

ERT_LD      mpic++
ERT_LDFLAGS
ERT_LDLIBS

ERT_PRECISION FP64

ERT_RUN     export OMP_NUM_THREADS=ERT_OPENMP_THREADS; mpirun -np ERT_MPI_PROCS
ERT_CODE

ERT_PROCS_THREADS  1-8
ERT_MPI_PROCS      1,2,4,8
ERT_OPENMP_THREADS 1,2,4,8

ERT_NUM_EXPERIMENTS 3

ERT_MEMORY_MAX 1073741824
ERT_WORKING_SET_MIN 1

ERT_TRIALS_MIN 1

ERT_GNUPLOT gnuplot
  1. 绘图若使用较高版本gnuplot,需要修改Plot文件夹中所有文件的其中一行:

    # 原命令
    set clabel '%8.3g'
    # 替换后:
    set cntrlabel format '%8.3g'
  2. 运行测试: ./ert Config/myPCtest

  3. 查看Results.myPCtest/Run.001/roofline.ps 文件

3.2 实际内存带宽测试(likwid)

  1. 源码安装likwid

    git clone https://github.com/RRZE-HPC/likwid.git
    cd likwid
    vi config.mk
    make -j8
    sudo make install
  2. 启动MSR: sudo modprobe msr

  3. 运行程序:

    • 串行: likwid-perfctr -g MEM_DP ./xxx |tee log.perf.np1
    • 并行: likwid-mpirun -np $1 -mpi openmpi -g MEM_DP ./xxx |tee log.perf.np$1
  4. 结果示例:

# Debug版本

## 1进程
+----------------------------------------+--------------+--------------+-----------------+--------------+
|                 Metric                 |      Sum     |      Min     |       Max       |      Avg     |
+----------------------------------------+--------------+--------------+-----------------+--------------+
|        Runtime (RDTSC) [s] STAT        |    1479.7008 |      30.8271 |         30.8271 |      30.8271 |
|        Runtime unhalted [s] STAT       |      31.1694 | 2.426207e-05 |         27.6407 |       0.6494 |
|            Clock [MHz] STAT            |   72421.4641 |    1199.9661 |       2497.0138 |    1508.7805 |
|                CPI STAT                |      66.3224 |       0.4918 |          5.3376 |       1.3817 |
|             Energy [J] STAT            |    2597.8198 |            0 |       1348.1428 |      54.1212 |
|             Power [W] STAT             |      84.2708 |            0 |         43.7325 |       1.7556 |
|          Energy DRAM [J] STAT          |     183.7499 |            0 |        107.3479 |       3.8281 |
|           Power DRAM [W] STAT          |       5.9607 |            0 |          3.4823 |       0.1242 |
|              MFLOP/s STAT              |      95.8335 | 1.427318e-06 |         90.9389 |       1.9965 |
|           AVX [MFLOP/s] STAT           |            0 |            0 |               0 |            0 |
|          Packed [MUOPS/s] STAT         | 6.487808e-08 |            0 |    6.487808e-08 | 1.351627e-09 |
|          Scalar [MUOPS/s] STAT         |      95.8335 | 1.427318e-06 |         90.9389 |       1.9965 |
|  Memory read bandwidth [MBytes/s] STAT |     820.1712 |            0 |        820.1712 |      17.0869 |
|  Memory read data volume [GBytes] STAT | 2.361183e+12 |            0 |   2361183000000 | 4.919131e+10 |
| Memory write bandwidth [MBytes/s] STAT |     231.5628 |            0 |        231.5628 |       4.8242 |
| Memory write data volume [GBytes] STAT | 2.361183e+12 |            0 |   2361183000000 | 4.919131e+10 |
|    Memory bandwidth [MBytes/s] STAT    | 1.531890e+14 |            0 | 153189000000000 | 3.191438e+12 |
|    Memory data volume [GBytes] STAT    | 4.722366e+12 |            0 |   4722366000000 | 9.838263e+10 |
| Operational intensity [FLOP/Byte] STAT |       0.0912 | 1.000000e-20 |          0.0865 |       0.0019 |
|      Vectorization ratio [%] STAT      |       0.1795 |            0 |          0.1795 |       0.0037 |
+----------------------------------------+--------------+--------------+-----------------+--------------+

## 2进程

+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|                 Metric                 |    Sum    |    Min    |    Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |   35.5453 |   17.7625 |   17.7828 |   17.7726 |   17.7625 |   17.7625 |   17.7828 |
|        Runtime unhalted [s] STAT       |   37.1324 |   18.5648 |   18.5676 |   18.5662 |   18.5648 |   18.5648 |   18.5676 |
|            Clock [MHz] STAT            | 4994.1269 | 2497.0395 | 2497.0874 | 2497.0634 | 2497.0395 | 2497.0395 | 2497.0874 |
|                CPI STAT                |    0.9934 |    0.4885 |    0.5049 |    0.4967 |    0.4885 |    0.4885 |    0.5049 |
|             Energy [J] STAT            |  922.7993 |         0 |  922.7993 |  461.3997 |         0 |         0 |  922.7993 |
|             Power [W] STAT             |   51.9522 |         0 |   51.9522 |   25.9761 |         0 |         0 |   51.9522 |
|          Energy DRAM [J] STAT          |   79.1592 |         0 |   79.1592 |   39.5796 |         0 |         0 |   79.1592 |
|           Power DRAM [W] STAT          |    4.4565 |         0 |    4.4565 |    2.2283 |         0 |         0 |    4.4565 |
|              MFLOP/s STAT              |  190.9296 |   94.4443 |   96.4853 |   95.4648 |   94.4443 |   94.4443 |   96.4853 |
|           AVX [MFLOP/s] STAT           |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Scalar [MUOPS/s] STAT         |  190.9296 |   94.4443 |   96.4853 |   95.4648 |   94.4443 |   94.4443 |   96.4853 |
|  Memory read bandwidth [MBytes/s] STAT | 1562.0553 |         0 | 1562.0553 |  781.0276 |         0 |         0 | 1562.0553 |
|  Memory read data volume [GBytes] STAT |   27.7459 |         0 |   27.7459 |   13.8729 |         0 |         0 |   27.7459 |
| Memory write bandwidth [MBytes/s] STAT |  413.7612 |         0 |  413.7612 |  206.8806 |         0 |         0 |  413.7612 |
| Memory write data volume [GBytes] STAT |    7.3494 |         0 |    7.3494 |    3.6747 |         0 |         0 |    7.3494 |
|    Memory bandwidth [MBytes/s] STAT    | 1975.8165 |         0 | 1975.8165 |  987.9082 |         0 |         0 | 1975.8165 |
|    Memory data volume [GBytes] STAT    |   35.0954 |         0 |   35.0954 |   17.5477 |         0 |         0 |   35.0954 |
| Operational intensity [FLOP/Byte] STAT |    0.0488 |         0 |    0.0488 |    0.0244 |         0 |         0 |    0.0488 |
|      Vectorization ratio [%] STAT      |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

## 4进程

+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|                 Metric                 |    Sum    |    Min    |    Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |   36.0001 |    8.9877 |    9.0053 |    9.0000 |    8.9877 |    9.0029 |    9.0042 |
|        Runtime unhalted [s] STAT       |   37.8236 |    9.4377 |    9.4879 |    9.4559 |    9.4377 |    9.4459 |    9.4521 |
|            Clock [MHz] STAT            | 9988.1020 | 2497.0171 | 2497.0344 | 2497.0255 | 2497.0171 | 2497.0228 | 2497.0277 |
|                CPI STAT                |    1.9578 |    0.4859 |    0.4919 |    0.4894 |    0.4859 |    0.4893 |    0.4907 |
|             Energy [J] STAT            |  549.6915 |         0 |  549.6915 |  137.4229 |         0 |         0 |         0 |
|             Power [W] STAT             |   61.1607 |         0 |   61.1607 |   15.2902 |         0 |         0 |         0 |
|          Energy DRAM [J] STAT          |   46.4047 |         0 |   46.4047 |   11.6012 |         0 |         0 |         0 |
|           Power DRAM [W] STAT          |    5.1632 |         0 |    5.1632 |    1.2908 |         0 |         0 |         0 |
|              MFLOP/s STAT              |  375.9728 |   92.8303 |   96.2008 |   93.9932 |   92.8303 |   93.1243 |   93.8174 |
|           AVX [MFLOP/s] STAT           |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Scalar [MUOPS/s] STAT         |  375.9728 |   92.8303 |   96.2008 |   93.9932 |   92.8303 |   93.1243 |   93.8174 |
|  Memory read bandwidth [MBytes/s] STAT | 3151.1117 |         0 | 3151.1117 |  787.7779 |         0 |         0 |         0 |
|  Memory read data volume [GBytes] STAT |   28.3211 |         0 |   28.3211 |    7.0803 |         0 |         0 |         0 |
| Memory write bandwidth [MBytes/s] STAT |  837.0949 |         0 |  837.0949 |  209.2737 |         0 |         0 |         0 |
| Memory write data volume [GBytes] STAT |    7.5235 |         0 |    7.5235 |    1.8809 |         0 |         0 |         0 |
|    Memory bandwidth [MBytes/s] STAT    | 3988.2066 |         0 | 3988.2066 |  997.0516 |         0 |         0 |         0 |
|    Memory data volume [GBytes] STAT    |   35.8447 |         0 |   35.8447 |    8.9612 |         0 |         0 |         0 |
| Operational intensity [FLOP/Byte] STAT |    0.0241 |         0 |    0.0241 |    0.0060 |         0 |         0 |         0 |
|      Vectorization ratio [%] STAT      |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

## 8进程

+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|                 Metric                 |     Sum    |    Min    |    Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |    37.2697 |    4.6400 |    4.6783 |    4.6587 |    4.6538 |    4.6568 |    4.6605 |
|        Runtime unhalted [s] STAT       |    38.8445 |    4.8415 |    4.8793 |    4.8556 |    4.8419 |    4.8510 |    4.8547 |
|            Clock [MHz] STAT            | 19975.2132 | 2496.5400 | 2497.0420 | 2496.9017 | 2496.8017 | 2496.9713 | 2496.9762 |
|                CPI STAT                |     3.9348 |    0.4852 |    0.4938 |    0.4919 |    0.4900 |    0.4926 |    0.4937 |
|             Energy [J] STAT            |   337.8464 |         0 |  337.8464 |   42.2308 |         0 |         0 |         0 |
|             Power [W] STAT             |    72.8114 |         0 |   72.8114 |    9.1014 |         0 |         0 |         0 |
|          Energy DRAM [J] STAT          |    29.4452 |         0 |   29.4452 |    3.6806 |         0 |         0 |         0 |
|           Power DRAM [W] STAT          |     6.3459 |         0 |    6.3459 |    0.7932 |         0 |         0 |         0 |
|              MFLOP/s STAT              |   722.5548 |   88.2318 |   93.2152 |   90.3194 |   89.0568 |   89.3543 |   91.0526 |
|           AVX [MFLOP/s] STAT           |          0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |          0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Scalar [MUOPS/s] STAT         |   722.5548 |   88.2318 |   93.2152 |   90.3194 |   89.0568 |   89.3543 |   91.0526 |
|  Memory read bandwidth [MBytes/s] STAT |  6235.6248 |         0 | 6235.6248 |  779.4531 |         0 |         0 |         0 |
|  Memory read data volume [GBytes] STAT |    28.9334 |         0 |   28.9334 |    3.6167 |         0 |         0 |         0 |
| Memory write bandwidth [MBytes/s] STAT |  1648.2151 |         0 | 1648.2151 |  206.0269 |         0 |         0 |         0 |
| Memory write data volume [GBytes] STAT |     7.6478 |         0 |    7.6478 |    0.9560 |         0 |         0 |         0 |
|    Memory bandwidth [MBytes/s] STAT    |  7883.8399 |         0 | 7883.8399 |  985.4800 |         0 |         0 |         0 |
|    Memory data volume [GBytes] STAT    |    36.5812 |         0 |   36.5812 |    4.5727 |         0 |         0 |         0 |
| Operational intensity [FLOP/Byte] STAT |     0.0118 |         0 |    0.0118 |    0.0015 |         0 |         0 |         0 |
|      Vectorization ratio [%] STAT      |          0 |         0 |         0 |         0 |         0 |         0 |         0 |
+----------------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+

# Release版本

## 1进程

+-----------------------------------+--------------+
|               Metric              | simforge:0:0 |
+-----------------------------------+--------------+
|        Runtime (RDTSC) [s]        |       6.3306 |
|        Runtime unhalted [s]       |       5.0403 |
|            Clock [MHz]            |    2497.0459 |
|                CPI                |       0.7337 |
|             Energy [J]            |     314.7977 |
|             Power [W]             |      49.7263 |
|          Energy DRAM [J]          |      32.3864 |
|           Power DRAM [W]          |       5.1158 |
|              MFLOP/s              |     467.1491 |
|           AVX [MFLOP/s]           |            0 |
|          Packed [MUOPS/s]         |      64.0155 |
|          Scalar [MUOPS/s]         |     339.1181 |
|  Memory read bandwidth [MBytes/s] |    3835.0151 |
|  Memory read data volume [GBytes] |      24.2780 |
| Memory write bandwidth [MBytes/s] |    1000.8994 |
| Memory write data volume [GBytes] |       6.3363 |
|    Memory bandwidth [MBytes/s]    |    4835.9145 |
|    Memory data volume [GBytes]    |      30.6143 |
| Operational intensity [FLOP/Byte] |       0.0966 |
|      Vectorization ratio [%]      |      15.8795 |
+-----------------------------------+--------------+

## 2进程

+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|                 Metric                 |    Sum    |    Min    |    Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |    7.3827 |    3.6819 |    3.7008 |    3.6913 |    3.6819 |    3.6819 |    3.7008 |
|        Runtime unhalted [s] STAT       |    5.9300 |    2.9573 |    2.9727 |    2.9650 |    2.9573 |    2.9573 |    2.9727 |
|            Clock [MHz] STAT            | 4993.3202 | 2496.5943 | 2496.7259 | 2496.6601 | 2496.5943 | 2496.5943 | 2496.7259 |
|                CPI STAT                |    1.3248 |    0.6557 |    0.6691 |    0.6624 |    0.6557 |    0.6557 |    0.6691 |
|             Energy [J] STAT            |  203.2261 |         0 |  203.2261 |  101.6131 |         0 |         0 |  203.2261 |
|             Power [W] STAT             |   55.1963 |         0 |   55.1963 |   27.5982 |         0 |         0 |   55.1963 |
|          Energy DRAM [J] STAT          |   23.0437 |         0 |   23.0437 |   11.5219 |         0 |         0 |   23.0437 |
|           Power DRAM [W] STAT          |    6.2587 |         0 |    6.2587 |    3.1294 |         0 |         0 |    6.2587 |
|              MFLOP/s STAT              |  919.2860 |  453.8133 |  465.4727 |  459.6430 |  453.8133 |  453.8133 |  465.4727 |
|           AVX [MFLOP/s] STAT           |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |  120.4004 |   59.2092 |   61.1912 |   60.2002 |   59.2092 |   59.2092 |   61.1912 |
|          Scalar [MUOPS/s] STAT         |  678.4853 |  335.3949 |  343.0904 |  339.2427 |  335.3949 |  335.3949 |  343.0904 |
|  Memory read bandwidth [MBytes/s] STAT | 7451.4953 |         0 | 7451.4953 | 3725.7476 |         0 |         0 | 7451.4953 |
|  Memory read data volume [GBytes] STAT |   27.4355 |         0 |   27.4355 |   13.7178 |         0 |         0 |   27.4355 |
| Memory write bandwidth [MBytes/s] STAT | 1951.3573 |         0 | 1951.3573 |  975.6786 |         0 |         0 | 1951.3573 |
| Memory write data volume [GBytes] STAT |    7.1847 |         0 |    7.1847 |    3.5924 |         0 |         0 |    7.1847 |
|    Memory bandwidth [MBytes/s] STAT    | 9402.8526 |         0 | 9402.8526 | 4701.4263 |         0 |         0 | 9402.8526 |
|    Memory data volume [GBytes] STAT    |   34.6202 |         0 |   34.6202 |   17.3101 |         0 |         0 |   34.6202 |
| Operational intensity [FLOP/Byte] STAT |    0.0495 |         0 |    0.0495 |    0.0248 |         0 |         0 |    0.0495 |
|      Vectorization ratio [%] STAT      |   30.1405 |   15.0047 |   15.1358 |   15.0702 |   15.0047 |   15.0047 |   15.1358 |
+----------------------------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

## 4进程

+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
|                 Metric                 |     Sum    |    Min    |     Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |     8.2769 |    2.0532 |     2.0822 |    2.0692 |    2.0532 |    2.0701 |    2.0714 |
|        Runtime unhalted [s] STAT       |     6.8580 |    1.7066 |     1.7271 |    1.7145 |    1.7066 |    1.7098 |    1.7145 |
|            Clock [MHz] STAT            |  9987.9943 | 2496.8601 |  2497.1454 | 2496.9986 | 2496.8601 | 2496.9319 | 2497.0569 |
|                CPI STAT                |     2.9292 |    0.7068 |     0.7549 |    0.7323 |    0.7068 |    0.7220 |    0.7455 |
|             Energy [J] STAT            |   126.1763 |         0 |   126.1763 |   31.5441 |         0 |         0 |         0 |
|             Power [W] STAT             |    61.4538 |         0 |    61.4538 |   15.3635 |         0 |         0 |         0 |
|          Energy DRAM [J] STAT          |    16.3530 |         0 |    16.3530 |    4.0883 |         0 |         0 |         0 |
|           Power DRAM [W] STAT          |     7.9647 |         0 |     7.9647 |    1.9912 |         0 |         0 |         0 |
|              MFLOP/s STAT              |  1635.4090 |  401.4841 |   421.1108 |  408.8523 |  401.4841 |  404.7493 |  408.0648 |
|           AVX [MFLOP/s] STAT           |          0 |         0 |          0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |   214.6206 |   52.6419 |    55.4525 |   53.6552 |   52.6419 |   52.7975 |   53.7287 |
|          Scalar [MUOPS/s] STAT         |  1206.1680 |  296.2004 |   310.2059 |  301.5420 |  296.2004 |  299.1543 |  300.6074 |
|  Memory read bandwidth [MBytes/s] STAT | 13624.7225 |         0 | 13624.7225 | 3406.1806 |         0 |         0 |         0 |
|  Memory read data volume [GBytes] STAT |    27.9741 |         0 |    27.9741 |    6.9935 |         0 |         0 |         0 |
| Memory write bandwidth [MBytes/s] STAT |  3568.9472 |         0 |  3568.9472 |  892.2368 |         0 |         0 |         0 |
| Memory write data volume [GBytes] STAT |     7.3277 |         0 |     7.3277 |    1.8319 |         0 |         0 |         0 |
|    Memory bandwidth [MBytes/s] STAT    | 17193.6696 |         0 | 17193.6696 | 4298.4174 |         0 |         0 |         0 |
|    Memory data volume [GBytes] STAT    |    35.3019 |         0 |    35.3019 |    8.8255 |         0 |         0 |         0 |
| Operational intensity [FLOP/Byte] STAT |     0.0245 |         0 |     0.0245 |    0.0061 |         0 |         0 |         0 |
|      Vectorization ratio [%] STAT      |    60.4202 |   15.0014 |    15.1651 |   15.1050 |   15.0014 |   15.0905 |   15.1632 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+

## 8进程

+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
|                 Metric                 |     Sum    |    Min    |     Max    |    Avg    |  %ile 25  |  %ile 50  |  %ile 75  |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |    11.9013 |    1.4603 |     1.4996 |    1.4877 |    1.4830 |    1.4848 |    1.4990 |
|        Runtime unhalted [s] STAT       |    10.1243 |    1.2500 |     1.3127 |    1.2655 |    1.2549 |    1.2599 |    1.2614 |
|            Clock [MHz] STAT            | 19973.1141 | 2496.3416 |  2497.0836 | 2496.6393 | 2496.6011 | 2496.6143 | 2496.6251 |
|                CPI STAT                |     8.7761 |    0.9795 |     1.1799 |    1.0970 |    0.9848 |    1.1228 |    1.1395 |
|             Energy [J] STAT            |   105.2261 |         0 |   105.2261 |   13.1533 |         0 |         0 |         0 |
|             Power [W] STAT             |    72.0554 |         0 |    72.0554 |    9.0069 |         0 |         0 |         0 |
|          Energy DRAM [J] STAT          |    13.8533 |         0 |    13.8533 |    1.7317 |         0 |         0 |         0 |
|           Power DRAM [W] STAT          |     9.4863 |         0 |     9.4863 |    1.1858 |         0 |         0 |         0 |
|              MFLOP/s STAT              |  2263.1222 |  274.1682 |   296.1761 |  282.8903 |  277.2784 |  279.2491 |  285.9232 |
|           AVX [MFLOP/s] STAT           |          0 |         0 |          0 |         0 |         0 |         0 |         0 |
|          Packed [MUOPS/s] STAT         |   298.0918 |   36.0281 |    39.0467 |   37.2615 |   36.5567 |   36.7915 |   37.5327 |
|          Scalar [MUOPS/s] STAT         |  1666.9384 |  202.1119 |   218.0828 |  208.3673 |  203.9546 |  205.6660 |  211.3595 |
|  Memory read bandwidth [MBytes/s] STAT | 19786.0005 |         0 | 19786.0005 | 2473.2501 |         0 |         0 |         0 |
|  Memory read data volume [GBytes] STAT |    28.8945 |         0 |    28.8945 |    3.6118 |         0 |         0 |         0 |
| Memory write bandwidth [MBytes/s] STAT |  5186.3296 |         0 |  5186.3296 |  648.2912 |         0 |         0 |         0 |
| Memory write data volume [GBytes] STAT |     7.5739 |         0 |     7.5739 |    0.9467 |         0 |         0 |         0 |
|    Memory bandwidth [MBytes/s] STAT    | 24972.3301 |         0 | 24972.3301 | 3121.5413 |         0 |         0 |         0 |
|    Memory data volume [GBytes] STAT    |    36.4683 |         0 |    36.4683 |    4.5585 |         0 |         0 |         0 |
| Operational intensity [FLOP/Byte] STAT |     0.0119 |         0 |     0.0119 |    0.0015 |         0 |         0 |         0 |
|      Vectorization ratio [%] STAT      |   121.3594 |   14.9942 |    15.2373 |   15.1699 |   15.1290 |   15.1848 |   15.2175 |
+----------------------------------------+------------+-----------+------------+-----------+-----------+-----------+-----------+
  1. 示例结果分析: Debug版这浮点计算量,8进程带宽都没吃满,并行效率正常。Release版本,4进程到8进程发现浮点计算效率明显降低,带宽提升不够,显然发生了竞争,总带宽已经接近该机器的上限了。
Author: zcp
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source zcp !
评论
 Current
程序访存性能瓶颈分析
1 问题场景在测试程序的并行效率时,发现Release版本的并行效率很低,而Debug版测出来则符合预期,怀疑程序受内存带宽限制而无法发挥应有的性能。但需要用实际数据来验证这个合理的猜想。 想象内存带宽就是一根水管,数据是水
Next 
Ubuntu上安装NVIDIA GPU开发环境
1 问题场景在 Linux 系统上进行深度学习相关项目开发时,需要安装 PyTorch(libTorch) 、 CUDA套件 来充分利用 NVIDIA GPU 的计算能力,加速模型训练和推理过程。然而,我在实际安装过程中遇到找不到
  TOC