一些网摘的hpc材料

时间:2021-03-21 19:21:29

source from: https://computing.llnl.gov

Factors determines a large-scale program's performance

4         * Application related factors:

5                 * algorithms

6                 * dataset size

7                 * Memory Usage Pattern

8                 * Use of IO

9                 * Communication Patterns

10                 * Task Granularity

11                 * Load Balancing

12                 * Amdahl's Law

13

14         * Hardware factors

15                 * Processors Architecture

16                 * Memory Hierarchy

17                 * I/O configuration

18                 * Network

19

20         * Software factors

21                 * OS

22                 * Compiler

23                 * Preprocessor

24                 * Communication protocols

25                 * Libraries

Performance analysis:

  Timers, Profiles, system stat, memory tools

Learn some about hardware archiecture:

Intel Xeon 5500/5600

  4-core/ 6-core

  2.4/2.8 GHz

  Cache

    L1 Data 32Kb, private

    L1 Instruction 32Kb, private

        L2 256K, private

     L3 8Mb/12Mb, shared

Cpu-Memory bandwidth: 32 Gb/s

Intel Xeon E5-2670

    8-core, 2.6GHz

Cache

      L1 Data 32K, private

      L1 Instruction 32K, private

      L2 256K, private

      L3 20Mb, shared

CPU-Memory bandwidth  51.2G/s

AMD processors

     2.2 GHz

  Cache

       L1  Data 64k (2-way)

       L1  Instruction 64k(2-way)

       L2  512K private

       L3  2M shared

  Direct - connect Architecture

    CPU-memory bandwidth 10.7G/s per socket F

    other connect socket bandwidth 8G/s(2-way)

  4x Infiniband Interconnect

    * SDR 1.25G/s

    * DDR 2.5G/s

* QDR  5G/s

Learn something about NUMA  

  -physical: each node has sevearl(2-4) sockets, each socket has sevearl(4-8) CPU cores. On same socket, cores share L3 cache; socket-socket communcation through CPU-memory bus, almost 2x ~ 5x slower.   

-design consideration: CPU affinity(numactl --cpunodebind), local memory policy. other compiler/running-time options(mpirun --bind-to-socket -bynode)

Finally and most importantly, a good algorithm.