Hot Topics on Data Center (HotDC)

时间:2024-01-22 08:48:41

Keynote Session

Accelerate Machine Intelligence: An Edge to Cloud Continuum

Hadi Esmaeilzadeh - UCSD

Background

open source: http://act-lab.org/artifacts

CoSMIC stack

how to distribute
  • understanding machine learning - solving optimize problem
  • abstraction between algorithm and acceleration system - parallelized stochastic gradient descent solver(to fpga gpu asic cgra xeon phi)
  • leverage linearity of differentiation for distributed learning
  • programming and compilation
    • build a new language for math
    • dataflow graph generation
how to design customizable accelerator
  • multi-threading acceleration
  • connectivity and bussing
  • PE architecture - make hardware simple
how to reduce overhead of distributed coordination

specialized system software in CoSIMC

benchmarks
  • 16-node CoSIMC with UltraScale+FPGA offer 18.8x speedup over 16-node spark with E3 skylake cpu
  • using FPGA (66%) and software (34%) for speedup

RoboX Accelerator Architecture

DNNs tolerate low-bitwidth operations - bit-level

Making Cloud Systems Reliable and Dependable: Challenges and Opportunities

Lidong Zhou- MSRA

Background

system reliability:

  • Fault Tolerance
  • Redundancies
  • State Machine Replication
  • Paxos
  • Erasure Coding

Real-World Gray Failures in Cloud

  • redundancies in data center networking
  • active device and link failure localization in data center
  • NetBouncer: large-Scale path probing and diagnosis
  • NetBouncer: leverage the power of scale
  • root cause of the gray failure - stuck due to network issue - heart beat still normal (request stuck)
  • Insight: should detect what the requesters errors
    • critical gray failure are ovserviable
    • from error handling to error reporting

Solution - Panorama

  • Analysis - automatically covert a software component into an in-situ observer
  • Runtime - observer send to local observation store(LOS)
    • locate ob-boundary
    • observations not always direct
    • observations split to ob-origin & ob-sink
    • match ob-origin & ob-sink
  • Detect what "requesters" see

Reliability of Large-Scale Distributed Systems

  • foundation reliability
  • rethink cloud reliability: new theory & new method
  • understand gray failure
  • systematic and comprehensive observations

paper: Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Haibo Chen - SJTU

Background

  • (Distributed) Transactions were slow
  • High cost for distributed TX - Usually 10s~100s of thousands of TPS - (SIGMOD'12)
  • only 4% of wall-clock time spent in useful data processing

new features:

  • RDMA: remote direct memory access
    • ultra low latency(5us)
    • ultra high throughput
  • NVM: Non-volatile memory

An Active Line of Research of RDMA-enabled TX

  • DrTM - DrTM(SOSP 2015) DrTM-R(EuroSys 2016) DrTM-B(USENIX ATC 2017)
  • FaRM - FaRM-KV(NSDI 2014) FaRM-TX(SOSP 2015)
  • FaSST(OSDI 2016)
  • LITE(SOSP 2017)

Transaction(TX)s

  • protocols - OCC,2PL,SI...
  • impl on hardware devices - CX3,CX4,CX5,ROCE, one-side, two-side....
  • OLTP workloads - TPC-C, TPC-E, TATP, Smallbank

Main: Use RDMA in TXs

outlet:

  • RDMA primitive-level analysis
  • Phase-by-phase analysis for TX
  • DrTM+H: Putting it all together

content:

  • phase: Exe/Val/Log/Commit
  • offloading with one-side improves the performance
  • one-sided primitive has good scalability on modern RNIC
  • Execution framework & DrTM+H:https://github/com/SJTU-IPADS/drtmh

RDMA in Data Centers: from Cloud Computing to Machine Learning

Chuanxiong Guo - ByteDance

Background

  • Data Center Network (DCN) offer lot services
    • single ownership
    • large scale
    • bisection bandwidth
  • TCP/IP not working well
    • latency
    • bandwidth
    • processing overhead(40G) - 12% CPU at receiver & 6% CPU at sender

RDMA over Commodity Ethernet (RoCEv2)

  • no CPU overhead
  • single QP, 88Gb/s 1.7% CPU usage (TCP 8 connection 30-50Gb/s, client 2.6% & server 4.3% CPU)
  • RoCEv2 needs a lossless ethernet network
    • PFC(priority-based flow control) hop-by-hop flow control
    • DCQCN - sender-switch-receiver (RP-CP-NP)
  • the slow-receiver symptom - ToR tot NIC is 40Gb/s & NIC to server is 64Gb/s. NIC may generate large number of PFC pause frames

RDMA for DNN Training Acceleration

  • understanding using DNN
  • DNN Training: BP
  • Distributed ML training, GPUs, with mini-batch
  • RDMA acceleration : ResNet \ RNNs \ DNN (rdma performance better than tcp)

Highlighted Research Session

Congestion Control Mechanisms in Data Center Networks

Wei Bai - MSRA

DCN中实现低时延

  • 排队时延 -PIAS(NSDI 2015)
  • 丢包重发时延 - TLT

PIAS

  • Flow completion Time (FCT)是关键问题
  • 流信息不能假设为已知、可以在现有设备上快速部署
  • PIAS performs Multi-level feedback queue (MLFQ) to emulate shortest job first (SJF)
  • three function in pias:
    • package tagging
    • switch
    • rate control

TLT

  • 同时达到Lossy & Loss-Less两种网络的好处
  • using PFC to eliminate congestion packet losses
  • packet loss :
    • middle - fast retransmissions
    • tail - Timeout retransmissions
    • 识别重要包, 当交换机队列超过阈值时丢掉非重要包

Understanding the challenges of Scaling Distributed DNN Training

Cheng Li - USTC

  • Deep Learning growth fast
  • DNN - Deep Neural Networks
  • benefit: more data / bigger models / more computation
  • Jeff Dean - Google

Distributed DNN

  • Model or data parallelism
    • data parallelism is a primary choice
  • BSP / ASP - BSP is choice (ASP可能不收敛)
    • Bulk Synchronous Parallel - 确定时间同步
    • Asynchronous Parallel
  • net \ server \ other bottlenecks for parallelism
  • 通过测试确定影响计算能力的制约条件
    • 数据压缩传输带来的压缩开销
  • 系统设计
    • 弹性系统设计
    • 短板效应 - 最终计算速度的制约
    • 如何快速调整系统的规模等 - message bus流处理 - 用生产者消费者模型

Octopus: an RDMA-enable Distributed Persistent Memory File System

Youyou Lu - Tsinghua

  • 分布式文件系统设计
  • 非易失性内存 - 内存存储
  • DRAM Limitations
    • Cell Density
    • Refresh - 性能/功耗
  • NVDIMM内存 - 断电后存储数据
  • Intel 3D Xpoint - 接近内存的延迟, 高容量, 断电非易失
  • RDMA - 高性能环境下使用
  • DiskGluster - latency来自于HDD | MemGluster - latency来自于软件
  • RDMA-enable Distributed File System
    • shard data mamangment
    • New data flow strategies
    • Efficient RPC design
    • Concurrent control

Design

  • I/O处理
    • 将所有NVMM组织为同一空间
    • 降低DFS中的数据拷贝(7次降到4次)
    • server扫描数据存储地址,client获取地址之后自己获取(将任务转嫁给client)
  • Metadata RPC
  • Collect-Dispatch Distributed Transaction
  • 性能测试
    • 局域网服务期间测试 - 带宽可以达到网络带宽的88%
    • 在Hadoop平台下进行测试

Short Talk

Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)

新的技术AI \ IOT
提高新的软硬协同设计能力 - CPU\GPU\FPGA\GPU\ASIC
ZyForce平台 - 虚拟FPGA实验

ActionFlow:A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)

国科大大四 - 2018.8开始
机器人应用快速开发

Labeled Network Stack, Yifan Shen (ICT, CAS)

Caching or Not: Rethinking Virtual File System for Non-Volatile Main Memory, Ying Wang (ICT, CAS)

Data Motif-based Proxy Benchmarks for Big Data and AI Workloads, Chen Zheng (ICT, CAS)