Keynote Session
Accelerate Machine Intelligence: An Edge to Cloud Continuum
Hadi Esmaeilzadeh - UCSD
Background
open source: http://act-lab.org/artifacts
Data grows at an unprecedented rate
new landscape of computing: personalize and targeted experience for users
growing gap between data and compute
power/energy efficiency is a primary concern
approximate computing
machines learn to extract insights from data - two disjoin solutions for ml
distrubute computer + FPGA / ASIC chips
don't use vhdl / verlog language in the full stack for normal user
CoSMIC stack
how to distribute
- understanding machine learning - solving optimize problem
- abstraction between algorithm and acceleration system - parallelized stochastic gradient descent solver(to fpga gpu asic cgra xeon phi)
- leverage linearity of differentiation for distributed learning
- programming and compilation
- build a new language for math
- dataflow graph generation
how to design customizable accelerator
- multi-threading acceleration
- connectivity and bussing
- PE architecture - make hardware simple
how to reduce overhead of distributed coordination
specialized system software in CoSIMC
benchmarks
- 16-node CoSIMC with UltraScale+FPGA offer 18.8x speedup over 16-node spark with E3 skylake cpu
- using FPGA (66%) and software (34%) for speedup
RoboX Accelerator Architecture
DNNs tolerate low-bitwidth operations - bit-level
Making Cloud Systems Reliable and Dependable: Challenges and Opportunities
Lidong Zhou- MSRA
Background
system reliability:
- Fault Tolerance
- Redundancies
- State Machine Replication
- Paxos
- Erasure Coding
Real-World Gray Failures in Cloud
- redundancies in data center networking
- active device and link failure localization in data center
- NetBouncer: large-Scale path probing and diagnosis
- NetBouncer: leverage the power of scale
- root cause of the gray failure - stuck due to network issue - heart beat still normal (request stuck)
- Insight: should detect what the requesters errors
- critical gray failure are ovserviable
- from error handling to error reporting
Solution - Panorama
- Analysis - automatically covert a software component into an in-situ observer
- Runtime - observer send to local observation store(LOS)
- locate ob-boundary
- observations not always direct
- observations split to ob-origin & ob-sink
- match ob-origin & ob-sink
- Detect what "requesters" see
- failure that matter are observable to requesters
- turn error handlers into error reporters
- enables construction of in-situ observers
- https://github.com/ryanphuang/panorma
Reliability of Large-Scale Distributed Systems
- foundation reliability
- rethink cloud reliability: new theory & new method
- understand gray failure
- systematic and comprehensive observations
paper: Gray Failure: The Achilles' Heel of Cloud-Scale Systems
Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
Haibo Chen - SJTU
Background
- (Distributed) Transactions were slow
- High cost for distributed TX - Usually 10s~100s of thousands of TPS - (SIGMOD'12)
- only 4% of wall-clock time spent in useful data processing
new features:
- RDMA: remote direct memory access
- ultra low latency(5us)
- ultra high throughput
- NVM: Non-volatile memory
An Active Line of Research of RDMA-enabled TX
- DrTM - DrTM(SOSP 2015) DrTM-R(EuroSys 2016) DrTM-B(USENIX ATC 2017)
- FaRM - FaRM-KV(NSDI 2014) FaRM-TX(SOSP 2015)
- FaSST(OSDI 2016)
- LITE(SOSP 2017)
Transaction(TX)s
- protocols - OCC,2PL,SI...
- impl on hardware devices - CX3,CX4,CX5,ROCE, one-side, two-side....
- OLTP workloads - TPC-C, TPC-E, TATP, Smallbank
Main: Use RDMA in TXs
outlet:
- RDMA primitive-level analysis
- Phase-by-phase analysis for TX
- DrTM+H: Putting it all together
content:
- phase: Exe/Val/Log/Commit
- offloading with one-side improves the performance
- one-sided primitive has good scalability on modern RNIC
- Execution framework & DrTM+H:https://github/com/SJTU-IPADS/drtmh
RDMA in Data Centers: from Cloud Computing to Machine Learning
Chuanxiong Guo - ByteDance
Background
- Data Center Network (DCN) offer lot services
- single ownership
- large scale
- bisection bandwidth
- TCP/IP not working well
- latency
- bandwidth
- processing overhead(40G) - 12% CPU at receiver & 6% CPU at sender
RDMA over Commodity Ethernet (RoCEv2)
- no CPU overhead
- single QP, 88Gb/s 1.7% CPU usage (TCP 8 connection 30-50Gb/s, client 2.6% & server 4.3% CPU)
- RoCEv2 needs a lossless ethernet network
- PFC(priority-based flow control) hop-by-hop flow control
- DCQCN - sender-switch-receiver (RP-CP-NP)
- the slow-receiver symptom - ToR tot NIC is 40Gb/s & NIC to server is 64Gb/s. NIC may generate large number of PFC pause frames
RDMA for DNN Training Acceleration
- understanding using DNN
- DNN Training: BP
- Distributed ML training, GPUs, with mini-batch
- RDMA acceleration : ResNet \ RNNs \ DNN (rdma performance better than tcp)
Highlighted Research Session
Congestion Control Mechanisms in Data Center Networks
Wei Bai - MSRA
DCN中实现低时延
- 排队时延 -PIAS(NSDI 2015)
- 丢包重发时延 - TLT
PIAS
- Flow completion Time (FCT)是关键问题
- 流信息不能假设为已知、可以在现有设备上快速部署
- PIAS performs Multi-level feedback queue (MLFQ) to emulate shortest job first (SJF)
- three function in pias:
- package tagging
- switch
- rate control
TLT
- 同时达到Lossy & Loss-Less两种网络的好处
- using PFC to eliminate congestion packet losses
- packet loss :
- middle - fast retransmissions
- tail - Timeout retransmissions
- 识别重要包, 当交换机队列超过阈值时丢掉非重要包
Understanding the challenges of Scaling Distributed DNN Training
Cheng Li - USTC
- Deep Learning growth fast
- DNN - Deep Neural Networks
- benefit: more data / bigger models / more computation
- Jeff Dean - Google
Distributed DNN
- Model or data parallelism
- data parallelism is a primary choice
- BSP / ASP - BSP is choice (ASP可能不收敛)
- Bulk Synchronous Parallel - 确定时间同步
- Asynchronous Parallel
- net \ server \ other bottlenecks for parallelism
- 通过测试确定影响计算能力的制约条件
- 数据压缩传输带来的压缩开销
- 系统设计
- 弹性系统设计
- 短板效应 - 最终计算速度的制约
- 如何快速调整系统的规模等 - message bus流处理 - 用生产者消费者模型
Octopus: an RDMA-enable Distributed Persistent Memory File System
Youyou Lu - Tsinghua
- 分布式文件系统设计
- 非易失性内存 - 内存存储
- DRAM Limitations
- Cell Density
- Refresh - 性能/功耗
- NVDIMM内存 - 断电后存储数据
- Intel 3D Xpoint - 接近内存的延迟, 高容量, 断电非易失
- RDMA - 高性能环境下使用
- DiskGluster - latency来自于HDD | MemGluster - latency来自于软件
- RDMA-enable Distributed File System
- shard data mamangment
- New data flow strategies
- Efficient RPC design
- Concurrent control
Design
- I/O处理
- 将所有NVMM组织为同一空间
- 降低DFS中的数据拷贝(7次降到4次)
- server扫描数据存储地址,client获取地址之后自己获取(将任务转嫁给client)
- Metadata RPC
- Collect-Dispatch Distributed Transaction
- 性能测试
- 局域网服务期间测试 - 带宽可以达到网络带宽的88%
- 在Hadoop平台下进行测试
Short Talk
Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)
新的技术AI \ IOT
提高新的软硬协同设计能力 - CPU\GPU\FPGA\GPU\ASIC
ZyForce平台 - 虚拟FPGA实验
ActionFlow:A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)
国科大大四 - 2018.8开始
机器人应用快速开发