Hot Topics on Data Center (HotDC)

Keynote Session

Accelerate Machine Intelligence: An Edge to Cloud Continuum

Hadi Esmaeilzadeh - UCSD

Background

open source: http://act-lab.org/artifacts

Data grows at an unprecedented rate
new landscape of computing: personalize and targeted experience for users
growing gap between data and compute
power/energy efficiency is a primary concern
approximate computing
AxGames: https://www.researchgate.net/publication/303905276_AxGames_Towards_Crowdsourcing_Quality_Target_Determination_in_Approximate_Computing
machines learn to extract insights from data - two disjoin solutions for ml
distrubute computer + FPGA / ASIC chips
don't use vhdl / verlog language in the full stack for normal user

CoSMIC stack

how to distribute

understanding machine learning - solving optimize problem
abstraction between algorithm and acceleration system - parallelized stochastic gradient descent solver(to fpga gpu asic cgra xeon phi)
leverage linearity of differentiation for distributed learning
programming and compilation
- build a new language for math
- dataflow graph generation

how to design customizable accelerator

multi-threading acceleration
connectivity and bussing
PE architecture - make hardware simple

how to reduce overhead of distributed coordination

specialized system software in CoSIMC

benchmarks

16-node CoSIMC with UltraScale+FPGA offer 18.8x speedup over 16-node spark with E3 skylake cpu
using FPGA (66%) and software (34%) for speedup

RoboX Accelerator Architecture

DNNs tolerate low-bitwidth operations - bit-level

Making Cloud Systems Reliable and Dependable: Challenges and Opportunities

Lidong Zhou- MSRA

Background

system reliability:

Fault Tolerance
Redundancies
State Machine Replication
Paxos
Erasure Coding

Real-World Gray Failures in Cloud

redundancies in data center networking
active device and link failure localization in data center
NetBouncer: large-Scale path probing and diagnosis
NetBouncer: leverage the power of scale
root cause of the gray failure - stuck due to network issue - heart beat still normal (request stuck)
Insight: should detect what the requesters errors
- critical gray failure are ovserviable
- from error handling to error reporting

Solution - Panorama

Analysis - automatically covert a software component into an in-situ observer
Runtime - observer send to local observation store(LOS)
- locate ob-boundary
- observations not always direct
- observations split to ob-origin & ob-sink
- match ob-origin & ob-sink
Detect what "requesters" see
- failure that matter are observable to requesters
- turn error handlers into error reporters
- enables construction of in-situ observers
- https://github.com/ryanphuang/panorma

Reliability of Large-Scale Distributed Systems

foundation reliability
rethink cloud reliability: new theory & new method
understand gray failure
systematic and comprehensive observations

paper: Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Haibo Chen - SJTU

Background

(Distributed) Transactions were slow
High cost for distributed TX - Usually 10s~100s of thousands of TPS - (SIGMOD'12)
only 4% of wall-clock time spent in useful data processing

new features:

RDMA: remote direct memory access
- ultra low latency(5us)
- ultra high throughput
NVM: Non-volatile memory

An Active Line of Research of RDMA-enabled TX

DrTM - DrTM(SOSP 2015) DrTM-R(EuroSys 2016) DrTM-B(USENIX ATC 2017)
FaRM - FaRM-KV(NSDI 2014) FaRM-TX(SOSP 2015)
FaSST(OSDI 2016)
LITE(SOSP 2017)

Transaction(TX)s

protocols - OCC,2PL,SI...
impl on hardware devices - CX3,CX4,CX5,ROCE, one-side, two-side....
OLTP workloads - TPC-C, TPC-E, TATP, Smallbank

Main: Use RDMA in TXs

outlet:

RDMA primitive-level analysis
Phase-by-phase analysis for TX
DrTM+H: Putting it all together

content:

phase: Exe/Val/Log/Commit
offloading with one-side improves the performance
one-sided primitive has good scalability on modern RNIC
Execution framework & DrTM+H:https://github/com/SJTU-IPADS/drtmh

RDMA in Data Centers: from Cloud Computing to Machine Learning

Chuanxiong Guo - ByteDance

Background

Data Center Network (DCN) offer lot services
- single ownership
- large scale
- bisection bandwidth
TCP/IP not working well
- latency
- bandwidth
- processing overhead(40G) - 12% CPU at receiver & 6% CPU at sender

RDMA over Commodity Ethernet (RoCEv2)

no CPU overhead
single QP, 88Gb/s 1.7% CPU usage (TCP 8 connection 30-50Gb/s, client 2.6% & server 4.3% CPU)
RoCEv2 needs a lossless ethernet network
- PFC(priority-based flow control) hop-by-hop flow control
- DCQCN - sender-switch-receiver (RP-CP-NP)
the slow-receiver symptom - ToR tot NIC is 40Gb/s & NIC to server is 64Gb/s. NIC may generate large number of PFC pause frames

RDMA for DNN Training Acceleration

understanding using DNN
DNN Training: BP
Distributed ML training, GPUs, with mini-batch
RDMA acceleration : ResNet \ RNNs \ DNN (rdma performance better than tcp)

Highlighted Research Session

Congestion Control Mechanisms in Data Center Networks

Wei Bai - MSRA

DCN中实现低时延

排队时延 -PIAS(NSDI 2015)
丢包重发时延 - TLT

PIAS

Flow completion Time (FCT)是关键问题
流信息不能假设为已知、可以在现有设备上快速部署
PIAS performs Multi-level feedback queue (MLFQ) to emulate shortest job first (SJF)
three function in pias:
- package tagging
- switch
- rate control

TLT

同时达到Lossy & Loss-Less两种网络的好处
using PFC to eliminate congestion packet losses
packet loss :
- middle - fast retransmissions
- tail - Timeout retransmissions
- 识别重要包, 当交换机队列超过阈值时丢掉非重要包

Understanding the challenges of Scaling Distributed DNN Training

Cheng Li - USTC

Deep Learning growth fast
DNN - Deep Neural Networks
benefit: more data / bigger models / more computation
Jeff Dean - Google

Distributed DNN

Model or data parallelism
- data parallelism is a primary choice
BSP / ASP - BSP is choice (ASP可能不收敛)
- Bulk Synchronous Parallel - 确定时间同步
- Asynchronous Parallel
net \ server \ other bottlenecks for parallelism
通过测试确定影响计算能力的制约条件
- 数据压缩传输带来的压缩开销
系统设计
- 弹性系统设计
- 短板效应 - 最终计算速度的制约
- 如何快速调整系统的规模等 - message bus流处理 - 用生产者消费者模型

Octopus: an RDMA-enable Distributed Persistent Memory File System

Youyou Lu - Tsinghua

分布式文件系统设计
非易失性内存 - 内存存储
DRAM Limitations
- Cell Density
- Refresh - 性能/功耗
NVDIMM内存 - 断电后存储数据
Intel 3D Xpoint - 接近内存的延迟, 高容量, 断电非易失
RDMA - 高性能环境下使用
DiskGluster - latency来自于HDD | MemGluster - latency来自于软件
RDMA-enable Distributed File System
- shard data mamangment
- New data flow strategies
- Efficient RPC design
- Concurrent control

Design

I/O处理
- 将所有NVMM组织为同一空间
- 降低DFS中的数据拷贝(7次降到4次)
- server扫描数据存储地址,client获取地址之后自己获取(将任务转嫁给client)
Metadata RPC
Collect-Dispatch Distributed Transaction
性能测试
- 局域网服务期间测试 - 带宽可以达到网络带宽的88%
- 在Hadoop平台下进行测试

Short Talk

Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)

新的技术AI \ IOT
提高新的软硬协同设计能力 - CPU\GPU\FPGA\GPU\ASIC
ZyForce平台 - 虚拟FPGA实验

ActionFlow：A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)

国科大大四 - 2018.8开始
机器人应用快速开发

秒客网

Hot Topics on Data Center (HotDC)

Keynote Session

Accelerate Machine Intelligence: An Edge to Cloud Continuum

Background

CoSMIC stack

how to distribute

how to design customizable accelerator

how to reduce overhead of distributed coordination

benchmarks

RoboX Accelerator Architecture

Making Cloud Systems Reliable and Dependable: Challenges and Opportunities

Background

Real-World Gray Failures in Cloud

Solution - Panorama

Reliability of Large-Scale Distributed Systems

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

Background

An Active Line of Research of RDMA-enabled TX

Transaction(TX)s

Main: Use RDMA in TXs

RDMA in Data Centers: from Cloud Computing to Machine Learning

Background

RDMA over Commodity Ethernet (RoCEv2)

RDMA for DNN Training Acceleration

Highlighted Research Session

Congestion Control Mechanisms in Data Center Networks

PIAS

TLT

Understanding the challenges of Scaling Distributed DNN Training

Distributed DNN

Octopus: an RDMA-enable Distributed Persistent Memory File System

Design

Short Talk

Computer Organization and Design Course with FPGA Cloud, Ke Zhang (ICT, CAS)

ActionFlow：A Framework for Fast Multi-Robots Application Development, Jimin Han (UCAS)

Labeled Network Stack, Yifan Shen (ICT, CAS)

Caching or Not: Rethinking Virtual File System for Non-Volatile Main Memory, Ying Wang (ICT, CAS)

Data Motif-based Proxy Benchmarks for Big Data and AI Workloads, Chen Zheng (ICT, CAS)

相关文章