How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation

January 11, 2018- Apache Flink

A favorite session from Flink Forward Berlin 2017 was Robert Metzger’s “Keep It Going: How to Reliably and Efficiently Operate Apache Flink”. One of the topics that Robert touches on is how to roughly size an Apache Flink cluster. Flink Forward attendees mentioned that his cluster sizing guidelines were helpful to them, and so we’ve converted that section of his talk into a blog post. Enjoy!

One of the most frequently-asked questions in the Flink community is how to size a cluster when moving from development to production. The definitive answer to this question, is, of course, “it depends,” but that’s not a helpful answer. This post outlines a series of questions to ask to arrive at some numbers you can use as guidance.

Do the Math and Establish a Baseline

The first step is to think through your application’s operational metrics to arrive at a baseline of required resources.

The key metrics to consider are:

The number of records per second and the size per record
The number of distinct keys you have and the state size per key
The number of state updates and the access patterns of your state backend

Finally, a more pragmatic concern is your service-level agreements (SLAs) around downtime, latency, and max throughput with your customers as these directly influence your capacity planning.

Next, look at what resources you have available based on your budget. For example:

The network capacity, taking into account any external services that also use the network, such as Kafka, HDFS, etc.
Your disk bandwidth, if you are relying on a disk-based state backend like RocksDB (and considering other disk use like Kafka or HDFS)
The number of machines and the CPU and memory they have available

Based on all these factors, you can now build a baseline for normal operation, plus a buffer of resources used for recovery catch-up or to handle load spikes. I recommend you also consider the resources used during checkpointing when establishing the baseline.

Example: Let’s run some numbers

I will now plan a job deployment on a hypothetical cluster to visualize the process of establishing a resource usage baseline. These numbers are rough “back-of-the-envelope” values, and they’re not comprehensive–at the end of the post, I’ll also identify some of the aspects that I ignored while making this calculation.

Example Flink Streaming Job and Hardware

Example Flink Streaming job topology

For this example, I am going to deploy a typical Flink streaming job that reads data from a Kafka topic using Flink’s Kafka consumer. The stream is then transformed using a keyed, aggregating window operator. The window operator performs aggregations on time windows of 5 minutes. As there is always fresh data, I’ll configure the window to be a sliding window with a 1-minute slide.

This means I’ll get the aggregates for the past 5 minutes updated every minute. The streaming job creates an aggregate per userId. The messages consumed from the Kafka topic have a size (on average) of 2 KB.

The throughput is 1 million messages per second. To understand the state size of the window operator, you need to know the number of distinct keys. In this case, it’s the number of userIds, which is 500,000,000 unique users. For each user, you are computing four numbers, stored as longs (8 bytes).

Let’s summarize the job’s key metrics:

Message size: 2KB
Throughput: 1,000,000 msg/sec
Distinct keys: 500,000,000 (aggregation in window: 4 longs per key)
Checkpointing: Once every minute.

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation
Hypothetical Hardware Setup

There are five machines running the job, each running a Flink TaskManager (Flink’s worker nodes). Disks are network-attached (common in cloud setups), and there is a 10 Gigabit Ethernet connection from the main switch to each machine running a TaskManager. The Kafka brokers are running on separate machines.

Each machine has 16 CPU cores. For simplicity, I won’t consider CPU and memory requirements. In the real world, depending on your application logic and the state backend in use, you would need to pay attention to memory. This example uses a RocksDB-based state backend, which is robust and has low memory requirements.

A Single Machine’s Perspective

To understand the resource requirements of the whole job deployment, it’s easiest to focus on the operations in one machine and one TaskManager first. You can then use the numbers derived from one machine to calculate the overall resource requirements.

By default (if all operators have the same parallelism and there are no special scheduling restrictions), all operators of a streaming job are running on each machine.

In this case, the Kafka source (or consumer), window operator, and Kafka sink (or producer) are all running on each of the five machines.
How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation
A machine perspective – TaskManager n

keyBy is a separate operator in the figure above so that calculating the resource requirements is easier. In reality, keyBy is an API construct and translates into a configuration attribute for the connection between the Kafka source and window operator.

I will now go through each of these operators from top to bottom to understand their network resource requirements.

The Kafka source

To calculate the amount of data received by an individual Kafka source, first, compute the aggregate Kafka input. The sources receive 1,000,000 messages per second that are 2KB each.

2KB x 1,000,000/s = 2GB/s

Dividing 2GB/s by the number of machines (5) leads to the following result:

2GB/s ÷ 5 machines = 400MB/s

Each of the 5 Kafka sources running in the cluster receives data with an average throughput of 400 MB/s.

The Kafka source calculation

The Shuffle / keyBy

Next, you need to ensure that all events with the same key (in this case the userId) end up on the same machine. The data in the Kafka topic you are reading from might be partitioned according to a different partitioning scheme.

The shuffling process sends all data with the same key to one machine, so you are splitting the 400MB/s stream of data coming from Kafka into a userId-partitioned stream:

400MB/s ÷ 5 machines = 80MB/s

On average, you have to send 80 MB/s of data to each of the machines. This analysis is from the perspective of a single machine which means that some of the data is already on the designated target machine, so subtract 80MB/s to account for that:

400MB/s – 80MB = 320MB/s

Each machine receives and sends user data at a rate of 320MB/s.

The shuffle calculation

Window Emit and Kafka Sink

The next question to ask is how much data the window operator emits and sends through to the Kafka sink. It’s 67MB/s, and let’s explain how we arrived at this number.

The window operator keeps an aggregate of 4 numbers (represented as longs) for each key. Once every minute, the operator emits the current aggregate values. Each key emits 2 ints (user_id, window_ts) and 4 longs from the aggregation:

(2 x 4 bytes) + (4 x 8 bytes) = 40 bytes per key

Then factor in the keys (500,000,000 divided by the number of machines):

100,000,000 keys x 40 bytes = 4GB

…from each machine.

Then calculate the per-second size:

4GB/min ÷ 60 = 67MB/s

…emitted by each TaskManager.

This means that each TaskManager emits on average 67 MB/s of user data from the window operators. Since there is a Kafka sink running on each TaskManager (next to the window operator), and there’s no further repartitioning, this is the amount of data emitted from Flink to Kafka.

User data: From Kafka, shuffled to the window operators and back to Kafka

The emission of data from the window operators is expected to be “bursty,” because they are emitting the data once every minute. In practice, the operator will not send data at a constant rate of 67 MB/s, but rather max out the available bandwidth for a few seconds every minute.

This all totals to:

Data in: 720MB/s (400 + 320) per machine
Data out: 387MB/s (320 + 67) per machine

State Access and Checkpointing

That’s not everything. So far, I’ve only looked at the user data that Flink is processing. You need to include the overhead from disk access to RocksDB for storing state and checkpointing. To understand the disk access costs, you look at how the window operator accesses state. The Kafka source also keeps some state, but it is negligible compared to the window operator.

To understand the state size of the window operator, look at it from a different angle. Flink is computing five-minute windows with a 1-minute slide. Flink implements sliding windows by maintaining five windows, one for each “slide.” As mentioned earlier, you maintain 40 bytes of state for each window and each key for the aggregations when using a window implementation which is performing an eager aggregation. For every incoming event, you first need to retrieve the current aggregation values from disk (read 40 bytes), update the aggregates, and then write the new value back (write 40 bytes).

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation

Window State

This means:

40 bytes of state x 5 windows x 200,000 msg/s per machine = 40MB/s

…of read or write disk access per machine. As said in the beginning, the disks are network attached, so I need to add these numbers to the overall throughput calculations.
The totals are now:

Data in: 760MB/s (400 MB/s data in + 320 MB/s shuffle + 40 MB/s state)
Data out: 427MB/s (320 MB/s shuffle + 67 MB/s data out + 40 MB/s state)

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation
The above considerations are for the state access, which happens consistently as new events arrive at the window operator. You also have checkpointing enabled for fault-tolerance. If a machine or anything else fails, you want to restore your window contents and continue processing.

Checkpointing is set to an interval of one checkpoint per minute, and each checkpoint copies the entire state of the job into a network-attached file system.

Let’s quickly see how big the entire state on each machine is:

40bytes of state x 5 windows x 100,000,000 keys = 20GB

And, to get the per-second value:

20GB ÷ 60 = 333 MB/s.

Similar to the window operator, checkpointing has a bursty pattern, and once every minute, it tries to send its data at full speed to external storage. Checkpointing causes additional state access to RocksDB (which in this example is located on network attached disks). Since Flink 1.3, the RocksDB state backend supports incremental checkpointing, reducing the required network transfers on each checkpoint, by conceptually only sending the “diff” since the last checkpoint, but this feature is not used in this example.

This updates the totals to:

Data in: 760MB/s (400 + 320 + 40)
Data out: 760MB/s (320 + 67 + 40 + 333)

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation
This means that the overall network traffic is:

760 + 760 x 5 + 400 + 2335 = 10335 MB/s

The 400 is total of the 80MB state access (read and write) process across the 5 machines, and 2335 is the total of the Kafka in and out processes across the cluster.

Or just over half the available network capacity in the hardware setup above.

Networking requirements

There’s a disclaimer I’d like to add. None of these calculations include protocol overhead such as TCP, Ethernet, and RPC calls from Flink, Kafka, or the file system. This is still good starting point to understand what sort of hardware you will need for a job and to have an indication of performance.

Scale Your Way

Based on my analysis, this example, with a 5-node cluster, and in typical operation, each machine would need to handle 760 MB/s of data, both in and out, from a total capacity of 1250 MB/s. That reserves about 40% of the network capacity for the complexities I glossed over, such as network protocol overheads, heavy load during event replay when recovering from a checkpoint, and uneven load balancing across the cluster caused by data skew.

There’s no one-size-fits-all answer to whether 40% is an appropriate amount of headroom, but this arithmetic should give you a good starting point. Try the calculations above, swapping out the number of machines, the number of keys, or the messages per second to get a selection of values to consider and then balance that with your budget and operational factors. Happy scaling!

Tags: capacity planning, cluster sizing, resource planning, throughput

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation的更多相关文章

Flink监控：Monitoring Apache Flink Applications
This post originally appeared on the Apache Flink blog. It was reproduced here under the Apache Lice ...
Managing Large State in Apache Flink&&num;174&semi;&colon; An Intro to Incremental Checkpointing
January 23, 2018- Apache Flink, Flink Features Stefan Richter and Chris Ward Apache Flink was purpos ...
Apache Flink 开发环境搭建和应用的配置、部署及运行
https://mp.weixin.qq.com/s/noD2Jv6m-somEMtjWTJh3w 本文是根据 Apache Flink 系列直播课程整理而成,由阿里巴巴高级开发工程师沙晟阳分享,主要 ...
apache flink源码挖坑 (未完待续)
Apache Flink 源码解读(一) By yyz940922原创项目模块 (除去.git, .github, .idea, docs等): flink-annotations: flink ...
Peeking into Apache Flink's Engine Room
http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Join Processin ...
Apache Flink
Flink 剖析 1.概述在如今数据爆炸的时代,企业的数据量与日俱增,大数据产品层出不穷.今天给大家分享一款产品—— Apache Flink,目前,已是 Apache *项目之一.那么,接下来, ...
Apache Flink Quickstart
Apache Flink 是新一代的基于 Kappa 架构的流处理框架,近期底层部署结构基于 FLIP-6 做了大规模的调整,我们来看一下在新的版本(1.6-SNAPSHOT)下怎样从源码快速编译执行 ...
新一代大数据处理引擎 Apache Flink
https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/index.html 大数据计算引擎的发展这几年大数据的飞速发 ...
腾讯大数据平台Oceanus&colon; A one-stop platform for real time stream processing powered by Apache Flink
January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent In recent years, the increa ...

随机推荐

设计模式--命令模式Command（对象行为型）
一.命令模式将一个请求封装为一个对象,从而让你使用不同的请求把客户端参数化,对请求排队或者记录请求日志,可以提供命令的撤销和恢复功能. (1)Command类:是一个抽象类,类中对需要执行的命令进行 ...
如何从SharePoint Content DB中查询List数据
SharePoint用来维护基础数据非常方便,只需要建立自定义列表,然后使用InfoPath自定义一下维护界面,就可以实现在线的增删改查,开发效率很高.如果维护的数据需要进行审批,还可以加入工作流功能 ...
3、通过挂在系统光盘搭建本地yum仓库的方法
1. mkdir xxx #新建文件夹 (新建一个挂载需要的文件夹) .配置本地yum源(挂载光盘) .进入 yum.repos.d .ls (查看当前文件夹全部的文件) 并 mv 修改除Med ...
IDEA下安装/配置Jrebel
IDEA下安装/配置Jrebel6.X 1. 为什么要使用Jrebel 在日常开发过程中, 一旦修改配置/在类中增加静态变量/增加方法/修改方法名等情况, tomcat不会自动加载, 需要重启tomc ...
记录一次坑爹的VM连接主机的路程
因为之前电脑配置过虚拟机连接主机的过程,所以没有太在意,换电脑了之后配了两天结果没有配置成功; 首先配置静态ip: 1,编辑第一个文件/etc/sysconfig/network-scripts/if ...
openstack-glance API 镜像管理的部分实现和样例
感谢朋友支持本博客,欢迎共同探讨交流,因为能力和时间有限.错误之处在所难免.欢迎指正. 假设转载,请保留作者信息. 博客地址:http://blog.csdn.net/qq_21398167 原博文地 ...
Scrapy框架学习第二天
编写scrapy爬虫的具体流程最初:分析网站页面需要爬取的结构第一步:创建scrapy项目:scrapy startproject +文件名第二步:打开项目第三步:编写items.py第四步:创建爬虫 ...
解决VS2010使用mscomm控件无法接收数据的问题【转】
之前有用过VC6的mscomm控件.所以这次也想继续用此控件实现此功能,结果没想到刚一上手还真的绕了不少弯子.主要是因为VC2010下对mscomm控件的添加,以及对控件成员变量的添加有点小繁琐,特此 ...
ESXi虚拟机出现关机时卡住的问题处理
1. ESXi在日常使用时经常会遇到机器卡住的情况这种情况下GUI的方式无从下手, 需要从cli的方式处理我记得之前写过一个但是不知道放哪里去了. 再重新写一下. 直接按照图处理 2. 然后xs ...
lnmp环境下piwiki网站流量分析工具的安装及配置
piwiki统计网站的安装 Piwik是一个PHP和MySQL的开放源代码的Web统计软件. 它给你一些关于你的网站的实用统计报告,比如网页浏览人数, 访问最多的页面, 搜索引擎关键词等等- Piwi ...