nmap扫描端口导致线上大量Java服务FullGC甚至OOM

时间:2023-03-09 23:57:36
nmap扫描端口导致线上大量Java服务FullGC甚至OOM

nmap扫描端口导致线上大量Java服务FullGC甚至OOM

最近公司遇到了一次诡异的线上FullGC保障,多个服务几乎所有的实例集中报FullGC,个别实例甚至出现了OOM,直接被docker杀掉。

观察报警服务的log,均有大量的此log

*TNonblockingServer [ERROR] Read a frame size of ****, which is bigger than the maximum allowable buffer size for ALL connections.

注意到报警的几个服务都是net in流量突然有峰值,但是对应时间的http和rpc请求数没有增加。

此时已经开始怀疑是否有人恶意的调用接口。分析log应该与rpc接口有关,rpc端口外网访问不了,请求肯定来自内网。

后查明是公司内有人在用nmap扫描端口,nmap会向端口发送随机的数据包。公司内的JavaRpc采用的是Thrift,Thrift在解析异常数据包时有概率会申请大量内存,进而导致服务OOM。

分析了一波Thrift的源码,主要关注AbstractNonblockingServer类,以及其中的FrameBuffer。这两个类主要与Thrift解析数据流有关。

thrift采用NIO进行网络数据处理,NIO将数据放到buffer中,thrift再通过socketChannel读取buffer

thrift先从socketChannel中读取4个字节数据,4个字节转成int frameSize ,thrift就认作这个frameSize是整个数据包的size了,下一步就去申请对应大小的内存,再做进一步读取。

问题就出在了这里,如果是符合thrift IDL的数据包应该就没有问题,可是nmap发送的tcp包里的数据是随机的,而且我测试发现nmap一下会发几十个tcp请求过来。

thrift估计是为了避免这个问题,在拿到frameSize后,申请内存前,会有一个判断,当frameSize > MAX_READ_BUFFER_BYTES时,会把请求认作异常,不再进行处理。

if (frameSize > MAX_READ_BUFFER_BYTES) {
LOGGER.error("Read a frame size of " + frameSize
+ ", which is bigger than the maximum allowable buffer size for ALL connections.");
return false;
}

MAX_READ_BUFFER_BYTES默认是Long.MAX_VALUE,可以在thrift的TThreadedSelectorServer构造函数中进行指定,公司指定的值为100M。

这个100M仍然过大了,当frameSize小于100M,thrift仍然会申请内存的,而nmap会几秒钟发几十个请求过来,极端情况下这几十个请求均会申请<=100M的内存,最终导致服务的OOM

一个治标不治本的解决方法是调小MAX_READ_BUFFER_BYTES。然而将MAX_READ_BUFFER_BYTES调整的过小的话也要考虑是否影响到正常的RPC请求。

总之,考虑到thrift如此迷惑的设计,运维应该在网络上做更多的限制可能会更好一点,首先RPC端口外网禁止访问,避免心怀不轨的人来搞破坏,其次内网也应该建立规范,比如扫描时尽量有意的避开绕过RPC端口。

核心代码

public boolean read() {
if (state_ == FrameBufferState.READING_FRAME_SIZE) {
// try to read the frame size completely
if (!internalRead()) {
return false;
} // if the frame size has been read completely, then prepare to read the
// actual frame.
if (buffer_.remaining() == 0) {
// pull out the frame size as an integer.
int frameSize = buffer_.getInt(0);
if (frameSize <= 0) {
LOGGER.error("Read an invalid frame size of " + frameSize
+ ". Are you using TFramedTransport on the client side?");
return false;
} // if this frame will always be too large for this server, log the
// error and close the connection.
if (frameSize > MAX_READ_BUFFER_BYTES) {
LOGGER.error("Read a frame size of " + frameSize
+ ", which is bigger than the maximum allowable buffer size for ALL connections.");
return false;
} // if this frame will push us over the memory limit, then return.
// with luck, more memory will free up the next time around.
if (readBufferBytesAllocated.get() + frameSize > MAX_READ_BUFFER_BYTES) {
return true;
} // increment the amount of memory allocated to read buffers
readBufferBytesAllocated.addAndGet(frameSize + 4); // reallocate the readbuffer as a frame-sized buffer
buffer_ = ByteBuffer.allocate(frameSize + 4);
buffer_.putInt(frameSize); state_ = FrameBufferState.READING_FRAME;
} else {
// this skips the check of READING_FRAME state below, since we can't
// possibly go on to that state if there's data left to be read at
// this one.
return true;
}
} // it is possible to fall through from the READING_FRAME_SIZE section
// to READING_FRAME if there's already some frame data available once
// READING_FRAME_SIZE is complete. if (state_ == FrameBufferState.READING_FRAME) {
if (!internalRead()) {
return false;
} // since we're already in the select loop here for sure, we can just
// modify our selection key directly.
if (buffer_.remaining() == 0) {
// get rid of the read select interests
selectionKey_.interestOps(0);
state_ = FrameBufferState.READ_FRAME_COMPLETE;
} return true;
} // if we fall through to this point, then the state must be invalid.
LOGGER.error("Read was called but state is invalid (" + state_ + ")");
return false;
}

参考:https://my.oschina.net/shipley/blog/422204