解决NodeManager跑半年左右就频繁触发FULL-GC的内存泄露问题

故障过程

故障过程

我们的集群用的事2.7.3版本的集群，NodeManager跑半年左右就会频繁触发FULL-GC，然后重启。然后发现这是一个普遍现象，集群里的机器好像都是这个节奏，过了半年之后，就会触发FULL-GC告警，然后重启一下。

到底是为什么呢，我们带着这个问题，开始进行排查工作。我们找到一台触发FULL-GC的机器，DUMP内存快照。然后通过我们蘑菇街内部的内存快照分析工具进行分析
看了一下排除可回收对象之后，不可回收内存大约2.5G
在这里插入图片描述
根据引用链分析，内存消耗主要集中再俩块，一块是相关的引用链路上，一块是$ThreadLocalMap 相关的引用链路上

我们先看看DistributedFileSystem 相关的信息，到GC_ROOT引用链
在这里插入图片描述
子属性列表

上面就是DistributedFileSystem 引用链路的相关情况，那么另外一块占用内存比较多的ThreadLocalMap是什么呢？我们也翻了翻源码。

@InterfaceAudience.LimitedPrivate({ "MapReduce", "HBase" })
@InterfaceStability.Unstable
public class DistributedFileSystem extends FileSystem {
  private Path workingDir;
  private URI uri;
  private String homeDirPrefix =
      DFSConfigKeys.DFS_USER_HOME_DIR_PREFIX_DEFAULT;

DistributedFileSystem是继承与 FileSystem的
然后看下 FileSystem 的源码

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class FileSystem extends Configured implements Closeable {
  public static final String FS_DEFAULT_NAME_KEY = 
                   CommonConfigurationKeys.FS_DEFAULT_NAME_KEY;
  public static final String DEFAULT_FS = 
                   CommonConfigurationKeys.FS_DEFAULT_NAME_DEFAULT;
                   
.......（这里略过N行）

private final String scheme;

    /**
     * rootData is data that doesn't belong to any thread, but will be added
     * to the totals.  This is useful for making copies of Statistics objects,
     * and for storing data that pertains to threads that have been garbage
     * collected.  Protected by the Statistics lock.
     */
    private final StatisticsData rootData;

    /**
     * Thread-local data.
     */
    private final ThreadLocal<StatisticsData> threadData;

好了找到这个 ThreadLocal了，那么他们为什么不在一个引用链上呢？看过ThreadLocal 源码的同学应该都知道，ThreadLocal 就是一个工具类，他本身不存放任何对象，真正的对象都存放在Thread下面的ThreadLocalMap中，所以他们并不在一个引用链上。

根据以上分析大致确定是 DistributedFileSystem 对象过多导致的内存泄露问题，为了确认这个问题，我们找了一个重启过后跑了一断时间的机器，做对比
在这里插入图片描述
只有41个对象，比之前的2700多个要少了很多。
然后我们就考虑到底为什么内存泄露，而且这个内存泄露有点奇葩，如果真的有严重的泄露问题，理论上最多几天之后就会FULL-GC然后OOM，但是他确能抗这么久，然后才会出现情况。
基于上诉问题，感觉应该是HSF客户端没正确关闭，或没正确释放，导致没被回收，并且只有极少数极端情况下才会发生这个问题，带着个疑问，我们翻了翻源码，重点关注关闭时的情况
我们翻阅了DFSClient源码，看到了2个方法，其中一个是close

/**
   * Closes this output stream and releases any system 
   * resources associated with this stream.
   */
  @Override
  public void close() throws IOException {
    synchronized (this) {
      TraceScope scope = dfsClient.getPathTraceScope("DFSOutputStream#close",
          src);
      try {
        closeImpl();
      } finally {
        scope.close();
      }
    }
    dfsClient.endFileLease(fileId);
  }

这个方法如果在执行closeImpl异常时，(fileId); 将不会被执行到

另外一个

/**
   * Aborts this output stream and releases any system 
   * resources associated with this stream.
   */
  void abort() throws IOException {
    synchronized (this) {
      if (isClosed()) {
        return;
      }
      streamer.setLastException(new IOException("Lease timeout of "
          + (dfsClient.getHdfsTimeout() / 1000) + " seconds expired."));
      closeThreads(true);
    }
    dfsClient.endFileLease(fileId);
  }

这里如果closeThreads发生异常，(fileId);也将不会被执行到，
如果(fileId)不被执行到，那么会导致文件没有释放，也就产生了Client泄露，基于这个问题，我们查了一下官方，发现已经有了这个patch了：/jira/browse/HDFS-10549
好了此问题终结！

秒客网

解决NodeManager跑半年左右就频繁触发FULL-GC的内存泄露问题

解决NodeManager跑半年左右就频繁触发FULL-GC的内存泄露问题

故障过程

相关文章