HBase源码学习客户端scan过程

申明：以下代码均来自HBase-1.0.1.1

HTable tb = new HTable(conf,"test");
Scan scan = new Scan();
scan.addColumn(("colfam").getBytes(),("col").getBytes());

ResultScanner rs = tb.getScanner(scan);
Result r = null;

while((r=rs.next())!=null){
    byte[] val = r.getValue(("colfam").getBytes(),("col").getBytes());
    System.out.println("Get value: "+(new String(val)));
}

上面是一段非常简单的扫描test表的代码。首先它构造了一个scan，这是扫描的描述。然后调用HTable.getScanner获取ResultScanner。最后调用ResultScanner.next获取数据结果。整个扫描过程简单的来说就是这样，下面我们具体来看下这个过程是如何工作的。

首先是getScanner，它实际上返回的是ClientScanner或其子类。根据scan设置的不同会返回4种不同的Scanner。

  public ResultScanner getScanner(final Scan scan) throws IOException {
    if (scan.getBatch() > 0 && scan.isSmall()) {
          throw new IllegalArgumentException("Small scan should not be used with batching");
    }
    if (scan.getCaching() <= 0) {
      scan.setCaching(getScannerCaching());
    }

    if (scan.isReversed()) {
      if (scan.isSmall()) {
        return new ClientSmallReversedScanner(getConfiguration(), scan, getName(),
            this.connection, this.rpcCallerFactory, this.rpcControllerFactory,
            pool, tableConfiguration.getReplicaCallTimeoutMicroSecondScan());//1
      } else {
        return new ReversedClientScanner(getConfiguration(), scan, getName(),
            this.connection, this.rpcCallerFactory, this.rpcControllerFactory,
            pool, tableConfiguration.getReplicaCallTimeoutMicroSecondScan());//2
      }
    }

    if (scan.isSmall()) {
      return new ClientSmallScanner(getConfiguration(), scan, getName(),
          this.connection, this.rpcCallerFactory, this.rpcControllerFactory,
          pool, tableConfiguration.getReplicaCallTimeoutMicroSecondScan());//3
    } else {
      return new ClientScanner(getConfiguration(), scan, getName(), this.connection,
          this.rpcCallerFactory, this.rpcControllerFactory,
          pool, tableConfiguration.getReplicaCallTimeoutMicroSecondScan());//4
    }
  }

ClientScanner实现了ResultScanner接口，我们来看看ClientScanner的next是如何工作的。

    public Result next() throws IOException {
      // If the scanner is closed and there's nothing left in the cache, next is a no-op.
      if (cache.size() == 0 && this.closed) {
        return null;
      }
      if (cache.size() == 0) {
        loadCache();
      }

      if (cache.size() > 0) {
        return cache.poll();
      }

      // if we exhausted this scanner before calling close, write out the scan metrics
      writeScanMetrics();
      return null;
    }

这段代码的过程是这样：首先判断cache的size是否为0，如果为0则加载一些数据进cache，然后如果cache的size大于0，说明有数据，那么poll第一个数据并返回。从这里我们可以看到，扫描并不是每次next都去server拿数据，而是预先加载一些进来，不够了再取。

然后我们看看如果cache里没有数据，loadCache是如何工作的。loadCache的代码很长，去掉了一些因为注释，把自己的理解注释在里面。

   protected void loadCache() throws IOException {
    Result[] values = null; 
    long remainingResultSize = maxScannerResultSize; //剩余cache容量
    int countdown = this.caching; //剩余cache能保存的Result数量

    //设置cache大小，server将会一次返回这么多条数据回来，如果有的话
    callable.setCaching(this.caching); 

    boolean skipFirst = false; //是否需要跳过一行
    //发生OutOfOrderException时候，是否需要重试
    boolean retryAfterOutOfOrderException = true; 

    boolean serverHasMoreResults = false; //region是否还有更多的数据
    do {
      try {
          //跳过一行数据，为什么？
          //因为当发生异常的时候，我们可能需要重试
          //重试需要重新定位start row，而之前已经成功读取了一些记录
          //我们只能通过lastResult记录最后一次成功读取的记录，然后从下一条开始
          //所以这里其实就是skip lastRsult
        if (skipFirst) {

          callable.setCaching(1); //跳过一条记录，所以设为1
          values = call(scan, callable, caller, scannerTimeout);

          //当前向RS (region server)读取失败，切换到了新的replicaRS，我们需要重做一次skip
          //switchedToADifferentReplica用来判断是否做了切换
          if (values == null && callable.switchedToADifferentReplica()) {
            if (this.lastResult != null) { 
              skipFirst = true;
            }
            //设置新的region信息
            this.currentRegion = callable.getHRegionInfo();
            continue;
          }
          callable.setCaching(this.caching);
          skipFirst = false;
        }
        // Server returns a null values if scanning is to stop. Else,
        // returns an empty array if scanning is to go on and we've just
        // exhausted current region.
        //读取数据
        //这里最终调用的是callable的call
        //callable是ScannerCallableWithReplicas的实例
        values = call(scan,  callable, caller, scannerTimeout);
        //我认为这个if不会发生，要么skip成功skipFirst置为false，要么失败执行continue
        //这里不重要可以忽略
        if (skipFirst && values != null && values.length == 1) {
          skipFirst = false; // Already skipped, unset it before scanning again
          values = call(scan, callable, caller, scannerTimeout);
        }

        //这里和上面一样，在失败的时候做一些处理
        //多说明一些东西：
        //value何时为null？server停止scan，或者第一次调用call
        //因为第一次执行openScanner并不返回数据
        if (values == null && callable.switchedToADifferentReplica()) { 
          if (this.lastResult != null) { 
            skipFirst = true;
          }
          this.currentRegion = callable.getHRegionInfo();
          continue;
        }
        //重置为true
        retryAfterOutOfOrderException = true;
      } catch (DoNotRetryIOException e) {
         //这里是一些异常处理

        if (e instanceof UnknownScannerException) {
          long timeout = lastNext + scannerTimeout;

          if (timeout < System.currentTimeMillis()) {
            long elapsed = System.currentTimeMillis() - lastNext;
            ScannerTimeoutException ex =
                new ScannerTimeoutException(elapsed + "ms passed since the last invocation, "
                    + "timeout is currently set to " + scannerTimeout);
            ex.initCause(e);
            throw ex;
          }
        } else {

          Throwable cause = e.getCause();
          if ((cause != null && cause instanceof NotServingRegionException) ||
              (cause != null && cause instanceof RegionServerStoppedException) ||
              e instanceof OutOfOrderScannerNextException) {
            // Pass
            // It is easier writing the if loop test as list of what is allowed rather than
            // as a list of what is not allowed... so if in here, it means we do not throw.
          } else {
            throw e;
          }
        }
        // Else, its signal from depths of ScannerCallable that we need to reset the scanner.
        //这里就是上面提到的为什么要skip
        //发生异常，我们需要重新开始，所以必须定位start row
        if (this.lastResult != null) {

          //设置start row为最后一次成功读取的结果，但是我们会跳过它
          //真实的start row在下一行
          this.scan.setStartRow(this.lastResult.getRow());

          // Skip first row returned. We already let it out on previous
          // invocation.
          skipFirst = true;
        }
        //这个异常在第一出现的时候
        //因为retryAfterOutOfOrderException=ture，所以会重试一次
        //但是如果连续出现第二次
        //因为retryAfterOutOfOrderException还没被重置为true
        //所以会直接抛出异常，不再重试
        if (e instanceof OutOfOrderScannerNextException) {
          if (retryAfterOutOfOrderException) {
            retryAfterOutOfOrderException = false;
          } else {
            // TODO: Why wrap this in a DNRIOE when it already is a DNRIOE?
            throw new DoNotRetryIOException("Failed after retry of " +
                "OutOfOrderScannerNextException: was there a rpc timeout?", e);
          }
        }
        //发生异常我们重置region信息和callable
        // Clear region.
        this.currentRegion = null;
        // Set this to zero so we don't try and do an rpc and close on remote server when
        // the exception we got was UnknownScanner or the Server is going down.
        callable = null;
        // This continue will take us to while at end of loop where we will set up new scanner.
        continue;
      }
      long currentTime = System.currentTimeMillis();
      if (this.scanMetrics != null) {
        this.scanMetrics.sumOfMillisSecBetweenNexts.addAndGet(currentTime - lastNext);
      }
      lastNext = currentTime;
      //把结果加入cache
      if (values != null && values.length > 0) {
        for (Result rs : values) {
          cache.add(rs);
          // We don't make Iterator here
          for (Cell cell : rs.rawCells()) {
            remainingResultSize -= CellUtil.estimatedHeapSizeOf(cell);
          }
          countdown--;
          this.lastResult = rs;
        }
      }

      //设置serverHasMoreResults，true表示这个region还有数据，否则
      if (null != values && values.length > 0 && callable.hasMoreResultsContext()) {
        // Only adhere to more server results when we don't have any partialResults
        // as it keeps the outer loop logic the same.
        serverHasMoreResults = callable.getServerHasMoreResults();
      }

      //这个while的逻辑是这样：
      //假如serverHasMoreResults=true，还有数据
      //!serverHasMoreResults=false，整个逻辑false，循环终止
      //因server一次返回剩余容量的数据，cache已经load满了，当然要终止
      //反之，!serverHasMoreResults=true，那么将会执行possiblyNextScanner
      //possiblyNextScanner用来寻找下一个region，并构造新的callable
      //如果成功找到，那么返回true，这样循环继续进行
    } while (remainingResultSize > 0 && countdown > 0 && !serverHasMoreResults
        && possiblyNextScanner(countdown, values == null));
  }

整个过程的主线是：调用call获取Result，如果发生异常进行重试。重试重新确定start row通过跳过lastResult完成。

注意：
1. call如果失败会自动切换replicaRS，也就是换到备份的region server上去读取，这个只有在scan的Consistency设为Consistency.TIMELINE情况下才会发生。
2. OutOfOrderScannerNextException异常的原因是客户端和服务端维护的一个序列值不一致。每次读取收到结果，客户端的nextCallSeq++。假如某次读取请求服务端正确收到，返回结果给客户端的时候，客户端等待超时了，那么服务端的seq就会比客户端多1。
3.第一次调用call实则是在服务端上打开scan服务，并获取scannerId为之后做准备。如果你继续跟踪possiblyNextScanner，你会发现定位了新的region并构造了对应的ScannerCallableWithReplicas后主动调了一次call，就是这个原因。

更详细的过程可以继续跟踪call，其实是调用ScannerCallableWithReplicas.call,而后者维护了线程池执行任务，最终调用的是ScannerCallable.call。如果任务失败，那么ScannerCallableWithReplicas会为每一个replicaRS构造各自的ScannerCallable，然后各自起任务运行。只有第一个成功返回的replicaRS会被选中，之后的读取会被切换到这个replicaRS上。

秒客网

HBase源码学习客户端scan过程

相关文章

HBase源码学习 客户端scan过程

相关文章

HBase源码学习客户端scan过程