HBase shell扫描字节到字符串转换。

I would like to scan hbase table and see integers as strings (not their binary representation). I can do the conversion but have no idea how to write scan statement by using Java API from hbase shell:

我希望扫描hbase表，将整数看作字符串(而不是二进制表示)。我可以进行转换，但不知道如何使用hbase shell中的Java API来编写扫描语句:

org.apache.hadoop.hbase.util.Bytes.toString(
  "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)

 org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)

I will be very happy to have examples of scan, get that searching binary data (long's) and output normal strings. I am using hbase shell, not JAVA.

我很高兴有扫描的例子，得到那个搜索的二进制数据(long的)和输出正常的字符串。我使用的是hbase shell，而不是JAVA。

3 个解决方案

#1

HBase stores data as byte arrays (untyped). Therefore if you perform a table scan data will be displayed in a common format (escaped hexadecimal string), e.g:
"\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65" -> Hello HBase

HBase将数据存储为字节数组(无类型)。因此，如果您执行一个表扫描数据，将以一种通用格式显示(脱逃十六进制字符串)，e。g:“\x48\x65\x6c\x6 \x20\ x20\x48\ x60 \x65\ x65\ x65”->你好HBase。

If you want to get back the typed value from the serialized byte array you have to do this manually. You have the following options:

如果您想要从序列化的字节数组中取回类型化的值，您必须手动执行。你有以下选择:

Java code (Bytes.toString(...))
Java代码(Bytes.toString(…))
hack the to_string function in $HBASE/HOME/lib/ruby/hbase/table.rb : replace toStringBinary with toInt for non-meta tables
在$HBASE/HOME/lib/ruby/ HBASE/表中攻击to_string函数。rb:用toInt替换非元数据表。
write a get/scan JRuby function which converts the byte array to the appropriate type
编写一个获取/扫描JRuby函数，该函数将字节数组转换为适当的类型。

Since you want it HBase shell, then consider the last option:
Create a file get_result.rb :

既然您想要它的HBase shell，那么考虑最后一个选项:创建一个文件get_result。rb:

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;

# Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
def get_result()
  htable = HTable.new(HBaseConfiguration.new, "test")
  rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
  output = ArrayList.new
  output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
  rs.each { |r| 
    r.raw.each { |kv|
      row = Bytes.toString(kv.getRow)
      fam = Bytes.toString(kv.getFamily)
      ql = Bytes.toString(kv.getQualifier)
      ts = kv.getTimestamp
      val = Bytes.toInt(kv.getValue)
      output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
    }
  }
  output.each {|line| puts "#{line}\n"}
end

load it in the HBase shell and use it:

将其装入HBase外壳并使用:

require '/path/to/get_result'
get_result

Note: modify/enhance/fix the code according to your needs

注意:根据您的需要修改/增强/修改代码。

#2

you can add a scan_counter command to the hbase shell.

您可以向hbase shell添加scan_counter命令。

first:

第一:

add to /usr/lib/hbase/lib/ruby/hbase/table.rb (after the scan function):

添加到/usr/lib/hbase/lib/ruby/hbase/table.rb(扫描后功能):

#----------------------------------------------------------------------------------------------
  # Scans whole table or a range of keys and returns rows matching specific criterias with values as number
  def scan_counter(args = {})
    unless args.kind_of?(Hash)
      raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
    end

    limit = args.delete("LIMIT") || -1
    maxlength = args.delete("MAXLENGTH") || -1

    if args.any?
      filter = args["FILTER"]
      startrow = args["STARTROW"] || ''
      stoprow = args["STOPROW"]
      timestamp = args["TIMESTAMP"]
      columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
      cache = args["CACHE_BLOCKS"] || true
      versions = args["VERSIONS"] || 1
      timerange = args[TIMERANGE]

      # Normalize column names
      columns = [columns] if columns.class == String
      unless columns.kind_of?(Array)
        raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
      end

      scan = if stoprow
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
      else
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
      end

      columns.each { |c| scan.addColumns(c) }
      scan.setFilter(filter) if filter
      scan.setTimeStamp(timestamp) if timestamp
      scan.setCacheBlocks(cache)
      scan.setMaxVersions(versions) if versions > 1
      scan.setTimeRange(timerange[0], timerange[1]) if timerange
    else
      scan = org.apache.hadoop.hbase.client.Scan.new
    end

    # Start the scanner
    scanner = @table.getScanner(scan)
    count = 0
    res = {}
    iter = scanner.iterator

    # Iterate results
    while iter.hasNext
      if limit > 0 && count >= limit
        break
      end

      row = iter.next
      key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)

      row.list.each do |kv|
        family = String.from_java_bytes(kv.getFamily)
        qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)

        column = "#{family}:#{qualifier}"
        cell = to_string_scan_counter(column, kv, maxlength)

        if block_given?
          yield(key, "column=#{column}, #{cell}")
        else
          res[key] ||= {}
          res[key][column] = cell
        end
      end

      # One more row processed
      count += 1
    end

    return ((block_given?) ? count : res)
  end

  #----------------------------------------------------------------------------------------
  # Helper methods

  # Returns a list of column names in the table
  def get_all_columns
    @table.table_descriptor.getFamilies.map do |family|
      "#{family.getNameAsString}:"
    end
  end

  # Checks if current table is one of the 'meta' tables
  def is_meta_table?
    tn = @table.table_name
    org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
  end

  # Returns family and (when has it) qualifier for a column name
  def parse_column_name(column)
    split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
    return split[0], (split.length > 1) ? split[1] : nil
  end

  # Make a String of the passed kv
  # Intercept cells whose format we know such as the info:regioninfo in .META.
  def to_string(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end


  def to_string_scan_counter(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end

second:

第二:

add to /usr/lib/hbase/lib/ruby/shell/commands/ the following file called: scan_counter.rb

添加到/usr/lib/hbase/lib/ruby/shell/commands/下一个文件:scan_counter.rb。

  #
# Copyright 2010 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

module Shell
  module Commands
    class ScanCounter < Command
      def help
        return <<-EOF
Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
specifications.  Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.

Some examples:

  hbase> scan_counter '.META.'
  hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false).  By
default it is enabled.  Examples:

  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
EOF
      end

      def command(table, args = {})
        now = Time.now
        formatter.header(["ROW", "COLUMN+CELL"])

        count = table(table).scan_counter(args) do |row, cells|
          formatter.row([ row, cells ])
        end

        formatter.footer(now, count)
      end
    end
  end
end

finally

最后

add to /usr/lib/hbase/lib/ruby/shell.rb the function scan_counter.

添加到/usr/lib/hbase/lib/ruby/shell.rb scan_counter的函数。

replace the current function with this: (you can identify it by: 'DATA MANIPULATION COMMANDS',)

用这个替换当前的函数:(您可以通过“数据操作命令”来识别它)

Shell.load_command_group(
  'dml',
  :full_name => 'DATA MANIPULATION COMMANDS',
  :commands => %w[
    count
    delete
    deleteall
    get
    get_counter
    incr
    put
    scan
    scan_counter
    truncate
  ]
)

#3

Just for completeness' sake, it turns out that the call Bytes::toStringBinary gives the hex-escaped sequence you get in HBase shell:

为了完整起见，我们发现调用字节::toStringBinary给出了在HBase shell中得到的hex转义序列:

\x0B\x2_SOME_ASCII_TEXT_\x10\x00...

\ x0B \ x2_SOME_ASCII_TEXT_ \ x10 \ x00…

Whereas, Bytes::toString will try to deserialize to a string assuming UTF8, which will look more like:

然而，字节::toString将试图反序列化为假设UTF8的字符串，它看起来更像:

\u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...

\ u8900 \ u0710 \ u0115 \ u0320 \ u0000_SOME_UTF8_TEXT_ \ u4009……

#1