Choosing the Right Import Method
If the data is already in an HBase table:
-
To move the data from one HBase cluster to another, use snapshot and either the clone_snapshot or ExportSnapshot utility; or, use the CopyTable utility.
-
To move the data from one HBase cluster to another without downtime on either cluster, use replication.
- To migrate data between HBase version that are not wire compatible, such as from CDH 4 to CDH 5, see Importing HBase Data From CDH 4 to CDH 5.
If the data currently exists outside HBase:
-
If possible, write the data to HFile format, and use a BulkLoad to import it into HBase. The data is immediately available to HBase and you can bypass the normal write path, increasing efficiency.
If you prefer not to use bulk loads, and you are using a tool such as Pig, you can use it to import your data.
If you need to stream live data to HBase instead of import in bulk:
-
Write a Java client using the Java API, or use the Apache Thrift Proxy API to write a client in a language supported by Thrift.
-
Stream data directly into HBase using the REST Proxy API in conjunction with an HTTP client such as wget or curl.
Use Flume or Spark.
Most likely, at least one of these methods works in your situation. If not, you can use MapReduce directly. Test the most feasible methods with a subset of your data to determine which one is optimal.
摘自:http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_import.html