1、什么是 Sqoop？

Sqoop 是一种用于在 Hadoop 和关系数据库或大型机之间传输数据的工具。

您可以使用 Sqoop 将数据从关系数据库管理系统RDBMS（如 MySQL 或 Oracle）导入 Hadoop 分布式文件系统 HDFS，转换 Hadoop MapReduce 中的数据，然后将数据导出回 RDBMS。

Sqoop 自动执行此过程的大部分过程，依靠数据库来描述要导入的数据的模式。Sqoop 使用 MapReduce 导入和导出数据，提供并行操作和容错。

2、下载应用程序及配置环境变量

2.1、下载 Sqoop 1.4.7

通过以下命令下载 Sqoop，解压后，放到/home/work/_app/ 目录中：

[root@c0 _src]# pwd

/home/work/_src

[root@c0 _src]# wget http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

[root@c0 _src]# tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

[root@c0 _src]# mv sqoop-1.4.7.bin__hadoop-2.6.0 /home/work/_app/

2.2、设置环境变量

在每一台机器上设置 Sqoop 环境变量，运行以下命令

echo "" >> /etc/bashrc

echo "# Sqoop 1.4.7" >> /etc/bashrc

echo "export SQOOP_HOME=/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0" >> /etc/bashrc

echo "" >> /etc/bashrc

echo "# Path" >> /etc/bashrc

echo "export PATH=\$PATH:\$SQOOP_HOME/bin" >> /etc/bashrc

source /etc/bashrc

2.3、设置安装所需环境

安装和运行 Sqoop 需要用到 Hive、MySql、Hadoop环境。可以参考文章：Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建

3、安装 Sqoop 1.4.7

3.1、修改 Sqoop 配置文件

3.1.1、修改配置文件 sqoop-env.sh

创建 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh 文件编辑并保存，内容为空，因为我们在本文的配置环境变量章节中已经配置了环境变量，同时也在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中配置了 Hive 和 Hadoop 环境变量：

[root@c0 ~]# echo "" > /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh

将 /home/work/_app/hive-2.3.4/lib/ 目录下的 hive-hcatalog-core-2.3.4.jar、mysql-connector-java-5.1.47-bin.jar、hive-common-2.3.4.jar、libthrift-0.9.3.jar 文件，复制到/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib 目录下

[root@c0 ~]# cp /home/work/_app/hive-2.3.4/lib/hive-hcatalog-core-2.3.4.jar /home/work/_app/hive-2.3.4/lib/mysql-connector-java-5.1.47-bin.jar /home/work/_app/hive-2.3.4/lib/libthrift-0.9.3.jar /home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib/

3.1.2、修改配置文件 configure-sqoop

编辑 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop 文件并保存，内容如下：

[root@c0 _src]# cat /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop

#!/bin/bash

#

# Copyright 2011 The Apache Software Foundation

#

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License.  You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# This is sourced in by bin/sqoop to set environment variables prior to

# invoking Hadoop.

bin="$1"

if [ -z "${bin}" ]; then

  bin=`dirname $0`

  bin=`cd ${bin} && pwd`

fi

if [ -z "$SQOOP_HOME" ]; then

  export SQOOP_HOME=${bin}/..

fi

SQOOP_CONF_DIR=${SQOOP_CONF_DIR:-${SQOOP_HOME}/conf}

if [ -f "${SQOOP_CONF_DIR}/sqoop-env.sh" ]; then

  . "${SQOOP_CONF_DIR}/sqoop-env.sh"

fi

# Find paths to our dependency systems. If they are unset, use CDH defaults.

if [ -z "${HADOOP_COMMON_HOME}" ]; then

  if [ -n "${HADOOP_HOME}" ]; then

    HADOOP_COMMON_HOME=${HADOOP_HOME}

  else

    if [ -d "/usr/lib/hadoop" ]; then

      HADOOP_COMMON_HOME=/usr/lib/hadoop

    else

      HADOOP_COMMON_HOME=${SQOOP_HOME}/../hadoop

    fi

  fi

fi

if [ -z "${HADOOP_MAPRED_HOME}" ]; then

  HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

  if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then

    if [ -n "${HADOOP_HOME}" ]; then

      HADOOP_MAPRED_HOME=${HADOOP_HOME}

    else

      HADOOP_MAPRED_HOME=${SQOOP_HOME}/../hadoop-mapreduce

    fi

  fi

fi

# We are setting HADOOP_HOME to HADOOP_COMMON_HOME if it is not set

# so that hcat script works correctly on BigTop

if [ -z "${HADOOP_HOME}" ]; then

  if [ -n "${HADOOP_COMMON_HOME}" ]; then

     HADOOP_HOME=${HADOOP_COMMON_HOME}

     export HADOOP_HOME

  fi

fi

if [ -z "${HBASE_HOME}" ]; then

  if [ -d "/usr/lib/hbase" ]; then

    HBASE_HOME=/usr/lib/hbase

  else

    HBASE_HOME=${SQOOP_HOME}/../hbase

  fi

fi

#if [ -z "${HCAT_HOME}" ]; then

#  if [ -d "/usr/lib/hive-hcatalog" ]; then

#    HCAT_HOME=/usr/lib/hive-hcatalog

#  elif [ -d "/usr/lib/hcatalog" ]; then

#    HCAT_HOME=/usr/lib/hcatalog

#  else

#    HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog

#    if [ ! -d ${HCAT_HOME} ]; then

#       HCAT_HOME=${SQOOP_HOME}/../hcatalog

#    fi

#  fi

#fi

#if [ -z "${ACCUMULO_HOME}" ]; then

#  if [ -d "/usr/lib/accumulo" ]; then

#    ACCUMULO_HOME=/usr/lib/accumulo

#  else

#    ACCUMULO_HOME=${SQOOP_HOME}/../accumulo

#  fi

#fi

if [ -z "${ZOOKEEPER_HOME}" ]; then

  if [ -d "/usr/lib/zookeeper" ]; then

    ZOOKEEPER_HOME=/usr/lib/zookeeper

  else

    ZOOKEEPER_HOME=${SQOOP_HOME}/../zookeeper

  fi

fi

if [ -z "${HIVE_HOME}" ]; then

  if [ -d "/usr/lib/hive" ]; then

    export HIVE_HOME=/usr/lib/hive

  elif [ -d ${SQOOP_HOME}/../hive ]; then

    export HIVE_HOME=${SQOOP_HOME}/../hive

  fi

fi

# Check: If we can't find our dependencies, give up here.

if [ ! -d "${HADOOP_COMMON_HOME}" ]; then

  echo "Error: $HADOOP_COMMON_HOME does not exist!"

  echo 'Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.'

  exit 1

fi

if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then

  echo "Error: $HADOOP_MAPRED_HOME does not exist!"

  echo 'Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.'

  exit 1

fi

## Moved to be a runtime check in sqoop.

if [ ! -d "${HBASE_HOME}" ]; then

  echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."

  echo 'Please set $HBASE_HOME to the root of your HBase installation.'

fi

## Moved to be a runtime check in sqoop.

#if [ ! -d "${HCAT_HOME}" ]; then

#  echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."

#  echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'

#fi

#if [ ! -d "${ACCUMULO_HOME}" ]; then

#  echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."

#  echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'

#fi

if [ ! -d "${ZOOKEEPER_HOME}" ]; then

  echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."

  echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'

fi

# Where to find the main Sqoop jar

SQOOP_JAR_DIR=$SQOOP_HOME

# If there's a "build" subdir, override with this, so we use

# the newly-compiled copy.

if [ -d "$SQOOP_JAR_DIR/build" ]; then

  SQOOP_JAR_DIR="${SQOOP_JAR_DIR}/build"

fi

function add_to_classpath() {

  dir=$1

  for f in $dir/*.jar; do

    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;

  done

  export SQOOP_CLASSPATH

}

# Add sqoop dependencies to classpath.

SQOOP_CLASSPATH=""

if [ -d "$SQOOP_HOME/lib" ]; then

  add_to_classpath $SQOOP_HOME/lib

fi

# Add HBase to dependency list

if [ -e "$HBASE_HOME/bin/hbase" ]; then

  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`$HBASE_HOME/bin/hbase classpath`

  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}

fi

# Add HCatalog to dependency list

if [ -e "${HCAT_HOME}/bin/hcat" ]; then

  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`${HCAT_HOME}/bin/hcat -classpath`

  if [ -z "${HIVE_CONF_DIR}" ]; then

    TMP_SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}:${HIVE_CONF_DIR}

  fi

  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}

fi

# Add Accumulo to dependency list

if [ -e "$ACCUMULO_HOME/bin/accumulo" ]; then

  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*accumulo.*jar | cut -d':' -f2`; do

    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn

  done

  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*zookeeper.*jar | cut -d':' -f2`; do

    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn

  done

fi

ZOOCFGDIR=${ZOOCFGDIR:-/etc/zookeeper}

if [ -d "${ZOOCFGDIR}" ]; then

  SQOOP_CLASSPATH=$ZOOCFGDIR:$SQOOP_CLASSPATH

fi

SQOOP_CLASSPATH=${SQOOP_CONF_DIR}:${SQOOP_CLASSPATH}

# If there's a build subdir, use Ivy-retrieved dependencies too.

if [ -d "$SQOOP_HOME/build/ivy/lib/sqoop" ]; then

  for f in $SQOOP_HOME/build/ivy/lib/sqoop/*/*.jar; do

    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;

  done

fi

add_to_classpath ${SQOOP_JAR_DIR}

HADOOP_CLASSPATH="${SQOOP_CLASSPATH}:${HADOOP_CLASSPATH}"

if [ ! -z "$SQOOP_USER_CLASSPATH" ]; then

  # User has elements to prepend to the classpath, forcibly overriding

  # Sqoop's own lib directories.

  export HADOOP_CLASSPATH="${SQOOP_USER_CLASSPATH}:${HADOOP_CLASSPATH}"

fi

export SQOOP_CLASSPATH

export SQOOP_CONF_DIR

export SQOOP_JAR_DIR

export HADOOP_CLASSPATH

export HADOOP_COMMON_HOME

export HADOOP_MAPRED_HOME

export HBASE_HOME

export HCAT_HOME

export HIVE_CONF_DIR

export ACCUMULO_HOME

export ZOOKEEPER_HOME

3.2、查看 Sqoop 版本

[root@c0 _src]# sqoop version

2019-03-11 22:30:16,837 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

Sqoop 1.4.7

git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8

Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

4、启动和测试 Sqoop 的数据导入、导出

4.1、Sqoop 通过 Hive 导入数据到 Sqoop

在 mysql 中创建数据库 testmshk 并授权给 root 用户，同时创建 hive2mysql_mshk 表

[root@c0 _src]# mysql -uroot -p123456

mysql: [Warning] Using a password on the command line interface can be insecure.

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 21

Server version: 5.7.25 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE testmshk DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

Query OK, 1 row affected (0.00 sec)

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

| information_schema |

| hive               |

| mysql              |

| performance_schema |

| sys                |

| testmshk           |

+--------------------+

6 rows in set (0.02 sec)

mysql> grant select,insert,update,delete,create on testmshk.* to root;

Query OK, 0 rows affected (0.01 sec)

mysql> flush  privileges;

Query OK, 0 rows affected (0.01 sec)

mysql> use testmshk;

Database changed

mysql> create table hive2mysql_mshk(id int,namea varchar(50),nameb varchar(50));

Query OK, 0 rows affected (0.02 sec)

mysql> quit;

Bye

通过 Sqoop 查询 Mysql 中表的内容，这时可以看到表中的内容是空的

[root@c0 ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"

2019-03-11 23:44:06,894 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

2019-03-11 23:44:06,945 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

2019-03-11 23:44:07,100 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

-------------------------------------------------------------

| id          | namea                | nameb                |

-------------------------------------------------------------

-------------------------------------------------------------

在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中，我们在测试 Hive 时创建了测试数据 /hive/warehouse/testtable/testdata001.dat 我们将这个数据，导入到 Mysql

[root@c0 ~]# sqoop export --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --table hive2mysql_mshk  --export-dir /hive/warehouse/testtable/testdata001.dat  --input-fields-terminated-by ','

2019-03-11 23:47:10,400 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

2019-03-11 23:47:10,437 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

2019-03-11 23:47:10,571 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

2019-03-11 23:47:10,574 INFO tool.CodeGenTool: Beginning code generation

2019-03-11 23:47:10,914 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-11 23:47:10,943 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-11 23:47:10,952 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2

Note: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

2019-03-11 23:47:12,652 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.jar

2019-03-11 23:47:12,669 INFO mapreduce.ExportJobBase: Beginning export of hive2mysql_mshk

2019-03-11 23:47:12,669 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2019-03-11 23:47:12,804 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

2019-03-11 23:47:14,106 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative

2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative

2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

2019-03-11 23:47:14,479 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

2019-03-11 23:47:14,808 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0003

2019-03-11 23:47:16,429 INFO input.FileInputFormat: Total input files to process : 1

2019-03-11 23:47:16,432 INFO input.FileInputFormat: Total input files to process : 1

2019-03-11 23:47:16,513 INFO mapreduce.JobSubmitter: number of splits:4

2019-03-11 23:47:16,577 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative

2019-03-11 23:47:16,684 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0003

2019-03-11 23:47:16,686 INFO mapreduce.JobSubmitter: Executing with tokens: []

2019-03-11 23:47:16,924 INFO conf.Configuration: resource-types.xml not found

2019-03-11 23:47:16,924 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2019-03-11 23:47:17,213 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0003

2019-03-11 23:47:17,261 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0003/

2019-03-11 23:47:17,262 INFO mapreduce.Job: Running job: job_1552315366846_0003

2019-03-11 23:47:23,359 INFO mapreduce.Job: Job job_1552315366846_0003 running in uber mode : false

2019-03-11 23:47:23,360 INFO mapreduce.Job:  map 0% reduce 0%

2019-03-11 23:47:31,454 INFO mapreduce.Job:  map 75% reduce 0%

2019-03-11 23:47:32,462 INFO mapreduce.Job:  map 100% reduce 0%

2019-03-11 23:47:32,473 INFO mapreduce.Job: Job job_1552315366846_0003 completed successfully

2019-03-11 23:47:32,619 INFO mapreduce.Job: Counters: 32

	File System Counters

		FILE: Number of bytes read=0

		FILE: Number of bytes written=913424

		FILE: Number of read operations=0

		FILE: Number of large read operations=0

		FILE: Number of write operations=0

		HDFS: Number of bytes read=782

		HDFS: Number of bytes written=0

		HDFS: Number of read operations=19

		HDFS: Number of large read operations=0

		HDFS: Number of write operations=0

	Job Counters

		Launched map tasks=4

		Data-local map tasks=4

		Total time spent by all maps in occupied slots (ms)=23446

		Total time spent by all reduces in occupied slots (ms)=0

		Total time spent by all map tasks (ms)=23446

		Total vcore-milliseconds taken by all map tasks=23446

		Total megabyte-milliseconds taken by all map tasks=24008704

	Map-Reduce Framework

		Map input records=2

		Map output records=2

		Input split bytes=636

		Spilled Records=0

		Failed Shuffles=0

		Merged Map outputs=0

		GC time elapsed (ms)=582

		CPU time spent (ms)=3960

		Physical memory (bytes) snapshot=830259200

		Virtual memory (bytes) snapshot=11165683712

		Total committed heap usage (bytes)=454557696

		Peak Map Physical memory (bytes)=208502784

		Peak Map Virtual memory (bytes)=2793611264

	File Input Format Counters

		Bytes Read=0

	File Output Format Counters

		Bytes Written=0

2019-03-11 23:47:32,626 INFO mapreduce.ExportJobBase: Transferred 782 bytes in 18.5015 seconds (42.2668 bytes/sec)

2019-03-11 23:47:32,629 INFO mapreduce.ExportJobBase: Exported 2 records.

--export-dir 表示在 HDFS 对应的 Hive 数据库文件位置

–input-fields-terminated-by 表示要处理的间隔符

再次通过 Sqoop 查看 MySql 中的内容，可以看到数据已经成功导入

[root@c0 ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"

2019-03-11 23:48:56,848 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

2019-03-11 23:48:56,884 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

2019-03-11 23:48:57,024 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

-------------------------------------------------------------

| id          | namea                | nameb                |

-------------------------------------------------------------

| 10086       | my.mshk.top          | you.mshk.top         |

| 12306       | mname                | yname                |

-------------------------------------------------------------

在 MySql 中能够看到我们创建的 hive2mysql_mshk 表有2行数据

通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出

4.2、Sqoop 通过 MySql 导入数据到 Hive

刚刚我们创建的 hive2mysql_mshk 表没有任何主键，我们只是从 Hive 中添加了一些记录到 Mysql。

默认情况下，Sqoop 将识别表中的主键列（如果存在）并将其用作拆分列。

从数据库中检索拆分列的低值和高值，并且映射任务在总范围的大小均匀的组件上运行。

如果主键的实际值在其范围内不均匀分布，则可能导致任务不平衡。

您应该使用 --split-by 参数明确选择不同的列。例如-- split-by id。

在将 MySql 的数据导入到 Hive中的 Sqoop 命令添加了更多参数：

sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk

--split-by <column-name> 用哪个列来拆分

--table 告诉计算机您要从MySQL导入哪个表

--target-dir <dir> HDFS要存储的目录

--hive-import 将表导入Hive

--hive-overwrite 覆盖Hive表中的现有数据

--hive-table <table-name> 设置导入Hive时要使用的表名

--fields-terminated-by <char> 设置字段分隔符

接下来 Sqoop 的操作是一个 map-reduce 工作。

[root@c0 _src]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk --hive-overwrite

2019-03-12 20:21:05,060 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

2019-03-12 20:21:05,137 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

2019-03-12 20:21:05,337 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

2019-03-12 20:21:05,348 INFO tool.CodeGenTool: Beginning code generation

2019-03-12 20:21:05,785 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-12 20:21:05,821 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-12 20:21:05,831 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2

Note: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

2019-03-12 20:21:08,747 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.jar

2019-03-12 20:21:08,761 WARN manager.MySQLManager: It looks like you are importing from mysql.

2019-03-12 20:21:08,761 WARN manager.MySQLManager: This transfer can be faster! Use the --direct

2019-03-12 20:21:08,761 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.

2019-03-12 20:21:08,762 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)

2019-03-12 20:21:08,764 INFO mapreduce.ImportJobBase: Beginning import of hive2mysql_mshk

2019-03-12 20:21:08,765 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2019-03-12 20:21:08,928 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

2019-03-12 20:21:09,656 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

2019-03-12 20:21:10,332 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

2019-03-12 20:21:10,688 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0011

2019-03-12 20:21:12,618 INFO db.DBInputFormat: Using read commited transaction isolation

2019-03-12 20:21:12,619 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `hive2mysql_mshk`

2019-03-12 20:21:12,622 INFO db.IntegerSplitter: Split size: 555; Num splits: 4 from: 10086 to: 12306

2019-03-12 20:21:12,696 INFO mapreduce.JobSubmitter: number of splits:4

2019-03-12 20:21:13,137 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0011

2019-03-12 20:21:13,140 INFO mapreduce.JobSubmitter: Executing with tokens: []

2019-03-12 20:21:13,443 INFO conf.Configuration: resource-types.xml not found

2019-03-12 20:21:13,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2019-03-12 20:21:13,533 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0011

2019-03-12 20:21:13,593 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0011/

2019-03-12 20:21:13,594 INFO mapreduce.Job: Running job: job_1552315366846_0011

2019-03-12 20:21:20,705 INFO mapreduce.Job: Job job_1552315366846_0011 running in uber mode : false

2019-03-12 20:21:20,727 INFO mapreduce.Job:  map 0% reduce 0%

2019-03-12 20:21:29,927 INFO mapreduce.Job:  map 50% reduce 0%

2019-03-12 20:21:29,930 INFO mapreduce.Job: Task Id : attempt_1552315366846_0011_m_000000_0, Status : FAILED

[2019-03-12 20:21:28.236]Container [pid=19941,containerID=container_e15_1552315366846_0011_01_000002] is running 539445760B beyond the 'VIRTUAL' memory limit. Current usage: 199.9 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.

Dump of the process-tree for container_e15_1552315366846_0011_01_000002 :

	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

	|- 20026 19941 19941 19941 (java) 950 81 2678403072 50861 /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642

	|- 19941 19939 19941 19941 (bash) 1 2 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN   -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 1>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stdout 2>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stderr

[2019-03-12 20:21:28.324]Container killed on request. Exit code is 143

[2019-03-12 20:21:28.335]Container exited with a non-zero exit code 143.

2019-03-12 20:21:30,978 INFO mapreduce.Job:  map 75% reduce 0%

2019-03-12 20:21:37,021 INFO mapreduce.Job:  map 100% reduce 0%

2019-03-12 20:21:37,032 INFO mapreduce.Job: Job job_1552315366846_0011 completed successfully

2019-03-12 20:21:37,145 INFO mapreduce.Job: Counters: 33

	File System Counters

		FILE: Number of bytes read=0

		FILE: Number of bytes written=915840

		FILE: Number of read operations=0

		FILE: Number of large read operations=0

		FILE: Number of write operations=0

		HDFS: Number of bytes read=425

		HDFS: Number of bytes written=49

		HDFS: Number of read operations=24

		HDFS: Number of large read operations=0

		HDFS: Number of write operations=8

	Job Counters

		Failed map tasks=1

		Launched map tasks=5

		Other local map tasks=5

		Total time spent by all maps in occupied slots (ms)=31981

		Total time spent by all reduces in occupied slots (ms)=0

		Total time spent by all map tasks (ms)=31981

		Total vcore-milliseconds taken by all map tasks=31981

		Total megabyte-milliseconds taken by all map tasks=32748544

	Map-Reduce Framework

		Map input records=2

		Map output records=2

		Input split bytes=425

		Spilled Records=0

		Failed Shuffles=0

		Merged Map outputs=0

		GC time elapsed (ms)=318

		CPU time spent (ms)=6520

		Physical memory (bytes) snapshot=815542272

		Virtual memory (bytes) snapshot=11174408192

		Total committed heap usage (bytes)=437780480

		Peak Map Physical memory (bytes)=206934016

		Peak Map Virtual memory (bytes)=2795565056

	File Input Format Counters

		Bytes Read=0

	File Output Format Counters

		Bytes Written=49

2019-03-12 20:21:37,154 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 27.4776 seconds (1.7833 bytes/sec)

2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Retrieved 2 records.

2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table hive2mysql_mshk

2019-03-12 20:21:37,188 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-12 20:21:37,208 INFO hive.HiveImport: Loading uploaded data into Hive

2019-03-12 20:21:37,220 INFO conf.HiveConf: Found configuration file file:/home/work/_app/hive-2.3.4/conf/hive-site.xml

2019-03-12 20:21:49,491 INFO hive.HiveImport:

2019-03-12 20:21:49,492 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true

2019-03-12 20:21:56,558 INFO hive.HiveImport: OK

2019-03-12 20:21:56,561 INFO hive.HiveImport: Time taken: 5.954 seconds

2019-03-12 20:21:57,005 INFO hive.HiveImport: Loading data to table testmshk.mysql2hive_mshk

2019-03-12 20:21:58,181 INFO hive.HiveImport: OK

2019-03-12 20:21:58,181 INFO hive.HiveImport: Time taken: 1.619 seconds

2019-03-12 20:21:58,681 INFO hive.HiveImport: Hive import complete.

最后，让我们验证 Hive 中的输出：

[root@c0 ~]# hive

Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> show databases;

OK

default

testmshk

Time taken: 3.512 seconds, Fetched: 2 row(s)

hive> use testmshk;

OK

Time taken: 0.032 seconds

hive> show tables;

OK

mysql2hive_mshk

testtable

Time taken: 0.083 seconds, Fetched: 2 row(s)

hive> select * from mysql2hive_mshk;

OK

10086	my.mshk.top	you.mshk.top

12306	mname	yname

Time taken: 1.634 seconds, Fetched: 2 row(s)

hive> quit;

同时我们在 HDFS 中也可以看到创建的数据：

通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出

4.3、Sqoop 通过 MySql 导入数据到 Hbase

接下来我们将 MySql 中的表 hive2mysql_mshk 数据，导入到 Hbase ，同时在 Hbase 中创建表 mysql2hase_mshk

[root@c0 ~]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --hbase-table mysql2hase_mshk --hbase-create-table --hbase-row-key id --column-family id

2019-03-13 12:04:33,647 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7

2019-03-13 12:04:33,694 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

2019-03-13 12:04:33,841 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

2019-03-13 12:04:33,841 INFO tool.CodeGenTool: Beginning code generation

2019-03-13 12:04:34,162 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

2019-03-13 12:04:34,197 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1

...

2019-03-13 12:05:13,782 INFO mapreduce.Job:  map 75% reduce 0%

2019-03-13 12:05:15,813 INFO mapreduce.Job:  map 100% reduce 0%

2019-03-13 12:05:16,827 INFO mapreduce.Job: Job job_1552397454797_0002 completed successfully

2019-03-13 12:05:16,942 INFO mapreduce.Job: Counters: 33

	File System Counters

		FILE: Number of bytes read=0

		FILE: Number of bytes written=1041632

		FILE: Number of read operations=0

		FILE: Number of large read operations=0

		FILE: Number of write operations=0

		HDFS: Number of bytes read=425

		HDFS: Number of bytes written=0

		HDFS: Number of read operations=4

		HDFS: Number of large read operations=0

		HDFS: Number of write operations=0

	Job Counters

		Failed map tasks=5

		Launched map tasks=9

		Other local map tasks=9

		Total time spent by all maps in occupied slots (ms)=68882

		Total time spent by all reduces in occupied slots (ms)=0

		Total time spent by all map tasks (ms)=68882

		Total vcore-milliseconds taken by all map tasks=68882

		Total megabyte-milliseconds taken by all map tasks=70535168

	Map-Reduce Framework

		Map input records=2

		Map output records=2

		Input split bytes=425

		Spilled Records=0

		Failed Shuffles=0

		Merged Map outputs=0

		GC time elapsed (ms)=801

		CPU time spent (ms)=15480

		Physical memory (bytes) snapshot=1097326592

		Virtual memory (bytes) snapshot=11271196672

		Total committed heap usage (bytes)=629669888

		Peak Map Physical memory (bytes)=295751680

		Peak Map Virtual memory (bytes)=2828283904

	File Input Format Counters

		Bytes Read=0

	File Output Format Counters

		Bytes Written=0

2019-03-13 12:05:16,949 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 38.224 seconds (0 bytes/sec)

2019-03-13 12:05:16,954 INFO mapreduce.ImportJobBase: Retrieved 2 records.

这时，我们再用 shell 测试连接 Hbase ,查看我们刚刚导入的数据，能够看到 mysql2hase_mshk 已经存在，并且可以获取其中的数据

[root@c0 ~]# hbase shell

HBase Shell

Use "help" to get list of supported commands.

Use "exit" to quit this interactive shell.

Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec  5 11:54:10 PST 2018

hbase(main):001:0> list

TABLE

mysql2hase_mshk

1 row(s) in 0.1870 seconds

=> ["mysql2hase_mshk"]

hbase(main):002:0> scan 'mysql2hase_mshk'

ROW                                                        COLUMN+CELL

 10086                                                     column=id:namea, timestamp=1552449912494, value=my.mshk.top

 10086                                                     column=id:nameb, timestamp=1552449912494, value=you.mshk.top

 12306                                                     column=id:namea, timestamp=1552449906986, value=mname

 12306                                                     column=id:nameb, timestamp=1552449906986, value=yname

2 row(s) in 0.1330 seconds

hbase(main):003:0> hbase(main):003:0> get 'mysql2hase_mshk','10086'

COLUMN                                                     CELL

 id:namea                                                  timestamp=1552449912494, value=my.mshk.top

 id:nameb                                                  timestamp=1552449912494, value=you.mshk.top

1 row(s) in 0.0230 seconds

hbase(main):004:0>

如何在 Hbase 和 Hive 中互相导入、导出数据，请参考文章：Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建中的 9.2.4 和 9.2.5 章节。

希望您发现它很有用，感谢您的支持和阅读我的博客。

博文作者：迦壹

博客地址：通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出

转载声明：可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明，谢谢合作！

假设您认为这篇文章对您有帮助，可以通过以下方式进行捐赠，谢谢！

通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出

比特币地址：1KdgydfKMcFVpicj5w4vyn3T88dwjBst6Y

以太坊地址：0xbB0a92d634D7b9Ac69079ed0e521CC2e0a97c420