目录
1、什么是 Sqoop?
Sqoop
是一种用于在 Hadoop
和关系数据库或大型机之间传输数据的工具。
您可以使用 Sqoop
将数据从关系数据库管理系统RDBMS
(如 MySQL
或 Oracle
)导入 Hadoop
分布式文件系统 HDFS
,转换 Hadoop MapReduce
中的数据,然后将数据导出回 RDBMS
。
Sqoop
自动执行此过程的大部分过程,依靠数据库来描述要导入的数据的模式。Sqoop
使用 MapReduce
导入和导出数据,提供并行操作和容错。
2、下载应用程序及配置环境变量
2.1、下载 Sqoop 1.4.7
通过以下命令下载 Sqoop
,解压后,放到/home/work/_app/
目录中:
[root@c0 _src]# pwd
/home/work/_src
[root@c0 _src]# wget http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@c0 _src]# tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[root@c0 _src]# mv sqoop-1.4.7.bin__hadoop-2.6.0 /home/work/_app/
2.2、设置环境变量
在每一台机器上设置 Sqoop
环境变量,运行以下命令
echo "" >> /etc/bashrc
echo "# Sqoop 1.4.7" >> /etc/bashrc
echo "export SQOOP_HOME=/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0" >> /etc/bashrc
echo "" >> /etc/bashrc
echo "# Path" >> /etc/bashrc
echo "export PATH=\$PATH:\$SQOOP_HOME/bin" >> /etc/bashrc
source /etc/bashrc
2.3、设置安装所需环境
安装和运行 Sqoop
需要用到 Hive
、MySql
、Hadoop
环境。可以参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建
3、安装 Sqoop 1.4.7
3.1、修改 Sqoop 配置文件
3.1.1、修改配置文件 sqoop-env.sh
创建 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh
文件编辑并保存,内容为空,因为我们在本文的配置环境变量章节中已经配置了环境变量,同时也在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中配置了 Hive
和 Hadoop
环境变量:
[root@c0 ~]# echo "" > /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh
将 /home/work/_app/hive-2.3.4/lib/
目录下的 hive-hcatalog-core-2.3.4.jar
、mysql-connector-java-5.1.47-bin.jar
、hive-common-2.3.4.jar
、libthrift-0.9.3.jar
文件,复制到/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib
目录下
[root@c0 ~]# cp /home/work/_app/hive-2.3.4/lib/hive-hcatalog-core-2.3.4.jar /home/work/_app/hive-2.3.4/lib/mysql-connector-java-5.1.47-bin.jar /home/work/_app/hive-2.3.4/lib/libthrift-0.9.3.jar /home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib/
3.1.2、修改配置文件 configure-sqoop
编辑 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop
文件并保存,内容如下:
[root@c0 _src]# cat /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop
#!/bin/bash
#
# Copyright 2011 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This is sourced in by bin/sqoop to set environment variables prior to
# invoking Hadoop.
bin="$1"
if [ -z "${bin}" ]; then
bin=`dirname $0`
bin=`cd ${bin} && pwd`
fi
if [ -z "$SQOOP_HOME" ]; then
export SQOOP_HOME=${bin}/..
fi
SQOOP_CONF_DIR=${SQOOP_CONF_DIR:-${SQOOP_HOME}/conf}
if [ -f "${SQOOP_CONF_DIR}/sqoop-env.sh" ]; then
. "${SQOOP_CONF_DIR}/sqoop-env.sh"
fi
# Find paths to our dependency systems. If they are unset, use CDH defaults.
if [ -z "${HADOOP_COMMON_HOME}" ]; then
if [ -n "${HADOOP_HOME}" ]; then
HADOOP_COMMON_HOME=${HADOOP_HOME}
else
if [ -d "/usr/lib/hadoop" ]; then
HADOOP_COMMON_HOME=/usr/lib/hadoop
else
HADOOP_COMMON_HOME=${SQOOP_HOME}/../hadoop
fi
fi
fi
if [ -z "${HADOOP_MAPRED_HOME}" ]; then
HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
if [ -n "${HADOOP_HOME}" ]; then
HADOOP_MAPRED_HOME=${HADOOP_HOME}
else
HADOOP_MAPRED_HOME=${SQOOP_HOME}/../hadoop-mapreduce
fi
fi
fi
# We are setting HADOOP_HOME to HADOOP_COMMON_HOME if it is not set
# so that hcat script works correctly on BigTop
if [ -z "${HADOOP_HOME}" ]; then
if [ -n "${HADOOP_COMMON_HOME}" ]; then
HADOOP_HOME=${HADOOP_COMMON_HOME}
export HADOOP_HOME
fi
fi
if [ -z "${HBASE_HOME}" ]; then
if [ -d "/usr/lib/hbase" ]; then
HBASE_HOME=/usr/lib/hbase
else
HBASE_HOME=${SQOOP_HOME}/../hbase
fi
fi
#if [ -z "${HCAT_HOME}" ]; then
# if [ -d "/usr/lib/hive-hcatalog" ]; then
# HCAT_HOME=/usr/lib/hive-hcatalog
# elif [ -d "/usr/lib/hcatalog" ]; then
# HCAT_HOME=/usr/lib/hcatalog
# else
# HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog
# if [ ! -d ${HCAT_HOME} ]; then
# HCAT_HOME=${SQOOP_HOME}/../hcatalog
# fi
# fi
#fi
#if [ -z "${ACCUMULO_HOME}" ]; then
# if [ -d "/usr/lib/accumulo" ]; then
# ACCUMULO_HOME=/usr/lib/accumulo
# else
# ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
# fi
#fi
if [ -z "${ZOOKEEPER_HOME}" ]; then
if [ -d "/usr/lib/zookeeper" ]; then
ZOOKEEPER_HOME=/usr/lib/zookeeper
else
ZOOKEEPER_HOME=${SQOOP_HOME}/../zookeeper
fi
fi
if [ -z "${HIVE_HOME}" ]; then
if [ -d "/usr/lib/hive" ]; then
export HIVE_HOME=/usr/lib/hive
elif [ -d ${SQOOP_HOME}/../hive ]; then
export HIVE_HOME=${SQOOP_HOME}/../hive
fi
fi
# Check: If we can't find our dependencies, give up here.
if [ ! -d "${HADOOP_COMMON_HOME}" ]; then
echo "Error: $HADOOP_COMMON_HOME does not exist!"
echo 'Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.'
exit 1
fi
if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
echo "Error: $HADOOP_MAPRED_HOME does not exist!"
echo 'Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.'
exit 1
fi
## Moved to be a runtime check in sqoop.
if [ ! -d "${HBASE_HOME}" ]; then
echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."
echo 'Please set $HBASE_HOME to the root of your HBase installation.'
fi
## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
# echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
# echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi
#if [ ! -d "${ACCUMULO_HOME}" ]; then
# echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
# echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
if [ ! -d "${ZOOKEEPER_HOME}" ]; then
echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."
echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'
fi
# Where to find the main Sqoop jar
SQOOP_JAR_DIR=$SQOOP_HOME
# If there's a "build" subdir, override with this, so we use
# the newly-compiled copy.
if [ -d "$SQOOP_JAR_DIR/build" ]; then
SQOOP_JAR_DIR="${SQOOP_JAR_DIR}/build"
fi
function add_to_classpath() {
dir=$1
for f in $dir/*.jar; do
SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
done
export SQOOP_CLASSPATH
}
# Add sqoop dependencies to classpath.
SQOOP_CLASSPATH=""
if [ -d "$SQOOP_HOME/lib" ]; then
add_to_classpath $SQOOP_HOME/lib
fi
# Add HBase to dependency list
if [ -e "$HBASE_HOME/bin/hbase" ]; then
TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`$HBASE_HOME/bin/hbase classpath`
SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi
# Add HCatalog to dependency list
if [ -e "${HCAT_HOME}/bin/hcat" ]; then
TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`${HCAT_HOME}/bin/hcat -classpath`
if [ -z "${HIVE_CONF_DIR}" ]; then
TMP_SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}:${HIVE_CONF_DIR}
fi
SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi
# Add Accumulo to dependency list
if [ -e "$ACCUMULO_HOME/bin/accumulo" ]; then
for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*accumulo.*jar | cut -d':' -f2`; do
SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
done
for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*zookeeper.*jar | cut -d':' -f2`; do
SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
done
fi
ZOOCFGDIR=${ZOOCFGDIR:-/etc/zookeeper}
if [ -d "${ZOOCFGDIR}" ]; then
SQOOP_CLASSPATH=$ZOOCFGDIR:$SQOOP_CLASSPATH
fi
SQOOP_CLASSPATH=${SQOOP_CONF_DIR}:${SQOOP_CLASSPATH}
# If there's a build subdir, use Ivy-retrieved dependencies too.
if [ -d "$SQOOP_HOME/build/ivy/lib/sqoop" ]; then
for f in $SQOOP_HOME/build/ivy/lib/sqoop/*/*.jar; do
SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
done
fi
add_to_classpath ${SQOOP_JAR_DIR}
HADOOP_CLASSPATH="${SQOOP_CLASSPATH}:${HADOOP_CLASSPATH}"
if [ ! -z "$SQOOP_USER_CLASSPATH" ]; then
# User has elements to prepend to the classpath, forcibly overriding
# Sqoop's own lib directories.
export HADOOP_CLASSPATH="${SQOOP_USER_CLASSPATH}:${HADOOP_CLASSPATH}"
fi
export SQOOP_CLASSPATH
export SQOOP_CONF_DIR
export SQOOP_JAR_DIR
export HADOOP_CLASSPATH
export HADOOP_COMMON_HOME
export HADOOP_MAPRED_HOME
export HBASE_HOME
export HCAT_HOME
export HIVE_CONF_DIR
export ACCUMULO_HOME
export ZOOKEEPER_HOME
3.2、查看 Sqoop 版本
[root@c0 _src]# sqoop version
2019-03-11 22:30:16,837 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
4、启动和测试 Sqoop 的数据导入、导出
4.1、Sqoop 通过 Hive 导入数据到 Sqoop
在 mysql
中创建数据库 testmshk
并授权给 root
用户,同时创建 hive2mysql_mshk
表
[root@c0 _src]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 21
Server version: 5.7.25 MySQL Community Server (GPL)
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE DATABASE testmshk DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| performance_schema |
| sys |
| testmshk |
+--------------------+
6 rows in set (0.02 sec)
mysql> grant select,insert,update,delete,create on testmshk.* to root;
Query OK, 0 rows affected (0.01 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.01 sec)
mysql> use testmshk;
Database changed
mysql> create table hive2mysql_mshk(id int,namea varchar(50),nameb varchar(50));
Query OK, 0 rows affected (0.02 sec)
mysql> quit;
Bye
通过 Sqoop
查询 Mysql
中表的内容,这时可以看到表中的内容是空的
[root@c0 ~]# sqoop eval --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --query "select * from hive2mysql_mshk"
2019-03-11 23:44:06,894 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:44:06,945 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:44:07,100 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id | namea | nameb |
-------------------------------------------------------------
-------------------------------------------------------------
在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中,我们在测试 Hive
时创建了测试数据 /hive/warehouse/testtable/testdata001.dat
我们将这个数据,导入到 Mysql
[root@c0 ~]# sqoop export --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --table hive2mysql_mshk --export-dir /hive/warehouse/testtable/testdata001.dat --input-fields-terminated-by ','
2019-03-11 23:47:10,400 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:47:10,437 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:47:10,571 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-11 23:47:10,574 INFO tool.CodeGenTool: Beginning code generation
2019-03-11 23:47:10,914 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,943 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,952 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-11 23:47:12,652 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.jar
2019-03-11 23:47:12,669 INFO mapreduce.ExportJobBase: Beginning export of hive2mysql_mshk
2019-03-11 23:47:12,669 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-11 23:47:12,804 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-11 23:47:14,106 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-11 23:47:14,479 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-11 23:47:14,808 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0003
2019-03-11 23:47:16,429 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,432 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,513 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-11 23:47:16,577 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:16,684 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0003
2019-03-11 23:47:16,686 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-11 23:47:16,924 INFO conf.Configuration: resource-types.xml not found
2019-03-11 23:47:16,924 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-11 23:47:17,213 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0003
2019-03-11 23:47:17,261 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0003/
2019-03-11 23:47:17,262 INFO mapreduce.Job: Running job: job_1552315366846_0003
2019-03-11 23:47:23,359 INFO mapreduce.Job: Job job_1552315366846_0003 running in uber mode : false
2019-03-11 23:47:23,360 INFO mapreduce.Job: map 0% reduce 0%
2019-03-11 23:47:31,454 INFO mapreduce.Job: map 75% reduce 0%
2019-03-11 23:47:32,462 INFO mapreduce.Job: map 100% reduce 0%
2019-03-11 23:47:32,473 INFO mapreduce.Job: Job job_1552315366846_0003 completed successfully
2019-03-11 23:47:32,619 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=913424
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=782
HDFS: Number of bytes written=0
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=4
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=23446
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=23446
Total vcore-milliseconds taken by all map tasks=23446
Total megabyte-milliseconds taken by all map tasks=24008704
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=636
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=582
CPU time spent (ms)=3960
Physical memory (bytes) snapshot=830259200
Virtual memory (bytes) snapshot=11165683712
Total committed heap usage (bytes)=454557696
Peak Map Physical memory (bytes)=208502784
Peak Map Virtual memory (bytes)=2793611264
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2019-03-11 23:47:32,626 INFO mapreduce.ExportJobBase: Transferred 782 bytes in 18.5015 seconds (42.2668 bytes/sec)
2019-03-11 23:47:32,629 INFO mapreduce.ExportJobBase: Exported 2 records.
--export-dir
表示在HDFS
对应的Hive
数据库文件位置–input-fields-terminated-by
表示要处理的间隔符
再次通过 Sqoop
查看 MySql
中的内容,可以看到数据已经成功导入
[root@c0 ~]# sqoop eval --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --query "select * from hive2mysql_mshk"
2019-03-11 23:48:56,848 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:48:56,884 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:48:57,024 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id | namea | nameb |
-------------------------------------------------------------
| 10086 | my.mshk.top | you.mshk.top |
| 12306 | mname | yname |
-------------------------------------------------------------
在 MySql
中能够看到我们创建的 hive2mysql_mshk
表有2行数据
4.2、Sqoop 通过 MySql 导入数据到 Hive
刚刚我们创建的 hive2mysql_mshk
表没有任何主键,我们只是从 Hive
中添加了一些记录到 Mysql
。
默认情况下,Sqoop
将识别表中的主键列(如果存在)并将其用作拆分列。
从数据库中检索拆分列的低值和高值,并且映射任务在总范围的大小均匀的组件上运行。
如果主键的实际值在其范围内不均匀分布,则可能导致任务不平衡。
您应该使用 --split-by
参数明确选择不同的列。例如-- split-by id
。
在将 MySql
的数据导入到 Hive
中的 Sqoop
命令添加了更多参数:
sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk
--split-by <column-name>
用哪个列来拆分--table
告诉计算机您要从MySQL导入哪个表--target-dir <dir>
HDFS要存储的目录--hive-import
将表导入Hive--hive-overwrite
覆盖Hive表中的现有数据--hive-table <table-name>
设置导入Hive时要使用的表名--fields-terminated-by <char>
设置字段分隔符
接下来 Sqoop
的操作是一个 map-reduce
工作。
[root@c0 _src]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk --hive-overwrite
2019-03-12 20:21:05,060 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-12 20:21:05,137 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-12 20:21:05,337 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-12 20:21:05,348 INFO tool.CodeGenTool: Beginning code generation
2019-03-12 20:21:05,785 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,821 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,831 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-12 20:21:08,747 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.jar
2019-03-12 20:21:08,761 WARN manager.MySQLManager: It looks like you are importing from mysql.
2019-03-12 20:21:08,761 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2019-03-12 20:21:08,761 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2019-03-12 20:21:08,762 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2019-03-12 20:21:08,764 INFO mapreduce.ImportJobBase: Beginning import of hive2mysql_mshk
2019-03-12 20:21:08,765 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-12 20:21:08,928 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-12 20:21:09,656 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-12 20:21:10,332 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-12 20:21:10,688 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0011
2019-03-12 20:21:12,618 INFO db.DBInputFormat: Using read commited transaction isolation
2019-03-12 20:21:12,619 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `hive2mysql_mshk`
2019-03-12 20:21:12,622 INFO db.IntegerSplitter: Split size: 555; Num splits: 4 from: 10086 to: 12306
2019-03-12 20:21:12,696 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-12 20:21:13,137 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0011
2019-03-12 20:21:13,140 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-12 20:21:13,443 INFO conf.Configuration: resource-types.xml not found
2019-03-12 20:21:13,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-12 20:21:13,533 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0011
2019-03-12 20:21:13,593 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0011/
2019-03-12 20:21:13,594 INFO mapreduce.Job: Running job: job_1552315366846_0011
2019-03-12 20:21:20,705 INFO mapreduce.Job: Job job_1552315366846_0011 running in uber mode : false
2019-03-12 20:21:20,727 INFO mapreduce.Job: map 0% reduce 0%
2019-03-12 20:21:29,927 INFO mapreduce.Job: map 50% reduce 0%
2019-03-12 20:21:29,930 INFO mapreduce.Job: Task Id : attempt_1552315366846_0011_m_000000_0, Status : FAILED
[2019-03-12 20:21:28.236]Container [pid=19941,containerID=container_e15_1552315366846_0011_01_000002] is running 539445760B beyond the 'VIRTUAL' memory limit. Current usage: 199.9 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e15_1552315366846_0011_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 20026 19941 19941 19941 (java) 950 81 2678403072 50861 /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642
|- 19941 19939 19941 19941 (bash) 1 2 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 1>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stdout 2>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stderr
[2019-03-12 20:21:28.324]Container killed on request. Exit code is 143
[2019-03-12 20:21:28.335]Container exited with a non-zero exit code 143.
2019-03-12 20:21:30,978 INFO mapreduce.Job: map 75% reduce 0%
2019-03-12 20:21:37,021 INFO mapreduce.Job: map 100% reduce 0%
2019-03-12 20:21:37,032 INFO mapreduce.Job: Job job_1552315366846_0011 completed successfully
2019-03-12 20:21:37,145 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=915840
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=425
HDFS: Number of bytes written=49
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Failed map tasks=1
Launched map tasks=5
Other local map tasks=5
Total time spent by all maps in occupied slots (ms)=31981
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=31981
Total vcore-milliseconds taken by all map tasks=31981
Total megabyte-milliseconds taken by all map tasks=32748544
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=425
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=318
CPU time spent (ms)=6520
Physical memory (bytes) snapshot=815542272
Virtual memory (bytes) snapshot=11174408192
Total committed heap usage (bytes)=437780480
Peak Map Physical memory (bytes)=206934016
Peak Map Virtual memory (bytes)=2795565056
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=49
2019-03-12 20:21:37,154 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 27.4776 seconds (1.7833 bytes/sec)
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Retrieved 2 records.
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table hive2mysql_mshk
2019-03-12 20:21:37,188 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:37,208 INFO hive.HiveImport: Loading uploaded data into Hive
2019-03-12 20:21:37,220 INFO conf.HiveConf: Found configuration file file:/home/work/_app/hive-2.3.4/conf/hive-site.xml
2019-03-12 20:21:49,491 INFO hive.HiveImport:
2019-03-12 20:21:49,492 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
2019-03-12 20:21:56,558 INFO hive.HiveImport: OK
2019-03-12 20:21:56,561 INFO hive.HiveImport: Time taken: 5.954 seconds
2019-03-12 20:21:57,005 INFO hive.HiveImport: Loading data to table testmshk.mysql2hive_mshk
2019-03-12 20:21:58,181 INFO hive.HiveImport: OK
2019-03-12 20:21:58,181 INFO hive.HiveImport: Time taken: 1.619 seconds
2019-03-12 20:21:58,681 INFO hive.HiveImport: Hive import complete.
最后,让我们验证 Hive
中的输出:
[root@c0 ~]# hive
Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
testmshk
Time taken: 3.512 seconds, Fetched: 2 row(s)
hive> use testmshk;
OK
Time taken: 0.032 seconds
hive> show tables;
OK
mysql2hive_mshk
testtable
Time taken: 0.083 seconds, Fetched: 2 row(s)
hive> select * from mysql2hive_mshk;
OK
10086 my.mshk.top you.mshk.top
12306 mname yname
Time taken: 1.634 seconds, Fetched: 2 row(s)
hive> quit;
同时我们在 HDFS
中也可以看到创建的数据:
4.3、Sqoop 通过 MySql 导入数据到 Hbase
接下来我们将 MySql
中的表 hive2mysql_mshk
数据,导入到 Hbase
,同时在 Hbase
中创建表 mysql2hase_mshk
[root@c0 ~]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --hbase-table mysql2hase_mshk --hbase-create-table --hbase-row-key id --column-family id
2019-03-13 12:04:33,647 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-13 12:04:33,694 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-13 12:04:33,841 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-13 12:04:33,841 INFO tool.CodeGenTool: Beginning code generation
2019-03-13 12:04:34,162 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-13 12:04:34,197 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
...
2019-03-13 12:05:13,782 INFO mapreduce.Job: map 75% reduce 0%
2019-03-13 12:05:15,813 INFO mapreduce.Job: map 100% reduce 0%
2019-03-13 12:05:16,827 INFO mapreduce.Job: Job job_1552397454797_0002 completed successfully
2019-03-13 12:05:16,942 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1041632
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=425
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Launched map tasks=9
Other local map tasks=9
Total time spent by all maps in occupied slots (ms)=68882
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=68882
Total vcore-milliseconds taken by all map tasks=68882
Total megabyte-milliseconds taken by all map tasks=70535168
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=425
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=801
CPU time spent (ms)=15480
Physical memory (bytes) snapshot=1097326592
Virtual memory (bytes) snapshot=11271196672
Total committed heap usage (bytes)=629669888
Peak Map Physical memory (bytes)=295751680
Peak Map Virtual memory (bytes)=2828283904
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2019-03-13 12:05:16,949 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 38.224 seconds (0 bytes/sec)
2019-03-13 12:05:16,954 INFO mapreduce.ImportJobBase: Retrieved 2 records.
这时,我们再用 shell
测试连接 Hbase
,查看我们刚刚导入的数据,能够看到 mysql2hase_mshk
已经存在,并且可以获取其中的数据
[root@c0 ~]# hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec 5 11:54:10 PST 2018
hbase(main):001:0> list
TABLE
mysql2hase_mshk
1 row(s) in 0.1870 seconds
=> ["mysql2hase_mshk"]
hbase(main):002:0> scan 'mysql2hase_mshk'
ROW COLUMN+CELL
10086 column=id:namea, timestamp=1552449912494, value=my.mshk.top
10086 column=id:nameb, timestamp=1552449912494, value=you.mshk.top
12306 column=id:namea, timestamp=1552449906986, value=mname
12306 column=id:nameb, timestamp=1552449906986, value=yname
2 row(s) in 0.1330 seconds
hbase(main):003:0> hbase(main):003:0> get 'mysql2hase_mshk','10086'
COLUMN CELL
id:namea timestamp=1552449912494, value=my.mshk.top
id:nameb timestamp=1552449912494, value=you.mshk.top
1 row(s) in 0.0230 seconds
hbase(main):004:0>
如何在 Hbase
和 Hive
中互相导入、导出数据,请参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 中的 9.2.4
和 9.2.5
章节。
希望您发现它很有用,感谢您的支持和阅读我的博客。
博文作者:迦壹
博客地址:通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出
转载声明:可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明,谢谢合作!
假设您认为这篇文章对您有帮助,可以通过以下方式进行捐赠,谢谢!
比特币地址:1KdgydfKMcFVpicj5w4vyn3T88dwjBst6Y
以太坊地址:0xbB0a92d634D7b9Ac69079ed0e521CC2e0a97c420