Sqoop1.4.4使用SQL语句形式将MySQL数据库表中数据导入到HDFS中

问题导读：

1、Sqoop使用SQL语句实现数据导入使用哪个参数？

2、使用--query参数执行数据导入，三个必须加上的参数是？

3、--split-by参数的作用？

4、Sqoop执行数据导入时，Map tasks的默认个数是？

5、--query后SQL语句双引号和单引号的区别？该怎么解决？

6、Sqoop执行数据导入有哪两种数据文件格式？默认的是哪个文件格式？

一、*查询形式导入

Sqoop还支持将任意的查询结果集导入，不使用--table、--columns和--where，使用SQL语句--query参数执行*查询导入，但是必须指定--target-dir目录，必须指定--split-by 分隔列，同时必须使用where且在其后加个$CONDITIONS，使Sqoop进程替代为一个唯一的条件表达式达到条件查询效果。如下：

[hadoopUser@secondmgt conf]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query  'select * from users where  id<60 and $CONDITIONS' --split-by id -m 1 --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 14:30:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 14:30:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 14:30:10 INFO tool.CodeGenTool: Beginning code generation
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO manager.SqlManager: Executing SQL statement: select * from users where  id<60 and  (1 = 0)
15/01/18 14:30:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
Note: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/18 14:30:12 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/3488270c7f7b23dd3b556d8d185f6a82/QueryResult.jar
15/01/18 14:30:12 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/01/18 14:30:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/18 14:30:13 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/01/18 14:30:13 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: number of splits:1
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
15/01/18 14:30:14 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/01/18 14:30:14 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/01/18 14:30:14 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/01/18 14:30:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0016
15/01/18 14:30:15 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0016 to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 14:30:15 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0016/
15/01/18 14:30:15 INFO mapreduce.Job: Running job: job_1421373857783_0016
15/01/18 14:30:27 INFO mapreduce.Job: Job job_1421373857783_0016 running in uber mode : false
15/01/18 14:30:27 INFO mapreduce.Job:  map 0% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job:  map 100% reduce 0%
15/01/18 14:30:38 INFO mapreduce.Job: Job job_1421373857783_0016 completed successfully
15/01/18 14:30:38 INFO mapreduce.Job: Counters: 27
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=91814
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=87
                HDFS: Number of bytes written=123
                HDFS: Number of read operations=4
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=33944
                Total time spent by all reduces in occupied slots (ms)=0
        Map-Reduce Framework
                Map input records=3
                Map output records=3
                Input split bytes=87
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=44
                CPU time spent (ms)=2440
                Physical memory (bytes) snapshot=164503552
                Virtual memory (bytes) snapshot=888926208
                Total committed heap usage (bytes)=83886080
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=123
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Transferred 123 bytes in 25.6853 seconds (4.7887 bytes/sec)
15/01/18 14:30:38 INFO mapreduce.ImportJobBase: Retrieved 3 records.

Sqoop使用--split-by 列名，根据此分隔工作量，默认的Sqoop将表中的关键字作为分隔列，由上导入命令可知，此处我们是以“id”作为分隔列。

Sqoop从大部分的数据源并行的导入数据，我们可以使用-m参数控制Map tasks的数目，默认是4个，此处我们改成了1个Map task。Map task,根据整个范围的均衡大小进行操作。例如，你有一张表，关键字id范围是0-1000，默认Map tasks 是4个，Sqoop将会执行4个进程，每个进程以如下格式执行SELECT * FROM sometable WHERE id >= lo AND id < hi其中(lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) 在不同的任务中。

注意一： 如果你的表中关键字不是根据其范围均匀的分布，就可能导致不平衡的任务。这个时候你需要明确的选择一个不同的列使用--split-by指定分隔参数。目前，Sqoop，还不支持对各个列索引进行分隔，如果一个表没有索引列或者含有多个关键字列，你必须手动的指定一个分隔列。

注意二：如果SQL语句中使用双引号（“”），则必须使用\$CONDITIONS代替$CONDITIONS，使你的shell不将其识别为shell自身的变量。如下示例：

错误方式：

[hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where $CONDITIONS" --split-by id  --target-dir /output/query/
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 15:17:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 15:17:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 15:17:50 INFO tool.CodeGenTool: Beginning code generation
15/01/18 15:17:50 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [select * from users where ] must contain '$CONDITIONS' in WHERE clause.
        at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:352)
        at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)
        at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)
        at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:238)

正确如下：

[hadoopUser@secondmgt ~]$ sqoop import --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --query "select * from users where \$CONDITIONS" --split-by id  --target-dir /output/query/

注意三：目前版本的Sqoop中，使用*形式查询导入，只提供简单的查询，没有复杂的和“OR”条件查询在where子句中。

二、查看结果

[hadoopUser@secondmgt ~]$ hadoop fs -cat /output/query/*
56,hua,hanyun,男,开通,2013-12-02,0,1
58,feng,123456,男,开通,2013-11-22,0,0
59,test,123456,男,开通,2014-03-05,58,0

三、控制导入进程

有些数据库提供更加快捷、高效的方式用来将数据库表中的数据导入到其他的系统中，这个时候可以--direct 参数。如：mysql会调用 mysqldump和mysqlimport ，PostgreSQL 为psql。

四、控制映射类型

Sqoop预配置了Java和Hive典型的大部分SQL类型，然而，默认的类型有时候不一定完全适合用户需求。可以使用下面两个参数根据自己的应用修改映射类型

Argument Description
--map-column-java <mapping> Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping> Override mapping from SQL to Hive type for configured columns.

五、文件格式

Sqoop支持两种类型的文件格式导入：分隔符文本和序列文件（delimited text or SequenceFiles）。默认的是采用分隔符文本，由上面导入后查询的结果可知，默认采用逗号分隔的。可以使用--as-textfile参数修改默认的文件导入格式。

delimited text 是适合大多数非二进制数据类型。它也很容易支持进一步操纵其他工具,如Hive。

SequenceFiles是二进制格式以自定义记录特有的数据类型来存储个人记录的。

下一篇： Sqoop1.4.4使用增量导入模式将MySQL数据库中数据导入到HDFS中

秒客网

Sqoop1.4.4使用SQL语句形式将MySQL数据库表中数据导入到HDFS中

相关文章