在cloudera中从SQL SERVER到HDFS的近实时数据提取

We have PLC data in SQL Server which gets updated every 5 min. Have to push the data to HDFS in cloudera distribution in the same time interval. Which are the tools available for this?

我们在SQL Server中有PLC数据，每5分钟更新一次。必须在相同的时间间隔内将数据推送到cloudera分发中的HDFS。哪些工具适用于此？

3 个解决方案

#1

I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).

我建议使用Confluent Kafka来完成这项任务（https://www.confluent.io/product/connectors/）。

The idea is as following:

这个想法如下：

SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS

SQLServer - > [JDBC-Connector] - > Kafka - > [HDFS-Connector] - > HDFS

All these connectors are already available via confluent web site.

所有这些连接器都已通过汇合网站提供。

#2

I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.

我假设您的数据正在本地FS的某个目录中写入。您可以使用某些流引擎执行此任务。既然你用apache-spark标记了这个，我会给你Spark Streaming解决方案。

Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.

使用结构化流，您的流式消费者将观看您的数据目录。 Spark流以可配置的微批次（流等待时间）读取和处理数据，在您的情况下，持续时间为5分钟。您可以将每个微批次中的数据保存为文本文件，这些文件将使用您的cloudera hadoop集群进行存储。

Let me know if this helped. Cheers.

如果这有帮助，请告诉我。干杯。

#3

You can google the tool named sqoop. It is an open source software.

您可以谷歌名为sqoop的工具。它是一个开源软件。

#1