如何使用增量数据在pyspark中创建数据帧

I have some tables in hive. These tables get data appended incrementally to them.

我在蜂巢中有一些桌子。这些表将数据逐渐附加到它们。

Now I have created a data frame in pyspark using a table in hive today. I have done a transpose on the data frame and created another table with the new transposed data frame in hive.

现在我在pyspark中使用hive中的表创建了一个数据框。我在数据框上进行了转置,并在hive中创建了另一个带有新转置数据框的表。

Say tomorrow I get new incremental data in hive table of 100 new rows. Now I want to use only these 100 new rows an create a new data frame and do a transpose and append to the existing transposed hive table.

说明天我在100个新行的hive表中获得新的增量数据。现在我想只使用这100个新行创建一个新的数据框并进行转置并附加到现有的转置的hive表。

How can I achieve that using pyspark.

如何使用pyspark实现这一目标。

1 个解决方案

#1

The semantics in Hive in itself are not enough to provide this functionality. The data either has to be identifiable via content, file, or metadata process.

Hive中的语义本身不足以提供此功能。必须通过内容,文件或元数据处理来识别数据。

Identifiable by content: The data contains a time or date stamp which allows you to create a query against the table, but filter out only those rows that are of interest.

可通过内容标识:数据包含时间或日期戳,允许您针对表创建查询,但仅筛选出感兴趣的行。

Identifiable by file: Skip the Hive interface and attempt to locate the data on HDFS/POSIX using the Modify or Change timesteamps on individual files, for example. Load the file directly as a new dataframe.

可通过文件识别:例如,跳过Hive界面并尝试使用单个文件的修改或更改时间戳来查找HDFS / POSIX上的数据。将文件直接加载为新数据帧。

Identifiable by metadata process: In the architecture I've built, I use Apache NiFi, Kafka and Cloudera Navigator to provide metadata lineage regarding file and data ingestion. If your architecture contains metadata about ingested data, you may be able to leverage that to identify the files/records you need.

可通过元数据过程识别:在我构建的架构中,我使用Apache NiFi,Kafka和Cloudera Navigator来提供有关文件和数据摄取的元数据谱系。如果您的体系结构包含有关已摄取数据的元数据,您可以利用它来识别所需的文件/记录。

#1