pyspark使用s3中的regex / glob选择文件子集

时间:2022-09-01 23:38:46

I have a number files each segregated by date (date=yyyymmdd) on amazon s3. The files go back 6 months but I would like to restrict my script to only use the last 3 months of data. I am unsure as to whether I will be able to use regular expressions to do something like sc.textFile("s3://path_to_dir/yyyy[m1,m2,m3]*")

我有一个数字文件,每个文件按亚马逊s3上的日期(日期= yyyymmdd)隔离。文件可以追溯到6个月,但我想限制我的脚本只使用最近3个月的数据。我不确定我是否能够使用正则表达式来执行类似sc.textFile的操作(“s3:// path_to_dir / yyyy [m1,m2,m3] *”)

where m1,m2,m3 represents the 3 months from the current date that I would like to use.

其中m1,m2,m3代表我想要使用的当前日期的3个月。

One discussion also suggested using something like sc.textFile("s3://path_to_dir/yyyym1*","s3://path_to_dir/yyyym2*","s3://path_to_dir/yyyym3*") but that doesn't seem to work for me.

一个讨论还建议使用像sc.textFile(“s3:// path_to_dir / yyyym1 *”,“s3:// path_to_dir / yyyym2 *”,“s3:// path_to_dir / yyyym3 *”)这样的东西,但似乎不为我工作

Does sc.textFile( ) take regular expressions? I know you can use glob expressions but I was unsure how to represent the above case as a glob expression?

sc.textFile()是否采用正则表达式?我知道你可以使用glob表达式,但我不确定如何将上述情况表示为glob表达式?

1 个解决方案

#1


2  

For your first option, use curly braces:

对于您的第一个选项,请使用花括号:

sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")

For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:

对于第二个选项,您可以将每个单独的glob读入RDD,然后将这些RDD合并为一个RDD:

m1 = sc.textFile("s3://path_to_dir/yyyym1*")
m2 = sc.textFile("s3://path_to_dir/yyyym2*")
m3 = sc.textFile("s3://path_to_dir/yyyym3*")
all = m1.union(m2).union(m3)

You can use globs with sc.textFile but not full regular expressions.

您可以将globs与sc.textFile一起使用,但不能使用完整的正则表达式。

#1


2  

For your first option, use curly braces:

对于您的第一个选项,请使用花括号:

sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")

For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:

对于第二个选项,您可以将每个单独的glob读入RDD,然后将这些RDD合并为一个RDD:

m1 = sc.textFile("s3://path_to_dir/yyyym1*")
m2 = sc.textFile("s3://path_to_dir/yyyym2*")
m3 = sc.textFile("s3://path_to_dir/yyyym3*")
all = m1.union(m2).union(m3)

You can use globs with sc.textFile but not full regular expressions.

您可以将globs与sc.textFile一起使用,但不能使用完整的正则表达式。