I have a gcs folder as below:
我有一个gcs文件夹,如下所示:
gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv
/dt=2017-12-02/part-0000.tsv
/dt=2017-12-03/part-0000.tsv
/dt=2017-12-04/part-0000.tsv
...
I want to match only the files under dt=2017-12-02
and dt=2017-12-03
using sc.textFile()
in Scio, which uses TextIO.Read.from()
underneath as far as I know.
我想在scio中使用sc.textFile()匹配dt = 2017-12-02和dt = 2017-12-03下的文件,据我所知,它使用下面的TextIO.Read.from()。
I've tried
我试过了
gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv
and
和
gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv
Both match zero files:
两者都匹配零文件:
INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv matched 0 files with total size 0
INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv matched 0 files with total size 0
What should be the valid filepattern on doing this?
这样做的有效文件模式应该是什么?
2 个解决方案
#1
1
You need to use the TextIO.readAll()
transform that reads a PCollection<String>
of filepatterns. Create the collection of filepatterns either explicitly via Create.of()
or you can compute it using a ParDo
.
您需要使用TextIO.readAll()转换来读取filepatterns的PCollection
case class ReadPaths(paths: java.lang.Iterable[String]) extends PTransform[PBegin, PCollection[String]] {
override def expand(input: PBegin) = {
Create.of(paths).expand(input).apply(TextIO.readAll())
}
}
val paths = Seq(
"gs://<bucket-name>/<folder-name>/dt=2017-07-01/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2017-12-20/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2018-03-29/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2018-05-04/part-0000.tsv"
)
import scala.collection.JavaConverters._
sc.customInput("Read Paths", ReadPaths(paths.asJava))
#2
1
This might work:
这可能有效:
gs://bucket/folder/dt=2017-12-0[12]/*.tsv
Reference: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
参考:https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
#1
1
You need to use the TextIO.readAll()
transform that reads a PCollection<String>
of filepatterns. Create the collection of filepatterns either explicitly via Create.of()
or you can compute it using a ParDo
.
您需要使用TextIO.readAll()转换来读取filepatterns的PCollection
case class ReadPaths(paths: java.lang.Iterable[String]) extends PTransform[PBegin, PCollection[String]] {
override def expand(input: PBegin) = {
Create.of(paths).expand(input).apply(TextIO.readAll())
}
}
val paths = Seq(
"gs://<bucket-name>/<folder-name>/dt=2017-07-01/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2017-12-20/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2018-03-29/part-0000.tsv",
"gs://<bucket-name>/<folder-name>/dt=2018-05-04/part-0000.tsv"
)
import scala.collection.JavaConverters._
sc.customInput("Read Paths", ReadPaths(paths.asJava))
#2
1
This might work:
这可能有效:
gs://bucket/folder/dt=2017-12-0[12]/*.tsv
Reference: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
参考:https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames