在Apache Beam中将EBCDIC转换为ASCII

I am trying to convert EBCDIC file to ASCII using CobolIoProvider class from JRecord in Apache Beam.

我正在尝试使用Apache Beam中JRecord的CobolIoProvider类将EBCDIC文件转换为ASCII。

Code that I'm using:

我正在使用的代码：

CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader  = ioProvider.getLineReader(Constants.IO_FIXED_LENGTH, Convert.FMT_MAINFRAME,CopybookLoader.SPLIT_NONE, copybookname, cobolfilename);

The code reads and converts the file as required. I am able to read the cobolfilename and copybookname only from the local system which are basically paths of the EBCDIC file and the copybook respectively. However, when I try to read the files from GCS, it fails with FileNotFoundException – “The filename, directory name, or volume label syntax is incorrect” .

代码根据需要读取和转换文件。我只能从本地系统读取cobolfilename和copybookname，它们分别是EBCDIC文件和副本的路径。但是，当我尝试从GCS读取文件时，它会因FileNotFoundException而失败 - “文件名，目录名称或卷标语法不正确”。

Is there a way to read Cobol file(EBCDIC) from GCS using CobolIoProvider class ?

有没有办法使用CobolIoProvider类从GCS读取Cobol文件（EBCDIC）？

If not, is there any other class available to convert Cobol file(EBCDIC) to ASCII and allowing the files to be read from GCS.

如果没有，是否有任何其他类可用于将Cobol文件（EBCDIC）转换为ASCII并允许从GCS读取文件。

Using ICobolIOBuilder:-

使用ICobolIOBuilder： -

Code that I’m using:

我正在使用的代码：

ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder("copybook.cbl")
                                    .setFileOrganization(Constants.IO_FIXED_LENGTH)
                      .setSplitCopybook(CopybookLoader.SPLIT_NONE);

AbstractLineReader reader = iob.newReader(bs); //bs is an InputStream object of my Cobol file

However, here are a few concerns:-

但是，这里有一些问题： -

1) I have to keep my copybook.cbl locally. Is there any way to read copybook file from GCS. I tried the below code, trying to read my copybook from GCS to Stream and pass the stream to LoadCopyBook(). But the code didn’t work.

1）我必须在本地保留我的copybook.cbl。有没有办法从GCS读取字帖文件。我尝试了以下代码，尝试将我的副本从GCS读取到Stream并将流传递给LoadCopyBook（）。但是代码没有用。

Sample code below:

示例代码如下：

InputStream  bs2 = new ByteArrayInputStream(copybookfile.toString().getBytes());
LayoutDetail schema = new CobolCopybookLoader()
                     .loadCopyBook(   bs, " copybook.cbl",
                         CopybookLoader.SPLIT_NONE, 0, "",
                         Constants.USE_STANDARD_COLUMNS,
                         Convert.FMT_INTEL, 0, new TextLog())
                           .asLayoutDetail();

AbstractLineReader reader = LineIOProvider.getInstance().getLineReader(schema);

reader.open(inputStream, schema);

2) Reading the EBCDIC file from stream using newReader didn’t convert my file to ascii.

2）使用newReader从流中读取EBCDIC文件未将我的文件转换为ascii。

Thanks.

谢谢。

2 个解决方案

#1

I do not have a full answer. If you are using a recent version of suggest changing the JRecord code to use the JRecordInterface1. The IO-Builder is a lot more flexible than the older CobolIoProvider interface.

我没有完整的答案。如果您使用的是最新版本的建议，请更改JRecord代码以使用JRecordInterface1。 IO-Builder比旧的CobolIoProvider接口更灵活。

String encoding = "cp037"; // cp037/IBM037 US ebcdic; cp273 - German ebcdic 
ICobolIOBuilder iob = JRecordInterface1.COBOL
       .newIOBuilder("CopybookFile.cbl") 
            .setFileOrganization(Constants.IO_FIXED_LENGTH)
            .setFont(encoding);  // should set encoding if you can

AbstractLineReader reader = iob.newReader(datastream);

With the IO-Builder interface you can use streams. This question Stream file from Google Cloud Storage is about creating a stream from GCS, may be useful. Hopefully some one with more knowledge of GCS can help.

使用IO-Builder界面，您可以使用流。来自Google云端存储的此问题流文件是关于从GCS创建流，可能很有用。希望有一些对GCS有更多了解的人可以提供帮助。

Alternatively you could read from GCS directly and create data-lines(data-records) using the newLine method of a JRecord-IO-Builder:

或者，您可以直接从GCS读取并使用JRecord-IO-Builder的newLine方法创建数据行（数据记录）：

     AbstractLine l = iob.newLine(byteArray);

I will look at creating a basic Read/Write interface to JRecord so JRecord user's can write there own interface to GCS or IBM's Mainframe Access (ZFile) etc. But this will take time.

我将着眼于为JRecord创建一个基本的读/写接口，以便JRecord用户可以编写自己的接口到GCS或IBM的大型机访问（ZFile）等。但这需要时间。

#2

The easiest way to use Beam/Dataflow with new kinds of file-based sources is to first use FileIO to get a PCollection<ReadableFile> and then use a DoFn to read that file. This will require implementing the code to read from a given channel. Something like the following:

将Beam / Dataflow与新型基于文件的源一起使用的最简单方法是首先使用FileIO获取PCollection ，然后使用DoFn读取该文件。这将需要实现从给定通道读取的代码。类似于以下内容：

Pipeline p = ...
p.apply(FileIO.match().filepattern("..."))
 .apply(FileIO.readMatches(...))
 .apply(new DoFn<ReadableFile, String>() {
   @ProcessElement
   public void processElement(ProcessContext c) {
     try (ReadableByteChannel channel = c.element().open()) {
       // Use CobolIO to read from the byte channel
     }
   });

#1