使用节点fs从aws s3 bucket读取文件

时间:2022-09-25 18:39:25

I am attempting to read a file that is in a aws s3 bucket using

我正在尝试读取一个aws s3存储桶中使用的文件。

fs.readFile(file, function (err, contents) {
  var myLines = contents.Body.toString().split('\n')
})

I've been able to download and upload a file using the node aws-sdk, but I am at a loss as to how to simply read it and parse the contents.

我可以使用node aws-sdk下载和上传一个文件,但是我不知道如何简单地读取它并解析它的内容。

Here is an example of how I am reading the file from s3:

下面是我如何阅读s3中的文件的一个例子:

var s3 = new AWS.S3();
var params = {Bucket: 'myBucket', Key: 'myKey.csv'}
var s3file = s3.getObject(params)

7 个解决方案

#1


57  

You have a couple options. You can include a callback as a second argument, which will be invoked with any error message and the object. This example is straight from the AWS documentation:

你有几个选择。您可以将回调包含为第二个参数,该参数将与任何错误消息和对象一起调用。这个例子直接来自AWS文档:

s3.getObject(params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response
});

Alternatively, you can convert the output to a stream. There's also an example in the AWS documentation:

或者,您可以将输出转换为流。AWS文档中还有一个例子:

var s3 = new AWS.S3({apiVersion: '2006-03-01'});
var params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
var file = require('fs').createWriteStream('/path/to/file.jpg');
s3.getObject(params).createReadStream().pipe(file);

#2


28  

This will do it:

这将会这么做:

new AWS.S3().getObject({ Bucket: this.awsBucketName, Key: keyName }, function(err, data)
{
    if (!err)
        console.log(data.Body.toString());
});

#3


14  

Since you seem to want to process an S3 text file line-by-line. Here is a Node version that uses the standard readline module and AWS' createReadStream()

因为您似乎希望逐行处理S3文本文件。下面是使用标准readline模块和AWS的createReadStream()的节点版本

const readline = require('readline');

const rl = readline.createInterface({
    input: s3.getObject(params).createReadStream()
});

rl.on('line', function(line) {
    console.log(line);
})
.on('close', function() {
});

#4


5  

I couldn't figure why yet, but the createReadStream/pipe approach didn't work for me. I was trying to download a large CSV file (300MB+) and I got duplicated lines. It seemed a random issue. The final file size varied in each attempt to download it.

我不知道为什么,但是createReadStream/pipe方法对我不起作用。我试图下载一个大的CSV文件(300MB+),结果得到了重复的行。这似乎是一个随机的问题。每次下载时,最终的文件大小都会有所不同。

I ended up using another way, based on AWS JS SDK examples:

基于AWS JS SDK示例,我最后使用了另一种方式:

var s3 = new AWS.S3();
var params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
var file = require('fs').createWriteStream('/path/to/file.jpg');

s3.getObject(params).
    on('httpData', function(chunk) { file.write(chunk); }).
    on('httpDone', function() { file.end(); }).
    send();

This way, it worked like a charm.

这样,它就像一种魅力。

#5


4  

here is the example which i used to retrive and parse json data from s3.

下面是我用来从s3检索和解析json数据的示例。

    var params = {Bucket: BUCKET_NAME, Key: KEY_NAME};
    new AWS.S3().getObject(params, function(err, json_data)
    {
      if (!err) {
        var json = JSON.parse(new Buffer(json_data.Body).toString("utf8"));

       // PROCESS JSON DATA
           ......
     }
   });

#6


2  

I had exactly the same issue when downloading from S3 very large files.

当从S3下载非常大的文件时,我遇到了同样的问题。

The example solution from AWS docs just does not work:

AWS docs的示例解决方案就是不起作用:

var file = fs.createWriteStream(options.filePath);
        file.on('close', function(){
            if(self.logger) self.logger.info("S3Dataset file download saved to %s", options.filePath );
            return callback(null,done);
        });
        s3.getObject({ Key:  documentKey }).createReadStream().on('error', function(err) {
            if(self.logger) self.logger.error("S3Dataset download error key:%s error:%@", options.fileName, error);
            return callback(error);
        }).pipe(file);

While this solution will work:

虽然这个解决方案会奏效:

    var file = fs.createWriteStream(options.filePath);
    s3.getObject({ Bucket: this._options.s3.Bucket, Key: documentKey })
    .on('error', function(err) {
        if(self.logger) self.logger.error("S3Dataset download error key:%s error:%@", options.fileName, error);
        return callback(error);
    })
    .on('httpData', function(chunk) { file.write(chunk); })
    .on('httpDone', function() { 
        file.end(); 
        if(self.logger) self.logger.info("S3Dataset file download saved to %s", options.filePath );
        return callback(null,done);
    })
    .send();

The createReadStream attempt just does not fire the end, close or error callback for some reason. See here about this.

createReadStream尝试只是不会因为某些原因触发结束、关闭或错误回调。在这里看到的。

I'm using that solution also for writing down archives to gzip, since the first one (AWS example) does not work in this case either:

我也在使用这个解决方案将归档记录到gzip中,因为第一个(AWS示例)在这种情况下也不起作用:

        var gunzip = zlib.createGunzip();
        var file = fs.createWriteStream( options.filePath );

        s3.getObject({ Bucket: this._options.s3.Bucket, Key: documentKey })
        .on('error', function (error) {
            if(self.logger) self.logger.error("%@",error);
            return callback(error);
        })
        .on('httpData', function (chunk) {
            file.write(chunk);
        })
        .on('httpDone', function () {

            file.end();

            if(self.logger) self.logger.info("downloadArchive downloaded %s", options.filePath);

            fs.createReadStream( options.filePath )
            .on('error', (error) => {
                return callback(error);
            })
            .on('end', () => {
                if(self.logger) self.logger.info("downloadArchive unarchived %s", options.fileDest);
                return callback(null, options.fileDest);
            })
            .pipe(gunzip)
            .pipe(fs.createWriteStream(options.fileDest))
        })
        .send();

#7


0  

If you want to save memory and want to obtain each row as a json object, then you can use fast-csv to create readstream and can read each row as a json object as follows:

如果您想要保存内存,并想以json对象的形式获取每一行,那么您可以使用快速csv创建readstream,并可以将每一行作为json对象读取,如下所示:

const csv = require('fast-csv');
const AWS = require('aws-sdk');

const credentials = new AWS.Credentials("ACCESSKEY", "SECRETEKEY", "SESSIONTOKEN");
AWS.config.update({
    credentials: credentials, // credentials required for local execution
    region: 'your_region'
});
const dynamoS3Bucket = new AWS.S3();
const stream = dynamoS3Bucket.getObject({ Bucket: 'your_bucket', Key: 'example.csv' }).createReadStream();

var parser = csv.fromStream(stream, { headers: true }).on("data", function (data) {
    parser.pause();  //can pause reading using this at a particular row
    parser.resume(); // to continue reading
    console.log(data);
}).on("end", function () {
    console.log('process finished');
});

#1


57  

You have a couple options. You can include a callback as a second argument, which will be invoked with any error message and the object. This example is straight from the AWS documentation:

你有几个选择。您可以将回调包含为第二个参数,该参数将与任何错误消息和对象一起调用。这个例子直接来自AWS文档:

s3.getObject(params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response
});

Alternatively, you can convert the output to a stream. There's also an example in the AWS documentation:

或者,您可以将输出转换为流。AWS文档中还有一个例子:

var s3 = new AWS.S3({apiVersion: '2006-03-01'});
var params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
var file = require('fs').createWriteStream('/path/to/file.jpg');
s3.getObject(params).createReadStream().pipe(file);

#2


28  

This will do it:

这将会这么做:

new AWS.S3().getObject({ Bucket: this.awsBucketName, Key: keyName }, function(err, data)
{
    if (!err)
        console.log(data.Body.toString());
});

#3


14  

Since you seem to want to process an S3 text file line-by-line. Here is a Node version that uses the standard readline module and AWS' createReadStream()

因为您似乎希望逐行处理S3文本文件。下面是使用标准readline模块和AWS的createReadStream()的节点版本

const readline = require('readline');

const rl = readline.createInterface({
    input: s3.getObject(params).createReadStream()
});

rl.on('line', function(line) {
    console.log(line);
})
.on('close', function() {
});

#4


5  

I couldn't figure why yet, but the createReadStream/pipe approach didn't work for me. I was trying to download a large CSV file (300MB+) and I got duplicated lines. It seemed a random issue. The final file size varied in each attempt to download it.

我不知道为什么,但是createReadStream/pipe方法对我不起作用。我试图下载一个大的CSV文件(300MB+),结果得到了重复的行。这似乎是一个随机的问题。每次下载时,最终的文件大小都会有所不同。

I ended up using another way, based on AWS JS SDK examples:

基于AWS JS SDK示例,我最后使用了另一种方式:

var s3 = new AWS.S3();
var params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
var file = require('fs').createWriteStream('/path/to/file.jpg');

s3.getObject(params).
    on('httpData', function(chunk) { file.write(chunk); }).
    on('httpDone', function() { file.end(); }).
    send();

This way, it worked like a charm.

这样,它就像一种魅力。

#5


4  

here is the example which i used to retrive and parse json data from s3.

下面是我用来从s3检索和解析json数据的示例。

    var params = {Bucket: BUCKET_NAME, Key: KEY_NAME};
    new AWS.S3().getObject(params, function(err, json_data)
    {
      if (!err) {
        var json = JSON.parse(new Buffer(json_data.Body).toString("utf8"));

       // PROCESS JSON DATA
           ......
     }
   });

#6


2  

I had exactly the same issue when downloading from S3 very large files.

当从S3下载非常大的文件时,我遇到了同样的问题。

The example solution from AWS docs just does not work:

AWS docs的示例解决方案就是不起作用:

var file = fs.createWriteStream(options.filePath);
        file.on('close', function(){
            if(self.logger) self.logger.info("S3Dataset file download saved to %s", options.filePath );
            return callback(null,done);
        });
        s3.getObject({ Key:  documentKey }).createReadStream().on('error', function(err) {
            if(self.logger) self.logger.error("S3Dataset download error key:%s error:%@", options.fileName, error);
            return callback(error);
        }).pipe(file);

While this solution will work:

虽然这个解决方案会奏效:

    var file = fs.createWriteStream(options.filePath);
    s3.getObject({ Bucket: this._options.s3.Bucket, Key: documentKey })
    .on('error', function(err) {
        if(self.logger) self.logger.error("S3Dataset download error key:%s error:%@", options.fileName, error);
        return callback(error);
    })
    .on('httpData', function(chunk) { file.write(chunk); })
    .on('httpDone', function() { 
        file.end(); 
        if(self.logger) self.logger.info("S3Dataset file download saved to %s", options.filePath );
        return callback(null,done);
    })
    .send();

The createReadStream attempt just does not fire the end, close or error callback for some reason. See here about this.

createReadStream尝试只是不会因为某些原因触发结束、关闭或错误回调。在这里看到的。

I'm using that solution also for writing down archives to gzip, since the first one (AWS example) does not work in this case either:

我也在使用这个解决方案将归档记录到gzip中,因为第一个(AWS示例)在这种情况下也不起作用:

        var gunzip = zlib.createGunzip();
        var file = fs.createWriteStream( options.filePath );

        s3.getObject({ Bucket: this._options.s3.Bucket, Key: documentKey })
        .on('error', function (error) {
            if(self.logger) self.logger.error("%@",error);
            return callback(error);
        })
        .on('httpData', function (chunk) {
            file.write(chunk);
        })
        .on('httpDone', function () {

            file.end();

            if(self.logger) self.logger.info("downloadArchive downloaded %s", options.filePath);

            fs.createReadStream( options.filePath )
            .on('error', (error) => {
                return callback(error);
            })
            .on('end', () => {
                if(self.logger) self.logger.info("downloadArchive unarchived %s", options.fileDest);
                return callback(null, options.fileDest);
            })
            .pipe(gunzip)
            .pipe(fs.createWriteStream(options.fileDest))
        })
        .send();

#7


0  

If you want to save memory and want to obtain each row as a json object, then you can use fast-csv to create readstream and can read each row as a json object as follows:

如果您想要保存内存,并想以json对象的形式获取每一行,那么您可以使用快速csv创建readstream,并可以将每一行作为json对象读取,如下所示:

const csv = require('fast-csv');
const AWS = require('aws-sdk');

const credentials = new AWS.Credentials("ACCESSKEY", "SECRETEKEY", "SESSIONTOKEN");
AWS.config.update({
    credentials: credentials, // credentials required for local execution
    region: 'your_region'
});
const dynamoS3Bucket = new AWS.S3();
const stream = dynamoS3Bucket.getObject({ Bucket: 'your_bucket', Key: 'example.csv' }).createReadStream();

var parser = csv.fromStream(stream, { headers: true }).on("data", function (data) {
    parser.pause();  //can pause reading using this at a particular row
    parser.resume(); // to continue reading
    console.log(data);
}).on("end", function () {
    console.log('process finished');
});