有什么方法可以将json文件(包含100个文档)导入到elasticsearch服务器中?

时间:2022-07-26 13:49:29

Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server..

有什么方法可以导入JSON文件(包含100个文档)到elasticsearch服务器上?我想把一个大的json文件导入-服务器。

9 个解决方案

#1


18  

You should use Bulk API. Note that you will need to add a header line before each json document.

您应该使用批量API。注意,您需要在每个json文档之前添加标题行。

$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}

#2


36  

As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.

正如dadoonet已经提到的,大量的API可能是正确的。要为批量协议转换文件,可以使用jq。

Assuming the file contains just the documents itself:

假设该文件仅包含文档本身:

$ echo '{"foo":"bar"}{"baz":"qux"}' | 
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

And if the file contains the documents in a top level list they have to be unwrapped first:

如果文件包含顶层列表中的文档,则必须首先展开:

$ echo '[{"foo":"bar"},{"baz":"qux"}]' | 
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

jq's -c flag makes sure that each document is on a line by itself.

jq的-c标志确保每个文档都在一行上。

If you want to pipe straight to curl, you'll want to use --data-binary @-, and not just -d, otherwise curl will strip the newlines again.

如果您想直接将管道转换为curl,您需要使用——数据二进制@-,而不仅仅是-d,否则curl将再次剥离新行。

#3


11  

I'm sure someone wants this so I'll make it easy to find.

我肯定有人想要这个,所以我会很容易找到的。

FYI - This is using Node.js (essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV

这里使用的是Node。js(本质上是一个批处理脚本)与全新的ES实例在同一个服务器上。在两个文件上运行它,每个文件有4000个条目,在我的共享虚拟服务器上只花了12秒。YMMV

var elasticsearch = require('elasticsearch'),
    fs = require('fs'),
    pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
    forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({  // default is fine for me, change as you see fit
  host: 'localhost:9200',
  log: 'trace'
});

for (var i = 0; i < pubs.length; i++ ) {
  client.create({
    index: "epubs", // name your index
    type: "pub", // describe the data thats getting created
    id: i, // increment ID every iteration - I already sorted mine but not a requirement
    body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);  //  I don't recommend this but I like having my console flooded with stuff.  It looks cool.  Like I'm compiling a kernel really fast.
    }
  });
}

for (var a = 0; a < forms.length; a++ ) {  // Same stuff here, just slight changes in type and variables
  client.create({
    index: "epubs",
    type: "form",
    id: a,
    body: forms[a]
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);
    }
  });
}

Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.

希望我能帮助的不仅仅是我自己。不是火箭科学,但可以节省10分钟。

Cheers

干杯

#4


8  

jq is a lightweight and flexible command-line JSON processor.

jq是一个轻量级的、灵活的命令行JSON处理器。

Usage:

用法:

cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary @-

猫文件。json | jq -c。[]| { "指数":{“_index”:“书签”,“_type”:“书签”,“_id”:.id } },。| curl -XPOST localhost:9200 / _bulk——data-binary @

We’re taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here’s the nugget: We’re taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we’re creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).

我们正在获取文件。json并使用-c标志将其内容传输到jq,以构建紧凑的输出。这里有一个要点:我们正在利用jq可以在每行输入中不仅构造一个对象,而且构造多个对象这一事实。对于每一行,我们都在创建控件JSON Elasticsearch需求(使用来自原始对象的ID),并创建第二行,即原始的JSON对象(.)。

At this point we have our JSON formatted the way Elasticsearch’s bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!

现在我们有了JSON格式的弹性搜索的大容量API,所以我们只需要将它设置为curl,然后将它发布到弹性搜索!

Credit goes to Kevin Marsh

信贷给了凯文·马什

#5


8  

Import no, but you can index the documents by using the ES API.

导入no,但是您可以使用ES API对文档进行索引。

You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.

您可以使用索引api来加载每一行(使用某种代码来读取文件并进行curl调用),或者使用索引批量api来加载它们。假设您的数据文件可以被格式化以使用它。

Read more here : ES API

阅读更多:ES API。

A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):

一个简单的shell脚本可以达到这个目的,如果您对shell感到满意的话(可能没有经过测试):

while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json

Peronally, I would probably use Python either pyes or the elastic-search client.

有时,我可能会使用Python pyes或弹搜索客户端。

pyes on github
elastic search python client

github弹性搜索python客户端上的pyes

Stream2es is also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)

Stream2es对于快速将数据加载到es中也非常有用,并且可以简单地将文件流到其中。(我没有测试过一个文件,但是用它来加载wikipedia doc进行es perf测试)

#6


4  

Stream2es is the easiest way IMO.

Stream2es是最简单的方法。

e.g. assuming a file "some.json" containing a list of JSON documents, one per line:

假设有一个文件。json“包含一个json文档列表,每行一个:

curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type

#7


4  

You can use esbulk, a fast and simple bulk indexer:

您可以使用esbulk,一个快速和简单的散装索引器:

$ esbulk -index myindex file.ldj

Here's an asciicast showing it loading Project Gutenberg data into Elasticsearch in about 11s.

这是一个海鞘,显示它在大约11秒内将古登堡项目的数据加载到弹框搜索中。

Disclaimer: I'm the author.

免责声明:我是作家。

#8


3  

you can use Elasticsearch Gatherer Plugin

你可以使用弹性搜索收集插件。

The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.

用于弹搜索的采集者插件是一个用于可伸缩数据抓取和索引的框架。内容适配器是在采集者zip归档中实现的,它是一种特殊的可在弹性搜索节点上分布的插件。它们可以接收作业请求并在本地队列中执行它们。工作状态保持在一个特殊的索引中。

This plugin is under development.

这个插件正在开发中。

Milestone 1 - deploy gatherer zips to nodes

里程碑1 -部署采集者zips到节点

Milestone 2 - job specification and execution

里程碑2 -工作规范和执行

Milestone 3 - porting JDBC river to JDBC gatherer

里程碑3 -将河JDBC移植到JDBC采集程序

Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs

里程碑4 -根据负载/队列长度/节点名、cron作业收集作业分配

Milestone 5 - more gatherers, more content adapters

里程碑5 -更多的采集者,更多的内容适配器

reference https://github.com/jprante/elasticsearch-gatherer

参考https://github.com/jprante/elasticsearch-gatherer

#9


0  

One way is to create a bash script that does a bulk insert:

一种方法是创建一个执行批量插入的bash脚本:

curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary @myjsonfile.json

After you run the insert, run this command to get the count:

运行插入之后,运行此命令获取计数:

curl http://127.0.0.1:9200/myindexname/type/_count

#1


18  

You should use Bulk API. Note that you will need to add a header line before each json document.

您应该使用批量API。注意,您需要在每个json文档之前添加标题行。

$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}

#2


36  

As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.

正如dadoonet已经提到的,大量的API可能是正确的。要为批量协议转换文件,可以使用jq。

Assuming the file contains just the documents itself:

假设该文件仅包含文档本身:

$ echo '{"foo":"bar"}{"baz":"qux"}' | 
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

And if the file contains the documents in a top level list they have to be unwrapped first:

如果文件包含顶层列表中的文档,则必须首先展开:

$ echo '[{"foo":"bar"},{"baz":"qux"}]' | 
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

jq's -c flag makes sure that each document is on a line by itself.

jq的-c标志确保每个文档都在一行上。

If you want to pipe straight to curl, you'll want to use --data-binary @-, and not just -d, otherwise curl will strip the newlines again.

如果您想直接将管道转换为curl,您需要使用——数据二进制@-,而不仅仅是-d,否则curl将再次剥离新行。

#3


11  

I'm sure someone wants this so I'll make it easy to find.

我肯定有人想要这个,所以我会很容易找到的。

FYI - This is using Node.js (essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV

这里使用的是Node。js(本质上是一个批处理脚本)与全新的ES实例在同一个服务器上。在两个文件上运行它,每个文件有4000个条目,在我的共享虚拟服务器上只花了12秒。YMMV

var elasticsearch = require('elasticsearch'),
    fs = require('fs'),
    pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
    forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({  // default is fine for me, change as you see fit
  host: 'localhost:9200',
  log: 'trace'
});

for (var i = 0; i < pubs.length; i++ ) {
  client.create({
    index: "epubs", // name your index
    type: "pub", // describe the data thats getting created
    id: i, // increment ID every iteration - I already sorted mine but not a requirement
    body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);  //  I don't recommend this but I like having my console flooded with stuff.  It looks cool.  Like I'm compiling a kernel really fast.
    }
  });
}

for (var a = 0; a < forms.length; a++ ) {  // Same stuff here, just slight changes in type and variables
  client.create({
    index: "epubs",
    type: "form",
    id: a,
    body: forms[a]
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);
    }
  });
}

Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.

希望我能帮助的不仅仅是我自己。不是火箭科学,但可以节省10分钟。

Cheers

干杯

#4


8  

jq is a lightweight and flexible command-line JSON processor.

jq是一个轻量级的、灵活的命令行JSON处理器。

Usage:

用法:

cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary @-

猫文件。json | jq -c。[]| { "指数":{“_index”:“书签”,“_type”:“书签”,“_id”:.id } },。| curl -XPOST localhost:9200 / _bulk——data-binary @

We’re taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here’s the nugget: We’re taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we’re creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).

我们正在获取文件。json并使用-c标志将其内容传输到jq,以构建紧凑的输出。这里有一个要点:我们正在利用jq可以在每行输入中不仅构造一个对象,而且构造多个对象这一事实。对于每一行,我们都在创建控件JSON Elasticsearch需求(使用来自原始对象的ID),并创建第二行,即原始的JSON对象(.)。

At this point we have our JSON formatted the way Elasticsearch’s bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!

现在我们有了JSON格式的弹性搜索的大容量API,所以我们只需要将它设置为curl,然后将它发布到弹性搜索!

Credit goes to Kevin Marsh

信贷给了凯文·马什

#5


8  

Import no, but you can index the documents by using the ES API.

导入no,但是您可以使用ES API对文档进行索引。

You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.

您可以使用索引api来加载每一行(使用某种代码来读取文件并进行curl调用),或者使用索引批量api来加载它们。假设您的数据文件可以被格式化以使用它。

Read more here : ES API

阅读更多:ES API。

A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):

一个简单的shell脚本可以达到这个目的,如果您对shell感到满意的话(可能没有经过测试):

while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json

Peronally, I would probably use Python either pyes or the elastic-search client.

有时,我可能会使用Python pyes或弹搜索客户端。

pyes on github
elastic search python client

github弹性搜索python客户端上的pyes

Stream2es is also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)

Stream2es对于快速将数据加载到es中也非常有用,并且可以简单地将文件流到其中。(我没有测试过一个文件,但是用它来加载wikipedia doc进行es perf测试)

#6


4  

Stream2es is the easiest way IMO.

Stream2es是最简单的方法。

e.g. assuming a file "some.json" containing a list of JSON documents, one per line:

假设有一个文件。json“包含一个json文档列表,每行一个:

curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type

#7


4  

You can use esbulk, a fast and simple bulk indexer:

您可以使用esbulk,一个快速和简单的散装索引器:

$ esbulk -index myindex file.ldj

Here's an asciicast showing it loading Project Gutenberg data into Elasticsearch in about 11s.

这是一个海鞘,显示它在大约11秒内将古登堡项目的数据加载到弹框搜索中。

Disclaimer: I'm the author.

免责声明:我是作家。

#8


3  

you can use Elasticsearch Gatherer Plugin

你可以使用弹性搜索收集插件。

The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.

用于弹搜索的采集者插件是一个用于可伸缩数据抓取和索引的框架。内容适配器是在采集者zip归档中实现的,它是一种特殊的可在弹性搜索节点上分布的插件。它们可以接收作业请求并在本地队列中执行它们。工作状态保持在一个特殊的索引中。

This plugin is under development.

这个插件正在开发中。

Milestone 1 - deploy gatherer zips to nodes

里程碑1 -部署采集者zips到节点

Milestone 2 - job specification and execution

里程碑2 -工作规范和执行

Milestone 3 - porting JDBC river to JDBC gatherer

里程碑3 -将河JDBC移植到JDBC采集程序

Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs

里程碑4 -根据负载/队列长度/节点名、cron作业收集作业分配

Milestone 5 - more gatherers, more content adapters

里程碑5 -更多的采集者,更多的内容适配器

reference https://github.com/jprante/elasticsearch-gatherer

参考https://github.com/jprante/elasticsearch-gatherer

#9


0  

One way is to create a bash script that does a bulk insert:

一种方法是创建一个执行批量插入的bash脚本:

curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary @myjsonfile.json

After you run the insert, run this command to get the count:

运行插入之后,运行此命令获取计数:

curl http://127.0.0.1:9200/myindexname/type/_count