XML文件的大分割

时间:2022-07-13 21:33:18

I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn't have any top nodes which are interdependent .Is there any tool available which readily does this for me ?

我有一个15 GB的XML文件,我想把它拆分。它有大约3亿行。它没有任何相互依赖的*节点。是否有任何工具可供我使用?

8 个解决方案

#1


3  

I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.

我认为你必须手动拆分,除非你有兴趣以编程方式进行。这是一个可以做到这一点的示例,尽管它没有提到处理的XML文件的最大大小。手动执行时,出现的第一个问题是如何打开文件本身。

I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.

我会推荐一个非常简单的文本编辑器 - 像Vim。处理这样大的文件时,关闭所有形式的语法高亮和/或折叠总是有用的。

Other options worth considering:

其他值得考虑的选择:

  1. EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.

    EditPadPro - 我从来没有尝试过任何这么大的东西,但如果它和其他JGSoft产品一样,它应该像微风一样。请记住关闭语法突出显示。

  2. VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.

    VEdit - 我已经将它用于1GB大小的文件,就像它什么都没有一样。

  3. EmEditor

    EmEditor中

#2


7  

XmlSplit - A Command-line Tool That Splits Large XML Files

XmlSplit - 分割大型XML文件的命令行工具

xml_split - split huge XML documents into smaller chunks

xml_split - 将大型XML文档拆分为较小的块

Split that XML by bhayanakmaut (No source code and I could not get this one working)

通过bhayanakmaut拆分XML(没有源代码,我无法使这个工作)

A similar question: How do I split a large xml file?

一个类似的问题:如何拆分大型xml文件?

#3


3  

Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:

这是一个低内存占用脚本,可以使用CMarkup文件模式在免费的firstobject XML编辑器(foxe)中完成。我不确定你的意思是没有相互依赖的*节点或标签检查,但假设在根元素下你有数百万个包含对象属性或行的*元素,每个元素都需要作为一个整体保存在一起,你想要说每个输出文件100万,你可以这样做:

split_xml_15GB()
{
  int nObjectCount = 0, nFileCount = 0;
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "15GB.xml", MDF_READFILE );
  xmlInput.FindElem(); // root
  str sRootTag = xmlInput.GetTagName();
  xmlInput.IntoElem();
  while ( xmlInput.FindElem() )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( sRootTag );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 1000000 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

I posted a youtube video and article about this here:

我在这里发布了一个YouTube视频和文章:

http://www.firstobject.com/xml-splitter-script-video.htm

http://www.firstobject.com/xml-splitter-script-video.htm

#4


1  

QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.

QXMLEdit有一个专门的功能:我成功地使用了*转储。 ~2.7Gio文件变成了一堆~1 400 000个文件(每页一个)。它甚至允许您在子文件夹中分发它们。

#5


0  

In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.

你需要以什么方式拆分它?使用XmlReader.ReadSubTree编写代码非常简单。它将针对当前元素及其所有子元素返回一个新的xmlReader实例。因此,移动到root的第一个子节点,调用ReadSubtree,编写所有这些节点,使用原始读取器调用Read(),然后循环直到完成。

#6


0  

The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.

开源库逗号有几个工具可以在非常大的XMl文件中查找数据并将这些文件拆分成较小的文件。

https://github.com/acfr/comma/wiki/XML-Utilities

https://github.com/acfr/comma/wiki/XML-Utilities

The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.

这些工具是使用expat SAX解析器构建的,因此它们不会像xmlstarlet和saxon这样的DOM树填充内存。

#7


0  

Used this for splitting Yahoo Q&A dataset

    count = 0
    file_count = 1
    with open('filepath') as f:

    current_file = ""

    for line in f:
        current_file = current_file + line

        if "</your tag to split>" in line:
            count = count + 1

        if count==50000:
            current_file = current_file + "</endTag>"
            with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
                split.write(current_file)
            file_count = file_count + 1
            current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
            count = 0

current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
    split.write(current_file)

#8


-1  

Not an Xml tool but Ultraedit could probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.

不是Xml工具,但Ultraedit可能有所帮助,我已经将它用于2G文件而且根本不介意,但请确保关闭自动备份功能。

#1


3  

I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.

我认为你必须手动拆分,除非你有兴趣以编程方式进行。这是一个可以做到这一点的示例,尽管它没有提到处理的XML文件的最大大小。手动执行时,出现的第一个问题是如何打开文件本身。

I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.

我会推荐一个非常简单的文本编辑器 - 像Vim。处理这样大的文件时,关闭所有形式的语法高亮和/或折叠总是有用的。

Other options worth considering:

其他值得考虑的选择:

  1. EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.

    EditPadPro - 我从来没有尝试过任何这么大的东西,但如果它和其他JGSoft产品一样,它应该像微风一样。请记住关闭语法突出显示。

  2. VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.

    VEdit - 我已经将它用于1GB大小的文件,就像它什么都没有一样。

  3. EmEditor

    EmEditor中

#2


7  

XmlSplit - A Command-line Tool That Splits Large XML Files

XmlSplit - 分割大型XML文件的命令行工具

xml_split - split huge XML documents into smaller chunks

xml_split - 将大型XML文档拆分为较小的块

Split that XML by bhayanakmaut (No source code and I could not get this one working)

通过bhayanakmaut拆分XML(没有源代码,我无法使这个工作)

A similar question: How do I split a large xml file?

一个类似的问题:如何拆分大型xml文件?

#3


3  

Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:

这是一个低内存占用脚本,可以使用CMarkup文件模式在免费的firstobject XML编辑器(foxe)中完成。我不确定你的意思是没有相互依赖的*节点或标签检查,但假设在根元素下你有数百万个包含对象属性或行的*元素,每个元素都需要作为一个整体保存在一起,你想要说每个输出文件100万,你可以这样做:

split_xml_15GB()
{
  int nObjectCount = 0, nFileCount = 0;
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "15GB.xml", MDF_READFILE );
  xmlInput.FindElem(); // root
  str sRootTag = xmlInput.GetTagName();
  xmlInput.IntoElem();
  while ( xmlInput.FindElem() )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( sRootTag );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 1000000 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

I posted a youtube video and article about this here:

我在这里发布了一个YouTube视频和文章:

http://www.firstobject.com/xml-splitter-script-video.htm

http://www.firstobject.com/xml-splitter-script-video.htm

#4


1  

QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.

QXMLEdit有一个专门的功能:我成功地使用了*转储。 ~2.7Gio文件变成了一堆~1 400 000个文件(每页一个)。它甚至允许您在子文件夹中分发它们。

#5


0  

In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.

你需要以什么方式拆分它?使用XmlReader.ReadSubTree编写代码非常简单。它将针对当前元素及其所有子元素返回一个新的xmlReader实例。因此,移动到root的第一个子节点,调用ReadSubtree,编写所有这些节点,使用原始读取器调用Read(),然后循环直到完成。

#6


0  

The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.

开源库逗号有几个工具可以在非常大的XMl文件中查找数据并将这些文件拆分成较小的文件。

https://github.com/acfr/comma/wiki/XML-Utilities

https://github.com/acfr/comma/wiki/XML-Utilities

The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.

这些工具是使用expat SAX解析器构建的,因此它们不会像xmlstarlet和saxon这样的DOM树填充内存。

#7


0  

Used this for splitting Yahoo Q&A dataset

    count = 0
    file_count = 1
    with open('filepath') as f:

    current_file = ""

    for line in f:
        current_file = current_file + line

        if "</your tag to split>" in line:
            count = count + 1

        if count==50000:
            current_file = current_file + "</endTag>"
            with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
                split.write(current_file)
            file_count = file_count + 1
            current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
            count = 0

current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
    split.write(current_file)

#8


-1  

Not an Xml tool but Ultraedit could probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.

不是Xml工具,但Ultraedit可能有所帮助,我已经将它用于2G文件而且根本不介意,但请确保关闭自动备份功能。