We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file, while repeating the “header section” in each of the new files.
我们将“记录”导出到xml文件;我们的一个客户抱怨这个文件太大了,他们的其他系统无法处理。因此,我需要分割文件,同时在每个新文件中重复“头节”。
So I am looking for something that will let me define some xpaths for the section(s) that should always be outputted, and another xpath for the “rows” with a parameter that says how many rows to put in each file and how to name the files.
因此,我正在寻找一些东西,让我为应该始终输出的部分定义一些xpath,并为“行”定义另一个xpath,该xpath具有一个参数,该参数表示要在每个文件中放置多少行以及如何命名文件。
Before I start writing some custom .net code to do this; is there a standard command line tool that will work on windows that does it?
在我开始编写一些定制的。net代码之前;是否有一个标准的命令行工具可以在windows上运行?
(As I know how to program in C#, I am more included to write code then try to mess about with complex xsl etc, but a "of the self" solution would be better then custom code.)
(正如我知道如何在c#中编程一样,我更喜欢编写代码,然后尝试使用复杂的xsl等等,但是“自我”解决方案要比自定义代码更好。)
7 个解决方案
#1
-2
"is there a standard command line tool that will work on windows that does it?"
“有一个标准的命令行工具可以在windows上使用吗?”
Yes. http://xponentsoftware.com/xmlSplit.aspx
是的。http://xponentsoftware.com/xmlSplit.aspx
#2
3
There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.
这里没有通用的解决方案,因为有许多不同的可能的方法可以构造源XML。
It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:
构建将输出XML文档片段的XSLT转换相当简单。例如,给定这个XML:
<header>
<data rec="1"/>
<data rec="2"/>
<data rec="3"/>
<data rec="4"/>
<data rec="5"/>
<data rec="6"/>
</header>
you can output a copy of the file containing only data
elements within a certain range with this XSLT:
使用这个XSLT,您可以输出只包含在一定范围内的数据元素的文件副本:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="data"/>
</xsl:copy>
</xsl:template>
<xsl:template match="data">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
(Note, by the way, that because this is based on the identity transform, it works even if header
isn't the top-level element.)
(顺便说一下,因为这是基于身份转换的,即使header不是*元素,它也可以工作)。
You still need to count the data
elements in the source XML, and run the transform repeatedly with the values of $startPosition
and $endPosition
that are appropriate for the situation.
您仍然需要对源XML中的数据元素进行计数,并使用适合这种情况的$startPosition和$endPosition值反复运行转换。
#3
2
First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip
首先从这个链接http://www.firstobject.com/foxe242.zip下载foxe xml编辑器
Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.
观看http://www.firstobject.com/xml-splitter-script-video.htm视频,了解分割代码是如何工作的。
There is a script code on that page (starts with split()
) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:
该页面上有一个脚本代码(以split()开头)复制代码,在xml编辑器程序中,在“File”下创建一个“New program”。粘贴代码并保存它。的代码是:
split()
{
CMarkup xmlInput, xmlOutput;
xmlInput.Open( "**50MB.xml**", MDF_READFILE );
int nObjectCount = 0, nFileCount = 0;
while ( xmlInput.FindElem("//**ACT**") )
{
if ( nObjectCount == 0 )
{
++nFileCount;
xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
xmlOutput.AddElem( "**root**" );
xmlOutput.IntoElem();
}
xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
++nObjectCount;
if ( nObjectCount == **5** )
{
xmlOutput.Close();
nObjectCount = 0;
}
}
if ( nObjectCount )
xmlOutput.Close();
xmlInput.Close();
return nFileCount;
}
Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)
根据需要更改加粗标记(或** **标记)字段。(这也在视频页面上表达)
On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.
在xml编辑器窗口右击并单击RUN(或简单的F9)。窗口中有输出栏,显示生成的文件的数量。
Note: input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml"
(double slashes) and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"
注意:输入文件的名称可以是“C:\\ \\ \\ \\ \\ \a_xml_file”。双slash(双斜杠)和输出文件C:\\用户\\a \\ \\ \\桌面\anoutputfolder\ piece" + nFileCount + ".xml"
#4
2
As mentioned already the xml_split
from the Perl package XML::Twig does a great job.
正如前面所提到的,xml_split从Perl包XML:::Twig中分离出来的方法做得很好。
Usage
xml_split < bigFile.xml
#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split
Without any arguments xml_split
creates a file per top-level child node.
没有任何参数xml_split为*子节点创建一个文件。
There are parameters to specify the number of elements you want per file (-g
) or approximate size (-s <Kb|Mb|Gb>
).
有一些参数可以指定每个文件(-g)或近似大小(-s
Installation
Windows
看这里
Linux
sudo apt-get install xml-twig-tools
sudo apt-get安装xml-twig-tools
#5
1
There is nothing built in that can handle this situation easily.
没有任何东西能够轻易地应付这种情况。
Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".
您的方法听起来很合理,但是我可能会从包含需要重复的元素的“骨架”文档开始,并使用“记录”生成几个文档。
Update:
更新:
After a bit of digging, I found this article describing a way to split files using XSLT.
在深入研究之后,我发现本文描述了一种使用XSLT拆分文件的方法。
#6
1
xml_split - split huge XML documents into smaller chunks
xml_split——将大型XML文档分割成较小的块
http://www.perlmonks.org/index.pl?node_id=429707
http://www.perlmonks.org/index.pl?node_id=429707
http://metacpan.org/pod/XML::Twig
http://metacpan.org/pod/XML:树枝
#7
0
Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
使用基于https://www.ultra - edit.com/forums/viewtopic.php?
All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).
我所添加的只是一些XML头和尾位,第一个和最后一个文件需要手动修复(或者从源代码中删除根元素)。
// from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
var FoundsPerFile = 200; // Global setting for number of found split strings per file.
var SplitString = "</letter>"; // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';
/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
var tabindex = -1; /* start value */
for (var i = 0; i < UltraEdit.document.length; i++)
{
if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
tabindex = i;
break;
}
}
return tabindex;
}
if (UltraEdit.document.length) { // Is any file open?
// Set working environment required for this job.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.hexOff();
UltraEdit.ueReOn();
// Move cursor to top of active file and run the initial search.
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.matchCase=true;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=false;
// If the string to split is not found in this file, do nothing.
if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
// This file is probably the correct file for this script.
var FileNumber = 1; // Counts the number of saved files.
var StringsFound = 1; // Counts the number of found split strings.
var NewFileIndex = UltraEdit.document.length;
/* Get the path of the current file to save the new
files in the same directory as the current file. */
var SavePath = "";
var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
if (LastBackSlash >= 0) {
LastBackSlash++;
SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
}
/* Get active file index in case of more than 1 file is open and the
current file does not get back the focus after closing the new files. */
var FileToSplit = getActiveDocumentIndex();
// Always use clipboard 9 for this script and not the Windows clipboard.
UltraEdit.selectClipboard(9);
// Split the file after every x found split strings until source file is empty.
while (1) {
while (StringsFound < FoundsPerFile) {
if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
else {
UltraEdit.document[FileToSplit].bottom();
break;
}
}
// End the selection of the find command.
UltraEdit.document[FileToSplit].endSelect();
// Move the cursor right to include the next character and unselect the found string.
UltraEdit.document[FileToSplit].key("RIGHT ARROW");
// Select from this cursor position everything to top of the file.
UltraEdit.document[FileToSplit].selectToTop();
// Is the file not already empty?
if (UltraEdit.document[FileToSplit].isSel()) {
// Cut the selection and paste it into a new file.
UltraEdit.document[FileToSplit].cut();
UltraEdit.newFile();
UltraEdit.document[NewFileIndex].setActive();
UltraEdit.activeDocument.paste();
/* Add line termination on the last line and remove automatically added indent
spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
// add headers and footers
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.write(xmlHead);
UltraEdit.activeDocument.write(xmlRootStart);
UltraEdit.activeDocument.bottom();
UltraEdit.activeDocument.write(xmlRootEnd);
// Build the file name for this new file.
var SaveFileName = SavePath + "LETTER";
if (FileNumber < 10) SaveFileName += "0";
SaveFileName += String(FileNumber) + ".raw.xml";
// Save the new file and close it.
UltraEdit.saveAs(SaveFileName);
UltraEdit.closeFile(SaveFileName,2);
FileNumber++;
StringsFound = 0;
/* Delete the line termination in the source file
if last found split string was at end of a line. */
UltraEdit.document[FileToSplit].endSelect();
UltraEdit.document[FileToSplit].key("END");
if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
UltraEdit.document[FileToSplit].top();
} else {
UltraEdit.document[FileToSplit].deleteLine();
}
} else break;
UltraEdit.outputWindow.write("Progress " + SaveFileName);
} // Loop executed until source file is empty!
// Close source file without saving and re-open it.
var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
UltraEdit.closeFile(NameOfFileToSplit,2);
/* The following code line could be commented if the source
file is not needed anymore for further actions. */
UltraEdit.open(NameOfFileToSplit);
// Free memory and switch back to Windows clipboard.
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);
}
}
#1
-2
"is there a standard command line tool that will work on windows that does it?"
“有一个标准的命令行工具可以在windows上使用吗?”
Yes. http://xponentsoftware.com/xmlSplit.aspx
是的。http://xponentsoftware.com/xmlSplit.aspx
#2
3
There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.
这里没有通用的解决方案,因为有许多不同的可能的方法可以构造源XML。
It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:
构建将输出XML文档片段的XSLT转换相当简单。例如,给定这个XML:
<header>
<data rec="1"/>
<data rec="2"/>
<data rec="3"/>
<data rec="4"/>
<data rec="5"/>
<data rec="6"/>
</header>
you can output a copy of the file containing only data
elements within a certain range with this XSLT:
使用这个XSLT,您可以输出只包含在一定范围内的数据元素的文件副本:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="data"/>
</xsl:copy>
</xsl:template>
<xsl:template match="data">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
(Note, by the way, that because this is based on the identity transform, it works even if header
isn't the top-level element.)
(顺便说一下,因为这是基于身份转换的,即使header不是*元素,它也可以工作)。
You still need to count the data
elements in the source XML, and run the transform repeatedly with the values of $startPosition
and $endPosition
that are appropriate for the situation.
您仍然需要对源XML中的数据元素进行计数,并使用适合这种情况的$startPosition和$endPosition值反复运行转换。
#3
2
First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip
首先从这个链接http://www.firstobject.com/foxe242.zip下载foxe xml编辑器
Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.
观看http://www.firstobject.com/xml-splitter-script-video.htm视频,了解分割代码是如何工作的。
There is a script code on that page (starts with split()
) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:
该页面上有一个脚本代码(以split()开头)复制代码,在xml编辑器程序中,在“File”下创建一个“New program”。粘贴代码并保存它。的代码是:
split()
{
CMarkup xmlInput, xmlOutput;
xmlInput.Open( "**50MB.xml**", MDF_READFILE );
int nObjectCount = 0, nFileCount = 0;
while ( xmlInput.FindElem("//**ACT**") )
{
if ( nObjectCount == 0 )
{
++nFileCount;
xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
xmlOutput.AddElem( "**root**" );
xmlOutput.IntoElem();
}
xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
++nObjectCount;
if ( nObjectCount == **5** )
{
xmlOutput.Close();
nObjectCount = 0;
}
}
if ( nObjectCount )
xmlOutput.Close();
xmlInput.Close();
return nFileCount;
}
Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)
根据需要更改加粗标记(或** **标记)字段。(这也在视频页面上表达)
On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.
在xml编辑器窗口右击并单击RUN(或简单的F9)。窗口中有输出栏,显示生成的文件的数量。
Note: input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml"
(double slashes) and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"
注意:输入文件的名称可以是“C:\\ \\ \\ \\ \\ \a_xml_file”。双slash(双斜杠)和输出文件C:\\用户\\a \\ \\ \\桌面\anoutputfolder\ piece" + nFileCount + ".xml"
#4
2
As mentioned already the xml_split
from the Perl package XML::Twig does a great job.
正如前面所提到的,xml_split从Perl包XML:::Twig中分离出来的方法做得很好。
Usage
xml_split < bigFile.xml
#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split
Without any arguments xml_split
creates a file per top-level child node.
没有任何参数xml_split为*子节点创建一个文件。
There are parameters to specify the number of elements you want per file (-g
) or approximate size (-s <Kb|Mb|Gb>
).
有一些参数可以指定每个文件(-g)或近似大小(-s
Installation
Windows
看这里
Linux
sudo apt-get install xml-twig-tools
sudo apt-get安装xml-twig-tools
#5
1
There is nothing built in that can handle this situation easily.
没有任何东西能够轻易地应付这种情况。
Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".
您的方法听起来很合理,但是我可能会从包含需要重复的元素的“骨架”文档开始,并使用“记录”生成几个文档。
Update:
更新:
After a bit of digging, I found this article describing a way to split files using XSLT.
在深入研究之后,我发现本文描述了一种使用XSLT拆分文件的方法。
#6
1
xml_split - split huge XML documents into smaller chunks
xml_split——将大型XML文档分割成较小的块
http://www.perlmonks.org/index.pl?node_id=429707
http://www.perlmonks.org/index.pl?node_id=429707
http://metacpan.org/pod/XML::Twig
http://metacpan.org/pod/XML:树枝
#7
0
Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
使用基于https://www.ultra - edit.com/forums/viewtopic.php?
All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).
我所添加的只是一些XML头和尾位,第一个和最后一个文件需要手动修复(或者从源代码中删除根元素)。
// from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
var FoundsPerFile = 200; // Global setting for number of found split strings per file.
var SplitString = "</letter>"; // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';
/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
var tabindex = -1; /* start value */
for (var i = 0; i < UltraEdit.document.length; i++)
{
if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
tabindex = i;
break;
}
}
return tabindex;
}
if (UltraEdit.document.length) { // Is any file open?
// Set working environment required for this job.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.hexOff();
UltraEdit.ueReOn();
// Move cursor to top of active file and run the initial search.
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.matchCase=true;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=false;
// If the string to split is not found in this file, do nothing.
if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
// This file is probably the correct file for this script.
var FileNumber = 1; // Counts the number of saved files.
var StringsFound = 1; // Counts the number of found split strings.
var NewFileIndex = UltraEdit.document.length;
/* Get the path of the current file to save the new
files in the same directory as the current file. */
var SavePath = "";
var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
if (LastBackSlash >= 0) {
LastBackSlash++;
SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
}
/* Get active file index in case of more than 1 file is open and the
current file does not get back the focus after closing the new files. */
var FileToSplit = getActiveDocumentIndex();
// Always use clipboard 9 for this script and not the Windows clipboard.
UltraEdit.selectClipboard(9);
// Split the file after every x found split strings until source file is empty.
while (1) {
while (StringsFound < FoundsPerFile) {
if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
else {
UltraEdit.document[FileToSplit].bottom();
break;
}
}
// End the selection of the find command.
UltraEdit.document[FileToSplit].endSelect();
// Move the cursor right to include the next character and unselect the found string.
UltraEdit.document[FileToSplit].key("RIGHT ARROW");
// Select from this cursor position everything to top of the file.
UltraEdit.document[FileToSplit].selectToTop();
// Is the file not already empty?
if (UltraEdit.document[FileToSplit].isSel()) {
// Cut the selection and paste it into a new file.
UltraEdit.document[FileToSplit].cut();
UltraEdit.newFile();
UltraEdit.document[NewFileIndex].setActive();
UltraEdit.activeDocument.paste();
/* Add line termination on the last line and remove automatically added indent
spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
// add headers and footers
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.write(xmlHead);
UltraEdit.activeDocument.write(xmlRootStart);
UltraEdit.activeDocument.bottom();
UltraEdit.activeDocument.write(xmlRootEnd);
// Build the file name for this new file.
var SaveFileName = SavePath + "LETTER";
if (FileNumber < 10) SaveFileName += "0";
SaveFileName += String(FileNumber) + ".raw.xml";
// Save the new file and close it.
UltraEdit.saveAs(SaveFileName);
UltraEdit.closeFile(SaveFileName,2);
FileNumber++;
StringsFound = 0;
/* Delete the line termination in the source file
if last found split string was at end of a line. */
UltraEdit.document[FileToSplit].endSelect();
UltraEdit.document[FileToSplit].key("END");
if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
UltraEdit.document[FileToSplit].top();
} else {
UltraEdit.document[FileToSplit].deleteLine();
}
} else break;
UltraEdit.outputWindow.write("Progress " + SaveFileName);
} // Loop executed until source file is empty!
// Close source file without saving and re-open it.
var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
UltraEdit.closeFile(NameOfFileToSplit,2);
/* The following code line could be commented if the source
file is not needed anymore for further actions. */
UltraEdit.open(NameOfFileToSplit);
// Free memory and switch back to Windows clipboard.
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);
}
}