以XML格式存储数据系列的最佳/正确/最有效的方法是什么

时间:2021-06-28 23:47:21

I have an application which will store a series of (float) values in an XML file. There could be upwards of 100,000 values so I am interested in keeping the size down, but I also want files to be readily accessible by third parties.

我有一个应用程序,它将一系列(浮点)值存储在XML文件中。可能有超过100,000个值,所以我有兴趣保持大小,但我也希望第三方可以轻松访问文件。

There seem to be various methods open to me as far as encoding the data within the XML:

就编码XML中的数据而言,似乎有各种方法可供我使用:

1.

<data>
  <value>12.34</value>
  <value>56.78</value>
  ...
  <value>90.12</value>
</data>

2.

<data>
  <value v="12.34"/>
  <value v="56.78"/>
  ...
  <value v="90.12"/>
</data> 

3.

<data>12.34
56.78
  ...
90.12
</data> 

4.

<data>12.34, 56.78, ... 90.12</data> 

and there are probably more variations as well.

并且可能还有更多变化。

I'm just curious to know the drawbacks (if any) to each of these approaches. Some may not be compliant for example.

我只是想知道每种方法的缺点(如果有的话)。例如,有些可能不符合要求。

4 个解决方案

#1


3  

I don't think there's a "better" way of doing it. Read my comment above for alternatives. But if you're hooked on XML, then go with whatever works for you. I personally prefer something like this

我不认为有一种“更好”的方式。阅读我上面的评论以寻找替代方但是如果你迷上了XML,那么请选择适合你的方法。我个人更喜欢这样的东西

<data>
   <item key="somekey1" value="somevalue1" />
   <item key="somekey2" value="somevalue2" />
   <item key="somekey3" value="somevalue3" />
</data>

Simply because it's nice and easy to read, and keeps the tags smaller.

只是因为它很好且易于阅读,并使标签更小。

EDIT:

Remember, the fewer characters are in your XML, the smaller it will be. (again, why I suggest JSON), so if you can get it nice and tight, by all means do it.

请记住,XML中的字符越少,它就越小。 (再次,为什么我建议JSON),所以如果你能把它变得好看和紧张,那么一定要做到。

<d>
   <i k="somekey1" v="somevalue1" />
   <i k="somekey2" v="somevalue2" />
   <i k="somekey3" v="somevalue3" />
</d>

EDIT:

Also, I know you didn't ask, but I thought I'd show you what JSON would look like

另外,我知道你没有问,但我想我会告诉你JSON会是什么样子

   [{ "key": "somevalue1", "value": "somevalue1"},
    { "key": "somevalue2", "value": "somevalue2"}]

#2


3  

Semantically, there's no "difference" between 1 and 2. Similarly there's no difference between 3 and 4, save that one is delimited. Also note that whitespace is/can be ignored in XML, so if you read #3, it may well come up as "one long line" without any newlines separating them.

从语义上讲,1和2之间没有“差异”。同样,3和4之间没有区别,除了一个是分隔的。另请注意,在XML中可以忽略空格,因此如果您阅读#3,它可能会成为“一条长线”而没有任何新行将它们分开。

As for which is better, it's up to you application, and how you plan on using the data.

至于哪个更好,这取决于您的应用程序,以及您计划如何使用数据。

The serialized version (with each number in its own element) gives the user "direct" accesss to the individual numbers.

序列化版本(每个数字都在其自己的元素中)为用户提供对各个数字的“直接”访问。

Using the delimited "blob" requires the users to parse it themselves, so it depends on what kind of interface you're wishing to provide.

使用分隔的“blob”需要用户自己解析它,因此它取决于您希望提供的接口类型。

Also, the "blob" technique tends to prevent the XML from being "streamed", since you'll have one, enormous element, rather than a bunch of little elements. That can have a large memory impact.

此外,“blob”技术往往会阻止XML“流式传输”,因为你将拥有一个巨大的元素,而不是一堆小元素。这可能会产生很大的内存影响。

As for the overall file size, it may help to know that of you actually compress this data, the final, compressed sizes will likely be very close to each other, regardless of the technique. Dunno if that property is important or not.

至于整体文件大小,可能有助于知道您实际压缩此数据,最终的压缩大小可能彼此非常接近,无论采用何种技术。 Dunno,如果该财产是重要的或不重要。

#3


2  

The first two forms are preferrable to the final two, with the first being the best. The latter two would require reading the contents of the data and splitting it before you could use it. The first two, however, allow you to enumerate over the data and use only the piece or pieces you need at any given time. However, the second form embeds the value in yet another layer via an attribute, which makes it less desirable than the first (provided there aren't other elements/attributes for each particular data point).

前两种形式比最后两种形式更受欢迎,第一种形式是最好的。后两者需要读取数据的内容并在使用之前将其拆分。但是,前两个允许您枚举数据并在任何给定时间仅使用您需要的一块或多块。但是,第二种形式通过属性将值嵌入到另一层中,这使得它比第一种形式更不可取(假设每个特定数据点没有其他元素/属性)。

#4


1  

If the only data your file will process will always be only those float values, do not use XML. Use only a plain text file with a value in each line. It'll be many times faster to read and write and won't be even a little less self-descriptive than the XML samples you wrote.

如果您的文件将处理的唯一数据将始终只是那些浮点值,请不要使用XML。仅使用每行中包含值的纯文本文件。读取和写入的速度要快很多倍,并且与您编写的XML示例相比,它的自描述性要差一些。

XML may be a requirement, for an example case in which you will use this XML file from different applications/systems/users with different culture(TR, EN, FR). Some write floats with '.' (12.34) while some write them with ',' (12,34). An XML parser will handle all that stuff for you. So, if XML is a requirement, 3rd and 4th samples you wrote are totally missing the point of XML. In practice they're no different than using a plain text file except the slow XML parser on duty.

XML可能是一种要求,例如,您将使用来自不同文化(TR,EN,FR)的不同应用程序/系统/用户的XML文件。有人用“。”写浮点数。 (12.34)而有些人用','(12,34)写出来。 XML解析器将为您处理所有这些内容。因此,如果需要XML,那么您编写的第3和第4个示例完全忽略了XML的要点。在实践中,它们与使用纯文本文件没有区别,除了值班的慢速XML解析器。

1st and 2nd samples you wrote have only a subtle difference in meaning / interpretation. First one implies that the actual data you like to present is 12.34, and it's a 'value'. Second implies that there's a 'value', and the 'v' data associated with it is 12.34.

你写的第一和第二个样本在意义/解释上只有微妙的差异。第一个意味着您想要呈现的实际数据是12.34,它是一个“值”。第二个意味着存在“价值”,与之相关的“v”数据为12.34。

#1


3  

I don't think there's a "better" way of doing it. Read my comment above for alternatives. But if you're hooked on XML, then go with whatever works for you. I personally prefer something like this

我不认为有一种“更好”的方式。阅读我上面的评论以寻找替代方但是如果你迷上了XML,那么请选择适合你的方法。我个人更喜欢这样的东西

<data>
   <item key="somekey1" value="somevalue1" />
   <item key="somekey2" value="somevalue2" />
   <item key="somekey3" value="somevalue3" />
</data>

Simply because it's nice and easy to read, and keeps the tags smaller.

只是因为它很好且易于阅读,并使标签更小。

EDIT:

Remember, the fewer characters are in your XML, the smaller it will be. (again, why I suggest JSON), so if you can get it nice and tight, by all means do it.

请记住,XML中的字符越少,它就越小。 (再次,为什么我建议JSON),所以如果你能把它变得好看和紧张,那么一定要做到。

<d>
   <i k="somekey1" v="somevalue1" />
   <i k="somekey2" v="somevalue2" />
   <i k="somekey3" v="somevalue3" />
</d>

EDIT:

Also, I know you didn't ask, but I thought I'd show you what JSON would look like

另外,我知道你没有问,但我想我会告诉你JSON会是什么样子

   [{ "key": "somevalue1", "value": "somevalue1"},
    { "key": "somevalue2", "value": "somevalue2"}]

#2


3  

Semantically, there's no "difference" between 1 and 2. Similarly there's no difference between 3 and 4, save that one is delimited. Also note that whitespace is/can be ignored in XML, so if you read #3, it may well come up as "one long line" without any newlines separating them.

从语义上讲,1和2之间没有“差异”。同样,3和4之间没有区别,除了一个是分隔的。另请注意,在XML中可以忽略空格,因此如果您阅读#3,它可能会成为“一条长线”而没有任何新行将它们分开。

As for which is better, it's up to you application, and how you plan on using the data.

至于哪个更好,这取决于您的应用程序,以及您计划如何使用数据。

The serialized version (with each number in its own element) gives the user "direct" accesss to the individual numbers.

序列化版本(每个数字都在其自己的元素中)为用户提供对各个数字的“直接”访问。

Using the delimited "blob" requires the users to parse it themselves, so it depends on what kind of interface you're wishing to provide.

使用分隔的“blob”需要用户自己解析它,因此它取决于您希望提供的接口类型。

Also, the "blob" technique tends to prevent the XML from being "streamed", since you'll have one, enormous element, rather than a bunch of little elements. That can have a large memory impact.

此外,“blob”技术往往会阻止XML“流式传输”,因为你将拥有一个巨大的元素,而不是一堆小元素。这可能会产生很大的内存影响。

As for the overall file size, it may help to know that of you actually compress this data, the final, compressed sizes will likely be very close to each other, regardless of the technique. Dunno if that property is important or not.

至于整体文件大小,可能有助于知道您实际压缩此数据,最终的压缩大小可能彼此非常接近,无论采用何种技术。 Dunno,如果该财产是重要的或不重要。

#3


2  

The first two forms are preferrable to the final two, with the first being the best. The latter two would require reading the contents of the data and splitting it before you could use it. The first two, however, allow you to enumerate over the data and use only the piece or pieces you need at any given time. However, the second form embeds the value in yet another layer via an attribute, which makes it less desirable than the first (provided there aren't other elements/attributes for each particular data point).

前两种形式比最后两种形式更受欢迎,第一种形式是最好的。后两者需要读取数据的内容并在使用之前将其拆分。但是,前两个允许您枚举数据并在任何给定时间仅使用您需要的一块或多块。但是,第二种形式通过属性将值嵌入到另一层中,这使得它比第一种形式更不可取(假设每个特定数据点没有其他元素/属性)。

#4


1  

If the only data your file will process will always be only those float values, do not use XML. Use only a plain text file with a value in each line. It'll be many times faster to read and write and won't be even a little less self-descriptive than the XML samples you wrote.

如果您的文件将处理的唯一数据将始终只是那些浮点值,请不要使用XML。仅使用每行中包含值的纯文本文件。读取和写入的速度要快很多倍,并且与您编写的XML示例相比,它的自描述性要差一些。

XML may be a requirement, for an example case in which you will use this XML file from different applications/systems/users with different culture(TR, EN, FR). Some write floats with '.' (12.34) while some write them with ',' (12,34). An XML parser will handle all that stuff for you. So, if XML is a requirement, 3rd and 4th samples you wrote are totally missing the point of XML. In practice they're no different than using a plain text file except the slow XML parser on duty.

XML可能是一种要求,例如,您将使用来自不同文化(TR,EN,FR)的不同应用程序/系统/用户的XML文件。有人用“。”写浮点数。 (12.34)而有些人用','(12,34)写出来。 XML解析器将为您处理所有这些内容。因此,如果需要XML,那么您编写的第3和第4个示例完全忽略了XML的要点。在实践中,它们与使用纯文本文件没有区别,除了值班的慢速XML解析器。

1st and 2nd samples you wrote have only a subtle difference in meaning / interpretation. First one implies that the actual data you like to present is 12.34, and it's a 'value'. Second implies that there's a 'value', and the 'v' data associated with it is 12.34.

你写的第一和第二个样本在意义/解释上只有微妙的差异。第一个意味着您想要呈现的实际数据是12.34,它是一个“值”。第二个意味着存在“价值”,与之相关的“v”数据为12.34。