大型XML parcel并使用属性或元素

I understand that there's no universal answer to the attribute vs. element debate (and I read through the other questions I saw on this), but any insight into this particular circumstance would be greatly appreciated.

我理解,属性与元素争论没有普遍的答案(我读过我在此看到的其他问题),但是对这种特殊情况的任何见解都将非常感激。

In our case we're going to be receiving very large amounts of master and transactional data from a system of record to be merged into our own database (upwards of a gig, nightly). The information we receive is essentially a one-for-one with the records in our tables, so for example a list of customers would be (in our old version):

在我们的例子中,我们将从记录系统接收大量的主数据和事务数据,并将其合并到我们自己的数据库中(每晚一次)。我们收到的信息基本上与我们表格中的记录一对一,所以例如客户列表(在我们的旧版本中):

<Custs>
  <Cust ID="101" LongName="Large customer" ShortName="LgCust" Loc="SE"/>
  <Cust ID="102" LongName="Small customer" ShortName="SmCust" Loc="NE"/>
  ....
</Custs>

However we've been discussing the merits of moving to a structure that's more element based, for example:

然而,我们一直在讨论迁移到更基于元素的结构的优点,例如:

<Custs>
  <Cust ID="101">
    <LongName>Large Customer</LongName>
    <ShortName>LgCust</ShortName>
    <Loc>SE</Loc>
  </Cust>
  <Cust ID="102">
    <LongName>Small Customer</LongName>
    <ShortName>SmCust</ShortName>
    <Loc>NE</Loc>
  </Cust>
  ....
</Custs>

Because the files are so large I don't think we'll be using a DOM parser to try to load these into memory, nor do we have any need of locating particular items in the files. So my question is: in this case, is one form (elements or attributes) generally preferred over the other when you've got large amounts of data and performance demands to consider?

因为文件太大我不认为我们会使用DOM解析器来尝试将它们加载到内存中,我们也不需要在文件中定位特定项目。所以我的问题是:在这种情况下,当您需要考虑大量数据和性能需求时,一种形式(元素或属性)通常优先于另一种形式(元素或属性)吗?

5 个解决方案

#1

If performance is the only requirement, I think you have to go with the attributes, just because it takes up less space. I don't see any advantage to the elements.

如果性能是唯一的要求,我认为你必须使用属性,因为它占用的空间更少。我认为这些元素没有任何优势。

#2

I have used both methods with very large files both with DOM and with a line-by-line reader. Certainly you need to use a line-by-line reader to get good performance for very large files. Beyond this my gut feeling is that attributes are more efficient but I have no hard data to back that opinion up with!

我已经将这两种方法都用于带有DOM的非常大的文件和逐行读取器。当然,您需要使用逐行读取器来获得非常大的文件的良好性能。除此之外,我的直觉是属性更有效但我没有硬数据支持这种观点!

#3

If someone's providing you with 1gb of data at a time and you care about performance at all, you should really re-examine the decision to use XML as your transmission format. You're not parsing the data into a DOM, so you're not really able to make use of the benefits that XML gives you over (say) CSV -- ensuring well-formedness, schema validation, transformation, querying, etc.

如果某人一次为您提供1GB数据并且您完全关心性能,那么您应该重新检查使用XML作为传输格式的决定。您没有将数据解析为DOM,因此您无法真正利用XML为您提供的优势(例如)CSV - 确保格式良好,架构验证,转换,查询等。

And now you're considering moving to a format where half of the data that you're going to be processing is markup. What kind of sense does that make?

现在你正在考虑转向一种格式,你要处理的数据中有一半是标记。这有什么意义呢?

I come from the when-the-only-tool-you-have-is-a-hammer-you-tend-to-perceive-all-problems-as-nails school of XML, and even I wouldn't use XML for this.

我来自当时唯一的工具 - 你有一把锤子 - 你倾向于感知 - 所有问题 - 作为指甲学校的XML,甚至我也不会使用XML这个。

#4

The "attribute way" is more preferable if you plan to validate your xml prior to processing by means of a plain old DTD. There is no rule to validate one element content in DTD language but some basic rules can be applied to attribute values.

如果您计划在处理之前通过普通的旧DTD验证您的xml,那么“属性方式”更为可取。没有规则验证DTD语言中的一个元素内容,但可以将一些基本规则应用于属性值。

If you plan to use XSD or no validation at all then I would choose the most readable form, which IMHO is the "element way".

如果您计划使用XSD或根本不进行验证,那么我会选择最易读的形式,恕我直言是“元素方式”。

No matter where the XML comes from, XML validation should be the first step to process any XML. It makes your application safer and your code smaller since many checks are made before your code even toches the XML data. XSD should be the preferred choice since its syntax allows to check even data conversions (ie float, date fields inside element or attribute content). The con, it is much more complex than a plain DTD file.

无论XML来自何处,XML验证应该是处理任何XML的第一步。它使您的应用程序更安全,代码更小,因为在您的代码甚至可以处理XML数据之前进行了许多检查。 XSD应该是首选,因为它的语法允许检查甚至数据转换(即元素或属性内容中的float,date字段)。 con,它比普通的DTD文件复杂得多。

#5

Exchanging the data in XML format isn't necessarily bad just because it is a large data set.

以XML格式交换数据并不一定是因为它是一个大型数据集。

However, if you are exchanging really big XML files you might want to consider compressing them before transmission using zip, GZIP, etc. in order to save time and bandwidth.

但是,如果要交换非常大的XML文件,可能需要考虑在使用zip,GZIP等进行传输之前对其进行压缩,以节省时间和带宽。

If you are exchanging database info, consider formatting the information as SQL statements(and even compressing those SQL files before sending); especially if that is what you wind up converting the XML into anyway.

如果要交换数据库信息,请考虑将信息格式化为SQL语句(甚至在发送之前压缩这些SQL文件);特别是如果你最终转换XML的话。

#1