在Marklogic中，如何有效地深入比较两个xml文档?

I have a logging requirement to store the differences between old and new values when a (moderately complex) section of a document changes in our database. Only the changed data should be reported on. My current solution works reasonably well, but I have concerns that it's not optimal and may cause performance problems when updates start occurring in volume.

我有一个日志记录要求，当一个(中等复杂的)文档中某个文档发生变化时，存储新旧值之间的差异。应该只报告已更改的数据。我目前的解决方案工作得相当好，但是我担心它不是最佳的，并且在开始批量更新时可能会导致性能问题。

My current solution looks mostly like this:

我目前的解决方案大致是这样的:

for $element in $data/section//element()[text()]
return
  if (not($old-data//*[fn:name() = fn:name($element) and text() = $element/text()])) then
    element log:difference {
       ...
    }
  else ()

My problem is that the profiler shows this taking a (relatively) long time doing the thousands of comparisons that //*[fn:name() = fn:name($element)] construct leads to. It's only a couple of tens of milliseconds but with a lot of updates that's going to add up, and it feels like there should be a way to avoid it.

我的问题是，分析器显示这需要(相对)长时间进行//*[fn:name() = fn:name($element)]构造导致的数千次比较。这仅仅是几十毫秒的时间，但是有很多的更新会累积起来，而且感觉应该有办法避免它。

The structure of the xml is sufficiently well defined that I can be sure that a field in one document will have the same relative xpath as the other one, so technically my use of // could be removed, at the expense of manually walking the xml tree, but that's a reasonable amount of complexity and the structure is fairly flat so I'm not sure it would be very much more efficient.

xml的结构非常明确,我可以确定一个字段在一个文档将有相同的相对xpath作为另一个,所以我使用/ /技术可以移除,以牺牲手动xml树走,但这是一个合理的复杂性和结构是相当平坦,所以我不确定它会更有效。

Also, there are a finite set of fields that can be in this section of the document, so manually comparing each of them in turn (with fully qualified xpaths) would be an option, but I'd rather avoid it, since it would be best not to need to revisit this code in the future, should that list of fields change.

还有一组有限的字段,可以在这部分的文档,所以手动比较他们每个人依次与完全限定的xpath()将是一个选择,但我宁愿避免它,因为它是最好不需要重新审视这段代码在未来,应该改变的字段列表。

Are the solutions going to be along those lines, or is there something more obvious that I've missed?

解是沿着这些线，还是有更明显的东西我漏掉了?

Is there any way to construct the xpath using the string value of the element name directly without using a predicate? I'm assuming that would be more efficient, since xpath evaluation doesn't normally take as long as this.

有什么方法不用谓词直接使用元素名称的字符串值来构造xpath吗?我假设这是更有效的，因为xpath评估通常不需要这么长时间。

Can I, perhaps, extract the relative xpath of an element then look at that precise place in the other document?

我是否可以提取元素的相对xpath，然后查看另一个文档中的确切位置?

Am I missing a built-in xml comparison tool in marklogic itself?

我是否在marklogic本身中丢失了一个内置的xml比较工具?

2 个解决方案

#1

Using fn:name is a bad idea because it can be fooled by differences in namespace prefixes. It would be better to use fn:node-name. I would also avoid '//' wherever possible.

使用fn:名称是一个坏主意，因为它可以被名称空间前缀的差异所迷惑。最好使用fn:节点名。我也会尽量避免使用'//'。

Getting back to the deep compare, this sounds like an XML diff. There is no XML diff tool built into MarkLogic, so it might be best to set one up as a REST-ish web service and use MarkLogic http://docs.marklogic.com/xdmp:http-post to call it. There are quite a few XML diff tools out there.

回到深度比较，这听起来像一个XML diff, MarkLogic中没有内置的XML diff工具，所以最好将其设置为一个类似rest的web服务，并使用MarkLogic http://docs.marklogic.com/xdmp:http-post来调用它。有很多XML diff工具。

If you want to stay in XQuery, the solution will probably be slower. I would start with a recursive tree-walk and fn:deep-equal. Whenever you find a diff for a simple element you can stop descending, which prunes the tree and limits the work to be done. Here's a very rough sketch of how that might work. It's a long way from a proper LCS http://en.wikipedia.org/wiki/Diff but it might be useful. On my laptop this runs in less than 10-ms.

如果您希望继续使用XQuery，解决方案可能会比较慢。我将从一个递归的树形漫步开始，fn:deep-equal。只要找到一个简单元素的diff，就可以停止下降，这将删除树并限制要做的工作。这是一个很粗略的草图。这距离一个合适的LCS http://en.wikipedia.org/wiki/Diff还有很长的路要走，但是它可能是有用的。在我的笔记本电脑上，运行时间不到10毫秒。

declare function local:diff(
  $a as node(), $b as node())
as element(diff)*
{
  if (deep-equal($a, $b)) then ()
  else if (empty($a/*) or empty($b/*)) then element diff {
    element a { $a }, element b { $b } }
  else
    let $seq-a := $a/*
    let $seq-b := $b/*
    let $count := max((count($seq-a), count($seq-b)))
    return
      for $x in 1 to $count
      return local:diff($seq-a[$x], $seq-b[$x])
};

let $a := xdmp:query-meters()
let $_ := xdmp:sleep(1)
let $b := xdmp:query-meters()
return local:diff($a, $b)

#2

I would think it's worthwhile to try building an index, and benchmarking that approach.

我认为值得尝试建立一个指数，并对该方法进行基准测试。

I'm not well versed in marklogic, but they have what I recognize as an XSL key function in their API docs

我不太熟悉marklogic，但是他们在API文档中有我所认识的XSL key函数

(Update: this seems to only fetch keys. To create them, I'd guess you'd need to use XSLT directly. This is a good how-to. A small stylesheet generating keys on element/@id would be feasible.)

(更新:这似乎只获取键。要创建它们，我猜您需要直接使用XSLT。这是一个很好的操作方法。在元素/@id上生成键的小样式表是可行的。

You could even add the stylesheet as a string, and save a little I/O time:

您甚至可以将样式表添加为字符串，并节省一点I/O时间:

xdmp:xslt-eval(
  <xsl:stylesheet version="2.0"><xsl:key name="element_ids" match="element" use="@id"></xsl:stylesheet>,
  doc("input.xml")
)

If every element has an identifier you can use as a key, you can build an index when you parse the file, then compare that list against a stored (earlier) version of keys. From there, you have your list of locations to handle, and thanks to the index, they are found and accessed quite quickly.

如果每个元素都有一个可以用作键的标识符，那么可以在解析文件时构建一个索引，然后将该列表与存储的(较早的)键版本进行比较。从这里开始，您就有了要处理的位置列表，而且由于索引的存在，可以很快地找到并访问它们。

If you'd rather stick with XQuery, the 'map' function provides a similar interface.

如果您更喜欢使用XQuery，那么“map”函数提供了类似的接口。

#1