根据相同的XML架构(XSD)加速一批XML文件的XML模式验证

时间:2021-09-23 17:16:24

I would like to speed up the process of validating a batch of XML files against the same single XML schema (XSD). Only restrictions are that I am in a PHP environment.

我想加快针对同一个XML架构(XSD)验证一批XML文件的过程。只有限制是我在PHP环境中。

My current problem is that the schema I would like to validate against includes the fairly complex xhtml schema of 2755 lines (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). Even for very simple data this takes a long time (around 30 seconds pr. validation). As I have thousands of XML files in my batch, this doesn't really scale well.

我目前的问题是我想要验证的模式包括相当复杂的2755行的xhtml模式(http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd)。即使对于非常简单的数据,这也需要很长时间(大约30秒pr。验证)。因为我的批处理中有数千个XML文件,所以这并不能很好地扩展。

For validating the XML file I use both of these methods, from the standard php-xml libraries.

为了验证XML文件,我使用了标准php-xml库中的这两种方法。

  • DOMDocument::schemaValidate
  • DOM文档:: schemaValidate
  • DOMDocument::schemaValidateSource
  • DOM文档:: schemaValidateSource

I am thinking that the PHP implementation fetches the XHTML schema via HTTP and builds some internal representation (possibly a DOMDocument) and that this is thrown away when the validation is completed. I was thinking that some option for the XML-libs might change this behaviour to cache something in this process for reuse.

我认为PHP实现通过HTTP获取XHTML模式并构建一些内部表示(可能是DOMDocument),并在验证完成时抛弃它。我在想,XML-libs的一些选项可能会改变这种行为,以便在此过程中缓存一些东西以供重用。

I've build a simple test setup which illustrates my problem:

我构建了一个简单的测试设置来说明我的问题:

test-schema.xsd

测试schema.xsd

<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified"
    targetNamespace="http://myschema.example.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:myschema="http://myschema.example.com/"
    xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xs:import
        schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
        namespace="http://www.w3.org/1999/xhtml">
    </xs:import>
    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="MyHTMLElement">
                    <xs:complexType>
                        <xs:complexContent>
                            <xs:extension base="xhtml:Flow"></xs:extension>
                        </xs:complexContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

test-data.xml

测试data.xml中

<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
  <MyHTMLElement>
    <xhtml:p>This is an XHTML paragraph!</xhtml:p>
  </MyHTMLElement>
</Root>

schematest.php

schematest.php

<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidate: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidate('test-schema.xsd')) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidateSource: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidateSource($schema_source)) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

Running this schematest.php file produces the following output:

运行此schematest.php文件将生成以下输出:

schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.

Any help and suggestions on how to solve this issue, are very welcomed!

如何解决这个问题的任何帮助和建议都非常欢迎!

2 个解决方案

#1


13  

You can safely substract 30 seconds from the timing values as overhead.

您可以安全地从时间值减去30秒作为开销。

Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:

对W3C服务器的远程请求正在被延迟,因为大多数库都没有反映缓存文档(甚至HTTP标头也提示)。但阅读你自己的:

The W3C servers are slow to return DTDs. Is the delay intentional?

W3C服务器返回DTD的速度很慢。有意延迟吗?

Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.

是。由于各种软件系统每天从我们的网站下载数百万次DTD(尽管我们的服务器有缓存指令),我们已开始从我们的网站提供DTD和模式(DTD,XSD,ENT,MOD等)人为延迟。我们这样做的目的是更多地关注我们持续存在的DTD流量过多的问题,并保护我们网站其他部分的稳定性和响应时间。我们建议使用HTTP缓存或目录文件来提高性能。

W3.org tries to keep requests low. That is understandable. PHP's DomDocument is based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.

W3.org试图保持低要求。这是可以理解的。 PHP的DomDocument基于libxml。 libxml允许设置外部实体加载器。在这种情况下,整个目录支持部分很有趣。

To solve the issue in question, setup a catalog.xml file:

要解决相关问题,请设置catalog.xml文件:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

Save a copy of the two .xsd files with the names given in that catalog file next to the catalog (relative as well as absolute paths file:///... do work if you prefer a different directory).

使用目录旁边的目录文件中给出的名称保存两个.xsd文件的副本(相对和绝对路径文件:/// ...如果您喜欢其他目录,请执行此操作)。

Then ensure your systems environment variable XML_CATALOG_FILES is set to the filename of the catalog.xml file. When everything is setup, the validation just runs through:

然后确保将系统环境变量XML_CATALOG_FILES设置为catalog.xml文件的文件名。当一切都设置好后,验证就会贯穿:

schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.

If it still takes long, it's just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:

如果它仍然需要很长时间,那只是环境变量未设置到正确位置的标志。我在博客文章中处理了变量以及一些边缘情况:

It should take care of diverse edge cases, like filenames containing spaces.

它应该处理各种边缘情况,例如包含空格的文件名。

Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:

或者,可以创建一个简单的外部实体加载器回调函数,该函数使用URL =>文件映射为本地文件系统的数组形式:

$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',

     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];

As this shows, I've placed a verbatim copy of these two XSD files into a subdirectory called schema. The next step is to make use of libxml_set_external_entity_loader to activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, a RuntimeException will be thrown with a detailed message:

如图所示,我已将这两个XSD文件的逐字副本放入名为schema的子目录中。下一步是使用libxml_set_external_entity_loader激活带有映射的回调函数。磁盘上存在的文件已被首选并直接加载。如果例程遇到没有映射的非文件,则会抛出RuntimeException并带有详细消息:

libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {

        if (is_file($system)) {
            return $system;
        }

        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }

        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );

        throw new RuntimeException($message);
    }
);

After setting this external entity loader, there isn't any longer the delay for the remote-requests.

设置此外部实体加载程序后,远程请求不再存在延迟。

And that's it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.

就是这样。见Gist。注意:此外部实体加载程序已编写用于加载XML文件以从磁盘验证并将XSD URI“解析”为本地文件名。其他类型的操作(例如基于DTD的验证)可能需要一些代码更改/扩展。更优选的是XML目录。它也适用于不同的工具。

#2


0  

As an alternative to @hakre: Download the external resource (DTD) on first try, use the downloaded version afterwards:

作为@hakre的替代方案:首次尝试下载外部资源(DTD),然后使用下载的版本:

libxml_set_external_entity_loader(    
    function ($public, $system, $context) {
        if(is_file($system)){
            return $system;
        }
        $cached_file= tempnam(sys_get_temp_dir(), md5($system));
        if (is_file($cached_file)) {
            return $cached_file;
        }
        copy($system,$cached_file);
        return $cached_file;
    }
);

#1


13  

You can safely substract 30 seconds from the timing values as overhead.

您可以安全地从时间值减去30秒作为开销。

Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:

对W3C服务器的远程请求正在被延迟,因为大多数库都没有反映缓存文档(甚至HTTP标头也提示)。但阅读你自己的:

The W3C servers are slow to return DTDs. Is the delay intentional?

W3C服务器返回DTD的速度很慢。有意延迟吗?

Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.

是。由于各种软件系统每天从我们的网站下载数百万次DTD(尽管我们的服务器有缓存指令),我们已开始从我们的网站提供DTD和模式(DTD,XSD,ENT,MOD等)人为延迟。我们这样做的目的是更多地关注我们持续存在的DTD流量过多的问题,并保护我们网站其他部分的稳定性和响应时间。我们建议使用HTTP缓存或目录文件来提高性能。

W3.org tries to keep requests low. That is understandable. PHP's DomDocument is based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.

W3.org试图保持低要求。这是可以理解的。 PHP的DomDocument基于libxml。 libxml允许设置外部实体加载器。在这种情况下,整个目录支持部分很有趣。

To solve the issue in question, setup a catalog.xml file:

要解决相关问题,请设置catalog.xml文件:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

Save a copy of the two .xsd files with the names given in that catalog file next to the catalog (relative as well as absolute paths file:///... do work if you prefer a different directory).

使用目录旁边的目录文件中给出的名称保存两个.xsd文件的副本(相对和绝对路径文件:/// ...如果您喜欢其他目录,请执行此操作)。

Then ensure your systems environment variable XML_CATALOG_FILES is set to the filename of the catalog.xml file. When everything is setup, the validation just runs through:

然后确保将系统环境变量XML_CATALOG_FILES设置为catalog.xml文件的文件名。当一切都设置好后,验证就会贯穿:

schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.

If it still takes long, it's just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:

如果它仍然需要很长时间,那只是环境变量未设置到正确位置的标志。我在博客文章中处理了变量以及一些边缘情况:

It should take care of diverse edge cases, like filenames containing spaces.

它应该处理各种边缘情况,例如包含空格的文件名。

Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:

或者,可以创建一个简单的外部实体加载器回调函数,该函数使用URL =>文件映射为本地文件系统的数组形式:

$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',

     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];

As this shows, I've placed a verbatim copy of these two XSD files into a subdirectory called schema. The next step is to make use of libxml_set_external_entity_loader to activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, a RuntimeException will be thrown with a detailed message:

如图所示,我已将这两个XSD文件的逐字副本放入名为schema的子目录中。下一步是使用libxml_set_external_entity_loader激活带有映射的回调函数。磁盘上存在的文件已被首选并直接加载。如果例程遇到没有映射的非文件,则会抛出RuntimeException并带有详细消息:

libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {

        if (is_file($system)) {
            return $system;
        }

        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }

        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );

        throw new RuntimeException($message);
    }
);

After setting this external entity loader, there isn't any longer the delay for the remote-requests.

设置此外部实体加载程序后,远程请求不再存在延迟。

And that's it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.

就是这样。见Gist。注意:此外部实体加载程序已编写用于加载XML文件以从磁盘验证并将XSD URI“解析”为本地文件名。其他类型的操作(例如基于DTD的验证)可能需要一些代码更改/扩展。更优选的是XML目录。它也适用于不同的工具。

#2


0  

As an alternative to @hakre: Download the external resource (DTD) on first try, use the downloaded version afterwards:

作为@hakre的替代方案:首次尝试下载外部资源(DTD),然后使用下载的版本:

libxml_set_external_entity_loader(    
    function ($public, $system, $context) {
        if(is_file($system)){
            return $system;
        }
        $cached_file= tempnam(sys_get_temp_dir(), md5($system));
        if (is_file($cached_file)) {
            return $cached_file;
        }
        copy($system,$cached_file);
        return $cached_file;
    }
);