如何将任意XML转换（切碎）为平面数据结构？

Not actually a duplicate of Import arbitrary XML to SQL Server

实际上并不是将任意XML导入SQL Server的副本

My company has 20 GB of XML files that they want to do some data mining against. The analytics tool they will be using is SAS, which I have never used - someone else will be doing the actual mining. My job is to find a way to convert the XML files into a relatively flat data structure so they can be imported into SAS. The files have come from a half-dozen different sources over the course of six years. While they all nominally describe the same thing - the (very detailed) results of a credit inquiry - they don't follow a consistent format, even with files that come from the same source, because the version of the document has changed significantly over time. There are no XSL, XSD, or XSLT documents available.

我的公司有20 GB的XML文件,他们想要对其进行一些数据挖掘。他们将使用的分析工具是SAS,我从未使用过 - 其他人将进行实际挖掘。我的工作是找到一种方法将XML文件转换为相对扁平的数据结构,以便将它们导入SAS。这些文件在六年的时间里来自六个不同的来源。虽然它们都名义上描述了相同的东西 - 信用查询的(非常详细的)结果 - 它们不遵循一致的格式,即使是来自同一来源的文件,因为文档的版本随着时间的推移发生了显着变化。没有可用的XSL,XSD或XSLT文档。

It seems the answer would be "you want a document database", but apparently SAS needs either something flat, like a CSV or other wide-table structure, or something relational. My experience is primarily in SQL Server, but if there are solutions that target other platforms, we are definitely open to that. We've even looked into using Microsoft Excel, but it doesn't interpret the file correctly (it parses just fine, but it gives the columns nonsensical names).

似乎答案是“你想要一个文档数据库”,但显然SAS需要平坦的东西,比如CSV或其他宽表结构,或者某种关系。我的经验主要在于SQL Server,但如果有针对其他平台的解决方案,我们肯定会对此持开放态度。我们甚至考虑使用Microsoft Excel,但它没有正确解释文件(它解析得很好,但它给列无意义的名称)。

I've entertained the idea of writing C# code to generate a SQL schema based on the XML data, and hoping that, at least within the scope of an individual source, the structures could be made consistent enough to fit all the files. I've looked into using SQLXML Bulk Load to generate the tables, but this requires a SQL-annotated XSD schema, and there doesn't appear to be any tool to generate this.

我已经考虑编写C#代码以基于XML数据生成SQL模式,并希望至少在单个源的范围内,可以使结构足够一致以适合所有文件。我已经研究过使用SQLXML Bulk Load来生成表,但是这需要一个带有SQL注释的XSD架构,并且似乎没有任何工具可以生成它。

We've looked at using the xml Data Type Methods to get the data into a table like this:

我们已经研究过使用xml数据类型方法将数据导入到这样的表中:

CREATE TABLE ResponseData
(
    CustomerID INT,
    NodePath VARCHAR(500),
    Position SMALLINT,
    Value VARCHAR(500)
)

but feel there must be a way to get more useful information separation than that.

但是觉得必须有一种方法来获得比这更有用的信息分离。

There's plenty of information out there (including several SO questions) about how to convert a known XML document to SQL, but I need to know how to import an arbitrary XML document. "Import XML with an unknown structure" turned up a few suggested tools, but their output isn't that helpful.

有很多关于如何将已知的XML文档转换为SQL的信息(包括几个SO问题),但我需要知道如何导入任意XML文档。 “导入具有未知结构的XML”出现了一些建议的工具,但它们的输出没有用。

Any help would be appreciated!

任何帮助,将不胜感激!

2 个解决方案

#1

This is probably obvious, but I think you're going to have to start by opening several files from different time periods and try to get a feel for how many "schemas" you're dealing with (in the XML sense). Then you could write some code to systematically read files, trying to identify their "schema" and logging files that don't match any of your known types. The goal is to figure out how many types of document you really have; after that, you can worry about how to get them into a database, one type at a time, hopefully settling on a single DB schema that can fully represent a document from any type. I realize I haven't said much technical here, but I think what you have right now is a strategy problem, not a technical one.

这可能是显而易见的,但我认为你将不得不首先打开不同时期的几个文件,并尝试了解你正在处理多少“模式”(在XML意义上)。然后你可以编写一些代码来系统地读取文件,尝试识别它们的“模式”并记录与任何已知类型不匹配的文件。目标是弄清楚你真正拥有多少种类型的文件;之后,您可以担心如何将它们放入数据库中,一次只能使用一种类型,希望能够确定单个数据库模式,该模式可以完全代表任何类型的文档。我意识到我在这里没有说太多技术,但我认为你现在所拥有的是战略问题,而不是技术问题。

#2

I figured I'd list these here as even if they don't help you, they may help someone else in future searching for a similar solution.

我想我会在这里列出这些,即使他们没有帮助你,他们可能会帮助别人在将来寻找类似的解决方案。

We use the below two macros in SAS to extract certain attributes, elements, values, etc. from XML credit inquiries. I've provided examples which I'm hoping will help explain how they work. I don't have time right now to go through it in detail, but wanted to provide something that you may find useful in the meantime. If you provide these to the analysts, they should be able to run the code as-is, and working through the examples and parameters, extract some information for themselves to do preliminary investigations and maybe give you more concrete requirements.

我们在SAS中使用以下两个宏来从XML信用查询中提取某些属性,元素,值等。我提供了一些例子,我希望能帮助解释它们是如何工作的。我现在没有时间详细介绍它,但想提供一些在此期间你可能会发现有用的东西。如果您向分析师提供这些代码,他们应该能够按原样运行代码,并通过示例和参数,为自己提取一些信息以进行初步调查,并可能为您提供更具体的要求。

The only condition for the below macros, is that the XML is no longer than 32767 chars, and that it's all in a single character observation on a single row in SAS (ie. not stored over multiple observations in SAS).

下面宏的唯一条件是XML不超过32767个字符,并且它全部在SAS中单行上的单个字符观察中(即,不存储在SAS中的多个观察中)。

They shouldn't really need to understand how the macros work, they just need to understand how to call and use them.

他们不应该真正理解宏如何工作,他们只需要了解如何调用和使用它们。

/*****************************************************************************
**  PROGRAM: MACROS.PRXCOUNT.SAS
**
**  RETURNS THE NUMBER OF TIMES A SEGMENT IS FOUND IN AN XML FILE.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1 or 2). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 25-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 14-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
*****************************************************************************/

%macro prxCount(iElement=, iXMLField=, iDelimiterType=1);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxCount.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  prx_id = prxparse("/&delim_open&iElement(\s+.*?&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_count = 0;
  prx_start = 1;
  prx_stop  = length(&iXMLField);
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_count = prx_count + 1;
    call prxposn(prx_id, 1, prx_pos, prx_length);
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx_:;

%mend;






/*****************************************************************************
**  PROGRAM: PRXEXTRACT.SAS
**
**  SEARCHES THROUGH AN XML (OR HTML) FILE FOR AN ELEMENT AND EXTRACTS THE 
**  VALUE BETWEEN AN ELEMENTS TAGS.
**  
**  PARAMETERS:
**  iElement      : The element to search through the blob for.
**  iField        : The fieldname to save the result to.
**  iType         : (N or C) for Numeric or Character.
**  iLength       : The length of the field to create.  
**  iXMLField     : The name of the field that contains the XML blob to parse.
**  iDelimiterType: (1 or 2). Defaults to 1.  1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
**  HISTORY:
**  1.0 MODIFIED: 14-FEB-2011  BY:RP
**  - CREATED. 
**  1.1 MODIFIED: 16-FEB-2011  BY:RP
**  - ADDED OPTION TO CHANGE DELIMITERS FROM <> TO []
**  1.1 MODIFIED: 17-FEB-2011  BY:RP
**  - CORRECTED ERROR WHEN MATCH RETURNS A LENGTH OF ZERO
**  - CORRECTED MISSING AMPERSAND FROM IDELIMITERTYPE CHECK.
**  - ADDED ESCAPING QUOTES TO [] DELIMITER TYPE
**  - CORRECTED WARNING WHEN MATCH RETURNS MISSING NUMERIC FIELD
**  1.2 MODIFIED: 25-FEB-2011  BY:RP
**  - ADDED DELIMITER TYPES TO WORK WITH MASKED HTML CODES
**  1.3 MODIFIED: 11-MAR-2011  BY:RP
**  - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
**  1.4 MODIFIED: 14-MAR-2011  BY:RP
**  - CORRECTED TO REMOVE FALSE MATCHES FROM PRIOR VERSION. ADDED EXAMPLE.
**  1.5 MODIFIED: 10-APR-2012  BY:RP
**  - CORRECTED PROBLEM WITH ZERO LENGTH STRING MATCHES
**  1.6 MODIFIED: 22-MAY-2012  BY:RP
**  - ADDED ABILITY TO CAPTURE ATTRIBUTES
*****************************************************************************/

%macro prxExtract(iElement=, iField=, iType=, iLength=, iXMLField=, iDelimiterType=1, iSequence=1, iAttributesField=);

  %local delim_open delim_close;

  crLf = byte(10) || byte(13);
  &iXMLField = compress(&iXMLField,crLf,);

  %if &iDelimiterType eq 1 %then %do;
    %let delim_open  = <;
    %let delim_close = >;
  %end;
  %else %if &iDelimiterType eq 2 %then %do;
    %let delim_open  = \[;
    %let delim_close = \];
  %end;
  %else %if &iDelimiterType eq 3 %then %do;
    %let delim_open  = %nrbquote(&)lt%quote(%str(;)) ;
    %let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
  %end;
  %else %do;
    %put ERR%str()ROR (prxExtract.sas): You specified an incorrect option for the iDelimiterType parameter.;
  %end;

  %if %sysfunc(index(&iField,[)) %then %do;
    /* DONT DO THIS IF ITS AN ARRAY */
  %end;
  %else %do;
    %if "%upcase(&iType)" eq "N" %then %do;
      attrib &iField length=&iLength format=best.;
    %end;
    %else %do;
      attrib &iField length=$&iLength format=$&iLength..;
    %end;
  %end;

  /*
  ** BREAKDOWN OF REGULAR EXPRESSION (EXAMPLE USES < AND > AS DELIMS AND ANI AS THE ELEMENT BEING LOOKED FOR:
  **
  ** &delim_open&iElement                            -->  FINDS <ANI
  ** (\s+.*?&delim_close|&delim_close){1}?           -->  FINDS THE SHORTEST SINGLE INSTANCE OF EITHER:
  **                                                      - ONE OR MORE SPACES FOLLOWED BY ANYTHING UNTIL A > CHARACTER
  **                                                      - OR JUST A > CHARACTER
  **                                                      THE ?: JUST TELLS IT NOT TO CAPTURE WHAT IT FOUND INBETWEEN THE ( AND )
  ** (.*?)                                           -->  FINDS WHAT WE ARE SEARCHING FOR AND CAPTURES IT INTO BUFFER 1.
  ** &delim_open                                     -->  FINDS <
  ** \/                                              -->  FINDS THE / CHARACTER. THE FIRST SLASH ESCAPES IT SO IT KNOWS ITS NOT A SPECIAL REGEX SLASH
  ** &iElement&delim_close                           -->  FINDS ANI>
  */
  prx_id = prxparse("/&delim_open&iElement((\s+.*?)&delim_close|&delim_close){1}?(.*?)&delim_open\/&iElement&delim_close/i"); 

  prx_start = 1;
  prx_stop = length(&iXMLField);
  prx_sequence = 0;
  call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  do while (prx_pos > 0);
    prx_sequence = prx_sequence + 1;
    if prx_sequence = &iSequence then do;
      if prx_length > 0 then do;

        call prxposn(prx_id, 3, prx_pos, prx_length);
        %if "%upcase(&iType)" eq "N" %then %do;
          length prx_tmp_n $200;
          prx_tmp_n = substr(&iXMLField, prx_pos, prx_length);
          if cats(prx_tmp_n) ne "" then do;
            &iField = input(substr(&iXMLField, prx_pos, prx_length), ?best.);
          end;
        %end;
        %else %do;          
          if prx_length ne 0 then do;
            &iField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iField = "";
          end;
        %end;

        **
        ** ALSO SAVE THE ATTRIBUTES TO A FIELD IF REQUESTED
        *;
        %if "%upcase(&iAttributesField)" ne "" %then %do;
          call prxposn(prx_id, 2, prx_pos, prx_length);
          if prx_length ne 0 then do;
            &iAttributesField = substr(&iXMLField, prx_pos, prx_length);
          end;
          else do;
            &iAttributesField = "";
          end;
        %end;

      end;
    end;
    call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
  end;

  drop crLf prx:;

%mend;

Example for single element:

单个元素的示例:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><XNI xniattrib=1>7606256091</XNI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=xni, iField=my_xni, iType=c, iLength=15, iXMLField=xml, iSequence=1, iAttributesField=my_xni_attribs);

run;

Example for repeating elements:

重复元素的示例:

data example;

  xml = "<test><ANI2Digits>00</ANI2Digits><ANI>7606256091</ANI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;

  %prxExtract(iElement=ani2digits, iField=ani2digits, iType=c, iLength=50, iXMLField=xml);

  length ani1-ani6 $15;
  length attr1-attr6 $100;
  array arrani [1:6] $ ani1-ani6;
  array arrattr [1:6] $ attr1-attr6;
  %prxCount  (iElement=ani, iXMLField=xml, iDelimiterType=1);
  do cnt=1 to prx_count;
    %prxExtract(iElement=ani, iField=arrani[cnt], iType=c, iLength=15, iXMLField=xml, iSequence=cnt, iAttributesField=arrattr[cnt]);
  end;

run;

#1

#2