使用PHP将XML Spreadsheet工作簿解析为JSON

时间:2022-09-24 23:33:03

I, um, seem to have gotten lost.

我,嗯,好像已经迷路了。

I believe my problem is in parsing a PHP DOMDocument class correctly.

我相信我的问题是正确解析PHP DOMDocument类。

I have an XML spreadsheet coming from Excel which has headers for different columns. (It also has multiple worksheets, to help the end user in organizing the data.)

我有一个来自Excel的XML电子表格,其中包含不同列的标题。 (它还有多个工作表,以帮助最终用户组织数据。)

My end goal is markers on a map utilizing JavaScript.

我的最终目标是使用JavaScript在地图上的标记。

A simplified example of the XML file is here: Note: some of the data is strings, some is numeric, and some is HTML.

这里是XML文件的简化示例:注意:一些数据是字符串,一些是数字,一些是HTML。

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook>
 <Worksheet ss:Name="data">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">lat</Data></Cell>
    <Cell><Data ss:Type="String">lng</Data></Cell>
    <Cell><Data ss:Type="String">boolean_1</Data></Cell>
    <Cell><Data ss:Type="String">boolean_2</Data></Cell>
    <Cell><Data ss:Type="String">Source_documents</Data></Cell>
    <Cell><Data ss:Type="String">description</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">35.032139998</Data></Cell>
    <Cell><Data ss:Type="Number">-117.346952</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">0</Data></Cell>
    <Cell><ss:Data ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">Copy here inside HTML </Font><I><Font html:Color="#000000">with more copy</Font></I></ss:Data></Cell>
    <Cell><Data ss:Type="String">Copy here without HTML</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">43.444</Data></Cell>
    <Cell><Data ss:Type="Number">-112.005</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="String">Diff Marker Src</Data></Cell>
    <Cell><Data ss:Type="String">Diff Marker Desc</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
 <Worksheet ss:Name="tags">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">tag_label</Data></Cell>
    <Cell><Data ss:Type="String">tag_category</Data></Cell>
    <Cell><Data ss:Type="String">tag_description</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="String">boolean_1</Data></Cell>
    <Cell><Data ss:Type="String">tag_cat_A</Data></Cell>
    <Cell><Data ss:Type="String">bool_1 desc</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="String">boolean_2</Data></Cell>
    <Cell><Data ss:Type="String">tag_cat_B</Data></Cell>
    <Cell><Data ss:Type="String">bool_2 desc</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

I've been assuming that I need to convert the spreadsheet into either a JSON array, or a better-structured XML doc, that I can parse to create markers for a map. (JSON seems preferable to reduce data being transferred)

我一直在假设我需要将电子表格转换为JSON数组或结构更好的XML文档,我可以解析它来为地图创建标记。 (JSON似乎更适合减少传输的数据)

If that assumption is correct, I'd like to have a structure which looks kinda like this:

如果这个假设是正确的,我想要一个看起来像这样的结构:

array => {
  data => {
    [0] => {
        lat => '35.032139998',
        lng => '-117.346952',
        booleans => {
            boolean_1 => true
        },
        Source_documents => '<Font html:Color="#000000">Copy here inside HTML </Font><I><Font html:Color="#000000">with more copy</Font></I>',
        'description' => 'Copy here without HTML'
    },
    [1] => {
        lat => '43.444',
        lng => '-112.005',
        booleans => {
            boolean_1 => true,
            boolean_2 => true
        },
        Source_documents => 'Diff Marker Src',
        'description' => 'Diff Marker Desc'
    }
  },
  tags = {
    'boolean_1' => {
        tag_category => 'tag_cat_A',
        'tag_description' => 'bool_1 desc'
    },
    'boolean_2' => {
        tag_category => 'tag_cat_B',
        'tag_description' => 'bool_2 desc'
    }
  }
}

I'm working in PHP, and attempting to transform the XML into JSON utilizing the DOMDocument class. SimpleXML worked fine for me until a new Excel doc was loaded which included the occasional HTML.

我正在使用PHP,并尝试使用DOMDocument类将XML转换为JSON。 SimpleXML对我来说很好,直到加载了一个新的Excel文档,其中包括偶尔的HTML。

I have this PHP code so far:

到目前为止我有这个PHP代码:

function get_worksheet_table($file, $worksheet_name) {
  $dom = new DOMDocument;
  $dom->load($file);

  // returns a new instance of class DOMNodeList
  $worksheets = $dom->getElementsByTagName( 'Worksheet' );

  foreach($worksheets as $worksheet) {

    // check if right sheet
    if( $worksheet->getAttribute('ss:Name') == $worksheet_name) { 

      // trying to get entire node, or childNodeList, or ... ?
      // About here I am getting lost.
      $nodes = $worksheet->getElementsByTagName('Table')->item(0); 

      $table = new DOMDocument;
      $table->preserveWhiteSpace = false;
      $table->formatOutput = true;
      $table->createElement('Table');

      /*
         ITERATE THROUGH $nodes, ADD EACH CELL NODE'S CONTENTS 
         TO $table -- UNLESS IT HAS HTML, THEN USE DOMinnerHTML(node) 
         (DOMinnerHTML function @ http://php.net/manual/en/book.dom.php#89718)
       */

      return $table;
    }
  }
  return false;
}

$data = get_worksheet_table($file, 'data');
$tags = get_worksheet_table($file, 'tags');

From there, I'm trying to create associative arrays from $data and $tags, then output a big JSON statement to pass to my application.

从那里,我正在尝试从$ data和$ tags创建关联数组,然后输出一个大的JSON语句传递给我的应用程序。

But it is really a mess, and I'm, well like I said, I'm lost.

但它实在是一团糟,我就像我说的那样,我迷失了。

Questions:

  1. Does this look like I'm at least on the right track?
  2. 这看起来像我至少在正确的轨道上?

  3. How do I get access the nodes properly?—I seem to be getting all subnodes as one big text value.
  4. 如何正确访问节点? - 我似乎将所有子节点作为一个大文本值。

  5. How do I iterate through the DOM to access the cells' text content where appropriate, and accessing any children of the <data> nodes as a string, rather than a child node?
  6. 如何在适当的位置迭代DOM以访问单元格的文本内容,并以字符串形式访问 节点的任何子节点而不是子节点?

Any pointers you might have toward better understanding how to parse the DOMDocument class would be appreciated. I keep reading through the documentation, but it's eluding me.

你可能有任何指示,以更好地理解如何解析DOMDocument类,这将是值得赞赏的。我一直在阅读文档,但这是在逃避我。

Thank you so much for your time.

非常感谢您的参与。

1 个解决方案

#1


0  

After considerably more research, I found a way to achieve what I want. I am not going to claim that this is the best possible method, by a long shot.

经过相当多的研究,我找到了实现我想要的方法。我不会声称这是最好的方法,远远不够。

However, I was able to:

但是,我能够:

  1. parse an XML Spreadsheet, generated from Excel, into an array structured as I wanted;
  2. 将从Excel生成的XML电子表格解析为我想要的结构数组;

  3. output that as JSON; and
  4. 输出为JSON;和

  5. maintain any text styling as HTML within the generated output.
  6. 在生成的输出中将任何文本样式保持为HTML。

To be fair, I have not pushed the limits of the HTML—for example, we're really only messing with <b> and <i> tags. Font tags were coming in as well, and I decided to strip them.

公平地说,我没有突破HTML的限制 - 例如,我们实际上只是搞乱了标签。字体标签也进来了,我决定剥离它们。

I would not be surprised if there are cleaner, more elegant ways to do this—I'm pretty much getting out of an object into an array as soon as possibled—and I should also note that in my case, I'm dealing with a relatively small data load. YMMV for larger projects, but if you are reading this far, than I hope this helps.

如果有更干净,更优雅的方法来做到这一点我不会感到惊讶 - 我几乎可以尽快从一个对象变成一个数组 - 我应该注意到在我的情况下,我正在处理相对较小的数据负载。 YMMV用于大型项目,但如果你正在阅读这篇文章,那么我希望这会有所帮助。

Here, then, is my function to generate an array of data from an XML Worksheet table:

那么,这是我从XML Worksheet表生成数据数组的函数:

/* array_from_worksheet_table()
 * Generate an array from an XML Worksheet
 * $file needs to be the full path to your file (e.g., '/Users/jeremy/www/cms/files/yourfile.xml')
 * $worksheet_name = the name of the worksheet tab
 */
function array_from_worksheet_table($file, $worksheet_name) {

  // https://*.com/questions/7082401/avoid-domdocument-xml-warnings-in-php
  $previous_errors = libxml_use_internal_errors(true);

  $dom = new DOMDocument;
  if( !$dom->load($file) ) {
    foreach (libxml_get_errors() as $error) {
      // print_r($error);
    }
  }

  libxml_clear_errors();
  libxml_use_internal_errors($previous_errors);


  // returns a new instance of class DOMNodeList
  $worksheets = $dom->getElementsByTagName( 'Worksheet' );

  foreach($worksheets as $worksheet) {
    if( $worksheet->getAttribute('ss:Name') == $worksheet_name) {

      // When we get a DOMNodeList, if we want to access the first item, we have to
      // then use ->item(0). Important once we want to access a deeper-level DOMNodeList
      $rows = $worksheet->getElementsByTagName('Table')->item(0)->getElementsByTagName('Row');

      $table = array();

      // Get our headings.
      // This assumes that the first row HAS our headings!
      $headings = $rows->item(0)->getElementsByTagName('Cell');

      // loop through table rows. Setting $i=1 instead of 0 means we skip the first row
      for( $i = 1; $i < $rows->length; $i++ ) {

        // this is our row of data
        $cells = $rows->item($i)->getElementsByTagName('Cell'); 

        // loop through each cell
        for( $c = 0; $c < $cells->length; $c++ ) {

          // check for data element in cell
          $celldata = $cells->item($c)->getElementsByTagName('Data');

          // If the cell has data, proceed
          if( $celldata->length ) {

            // Get HTML content of any strings
            if( $celldata->item(0)->getAttribute('ss:Type')== 'String' ) {

              // Does not work for PHP < 5.3.6
              // If you HAVE PHP 5.3.6 then use function @ https://*.com/questions/2087103/
              // $value = xml_to_json::DOMinnerHTML( $celldata->item(0) );

              // DOMNode::C14N canonicalizes nodes into strings
              // This workaround is required for PHP < 5.3.6
              $value = $celldata->item(0)->C14N();

              // hack. remove tags like <ss:Data foo...> and </Data>
              // Necessary because C14N leaves outer tags (saveHTML did not)
              $value = preg_replace('/<([s\/:]+)?Data([^>]+)?>/i', '', $value);

              // Remove font tags from HTML. Bleah.
              $value = preg_replace('/<\/?font([^>]+)?>/i', '', $value);
            } else {
              $value = $cells->item($c)->nodeValue;
            }

            // grab label from first row
            $label = $headings->item($c)->nodeValue;

            $table[$i][$label] = $value;
          }
        }
      }
    return $table;
    }
  }
  return false;
}

This returned an array for a worksheet table, which I was then able to further manipulate.

这返回了一个工作表表的数组,然后我可以进一步操作。

One task was re-organizing the resulting array so that my boolean values were all in a sub-array. First I removed all zero values, using remove_element_by_value($data, '0') (Found that function @ https://*.com/a/4466181/156645)

一项任务是重新组织生成的数组,以便我的布尔值全部在子数组中。首先,我使用remove_element_by_value($ data,'0')删除了所有零值(发现函数@ https://*.com/a/4466181/156645)

Then I compared array keys to the values found in my tags array, and appended them to each subarray, something like this ($long_codes was my simple array of the tag values):

然后我将数组键与我的tags数组中的值进行比较,并将它们附加到每个子数组中,类似这样($ long_codes是我的标签值的简单数组):

if($data_array) {
  foreach($data_array as $key => $array) {
    foreach($array as $k => $val) {
      if( in_array($k, $long_codes)) {
        $data_array[$key]['Classify'][] = $k;
        unset($data_array[$key][$k]);
      }
    }
  }
}

Output was just echo json_encode($the_big_array), where the big array was just array('data' => $data_array, 'tags' => $tags_array).

输出只是echo json_encode($ the_big_array),其中大数组只是数组('data'=> $ data_array,'tags'=> $ tags_array)。

Hope that helps somebody else!

希望能帮助别人!

#1


0  

After considerably more research, I found a way to achieve what I want. I am not going to claim that this is the best possible method, by a long shot.

经过相当多的研究,我找到了实现我想要的方法。我不会声称这是最好的方法,远远不够。

However, I was able to:

但是,我能够:

  1. parse an XML Spreadsheet, generated from Excel, into an array structured as I wanted;
  2. 将从Excel生成的XML电子表格解析为我想要的结构数组;

  3. output that as JSON; and
  4. 输出为JSON;和

  5. maintain any text styling as HTML within the generated output.
  6. 在生成的输出中将任何文本样式保持为HTML。

To be fair, I have not pushed the limits of the HTML—for example, we're really only messing with <b> and <i> tags. Font tags were coming in as well, and I decided to strip them.

公平地说,我没有突破HTML的限制 - 例如,我们实际上只是搞乱了标签。字体标签也进来了,我决定剥离它们。

I would not be surprised if there are cleaner, more elegant ways to do this—I'm pretty much getting out of an object into an array as soon as possibled—and I should also note that in my case, I'm dealing with a relatively small data load. YMMV for larger projects, but if you are reading this far, than I hope this helps.

如果有更干净,更优雅的方法来做到这一点我不会感到惊讶 - 我几乎可以尽快从一个对象变成一个数组 - 我应该注意到在我的情况下,我正在处理相对较小的数据负载。 YMMV用于大型项目,但如果你正在阅读这篇文章,那么我希望这会有所帮助。

Here, then, is my function to generate an array of data from an XML Worksheet table:

那么,这是我从XML Worksheet表生成数据数组的函数:

/* array_from_worksheet_table()
 * Generate an array from an XML Worksheet
 * $file needs to be the full path to your file (e.g., '/Users/jeremy/www/cms/files/yourfile.xml')
 * $worksheet_name = the name of the worksheet tab
 */
function array_from_worksheet_table($file, $worksheet_name) {

  // https://*.com/questions/7082401/avoid-domdocument-xml-warnings-in-php
  $previous_errors = libxml_use_internal_errors(true);

  $dom = new DOMDocument;
  if( !$dom->load($file) ) {
    foreach (libxml_get_errors() as $error) {
      // print_r($error);
    }
  }

  libxml_clear_errors();
  libxml_use_internal_errors($previous_errors);


  // returns a new instance of class DOMNodeList
  $worksheets = $dom->getElementsByTagName( 'Worksheet' );

  foreach($worksheets as $worksheet) {
    if( $worksheet->getAttribute('ss:Name') == $worksheet_name) {

      // When we get a DOMNodeList, if we want to access the first item, we have to
      // then use ->item(0). Important once we want to access a deeper-level DOMNodeList
      $rows = $worksheet->getElementsByTagName('Table')->item(0)->getElementsByTagName('Row');

      $table = array();

      // Get our headings.
      // This assumes that the first row HAS our headings!
      $headings = $rows->item(0)->getElementsByTagName('Cell');

      // loop through table rows. Setting $i=1 instead of 0 means we skip the first row
      for( $i = 1; $i < $rows->length; $i++ ) {

        // this is our row of data
        $cells = $rows->item($i)->getElementsByTagName('Cell'); 

        // loop through each cell
        for( $c = 0; $c < $cells->length; $c++ ) {

          // check for data element in cell
          $celldata = $cells->item($c)->getElementsByTagName('Data');

          // If the cell has data, proceed
          if( $celldata->length ) {

            // Get HTML content of any strings
            if( $celldata->item(0)->getAttribute('ss:Type')== 'String' ) {

              // Does not work for PHP < 5.3.6
              // If you HAVE PHP 5.3.6 then use function @ https://*.com/questions/2087103/
              // $value = xml_to_json::DOMinnerHTML( $celldata->item(0) );

              // DOMNode::C14N canonicalizes nodes into strings
              // This workaround is required for PHP < 5.3.6
              $value = $celldata->item(0)->C14N();

              // hack. remove tags like <ss:Data foo...> and </Data>
              // Necessary because C14N leaves outer tags (saveHTML did not)
              $value = preg_replace('/<([s\/:]+)?Data([^>]+)?>/i', '', $value);

              // Remove font tags from HTML. Bleah.
              $value = preg_replace('/<\/?font([^>]+)?>/i', '', $value);
            } else {
              $value = $cells->item($c)->nodeValue;
            }

            // grab label from first row
            $label = $headings->item($c)->nodeValue;

            $table[$i][$label] = $value;
          }
        }
      }
    return $table;
    }
  }
  return false;
}

This returned an array for a worksheet table, which I was then able to further manipulate.

这返回了一个工作表表的数组,然后我可以进一步操作。

One task was re-organizing the resulting array so that my boolean values were all in a sub-array. First I removed all zero values, using remove_element_by_value($data, '0') (Found that function @ https://*.com/a/4466181/156645)

一项任务是重新组织生成的数组,以便我的布尔值全部在子数组中。首先,我使用remove_element_by_value($ data,'0')删除了所有零值(发现函数@ https://*.com/a/4466181/156645)

Then I compared array keys to the values found in my tags array, and appended them to each subarray, something like this ($long_codes was my simple array of the tag values):

然后我将数组键与我的tags数组中的值进行比较,并将它们附加到每个子数组中,类似这样($ long_codes是我的标签值的简单数组):

if($data_array) {
  foreach($data_array as $key => $array) {
    foreach($array as $k => $val) {
      if( in_array($k, $long_codes)) {
        $data_array[$key]['Classify'][] = $k;
        unset($data_array[$key][$k]);
      }
    }
  }
}

Output was just echo json_encode($the_big_array), where the big array was just array('data' => $data_array, 'tags' => $tags_array).

输出只是echo json_encode($ the_big_array),其中大数组只是数组('data'=> $ data_array,'tags'=> $ tags_array)。

Hope that helps somebody else!

希望能帮助别人!