如果它是变音符号,fgetcsv正在吃字符串的第一个字母

时间:2023-01-20 17:17:23

I am importing contents from an Excel-generated CSV-file into an XML document like:

我将内容从Excel生成的CSV文件导入XML文档,如:

$csv = fopen($csvfile, r);
$words = array();

while (($pair = fgetcsv($csv)) !== FALSE) {
    array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}

The inserted data are English/German expressions.

插入的数据是英语/德语表达。

I insert these values into an XML structure and output the XML as following:

我将这些值插入XML结构并输出XML,如下所示:

$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;

header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!

echo $dom -> saveXML();

This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.

这工作正常,但我遇到一个非常奇怪的问题。当一个字符串的第一个字母是变音符号(如Österreich或Ägypten)时,该字符将被省略,从而产生gypten或sterreich。如果变音符号位于字符串(RussischeFöderation)的中间,则会正确传输。 ß或é等等也是如此。

All files are UTF-8 encoded and served in UTF-8.

所有文件均采用UTF-8编码,并以UTF-8格式提供。

This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.

对我来说,这看起来很奇怪而且有点像虫子,但也许我错过了一些东西,这里有很多聪明的人。

5 个解决方案

#1


4  

Ok, so this seems to be a bug in fgetcsv.

好的,所以这似乎是fgetcsv中的一个错误。

I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.

我现在正在处理我自己的CSV数据(有点麻烦),但它正在工作,我根本没有任何编码问题。

This is (a not-yet-optimized version of) what I am doing:

这是我正在做的(尚未优化的版本):

$rawCSV = file_get_contents($csvfile);

$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://*.com/a/7498886/797194

foreach ($lines as $line) {
    array_push($words, getCSVValues($line));
}

The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):

getCSVValues来自这里,需要像这样处理CSV行(逗号!):

"I'm a string, what should I do when I need commas?",Howdy there

It looks like:

看起来像:

function getCSVValues($string, $separator=","){

    $elements = explode($separator, $string);

    for ($i = 0; $i < count($elements); $i++) {
        $nquotes = substr_count($elements[$i], '"');
        if ($nquotes %2 == 1) {
            for ($j = $i+1; $j < count($elements); $j++) {
                if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
                    // Put the quoted string's pieces back together again
                    array_splice($elements, $i, $j-$i+1,
                        implode($separator, array_slice($elements, $i, $j-$i+1)));
                    break;
                }
            }
        }
        if ($nquotes > 0) {
            // Remove first and last quotes, then merge pairs of quotes
            $qstr =& $elements[$i];
            $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
            $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
            $qstr = str_replace('""', '"', $qstr);
        }
    }
    return $elements;

}

Quite a bit of a workaround, but it seems to work fine.

相当多的解决方法,但似乎工作正常。

EDIT:

编辑:

There's a also a filed bug for this, apparently this depends on the locale settings.

还有一个提交的错误,显然这取决于语言环境设置。

#2


3  

If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:

如果字符串来自Excel(我遇到字母ø消失的问题,如果它在字符串的开头就消失了)......那么这就修好了:

setlocale(LC_ALL, 'en_US.ISO-8859-1');

setlocale(LC_ALL,'en_US.ISO-8859-1');

#3


2  

If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.

如果中间的其他变音符号显示正常,那么这不是基本编码问题。它发生在行的开头这一事实可能表明与换行标记有些不兼容。也许CSV是使用不同的换行编码生成的。

This happens when moving files between different OS:

在不同OS之间移动文件时会发生这种情况

  • Windows: \r\n (characters 13 and 10)
  • Windows:\ r \ n(字符13和10)
  • Linux: \n (character 10)
  • Linux:\ n(字符10)
  • Mac OS: \r (character 13)
  • Mac OS:\ r \ n(字符13)

If I were you, I would verify the newline mark to be sure.

如果我是你,我会确认换行标记。

If in Linux: hexdump -C filename | more and inspect the document.

如果在Linux中:hexdump -C filename |更多并检查文件。

You can change the newline marks with a sed expression if that's the case.

如果是这种情况,您可以使用sed表达式更改换行符。

Hope that helped!

希望有所帮助!

#4


2  

A bit simpler workaround (but pretty dirty):

一个更简单的解决方法(但非常脏):

//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);

//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);

//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
    $parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}

#5


0  

Could be some sort of utf8_encode() problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.

可能是某种utf8_encode()问题。文档页面上的这条评论似乎表明,如果你已经编码过Umlaut,它可能会导致问题。

Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding().

也许测试看看数据是否已经用mb_detect_encoding()进行了utf-8编码。

#1


4  

Ok, so this seems to be a bug in fgetcsv.

好的,所以这似乎是fgetcsv中的一个错误。

I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.

我现在正在处理我自己的CSV数据(有点麻烦),但它正在工作,我根本没有任何编码问题。

This is (a not-yet-optimized version of) what I am doing:

这是我正在做的(尚未优化的版本):

$rawCSV = file_get_contents($csvfile);

$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://*.com/a/7498886/797194

foreach ($lines as $line) {
    array_push($words, getCSVValues($line));
}

The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):

getCSVValues来自这里,需要像这样处理CSV行(逗号!):

"I'm a string, what should I do when I need commas?",Howdy there

It looks like:

看起来像:

function getCSVValues($string, $separator=","){

    $elements = explode($separator, $string);

    for ($i = 0; $i < count($elements); $i++) {
        $nquotes = substr_count($elements[$i], '"');
        if ($nquotes %2 == 1) {
            for ($j = $i+1; $j < count($elements); $j++) {
                if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
                    // Put the quoted string's pieces back together again
                    array_splice($elements, $i, $j-$i+1,
                        implode($separator, array_slice($elements, $i, $j-$i+1)));
                    break;
                }
            }
        }
        if ($nquotes > 0) {
            // Remove first and last quotes, then merge pairs of quotes
            $qstr =& $elements[$i];
            $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
            $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
            $qstr = str_replace('""', '"', $qstr);
        }
    }
    return $elements;

}

Quite a bit of a workaround, but it seems to work fine.

相当多的解决方法,但似乎工作正常。

EDIT:

编辑:

There's a also a filed bug for this, apparently this depends on the locale settings.

还有一个提交的错误,显然这取决于语言环境设置。

#2


3  

If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:

如果字符串来自Excel(我遇到字母ø消失的问题,如果它在字符串的开头就消失了)......那么这就修好了:

setlocale(LC_ALL, 'en_US.ISO-8859-1');

setlocale(LC_ALL,'en_US.ISO-8859-1');

#3


2  

If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.

如果中间的其他变音符号显示正常,那么这不是基本编码问题。它发生在行的开头这一事实可能表明与换行标记有些不兼容。也许CSV是使用不同的换行编码生成的。

This happens when moving files between different OS:

在不同OS之间移动文件时会发生这种情况

  • Windows: \r\n (characters 13 and 10)
  • Windows:\ r \ n(字符13和10)
  • Linux: \n (character 10)
  • Linux:\ n(字符10)
  • Mac OS: \r (character 13)
  • Mac OS:\ r \ n(字符13)

If I were you, I would verify the newline mark to be sure.

如果我是你,我会确认换行标记。

If in Linux: hexdump -C filename | more and inspect the document.

如果在Linux中:hexdump -C filename |更多并检查文件。

You can change the newline marks with a sed expression if that's the case.

如果是这种情况,您可以使用sed表达式更改换行符。

Hope that helped!

希望有所帮助!

#4


2  

A bit simpler workaround (but pretty dirty):

一个更简单的解决方法(但非常脏):

//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);

//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);

//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
    $parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}

#5


0  

Could be some sort of utf8_encode() problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.

可能是某种utf8_encode()问题。文档页面上的这条评论似乎表明,如果你已经编码过Umlaut,它可能会导致问题。

Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding().

也许测试看看数据是否已经用mb_detect_encoding()进行了utf-8编码。