I am importing contents from an Excel-generated CSV-file into an XML document like:
我将内容从Excel生成的CSV文件导入XML文档,如:
$csv = fopen($csvfile, r);
$words = array();
while (($pair = fgetcsv($csv)) !== FALSE) {
array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}
The inserted data are English/German expressions.
插入的数据是英语/德语表达。
I insert these values into an XML structure and output the XML as following:
我将这些值插入XML结构并输出XML,如下所示:
$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;
header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!
echo $dom -> saveXML();
This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich
or Ägypten
) the character will be omitted, resulting in gypten
or sterreich
. If the Umlaut is in the middle of the String (Russische Föderation
) it gets transferred correctly. Same goes for things like ß
or é
or whatever.
这工作正常,但我遇到一个非常奇怪的问题。当一个字符串的第一个字母是变音符号(如Österreich或Ägypten)时,该字符将被省略,从而产生gypten或sterreich。如果变音符号位于字符串(RussischeFöderation)的中间,则会正确传输。 ß或é等等也是如此。
All files are UTF-8 encoded and served in UTF-8.
所有文件均采用UTF-8编码,并以UTF-8格式提供。
This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.
对我来说,这看起来很奇怪而且有点像虫子,但也许我错过了一些东西,这里有很多聪明的人。
5 个解决方案
#1
4
Ok, so this seems to be a bug in fgetcsv
.
好的,所以这似乎是fgetcsv中的一个错误。
I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.
我现在正在处理我自己的CSV数据(有点麻烦),但它正在工作,我根本没有任何编码问题。
This is (a not-yet-optimized version of) what I am doing:
这是我正在做的(尚未优化的版本):
$rawCSV = file_get_contents($csvfile);
$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://*.com/a/7498886/797194
foreach ($lines as $line) {
array_push($words, getCSVValues($line));
}
The getCSVValues
is coming from here and is needed to deal with CSV lines like this (commas!):
getCSVValues来自这里,需要像这样处理CSV行(逗号!):
"I'm a string, what should I do when I need commas?",Howdy there
It looks like:
看起来像:
function getCSVValues($string, $separator=","){
$elements = explode($separator, $string);
for ($i = 0; $i < count($elements); $i++) {
$nquotes = substr_count($elements[$i], '"');
if ($nquotes %2 == 1) {
for ($j = $i+1; $j < count($elements); $j++) {
if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
// Put the quoted string's pieces back together again
array_splice($elements, $i, $j-$i+1,
implode($separator, array_slice($elements, $i, $j-$i+1)));
break;
}
}
}
if ($nquotes > 0) {
// Remove first and last quotes, then merge pairs of quotes
$qstr =& $elements[$i];
$qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
$qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
$qstr = str_replace('""', '"', $qstr);
}
}
return $elements;
}
Quite a bit of a workaround, but it seems to work fine.
相当多的解决方法,但似乎工作正常。
EDIT:
编辑:
There's a also a filed bug for this, apparently this depends on the locale settings.
还有一个提交的错误,显然这取决于语言环境设置。
#2
3
If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:
如果字符串来自Excel(我遇到字母ø消失的问题,如果它在字符串的开头就消失了)......那么这就修好了:
setlocale(LC_ALL, 'en_US.ISO-8859-1');
setlocale(LC_ALL,'en_US.ISO-8859-1');
#3
2
If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.
如果中间的其他变音符号显示正常,那么这不是基本编码问题。它发生在行的开头这一事实可能表明与换行标记有些不兼容。也许CSV是使用不同的换行编码生成的。
This happens when moving files between different OS:
在不同OS之间移动文件时会发生这种情况
- Windows:
\r\n
(characters 13 and 10) - Windows:\ r \ n(字符13和10)
- Linux:
\n
(character 10) - Linux:\ n(字符10)
- Mac OS:
\r
(character 13) - Mac OS:\ r \ n(字符13)
If I were you, I would verify the newline mark to be sure.
如果我是你,我会确认换行标记。
If in Linux: hexdump -C filename | more
and inspect the document.
如果在Linux中:hexdump -C filename |更多并检查文件。
You can change the newline marks with a sed
expression if that's the case.
如果是这种情况,您可以使用sed表达式更改换行符。
Hope that helped!
希望有所帮助!
#4
2
A bit simpler workaround (but pretty dirty):
一个更简单的解决方法(但非常脏):
//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);
//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);
//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
$parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}
#5
0
Could be some sort of utf8_encode()
problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.
可能是某种utf8_encode()问题。文档页面上的这条评论似乎表明,如果你已经编码过Umlaut,它可能会导致问题。
Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding()
.
也许测试看看数据是否已经用mb_detect_encoding()进行了utf-8编码。
#1
4
Ok, so this seems to be a bug in fgetcsv
.
好的,所以这似乎是fgetcsv中的一个错误。
I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.
我现在正在处理我自己的CSV数据(有点麻烦),但它正在工作,我根本没有任何编码问题。
This is (a not-yet-optimized version of) what I am doing:
这是我正在做的(尚未优化的版本):
$rawCSV = file_get_contents($csvfile);
$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://*.com/a/7498886/797194
foreach ($lines as $line) {
array_push($words, getCSVValues($line));
}
The getCSVValues
is coming from here and is needed to deal with CSV lines like this (commas!):
getCSVValues来自这里,需要像这样处理CSV行(逗号!):
"I'm a string, what should I do when I need commas?",Howdy there
It looks like:
看起来像:
function getCSVValues($string, $separator=","){
$elements = explode($separator, $string);
for ($i = 0; $i < count($elements); $i++) {
$nquotes = substr_count($elements[$i], '"');
if ($nquotes %2 == 1) {
for ($j = $i+1; $j < count($elements); $j++) {
if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
// Put the quoted string's pieces back together again
array_splice($elements, $i, $j-$i+1,
implode($separator, array_slice($elements, $i, $j-$i+1)));
break;
}
}
}
if ($nquotes > 0) {
// Remove first and last quotes, then merge pairs of quotes
$qstr =& $elements[$i];
$qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
$qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
$qstr = str_replace('""', '"', $qstr);
}
}
return $elements;
}
Quite a bit of a workaround, but it seems to work fine.
相当多的解决方法,但似乎工作正常。
EDIT:
编辑:
There's a also a filed bug for this, apparently this depends on the locale settings.
还有一个提交的错误,显然这取决于语言环境设置。
#2
3
If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:
如果字符串来自Excel(我遇到字母ø消失的问题,如果它在字符串的开头就消失了)......那么这就修好了:
setlocale(LC_ALL, 'en_US.ISO-8859-1');
setlocale(LC_ALL,'en_US.ISO-8859-1');
#3
2
If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.
如果中间的其他变音符号显示正常,那么这不是基本编码问题。它发生在行的开头这一事实可能表明与换行标记有些不兼容。也许CSV是使用不同的换行编码生成的。
This happens when moving files between different OS:
在不同OS之间移动文件时会发生这种情况
- Windows:
\r\n
(characters 13 and 10) - Windows:\ r \ n(字符13和10)
- Linux:
\n
(character 10) - Linux:\ n(字符10)
- Mac OS:
\r
(character 13) - Mac OS:\ r \ n(字符13)
If I were you, I would verify the newline mark to be sure.
如果我是你,我会确认换行标记。
If in Linux: hexdump -C filename | more
and inspect the document.
如果在Linux中:hexdump -C filename |更多并检查文件。
You can change the newline marks with a sed
expression if that's the case.
如果是这种情况,您可以使用sed表达式更改换行符。
Hope that helped!
希望有所帮助!
#4
2
A bit simpler workaround (but pretty dirty):
一个更简单的解决方法(但非常脏):
//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);
//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);
//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
$parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}
#5
0
Could be some sort of utf8_encode()
problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.
可能是某种utf8_encode()问题。文档页面上的这条评论似乎表明,如果你已经编码过Umlaut,它可能会导致问题。
Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding()
.
也许测试看看数据是否已经用mb_detect_encoding()进行了utf-8编码。