I want to read a .xml file which looks like:
我想读取一个。xml文件,它看起来像:
<?xml version="1.0" encoding="UTF-8"?>
<province name="北京市" id="11">
<city name="市辖区" id="110100000000">
<county name="东城区" id="110101000000">
<town name="珍珠泉乡" id="110229214000">
<village name="珍珠泉乡社区居委会" id="110229214001" type="220"/>
<village name="珍珠泉村委会" id="110229214200" type="210"/>
<village name="称沟湾村委会" id="110229214201" type="220"/>
<village name="庙梁村委会" id="110229214202" type="220"/>
<village name="下水沟村委会" id="110229214203" type="220"/>
<village name="上水沟村委会" id="110229214204" type="220"/>
<village name="下花楼村委会" id="110229214205" type="220"/>
<village name="八亩地村委会" id="110229214206" type="220"/>
<village name="转山子村委会" id="110229214207" type="220"/>
<village name="水泉子村委会" id="110229214208" type="220"/>
<village name="双金草村委会" id="110229214209" type="220"/>
<village name="小川村委会" id="110229214210" type="220"/>
<village name="小铺村委会" id="110229214211" type="220"/>
<village name="仓米道村委会" id="110229214212" type="220"/>
<village name="南天门村委会" id="110229214213" type="220"/>
<village name="桃条沟村委会" id="110229214214" type="220"/>
</town>
</county>
</city>
</province>
I set the system locale to be simplified Chinese using Sys.setlocale("LC_ALL", locale="Chinese (Simplified)")
, and read the document using XML package with UTF-8 encoding doc = xmlParse(files[i], encoding = "UTF-8", useInternalNodes = TRUE)
, but when I look at doc
, the Chinese characters are not properly displayed:
我将系统区域设置为使用Sys的简体中文。setlocale("LC_ALL", locale="Chinese(简化)"),并使用UTF-8编码doc = xmlParse(文件[i],编码=" UTF-8", useInternalNodes = TRUE)使用XML包读取文档,但当我查看doc时,汉字没有正确显示:
<village id="110229214001" type="220" name="鐝嶇彔娉変埂绀惧尯灞呭浼?/>
<village id="110229214200" type="210" name="鐝嶇彔娉夋潙濮斾細"/>
<village id="110229214201" type="220" name="绉版矡婀炬潙濮斾細"/>
<village id="110229214202" type="220" name="搴欐鏉戝浼?/>
<village id="110229214203" type="220" name="涓嬫按娌熸潙濮斾細"/>
<village id="110229214204" type="220" name="涓婃按娌熸潙濮斾細"/>
<village id="110229214205" type="220" name="涓嬭姳妤兼潙濮斾細"/>
<village id="110229214206" type="220" name="鍏憨鍦版潙濮斾細"/>
<village id="110229214207" type="220" name="杞北瀛愭潙濮斾細"/>
<village id="110229214208" type="220" name="姘存硥瀛愭潙濮斾細"/>
<village id="110229214209" type="220" name="鍙岄噾鑽夋潙濮斾細"/>
<village id="110229214210" type="220" name="灏忓窛鏉戝浼?/>
<village id="110229214211" type="220" name="灏忛摵鏉戝浼?/>
<village id="110229214212" type="220" name="浠撶背閬撴潙濮斾細"/>
<village id="110229214213" type="220" name="鍗楀ぉ闂ㄦ潙濮斾細"/>
<village id="110229214214" type="220" name="妗冩潯娌熸潙濮斾細"/>
I also tried to set the system locale to English_United States.1252
, but the problem remains the same. One strange thing is that, when I use some functions over doc
, for example xmlRoot(doc)
or getNodeSet(doc,"//village")[1]
, the Chinese characters are displayed correctly. But not for all functions, if I use xmlAttrs(getNodeSet(doc,"//village")[[1]])
, it has problem.
我还试图将系统地区设置为English_United state .1252,但问题仍然是一样的。奇怪的是,当我在doc上使用一些函数时,例如xmlRoot(doc)或getNodeSet(doc,“//village”)[1],汉字的显示是正确的。但对于所有函数,如果我使用xmlAttrs(getNodeSet(doc,“//village”)[[1]]),它就有问题。
2 个解决方案
#1
0
Try xml linq
尝试xml linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication49
{
class Program
{
const string FILENAME = @"c:\temp\test.xml";
static void Main(string[] args)
{
StreamReader reader = new StreamReader(FILENAME);
var z = reader.ReadLine();
XDocument doc = XDocument.Load(reader);
var results = doc.Descendants("village").Select(x => new
{
name = (string)x.Attribute("name"),
id = (long)x.Attribute("id"),
type = (int)x.Attribute("type")
}).ToList();
}
}
}
#2
0
It seems to be the problem with encoding. My aim is to extract the village information from the xml file. After I extracted the information, when I check the encoding of the village name column, it shows that the encoding is "unknown"
. So I added one command to make the encoding of that column as "UTF-8" and it works. My code is shown below.
这似乎是编码的问题。我的目标是从xml文件中提取村庄信息。当我提取了信息后,当我检查村子里的名字列的编码时,它显示编码是“未知的”。因此,我添加了一个命令,使该列的编码为“UTF-8”,并有效。我的代码如下所示。
But I still don't know why the encoding is unknown. I have already specified encoding="UTF-8
at the very beginning when I read the xml file using xmlParse()
. Anyone knows why? Did I make any mistake when I read the xml file?
但是我仍然不知道为什么编码是未知的。在使用xmlParse()读取xml文件时,我已经指定了编码=“UTF-8”。有人知道为什么吗?读取xml文件时是否出错?
> village = as.data.frame(t(xmlSApply(doc["/province/city/county/town/village"],xmlAttrs)),stringsAsFactors=FALSE)
> View(village)
> Encoding(village[1,"name"])
[1] "unknown"
> Encoding(village[,"name"])="UTF-8" #added this line and the display is fine now
#1
0
Try xml linq
尝试xml linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication49
{
class Program
{
const string FILENAME = @"c:\temp\test.xml";
static void Main(string[] args)
{
StreamReader reader = new StreamReader(FILENAME);
var z = reader.ReadLine();
XDocument doc = XDocument.Load(reader);
var results = doc.Descendants("village").Select(x => new
{
name = (string)x.Attribute("name"),
id = (long)x.Attribute("id"),
type = (int)x.Attribute("type")
}).ToList();
}
}
}
#2
0
It seems to be the problem with encoding. My aim is to extract the village information from the xml file. After I extracted the information, when I check the encoding of the village name column, it shows that the encoding is "unknown"
. So I added one command to make the encoding of that column as "UTF-8" and it works. My code is shown below.
这似乎是编码的问题。我的目标是从xml文件中提取村庄信息。当我提取了信息后,当我检查村子里的名字列的编码时,它显示编码是“未知的”。因此,我添加了一个命令,使该列的编码为“UTF-8”,并有效。我的代码如下所示。
But I still don't know why the encoding is unknown. I have already specified encoding="UTF-8
at the very beginning when I read the xml file using xmlParse()
. Anyone knows why? Did I make any mistake when I read the xml file?
但是我仍然不知道为什么编码是未知的。在使用xmlParse()读取xml文件时,我已经指定了编码=“UTF-8”。有人知道为什么吗?读取xml文件时是否出错?
> village = as.data.frame(t(xmlSApply(doc["/province/city/county/town/village"],xmlAttrs)),stringsAsFactors=FALSE)
> View(village)
> Encoding(village[1,"name"])
[1] "unknown"
> Encoding(village[,"name"])="UTF-8" #added this line and the display is fine now