I'm trying to get this table from Wikipedia. The source of the file clamis it's UTF-8:
我想从*上得到这个表格。文件clamis的源代码是UTF-8:
> <!DOCTYPE html> <html lang="en" dir="ltr" class="client-nojs"> <head>
> <meta charset="UTF-8"/> <title>List of cities in Colombia - Wikipedia,
> the free encyclopedia</title>
> ...
However, when I try to get the table with rvest
it shows weird characters where there should be accented (standard spanish) ones like á, é, etc. This is what I attempted:
然而,当我尝试使用rvest时,它显示了一些奇怪的字符,这些字符应该有重音(标准的西班牙语),比如a、e等等。
theurl <- "https://en.wikipedia.org/wiki/List_of_cities_in_Colombia"
file <- read_html(theurl, encoding = "UTF-8")
tables <- html_nodes(file, "table")
pop <- html_table(tables[[2]])
head(pop)
## No. City Population Department
## 1 1 Bogotá 6.840.116 Cundinamarca
## 2 2 MedellÃn 2.214.494 Antioquia
## 3 3 Cali 2.119.908 Valle del Cauca
## 4 4 Barranquilla 1.146.359 Atlántico
## 5 5 Cartagena 892.545 BolÃvar
## 6 6 Cúcuta 587.676 Norte de Santander
I have attempted to repair the encoding, as suggested in other SO questions, with:
我试图修复编码,如其他SO问题中所建议的,有:
repair_encoding(pop)
## Best guess: UTF-8 (100% confident)
## Error in stringi::stri_conv(x, from = from) :
## all elements in `str` should be a raw vectors
I've tested several different encodings (latin1, and others provided by guess_encoding()
, but all of them produce similarly incorrect results.
我测试了几个不同的编码(latin1,以及guess_encoding()提供的其他编码),但是它们都产生了类似的错误结果。
How can I properly load this table?
我如何正确地装载这张桌子?
1 个解决方案
#1
3
It looks like you have to use repair_encoding
on a character vector, not an entire dataframe...
看起来您必须在字符向量上使用repair_encoding,而不是整个dataframe…
> repair_encoding(head(pop[,2]))
Best guess: UTF-8 (80% confident)
[1] "Bogotá" "Medellín" "Cali" "Barranquilla"
[5] "Cartagena" "Cúcuta"
#1
3
It looks like you have to use repair_encoding
on a character vector, not an entire dataframe...
看起来您必须在字符向量上使用repair_encoding,而不是整个dataframe…
> repair_encoding(head(pop[,2]))
Best guess: UTF-8 (80% confident)
[1] "Bogotá" "Medellín" "Cali" "Barranquilla"
[5] "Cartagena" "Cúcuta"