从数据文件中删除非ascii字符

时间:2021-08-14 15:25:56

I've got a bunch of csv files that I'm reading into R and including in a package/data folder in .rdata format. Unfortunately the non-ASCII characters in the data fail the check. The tools package has two functions to check for non-ASCII characters (showNonASCII and showNonASCIIfile) but I can't seem to locate one to remove/clean them.

我有一堆csv文件,我正在读取到R中,包括在.rdata格式的包/数据文件夹中。不幸的是,数据中的非ascii字符无法通过检查。这个工具包有两个函数来检查非ascii字符(showNonASCII和showNonASCIIfile),但是我似乎找不到一个函数来删除/清理它们。

Before I explore other UNIX tools, it would be great to do this all in R so I can maintain a complete workflow from raw data to final product. Are there any existing packages/functions to help me get rid of the non-ASCII characters?

在我探索其他UNIX工具之前,最好在R中完成这些工作,这样我就可以从原始数据维护完整的工作流到最终产品。有任何现有的包/函数可以帮助我去掉非ascii字符吗?

2 个解决方案

#1


69  

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

要简单地删除非ascii字符,可以使用base R的iconv(),设置sub = ""。像这样的东西应该可以:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

要找到非ascii字符,或者要查找您的文件中是否有任何字符,您可以采用以下思想:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

#2


61  

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

现在,稍微好一点的方法是使用stringi包,它为通用的unicode转换提供了一个函数。这使您尽可能地保留原始文本:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"

#1


69  

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

要简单地删除非ascii字符,可以使用base R的iconv(),设置sub = ""。像这样的东西应该可以:

x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

要找到非ascii字符,或者要查找您的文件中是否有任何字符,您可以采用以下思想:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3

#2


61  

These days, a slightly better approach is to use the stringi package which provides a function for general unicode conversion. This allows you to preserve the original text as much as possible:

现在,稍微好一点的方法是使用stringi包,它为通用的unicode转换提供了一个函数。这使您尽可能地保留原始文本:

x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
x
#> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

stringi::stri_trans_general(x, "latin-ascii")
#> [1] "Ekstrom"          "Joreskog"         "bisschen Zurcher"