从BigQuery导入R时的特殊字符

时间:2022-10-12 19:17:33

I have a script for scrapping some tweets and saving the results to Google BigQuery. When I see the stored data, special characters like ➕, ‍♂️, Ñ, áéíóú appear correctly, but when I try to import the data again to R they are replaced by some strange characters. Here's an example.

我有一个脚本可以删除一些推文并将结果保存到Google BigQuery。当我看到存储的数据时,➕,♂️,Ñ,áéíóú等特殊字符显示正确,但当我尝试再次将数据导入R时,它们会被一些奇怪的字符所取代。这是一个例子。

# Create df

id_tweet <- 1023985670224785408
tweet <- "◉ Neuroeducación y entornos digitales de aprendizaje: un paso obligado para educadores, pedagogos y psicólogos"
descripcion <- "Desde las alturas se ve todo de otra manera... ️ ➕ ‍♂️"

data <- data.frame(id, tweet, description)

# Save to Google BQ

library(bigrquery)

insert_upload_job("project-id", "dataset", "table", data , write_disposition = "WRITE_APPEND")

#Load from Gooble BQ

sql <- paste("SELECT *", "FROM", "`project-id.dataset.table`")
data <- query_exec(sql, project = "project-id", use_legacy_sql = FALSE)

My output is the following:

我的输出如下:

> data
               id_tweet
283 1023985670224785408
                                                                                                                                         tweet
283 ◉ Neuroeducación y entornos digitales de aprendizaje: un paso obligado para educadores, pedagogos y psicólogos
                                                                                        descripcion
283 Desde las alturas se ve todo de otra manera... ï¿½ï¿½ï¸ âž• ��<U+200D>â™‚ï¸ ï¿½ï¿½ ��

What I want is to keep the original format.

我想要的是保持原始格式。

What should I do?

我该怎么办?

Thanks,

1 个解决方案

#1


0  

I tested a few things which may help.

我测试了一些可能有用的东西。

Firstly, I saved the blank R script and ensured it was in UTF-8 encoding: File -> Save with Encoding -> UTF-8. Then saved just the special characters in your question in double quotes as a .csv (i.e. "➕, ‍♂️, Ñ, áéíóú"). Then read in the csv with fileEncoding = "UTF-8", i.e:

首先,我保存了空白R脚本并确保它采用UTF-8编码:文件 - >使用编码保存 - > UTF-8。然后用双引号将你的问题中的特殊字符保存为.csv(即“➕,♂️,Ñ,áéíóú”)。然后使用fileEncoding =“UTF-8”读取csv,即:

test <- read.csv("test.csv", fileEncoding = "UTF-8", header=FALSE, stringsAsFactors = FALSE)

Inside R Studio, test returns:

在R Studio内部,测试返回:

# > test
# V1
# 1 \u2795, ‍♂️, Ñ, áéíóú

So all but the ➕ display nicely in R Studio. However, a lot of characters, even common ones like line breaks, and tabs etc display funnily in RStudio but normally when a file is written. These are no different.

所以除了Studio在R Studio中很好地显示。但是,很多字符,甚至像换行符和标签等常见的字符在RStudio中有趣地显示,但通常在写入文件时。这些也不例外。

When the csv is written (just using write.csv(test, 'test2.csv', row.names=FALSE)), it displays perfectly as it did in the original csv (that's when opened in sublime text)

当编写csv时(只使用write.csv(test,'test2.csv',row.names = FALSE)),它完全像在原始csv中那样显示(在以崇高文本打开时)

After all this, I would suggest ensuring your encoding is UTF-8, and perhaps trying to save the BQ output as a csv (if possible?) and inspecting it to see if the issue is coming from BQ or R. If it comes out of BQ correctly, then it should be simply a matter of changing the encoding in RStudio. But if it's not coming out of BQ as intended, then I'd suggest you need to change the datatype in BQ (to UTF-8)

毕竟,我建议你确保你的编码是UTF-8,并且可能试图将BQ输出保存为csv(如果可能的话?)并检查它以查看问题是来自BQ还是R.如果它出来了正确的BQ,那么它应该只是改变RStudio中的编码。但如果它没有按预期从BQ出来,那么我建议你需要在BQ中更改数据类型(到UTF-8)

#1


0  

I tested a few things which may help.

我测试了一些可能有用的东西。

Firstly, I saved the blank R script and ensured it was in UTF-8 encoding: File -> Save with Encoding -> UTF-8. Then saved just the special characters in your question in double quotes as a .csv (i.e. "➕, ‍♂️, Ñ, áéíóú"). Then read in the csv with fileEncoding = "UTF-8", i.e:

首先,我保存了空白R脚本并确保它采用UTF-8编码:文件 - >使用编码保存 - > UTF-8。然后用双引号将你的问题中的特殊字符保存为.csv(即“➕,♂️,Ñ,áéíóú”)。然后使用fileEncoding =“UTF-8”读取csv,即:

test <- read.csv("test.csv", fileEncoding = "UTF-8", header=FALSE, stringsAsFactors = FALSE)

Inside R Studio, test returns:

在R Studio内部,测试返回:

# > test
# V1
# 1 \u2795, ‍♂️, Ñ, áéíóú

So all but the ➕ display nicely in R Studio. However, a lot of characters, even common ones like line breaks, and tabs etc display funnily in RStudio but normally when a file is written. These are no different.

所以除了Studio在R Studio中很好地显示。但是,很多字符,甚至像换行符和标签等常见的字符在RStudio中有趣地显示,但通常在写入文件时。这些也不例外。

When the csv is written (just using write.csv(test, 'test2.csv', row.names=FALSE)), it displays perfectly as it did in the original csv (that's when opened in sublime text)

当编写csv时(只使用write.csv(test,'test2.csv',row.names = FALSE)),它完全像在原始csv中那样显示(在以崇高文本打开时)

After all this, I would suggest ensuring your encoding is UTF-8, and perhaps trying to save the BQ output as a csv (if possible?) and inspecting it to see if the issue is coming from BQ or R. If it comes out of BQ correctly, then it should be simply a matter of changing the encoding in RStudio. But if it's not coming out of BQ as intended, then I'd suggest you need to change the datatype in BQ (to UTF-8)

毕竟,我建议你确保你的编码是UTF-8,并且可能试图将BQ输出保存为csv(如果可能的话?)并检查它以查看问题是来自BQ还是R.如果它出来了正确的BQ,那么它应该只是改变RStudio中的编码。但如果它没有按预期从BQ出来,那么我建议你需要在BQ中更改数据类型(到UTF-8)