I need to read a text file that is encoded in GBK. The standard library in Go programming language assumes that all text is encoded in UTF-8.
我需要读取一个用GBK编码的文本文件。Go编程语言中的标准库假设所有文本都用UTF-8编码。
How can I read files in other encodings?
如何读取其他编码中的文件?
2 个解决方案
#1
14
Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).
以前(如旧的回答中所提到的)使用第三方包(需要cgo并封装iconv库)是“简单”的方法。这是不可取的原因有很多。值得庆幸的是,在相当长的一段时间内,使用Go作者提供的包(不是主要的包,而是在Go的子存储库中),已经有了一种更好的方法。
The golang.org/x/text/encoding
package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese
sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.
golang.org/x/text/编码包定义了一个接口,用于通用字符编码,可转换为/从UTF-8转换。golang.org/x/text/encoding/simplifiedchinese子包提供了GB18030、GBK和HZ-GB2312编码实现。
Here is an example of reading and writing a GBK encoded file. Note that the io.Reader
and io.Writer
do the encoding "on the fly" as data is being read/written.
下面是一个读取和编写GBK编码文件的示例。注意,io。读者和io。写入器在读取/写入数据时进行“动态”编码。
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
)
// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK
func main() {
const filename = "example_GBK_file"
exampleWriteGBK(filename)
exampleReadGBK(filename)
}
func exampleReadGBK(filename string) {
// Read UTF-8 from a GBK encoded file.
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(f, enc.NewDecoder())
// Read converted UTF-8 from `r` as needed.
// As an example we'll read line-by-line showing what was read:
sc := bufio.NewScanner(r)
for sc.Scan() {
fmt.Printf("Read line: %s\n", sc.Bytes())
}
if err = sc.Err(); err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
func exampleWriteGBK(filename string) {
// Write UTF-8 to a GBK encoded file.
f, err := os.Create(filename)
if err != nil {
log.Fatal(err)
}
w := transform.NewWriter(f, enc.NewEncoder())
// Write UTF-8 to `w` as desired.
// As an example we'll write some text from the Wikipedia
// GBK page that includes Chinese.
_, err = fmt.Fprintln(w,
`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
if err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
#2
#1
14
Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).
以前(如旧的回答中所提到的)使用第三方包(需要cgo并封装iconv库)是“简单”的方法。这是不可取的原因有很多。值得庆幸的是,在相当长的一段时间内,使用Go作者提供的包(不是主要的包,而是在Go的子存储库中),已经有了一种更好的方法。
The golang.org/x/text/encoding
package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese
sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.
golang.org/x/text/编码包定义了一个接口,用于通用字符编码,可转换为/从UTF-8转换。golang.org/x/text/encoding/simplifiedchinese子包提供了GB18030、GBK和HZ-GB2312编码实现。
Here is an example of reading and writing a GBK encoded file. Note that the io.Reader
and io.Writer
do the encoding "on the fly" as data is being read/written.
下面是一个读取和编写GBK编码文件的示例。注意,io。读者和io。写入器在读取/写入数据时进行“动态”编码。
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
)
// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK
func main() {
const filename = "example_GBK_file"
exampleWriteGBK(filename)
exampleReadGBK(filename)
}
func exampleReadGBK(filename string) {
// Read UTF-8 from a GBK encoded file.
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(f, enc.NewDecoder())
// Read converted UTF-8 from `r` as needed.
// As an example we'll read line-by-line showing what was read:
sc := bufio.NewScanner(r)
for sc.Scan() {
fmt.Printf("Read line: %s\n", sc.Bytes())
}
if err = sc.Err(); err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
func exampleWriteGBK(filename string) {
// Write UTF-8 to a GBK encoded file.
f, err := os.Create(filename)
if err != nil {
log.Fatal(err)
}
w := transform.NewWriter(f, enc.NewEncoder())
// Write UTF-8 to `w` as desired.
// As an example we'll write some text from the Wikipedia
// GBK page that includes Chinese.
_, err = fmt.Fprintln(w,
`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
if err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}