Right now I am trying to read and parse a file using Python 2. The creator of the file typed a bunch of lines in the terminal, with (Ctrl A)s within each line, and copied those lines into a text file. So the lines in the file look like "(something)^A(something)". When I use the readlines() function in python to read the file, those "^A" strings cannot be recognized.
现在我正在尝试使用Python 2读取和解析文件。文件的创建者在终端中键入了一串行,每行中都有(Ctrl A)s,并将这些行复制到文本文件中。所以文件中的行看起来像“(某事)^ A(某事)”。当我在python中使用readlines()函数来读取文件时,无法识别那些“^ A”字符串。
I tried to use io.open and codecs.open and set the encoding as UTF-8, but "^A" is clearly not an UTF-8 string. Does anyone know how to read these special control command strings from a file using python? Thank you very much!
我尝试使用io.open和codecs.open并将编码设置为UTF-8,但“^ A”显然不是UTF-8字符串。有谁知道如何使用python从文件中读取这些特殊的控制命令字符串?非常感谢你!
2 个解决方案
#1
0
Simply read the file in binary mode like so: open('file.txt', 'rb')
. Ctrl-A will be the value 1.
只需像二进制模式一样读取文件:open('file.txt','rb')。 Ctrl-A将是值1。
with open('test.txt', 'rb') as f:
text = f.read()
for char in text:
if char == b'\x01': # \x01 stands for the byte with hex value 01
# Do something
pass
else:
# Do something else
pass
#2
0
These control characters are part of the ASCII character set, with numeric codes ranging from 0 to 31 (or 00 to 1F in hexadecimals). To strip them out from a string, simply use regex substitution:
这些控制字符是ASCII字符集的一部分,数字代码范围为0到31(或十六进制为00到1F)。要从字符串中删除它们,只需使用正则表达式替换:
import re
clean_string = re.sub(r'[\x00-\x1f]+', '', string_with_control_characters)
#1
0
Simply read the file in binary mode like so: open('file.txt', 'rb')
. Ctrl-A will be the value 1.
只需像二进制模式一样读取文件:open('file.txt','rb')。 Ctrl-A将是值1。
with open('test.txt', 'rb') as f:
text = f.read()
for char in text:
if char == b'\x01': # \x01 stands for the byte with hex value 01
# Do something
pass
else:
# Do something else
pass
#2
0
These control characters are part of the ASCII character set, with numeric codes ranging from 0 to 31 (or 00 to 1F in hexadecimals). To strip them out from a string, simply use regex substitution:
这些控制字符是ASCII字符集的一部分,数字代码范围为0到31(或十六进制为00到1F)。要从字符串中删除它们,只需使用正则表达式替换:
import re
clean_string = re.sub(r'[\x00-\x1f]+', '', string_with_control_characters)