从文件中删除控制字符

时间:2022-04-20 17:07:56

I want to delete all the control characters from my file using linux bash commands.

我想使用linux bash命令从我的文件中删除所有控制字符。

There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.

有一些控制字符,如EOF(0x1A),特别是当我在另一个软件中加载我的文件时导致问题。我想删除它。

Here is what I have tried so far:

这是我到目前为止所尝试的:

this will list all the control characters:

这将列出所有控制字符:

cat -v -e -t file.txt | head -n 10

^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$

This will list all the control characters using grep:

这将使用grep列出所有控制字符:

$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1

-
-
1
%
-
.
/

matches the above output of cat command.

匹配cat命令的上述输出。

Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)

现在,我运行以下命令来显示所有不包含控制字符的行,但它仍然显示与上面相同的输出(带有控制字符的行)

$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1

-
-
1
%
-
.
/

here is the output in hex format:

这是十六进制格式的输出:

$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050

as you can see, the hex values, 0x01, 0x18 are control characters.

如您所见,十六进制值0x01,0x18是控制字符。

I tried using the tr command to delete the control characters but got an error:

我尝试使用tr命令删除控制字符,但出现错误:

$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.

If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?

如果我删除所有控制字符,我将最终删除换行符和回车符,它在Windows上用作换行符。如何删除所有控制字符,只保留所需的控制字符,如“\ r \ n”?

Thanks.

谢谢。

4 个解决方案

#1


20  

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:

而不是使用预定义的[:cntrl:]集合,如您所观察到的包含\ n和\ r \ n,只需列出(以八进制)您想要删除的控制字符:

$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

#2


3  

Try grep, like:

尝试grep,如:

grep -o "[[:print:][:space:]]*" in.txt > out.txt

which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.

它将仅打印字母数字字符,包括标点字符和空格字符,如制表符,换行符,垂直制表符,换页符,回车符和空格。

To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:

为了减少限制,只删除控制字符([:cntrl:]),删除它们:

tr -d "[:cntrl:]"

If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.

如果你想保留\ n(这是[:cntrl:]的一部分),那么暂时将其替换为其他内容,例如:

cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

#3


2  

Based on this answer on unix.stackexchange, this should do the trick:

根据unix.stackexchange上的这个答案,这应该可以解决问题:

$ cat scriptfile.raw | col -b > scriptfile.clean

#4


1  

A little late to the party: cat -v <file> which I think is the easiest to remember of the lot!

派对有点晚了:cat -v 我认为最容易记住的很多!

#1


20  

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:

而不是使用预定义的[:cntrl:]集合,如您所观察到的包含\ n和\ r \ n,只需列出(以八进制)您想要删除的控制字符:

$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt

#2


3  

Try grep, like:

尝试grep,如:

grep -o "[[:print:][:space:]]*" in.txt > out.txt

which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.

它将仅打印字母数字字符,包括标点字符和空格字符,如制表符,换行符,垂直制表符,换页符,回车符和空格。

To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:

为了减少限制,只删除控制字符([:cntrl:]),删除它们:

tr -d "[:cntrl:]"

If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.

如果你想保留\ n(这是[:cntrl:]的一部分),那么暂时将其替换为其他内容,例如:

cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"

#3


2  

Based on this answer on unix.stackexchange, this should do the trick:

根据unix.stackexchange上的这个答案,这应该可以解决问题:

$ cat scriptfile.raw | col -b > scriptfile.clean

#4


1  

A little late to the party: cat -v <file> which I think is the easiest to remember of the lot!

派对有点晚了:cat -v 我认为最容易记住的很多!