从字符串中删除隐藏字符

时间:2022-08-30 00:05:18

My problem:

我的问题:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.

我有一个。net应用程序,可以通过电子邮件发送时事通讯。在outlook中查看时事通讯时,outlook会显示一个问号,以取代它无法识别的隐藏字符。这些隐藏的字符来自于最终用户,他们复制和粘贴html,以组成一个表单并提交它。如果隐藏的字符出现在字符串的末尾或开头,c# trim()将删除这些字符。当时事通讯在gmail中被浏览时,gmail很好地忽略了它们。当把这些隐藏的字符粘贴到一个word文档中,我打开“显示段落标记和隐藏符号”选项时,这些符号就会在一个更大的矩形中显示为一个矩形。此外,组成时事通讯的文本可以是任何一种语言,所以必须接受Unicode字符。我尝试遍历字符串以检测字符,但是循环不识别它并将它传递过去。也不可能要求最终用户在提交之前先将html粘贴到记事本中。

My question:
How can I detect and eliminate these hidden characters using C#?

我的问题是:如何使用c#检测和消除这些隐藏字符?

8 个解决方案

#1


46  

You can remove all control characters from your input string with something like this:

您可以从输入字符串中删除所有控制字符,如下所示:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

下面是IsControl()方法的文档。

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

如果你想只保留字母和数字,你也可以使用IsLetter和IsDigit函数:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

#2


12  

I usually use this regular expression to replace all non-printable characters.

我通常使用这个正则表达式来替换所有不可打印的字符。

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

顺便说一下,大多数人认为制表符、换行符和回车符是不可打印的字符,但对我来说不是。

So here is the expression:

这是表达式:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • ^意味着如果有下列:
  • \u0009 is tab
  • \ u0009是标签
  • \u000A is linefeed
  • \ u000A是换行
  • \u000D is carriage return
  • \ u000D回车
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.
  • \u0020-\u007E表示从空间到~的所有东西——也就是说,所有的东西都是ASCII。

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

如果要进行更改,请参阅ASCII表。记住它会去掉所有非ascii字符。

To test above you can create a string by yourself like this:

要进行上述测试,您可以自己创建一个字符串,如下所示:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }

#3


3  

You can do this:

你可以这样做:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

#4


3  

new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit

IsControl会漏掉一些控制字符,比如从左到右标记(LRM)(在执行复制粘贴时通常隐藏在字符串中的char)。如果您确定您的字符串只有数字和数字,那么您可以使用IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then

如果你的字符串有特殊的字符,那么

new string(input.Where(c => c < 128).ToArray())

#5


2  

What best worked for me is:

最适合我的是:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.

我确保任何字母或数字字符,所以我不要忽略任何非英文字母,或者如果它不是一个信我检查是否一个ascii字符大于或等于空间以确保我忽略一些控制字符,这样可以确保我不忽略标点符号。

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

有些人建议使用IsControl检查字符是否不可打印,但这忽略了从左到右的标记。

#6


1  

If you know what these characters are you can use string.Replace:

如果你知道这些字符是什么,你可以使用string.Replace:

newString = oldString.Replace("?", "");

where "?" represents the character you want to strip out.

“?”表示要去掉的字符。

The drawback with this approach is that you need to make this call repeatedly if there are multiple characters that you want to remove.

这种方法的缺点是,如果需要删除多个字符,则需要重复调用。

#7


0  

It has been a while but this haven't been answered yet.

已经有一段时间了,但这个问题还没有得到回答。

How do you include the HMTL content in the sending code? if you are reading it from file, check the file encoding. If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail.

如何在发送代码中包含HMTL内容?如果您正在从文件中读取它,请检查文件编码。如果您使用的是带有签名的UTF-8(在编辑器之间的名称略有不同),这可能会在邮件的开始处引起奇怪的字符。

#8


0  

string output = new string(input.Where(c => !char.IsControl(c)).ToArray()); This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

字符串输出=新字符串(输入)。在(c = > ! char.IsControl(c)).ToArray());这肯定能解决问题。我有一个无法打印的替换字符(ASCII 26)在一个字符串中,这导致我的应用程序崩溃,这行代码删除了字符

#1


46  

You can remove all control characters from your input string with something like this:

您可以从输入字符串中删除所有控制字符,如下所示:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

下面是IsControl()方法的文档。

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

如果你想只保留字母和数字,你也可以使用IsLetter和IsDigit函数:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

#2


12  

I usually use this regular expression to replace all non-printable characters.

我通常使用这个正则表达式来替换所有不可打印的字符。

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

顺便说一下,大多数人认为制表符、换行符和回车符是不可打印的字符,但对我来说不是。

So here is the expression:

这是表达式:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • ^意味着如果有下列:
  • \u0009 is tab
  • \ u0009是标签
  • \u000A is linefeed
  • \ u000A是换行
  • \u000D is carriage return
  • \ u000D回车
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.
  • \u0020-\u007E表示从空间到~的所有东西——也就是说,所有的东西都是ASCII。

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

如果要进行更改,请参阅ASCII表。记住它会去掉所有非ascii字符。

To test above you can create a string by yourself like this:

要进行上述测试,您可以自己创建一个字符串,如下所示:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }

#3


3  

You can do this:

你可以这样做:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

#4


3  

new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit

IsControl会漏掉一些控制字符,比如从左到右标记(LRM)(在执行复制粘贴时通常隐藏在字符串中的char)。如果您确定您的字符串只有数字和数字,那么您可以使用IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then

如果你的字符串有特殊的字符,那么

new string(input.Where(c => c < 128).ToArray())

#5


2  

What best worked for me is:

最适合我的是:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.

我确保任何字母或数字字符,所以我不要忽略任何非英文字母,或者如果它不是一个信我检查是否一个ascii字符大于或等于空间以确保我忽略一些控制字符,这样可以确保我不忽略标点符号。

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

有些人建议使用IsControl检查字符是否不可打印,但这忽略了从左到右的标记。

#6


1  

If you know what these characters are you can use string.Replace:

如果你知道这些字符是什么,你可以使用string.Replace:

newString = oldString.Replace("?", "");

where "?" represents the character you want to strip out.

“?”表示要去掉的字符。

The drawback with this approach is that you need to make this call repeatedly if there are multiple characters that you want to remove.

这种方法的缺点是,如果需要删除多个字符,则需要重复调用。

#7


0  

It has been a while but this haven't been answered yet.

已经有一段时间了,但这个问题还没有得到回答。

How do you include the HMTL content in the sending code? if you are reading it from file, check the file encoding. If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail.

如何在发送代码中包含HMTL内容?如果您正在从文件中读取它,请检查文件编码。如果您使用的是带有签名的UTF-8(在编辑器之间的名称略有不同),这可能会在邮件的开始处引起奇怪的字符。

#8


0  

string output = new string(input.Where(c => !char.IsControl(c)).ToArray()); This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

字符串输出=新字符串(输入)。在(c = > ! char.IsControl(c)).ToArray());这肯定能解决问题。我有一个无法打印的替换字符(ASCII 26)在一个字符串中,这导致我的应用程序崩溃,这行代码删除了字符