I have been working on creating a Product Feed for a third party company. The data I am working with has all sorts on invalid, special characters, double spacing, etc. They have also requested that the data is HTML encoded, where special characters are used.
我一直在为第三方公司创建产品Feed。我正在使用的数据有各种各样的无效,特殊字符,双倍间距等。他们还要求数据是HTML编码的,其中使用了特殊字符。
An example of some data that would be passed = "Buy Kitchen
一些可传递的数据示例=“购买厨房”
Aid Artisan™ Stand Mixer 4.8L "
Aid Artisan™立式搅拌机4.8L“
try
{
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = HttpUtility.HtmlEncode(removeDoubleSpace).Trim();
var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
var finalStringOutput = Regex.Replace(encodedAndLineBreaksRemoved, @"(™)|(’)|(”)|(–)", "");
return finalStringOutput;
}
catch (Exception)
{
return stringInput;
}
I was trying to come up with one method that could be called, to do all the above, in a cleaner way rather than several Regex
expressions. Or, perhaps, is there just one regex that covers everything?
我试图想出一个可以调用的方法,以更清晰的方式完成上述所有操作,而不是几个Regex表达式。或者,也许只有一个正则表达式涵盖了一切?
3 个解决方案
#1
Use a white list not a blacklist, because you can more easily know which letters are acceptable than which letters might be there that are unacceptable. A white list is just that. It's a list of acceptable characters. Create your white list, and remove everything that is not on that list. In your case, a potential white list could include all ASCII characters.
使用白名单而不是黑名单,因为您可以更容易地知道哪些字母是可接受的,而不是哪些字母可能是不可接受的。白名单就是这样。这是一个可接受的字符列表。创建您的白名单,并删除该列表中没有的所有内容。在您的情况下,潜在的白名单可以包括所有ASCII字符。
The following is a white list that captures all alphanumeric and punctuation characters.
以下是捕获所有字母数字和标点字符的白名单。
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
private static string input = @"Buy Kitchen
Aid Artisan™ Stand Mixer 4.8L ";
public static void Main()
{
var match = Regex
.Match(input, @"[a-zA-Z0-9\p{P}]+");
StringBuilder builder = new StringBuilder();
while(match.Success)
{
// add a space between matches
builder.Append(match + " ");
match = match.NextMatch();
}
Console.WriteLine(builder.ToString());
}
}
Output
Buy Kitchen Aid Artisan Stand Mixer 4.8L
#2
Here is a bit enhanced code:
这是一些增强的代码:
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = System.Web.HttpUtility.HtmlEncode(removeDoubleSpace).Trim().Replace("™", string.Empty).Replace("’", string.Empty).Replace("”", string.Empty).Replace("–", string.Empty);
You do not need to use var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
since newline symbols have been already removed with \s+
regex (\s
matches any white space character including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v].).
您不需要使用var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine,“”);因为已使用\ s +正则表达式删除换行符号(\ s匹配任何空白字符,包括空格,制表符,换页符等等。相当于[\ f \ n \ r \ t \ t]。)。
Also, there is no need using a 2nd regex unless you plan to remove a certain range of characters, or a class (e.g. all characters inside \p{S}
shorthand class), thus, I just chained several string.Replace
methods, right to the trimmed and encoded string.
此外,除非您计划删除某个范围的字符或类(例如\ p {S}简写类中的所有字符),否则不需要使用第二个正则表达式,因此,我只是链接了几个string.Replace方法,右边修剪和编码的字符串。
Output:
Buy Kitchen Aid Artisan Stand Mixer 4.8L
#3
You don't need regex, linq will do as well:
你不需要正则表达式,linq也会这样做:
var str = "Buy Kitchen Aid Artisan™ Stand Mixer 4.8L";
var newStr = new string(str.Where(c => !Char.IsSymbol(c)).ToArray());
Console.WriteLine(newStr); // Buy Kitchen Aid Artisan Stand Mixer 4.8L
#1
Use a white list not a blacklist, because you can more easily know which letters are acceptable than which letters might be there that are unacceptable. A white list is just that. It's a list of acceptable characters. Create your white list, and remove everything that is not on that list. In your case, a potential white list could include all ASCII characters.
使用白名单而不是黑名单,因为您可以更容易地知道哪些字母是可接受的,而不是哪些字母可能是不可接受的。白名单就是这样。这是一个可接受的字符列表。创建您的白名单,并删除该列表中没有的所有内容。在您的情况下,潜在的白名单可以包括所有ASCII字符。
The following is a white list that captures all alphanumeric and punctuation characters.
以下是捕获所有字母数字和标点字符的白名单。
using System;
using System.Text;
using System.Text.RegularExpressions;
public class Program
{
private static string input = @"Buy Kitchen
Aid Artisan™ Stand Mixer 4.8L ";
public static void Main()
{
var match = Regex
.Match(input, @"[a-zA-Z0-9\p{P}]+");
StringBuilder builder = new StringBuilder();
while(match.Success)
{
// add a space between matches
builder.Append(match + " ");
match = match.NextMatch();
}
Console.WriteLine(builder.ToString());
}
}
Output
Buy Kitchen Aid Artisan Stand Mixer 4.8L
#2
Here is a bit enhanced code:
这是一些增强的代码:
var removeDoubleSpace = Regex.Replace(stringInput, @"\s+", " ");
var encodedString = System.Web.HttpUtility.HtmlEncode(removeDoubleSpace).Trim().Replace("™", string.Empty).Replace("’", string.Empty).Replace("”", string.Empty).Replace("–", string.Empty);
You do not need to use var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine, "");
since newline symbols have been already removed with \s+
regex (\s
matches any white space character including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v].).
您不需要使用var encodedAndLineBreaksRemoved = encodedString.Replace(Environment.NewLine,“”);因为已使用\ s +正则表达式删除换行符号(\ s匹配任何空白字符,包括空格,制表符,换页符等等。相当于[\ f \ n \ r \ t \ t]。)。
Also, there is no need using a 2nd regex unless you plan to remove a certain range of characters, or a class (e.g. all characters inside \p{S}
shorthand class), thus, I just chained several string.Replace
methods, right to the trimmed and encoded string.
此外,除非您计划删除某个范围的字符或类(例如\ p {S}简写类中的所有字符),否则不需要使用第二个正则表达式,因此,我只是链接了几个string.Replace方法,右边修剪和编码的字符串。
Output:
Buy Kitchen Aid Artisan Stand Mixer 4.8L
#3
You don't need regex, linq will do as well:
你不需要正则表达式,linq也会这样做:
var str = "Buy Kitchen Aid Artisan™ Stand Mixer 4.8L";
var newStr = new string(str.Where(c => !Char.IsSymbol(c)).ToArray());
Console.WriteLine(newStr); // Buy Kitchen Aid Artisan Stand Mixer 4.8L