My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:
我的问题是从字符串中删除表情符号,而不是使用regex从字符串中删除CJK(中文、日文、韩文)字符。我试着使用这个regex:
REGEX = /[^\u1F600-\u1F6FF\s]/i
This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?
这个regex可以正常工作,但是它还可以检测到需要这些字符的中国、日本和韩国字符。你知道怎么解决这个问题吗?
8 个解决方案
#1
25
Karol S already provided a solution, but the reason might not be clear:
Karol S已经提供了一个解决方案,但原因可能并不清楚:
"\u1F600"
is actually "\u1F60"
followed by "0"
:
“\u1F600”实际上是“\u1F60”,然后是“0”:
"\u1F60" # => "ὠ"
"\u1F600" # => "ὠ0"
You have to use curly braces for code points above FFFF:
您必须使用花括号,以在FFFF之上的代码点:
"\u{1F600}" #=> "????"
Therefore the character class [\u1F600-\u1F6FF]
is interpreted as [\u1F60 0-\u1F6F F]
, i.e. it matches "\u1F60"
, the range "0"
.."\u1F6F"
and "F"
.
因此字符类[\u1F600-\u1F6FF]被解释为[\u1F60 -\u1F6F],即它匹配"\u1F60",范围"0"。\ u1F6F”和“F”。
Using curly braces solves the issue:
使用大括号解决了这个问题:
/[\u{1F600}-\u{1F6FF}]/
This matches (emoji) characters in these unicode blocks:
它匹配这些unicode区块中的(表情符号)字符:
- U+1F600..U+1F64F Emoticons
- U + 1 f600 . .U + 1 f64f表情符号
- U+1F650..U+1F67F Ornamental Dingbats
- U + 1 f650 . .U + 1 f67f观赏装饰标志
- U+1F680..U+1F6FF Transport and Map Symbols
- U + 1 f680 . .U+1F6FF传输和地图符号
You can also use unpack
, pack
, and between?
to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.
您还可以使用开箱,打包,以及之间?取得类似的结果。这也适用于Ruby 1.8.7,在正则表达式中不支持Unicode。
s = 'Hi!????'
#=> "Hi!\360\237\230\200"
s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"
Regarding your Rubular example – Emoji are single characters:
关于你的小红帽例子——表情符号是单个字符:
"????".length #=> 1
"????".chars #=> ["????"]
Whereas kaomoji are a combination of multiple characters:
kaomoji是多种字符的组合:
"^_^".length #=> 3
"^_^".chars #=> ["^", "_", "^"]
Matching these is a very different task (and you should ask that in a separate question).
匹配这些是一个非常不同的任务(您应该在另一个问题中问这个问题)。
#2
13
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
这个regex匹配所有的845个表情符号,从表情符unicode字符中提取,用于网络:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.
我直接从Unicode表情符号的原始列表中生成了这个regex。该算法在这里:https://github.com/franklsf95/ruby-emoji-regex。
Example usage:
使用示例:
regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji ???????????????????????????????? and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji and other Unicode characters 比如中文."
Other Unicode characters, such as Asian characters, are preserved.
其他Unicode字符(如亚洲字符)将被保留。
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments for details.
编辑:我对regex进行了udpated,以排除ASCII码和符号。有关详细信息,请参阅注释。
#3
12
I am using one based on this script.
我正在基于这个脚本使用一个。
def strip_emoji(text)
text = text.force_encoding('utf-8').encode
clean = ""
# symbols & pics
regex = /[\u{1f300}-\u{1f5ff}]/
clean = text.gsub regex, ""
# enclosed chars
regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
clean = clean.gsub regex, ""
# emoticons
regex = /[\u{1f600}-\u{1f64f}]/
clean = clean.gsub regex, ""
#dingbats
regex = /[\u{2702}-\u{27b0}]/
clean = clean.gsub regex, ""
end
Results:
结果:
irb> strip_emoji("????????☂❤华み원❤")
=> "华み원"
#4
8
REGEX = /[^\u{1F600}-\u{1F6FF}\s]/
or
或
REGEX = /[\u{1F600}-\u{1F6FF}\s]/
REGEX = /[\u{1F600}-\u{1F6FF}]/
REGEX = /[^\u{1F600}-\u{1F6FF}]/
because your original regex seems to indicate you try to find everything that is not an amoji and not a whitespace and I don't know why would you want to do it.
因为您最初的regex似乎表明您试图查找所有不是amoji也不是空格的内容,我不知道您为什么要这样做。
Also:
另外:
-
the emoji are 1F300-1F6FF rather than 1F600-1F6FF; you may want to change that
表情符号是1F300-1F6FF,而不是1F600-1F6FF;你可能想要改变它。
-
if you want to remove all astral characters (for example you deal with a software that doesn't support all of Unicode), you should use 10000-10FFFF.
如果您想要删除所有的星体字符(例如,您要处理一个不支持所有Unicode的软件),您应该使用10,000 - 10ffff。
EDIT: You almost certainly want REGEX = /[\u{1F600}-\u{1F6FF}]/
or similar. Your original regex matched everything that is not a whitespace, and not in range 0-\u1F6F
. Since spaces are whitespace, and English letters are in range 0-\u1F6F
, and Chinese characters are in neither, the regex matched Chinese characters and removed them.
编辑:您几乎肯定想要REGEX = /[\u{1F600}-\u{1F6FF}]/或类似的。您的原始regex匹配了所有非空白、不在0-\u1F6F范围内的内容。由于空格是空格,英文字母在0-\u1F6F范围内,而中文字符在两者都不在范围内,所以regex会匹配并删除它们。
#5
1
Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, .tr('^A-Za-z0-9', '')
. Of course this will remove all punctuation, but you can always modify the regex to suit your specific condition.
你不能删除表情符号,你只能包括字母和数字。一个简单的tr应该做的诀窍,.tr(“^ A-Za-z0-9“,”)。当然,这将删除所有的标点符号,但是您始终可以修改regex以适应您的特定条件。
#6
1
This very short Regex covers all Emoji in getemoji.com so far:
这个非常短的Regex涵盖了getemoji.com上所有的表情符号:
[\u{1F300}-\u{1F5FF}|\u{1F1E6}-\u{1F1FF}|\u{2700}-\u{27BF}|\u{1F900}-\u{1F9FF}|\u{1F600}-\u{1F64F}|\u{1F680}-\u{1F6FF}|\u{2600}-\u{26FF}]
#7
0
I converted the RegEx from the RUBY project above to a JavaScript friendly RegEx:
我将RegEx从上面的RUBY项目转换为一个JavaScript友好的RegEx:
/// <summary>
/// Emoji symbols character sets (added \s and +)
/// Unicode with עברית Delete the emoji to match ????
/// https://regex101.com/r/jP5jC5/3
/// https://github.com/franklsf95/ruby-emoji-regex
/// http://*.com/questions/24672834/how-do-i-remove-emoji-from-string
/// </summary>
public const string Emoji = @"^[\s\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2694\u2696-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\u1F004\u1F0CF\u1F170-\u1F171\u1F17E-\u1F17F\u1F18E\u1F191-\u1F19A\u1F201-\u1F202\u1F21A\u1F22F\u1F232-\u1F23A\u1F250-\u1F251\u1F300-\u1F321\u1F324-\u1F393\u1F396-\u1F397\u1F399-\u1F39B\u1F39E-\u1F3F0\u1F3F3-\u1F3F5\u1F3F7-\u1F4FD\u1F4FF-\u1F53D\u1F549-\u1F54E\u1F550-\u1F567\u1F56F-\u1F570\u1F573-\u1F579\u1F587\u1F58A-\u1F58D\u1F590\u1F595-\u1F596\u1F5A5\u1F5A8\u1F5B1-\u1F5B2\u1F5BC\u1F5C2-\u1F5C4\u1F5D1-\u1F5D3\u1F5DC-\u1F5DE\u1F5E1\u1F5E3\u1F5EF\u1F5F3\u1F5FA-\u1F64F\u1F680-\u1F6C5\u1F6CB-\u1F6D0\u1F6E0-\u1F6E5\u1F6E9\u1F6EB-\u1F6EC\u1F6F0\u1F6F3\u1F910-\u1F918\u1F980-\u1F984\u1F9C0}]+$";
Usage:
用法:
if (!Regex.IsMatch(vm.NameFull, RegExKeys.Emoji)) // Match means no Emoji was found
#8
0
One more alternative
多一个选择
"Scheiße! I hate emoji ???? (123)".gsub(/[^\p{L}\s]+/, '').squeeze(' ').strip
=> "Scheiße I hate emoji"
This regex removes all non-word characters (e.g. !????(123)
) but keeps unicode letters (e.g. ß
in this example), where:
这个正则表达式删除所有非单词字符(e.g. !????(123))但保持unicode字母(e.g.ß在这个example),where:
\p{}
- matches a character’s Unicode script\p{} -匹配字符的Unicode脚本
\p{L}
- 'Letter'\ p { L } -“信”
^
- Start of line^——线的开始
\s
- Any whitespace character\s -任何空格字符
现场演示
More info about regexp
更多关于正则表达式的信息
#1
25
Karol S already provided a solution, but the reason might not be clear:
Karol S已经提供了一个解决方案,但原因可能并不清楚:
"\u1F600"
is actually "\u1F60"
followed by "0"
:
“\u1F600”实际上是“\u1F60”,然后是“0”:
"\u1F60" # => "ὠ"
"\u1F600" # => "ὠ0"
You have to use curly braces for code points above FFFF:
您必须使用花括号,以在FFFF之上的代码点:
"\u{1F600}" #=> "????"
Therefore the character class [\u1F600-\u1F6FF]
is interpreted as [\u1F60 0-\u1F6F F]
, i.e. it matches "\u1F60"
, the range "0"
.."\u1F6F"
and "F"
.
因此字符类[\u1F600-\u1F6FF]被解释为[\u1F60 -\u1F6F],即它匹配"\u1F60",范围"0"。\ u1F6F”和“F”。
Using curly braces solves the issue:
使用大括号解决了这个问题:
/[\u{1F600}-\u{1F6FF}]/
This matches (emoji) characters in these unicode blocks:
它匹配这些unicode区块中的(表情符号)字符:
- U+1F600..U+1F64F Emoticons
- U + 1 f600 . .U + 1 f64f表情符号
- U+1F650..U+1F67F Ornamental Dingbats
- U + 1 f650 . .U + 1 f67f观赏装饰标志
- U+1F680..U+1F6FF Transport and Map Symbols
- U + 1 f680 . .U+1F6FF传输和地图符号
You can also use unpack
, pack
, and between?
to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.
您还可以使用开箱,打包,以及之间?取得类似的结果。这也适用于Ruby 1.8.7,在正则表达式中不支持Unicode。
s = 'Hi!????'
#=> "Hi!\360\237\230\200"
s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"
Regarding your Rubular example – Emoji are single characters:
关于你的小红帽例子——表情符号是单个字符:
"????".length #=> 1
"????".chars #=> ["????"]
Whereas kaomoji are a combination of multiple characters:
kaomoji是多种字符的组合:
"^_^".length #=> 3
"^_^".chars #=> ["^", "_", "^"]
Matching these is a very different task (and you should ask that in a separate question).
匹配这些是一个非常不同的任务(您应该在另一个问题中问这个问题)。
#2
13
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
这个regex匹配所有的845个表情符号,从表情符unicode字符中提取,用于网络:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.
我直接从Unicode表情符号的原始列表中生成了这个regex。该算法在这里:https://github.com/franklsf95/ruby-emoji-regex。
Example usage:
使用示例:
regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji ???????????????????????????????? and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji and other Unicode characters 比如中文."
Other Unicode characters, such as Asian characters, are preserved.
其他Unicode字符(如亚洲字符)将被保留。
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments for details.
编辑:我对regex进行了udpated,以排除ASCII码和符号。有关详细信息,请参阅注释。
#3
12
I am using one based on this script.
我正在基于这个脚本使用一个。
def strip_emoji(text)
text = text.force_encoding('utf-8').encode
clean = ""
# symbols & pics
regex = /[\u{1f300}-\u{1f5ff}]/
clean = text.gsub regex, ""
# enclosed chars
regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
clean = clean.gsub regex, ""
# emoticons
regex = /[\u{1f600}-\u{1f64f}]/
clean = clean.gsub regex, ""
#dingbats
regex = /[\u{2702}-\u{27b0}]/
clean = clean.gsub regex, ""
end
Results:
结果:
irb> strip_emoji("????????☂❤华み원❤")
=> "华み원"
#4
8
REGEX = /[^\u{1F600}-\u{1F6FF}\s]/
or
或
REGEX = /[\u{1F600}-\u{1F6FF}\s]/
REGEX = /[\u{1F600}-\u{1F6FF}]/
REGEX = /[^\u{1F600}-\u{1F6FF}]/
because your original regex seems to indicate you try to find everything that is not an amoji and not a whitespace and I don't know why would you want to do it.
因为您最初的regex似乎表明您试图查找所有不是amoji也不是空格的内容,我不知道您为什么要这样做。
Also:
另外:
-
the emoji are 1F300-1F6FF rather than 1F600-1F6FF; you may want to change that
表情符号是1F300-1F6FF,而不是1F600-1F6FF;你可能想要改变它。
-
if you want to remove all astral characters (for example you deal with a software that doesn't support all of Unicode), you should use 10000-10FFFF.
如果您想要删除所有的星体字符(例如,您要处理一个不支持所有Unicode的软件),您应该使用10,000 - 10ffff。
EDIT: You almost certainly want REGEX = /[\u{1F600}-\u{1F6FF}]/
or similar. Your original regex matched everything that is not a whitespace, and not in range 0-\u1F6F
. Since spaces are whitespace, and English letters are in range 0-\u1F6F
, and Chinese characters are in neither, the regex matched Chinese characters and removed them.
编辑:您几乎肯定想要REGEX = /[\u{1F600}-\u{1F6FF}]/或类似的。您的原始regex匹配了所有非空白、不在0-\u1F6F范围内的内容。由于空格是空格,英文字母在0-\u1F6F范围内,而中文字符在两者都不在范围内,所以regex会匹配并删除它们。
#5
1
Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, .tr('^A-Za-z0-9', '')
. Of course this will remove all punctuation, but you can always modify the regex to suit your specific condition.
你不能删除表情符号,你只能包括字母和数字。一个简单的tr应该做的诀窍,.tr(“^ A-Za-z0-9“,”)。当然,这将删除所有的标点符号,但是您始终可以修改regex以适应您的特定条件。
#6
1
This very short Regex covers all Emoji in getemoji.com so far:
这个非常短的Regex涵盖了getemoji.com上所有的表情符号:
[\u{1F300}-\u{1F5FF}|\u{1F1E6}-\u{1F1FF}|\u{2700}-\u{27BF}|\u{1F900}-\u{1F9FF}|\u{1F600}-\u{1F64F}|\u{1F680}-\u{1F6FF}|\u{2600}-\u{26FF}]
#7
0
I converted the RegEx from the RUBY project above to a JavaScript friendly RegEx:
我将RegEx从上面的RUBY项目转换为一个JavaScript友好的RegEx:
/// <summary>
/// Emoji symbols character sets (added \s and +)
/// Unicode with עברית Delete the emoji to match ????
/// https://regex101.com/r/jP5jC5/3
/// https://github.com/franklsf95/ruby-emoji-regex
/// http://*.com/questions/24672834/how-do-i-remove-emoji-from-string
/// </summary>
public const string Emoji = @"^[\s\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9-\u21AA\u231A-\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA-\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614-\u2615\u2618\u261D\u2620\u2622-\u2623\u2626\u262A\u262E-\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665-\u2666\u2668\u267B\u267F\u2692-\u2694\u2696-\u2697\u2699\u269B-\u269C\u26A0-\u26A1\u26AA-\u26AB\u26B0-\u26B1\u26BD-\u26BE\u26C4-\u26C5\u26C8\u26CE-\u26CF\u26D1\u26D3-\u26D4\u26E9-\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733-\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763-\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934-\u2935\u2B05-\u2B07\u2B1B-\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\u1F004\u1F0CF\u1F170-\u1F171\u1F17E-\u1F17F\u1F18E\u1F191-\u1F19A\u1F201-\u1F202\u1F21A\u1F22F\u1F232-\u1F23A\u1F250-\u1F251\u1F300-\u1F321\u1F324-\u1F393\u1F396-\u1F397\u1F399-\u1F39B\u1F39E-\u1F3F0\u1F3F3-\u1F3F5\u1F3F7-\u1F4FD\u1F4FF-\u1F53D\u1F549-\u1F54E\u1F550-\u1F567\u1F56F-\u1F570\u1F573-\u1F579\u1F587\u1F58A-\u1F58D\u1F590\u1F595-\u1F596\u1F5A5\u1F5A8\u1F5B1-\u1F5B2\u1F5BC\u1F5C2-\u1F5C4\u1F5D1-\u1F5D3\u1F5DC-\u1F5DE\u1F5E1\u1F5E3\u1F5EF\u1F5F3\u1F5FA-\u1F64F\u1F680-\u1F6C5\u1F6CB-\u1F6D0\u1F6E0-\u1F6E5\u1F6E9\u1F6EB-\u1F6EC\u1F6F0\u1F6F3\u1F910-\u1F918\u1F980-\u1F984\u1F9C0}]+$";
Usage:
用法:
if (!Regex.IsMatch(vm.NameFull, RegExKeys.Emoji)) // Match means no Emoji was found
#8
0
One more alternative
多一个选择
"Scheiße! I hate emoji ???? (123)".gsub(/[^\p{L}\s]+/, '').squeeze(' ').strip
=> "Scheiße I hate emoji"
This regex removes all non-word characters (e.g. !????(123)
) but keeps unicode letters (e.g. ß
in this example), where:
这个正则表达式删除所有非单词字符(e.g. !????(123))但保持unicode字母(e.g.ß在这个example),where:
\p{}
- matches a character’s Unicode script\p{} -匹配字符的Unicode脚本
\p{L}
- 'Letter'\ p { L } -“信”
^
- Start of line^——线的开始
\s
- Any whitespace character\s -任何空格字符
现场演示
More info about regexp
更多关于正则表达式的信息