使用ASP.NET 3.5验证进行电子邮件格式验证的最佳正则表达式

时间:2021-06-16 16:34:56

I've used both of the following Regular Expressions for testing for a valid email expression with ASP.NET validation controls. I was wondering which is the better expression from a performance standpoint, or if someone has better one.

我使用以下两个正则表达式来测试带有ASP.NET验证控件的有效电子邮件表达式。我想知道从性能的角度来看哪个是更好的表达,或者如果有人有更好的表现。

 - \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
 - ^([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$

I'm trying avoid the "exponentially slow expression" problem described on the BCL Team Blog.

我正在努力避免在BCL团队博客上描述的“指数缓慢表达”问题。

UPDATE

UPDATE

Based on feedback I ended up creating a function to test if an email is valid:

根据反馈,我最终创建了一个函数来测试电子邮件是否有效:

Public Function IsValidEmail(ByVal emailString As String, Optional ByVal isRequired As Boolean = False) As Boolean
    Dim emailSplit As String()
    Dim isValid As Boolean = True
    Dim localPart As String = String.Empty
    Dim domainPart As String = String.Empty
    Dim domainSplit As String()
    Dim tld As String

    If emailString.Length >= 80 Then
        isValid = False
    ElseIf emailString.Length > 0 And emailString.Length < 6 Then
        'Email is too short
        isValid = False
    ElseIf emailString.Length > 0 Then
        'Email is optional, only test value if provided
        emailSplit = emailString.Split(CChar("@"))

        If emailSplit.Count <> 2 Then
            'Only 1 @ should exist
            isValid = False
        Else
            localPart = emailSplit(0)
            domainPart = emailSplit(1)
        End If

        If isValid = False OrElse domainPart.Contains(".") = False Then
            'Needs at least 1 period after @
            isValid = False
        Else
            'Test Local-Part Length and Characters
            If localPart.Length > 64 OrElse ValidateString(localPart, ValidateTests.EmailLocalPartSafeChars) = False OrElse _
               localPart.StartsWith(".") OrElse localPart.EndsWith(".") OrElse localPart.Contains("..") Then
                isValid = False
            End If

            'Validate Domain Name Portion of email address
            If isValid = False OrElse _
               ValidateString(domainPart, ValidateTests.HostNameChars) = False OrElse _
               domainPart.StartsWith("-") OrElse domainPart.StartsWith(".") OrElse domainPart.Contains("..") Then
                isValid = False
            Else
                domainSplit = domainPart.Split(CChar("."))
                tld = domainSplit(UBound(domainSplit))

                ' Top Level Domains must be at least two characters
                If tld.Length < 2 Then
                    isValid = False
                End If
            End If
        End If
    Else
        'If no value is passed review if required
        If isRequired = True Then
            isValid = False
        Else
            isValid = True
        End If
    End If

    Return isValid
End Function

Notes:

笔记:

  • IsValidEmail is more restrictive about characters allowed then the RFC, but it doesn't test for all possible invalid uses of those characters
  • 对于允许使用RFC的字符,IsValidEmail更具限制性,但它不会测试这些字符的所有可能的无效使用

9 个解决方案

#1


12  

If you're wondering why this question is generating so little activity, it's because there are so many other issues that should be dealt with before you start thinking about performance. Foremost among those is whether you should be using regexes to validate email addresses at all--and the consensus is that you should not. It's much trickier than most people expect, and probably pointless anyway.

如果你想知道为什么这个问题产生如此少的活动,那是因为在你开始考虑性能之前还有很多其他问题需要处理。其中最重要的是你是否应该使用正则表达式来验证电子邮件地址 - 而且你不应该达成共识。它比大多数人想象的要复杂得多,而且反正可能毫无意义。

Another problem is that your two regexes vary hugely in the kinds of strings they can match. For example, the second one is anchored at both ends, but the first isn't; it would match ">>>>foo@bar.com<<<<" because there's something that looks like an email address embedded in it. Maybe the framework forces the regex to match the whole string, but if that's the case, why is the second one anchored?

另一个问题是你的两个正则表达式在它们可以匹配的字符串种类上差别很大。例如,第二个锚定在两端,但第一个不是;它会匹配“>>>> foo@bar.com <<<<”因为它看起来像是嵌入其中的电子邮件地址。也许框架强制正则表达式匹配整个字符串,但如果是这样,为什么第二个锚定?

Another difference is that the first regex uses \w throughout, while the second uses [0-9a-zA-Z] in many places. In most regex flavors, \w matches the underscore in addition to letters and digits, but in some (including .NET) it also matches letters and digits from every writing system known to Unicode.

另一个区别是第一个正则表达式始终使用\ w,而第二个正则表达式在许多地方使用[0-9a-zA-Z]。在大多数正则表达式中,\ w除了字母和数字之外还匹配下划线,但在某些(包括.NET)中,它还匹配来自Unicode已知的每个书写系统的字母和数字。

There are many other differences, but that's academic; neither of those regexes is very good. See here for a good discussion of the topic, and a much better regex.

还有很多其他的差异,但这是学术上的;这些正则表达式都不是很好。请参阅此处以获得有关该主题的更好讨论,以及更好的正则表达式。

Getting back to the original question, I don't see a performance problem with either of those regexes. Aside from the nested-quantifiers anti-pattern cited in that BCL blog entry, you should also watch out for situations where two or more adjacent parts of the regex can match the same set of characters--for example,

回到最初的问题,我没有看到这些正则表达式的性能问题。除了BCL博客条目中引用的嵌套量词反模式之外,您还应该注意正则表达式的两个或多个相邻部分可以匹配相同字符集的情况 - 例如,

([A-Za-z]+|\w+)@

There's nothing like that in either of the regexes you posted. Parts that are controlled by quantifiers are always broken up by other parts that aren't quantified. Both regexes will experience some avoidable backtracking, but there are many better reasons than performance to reject them.

在你发布的任何一个正则表达式中没有类似的东西。由量词控制的零件总是被其他未量化的零件分解。两个正则表达式都会经历一些可避免的回溯,但是有很多比性能更好的理由来拒绝它们。

EDIT: So the second regex is subject to catastrophic backtracking; I should have tested it thoroughly before shooting my mouth off. Taking a closer look at that regex, I don't see why you need the outer asterisk in the first part:

编辑:所以第二个正则表达式受到灾难性的回溯;在拍摄我的嘴之前,我应该彻底测试一下。仔细看看那个正则表达式,我不明白为什么你需要在第一部分中使用外部星号:

[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*

All that bit does is make sure the first and last characters are alphanumeric while allowing some additional characters in between. This version does the same thing, but it fails much more quickly when no match is possible:

所有这一切都确保第一个和最后一个字符是字母数字,同时允许其间有一些额外的字符。这个版本做了同样的事情,但是当不可能匹配时它会更快地失败:

[0-9a-zA-Z][-.\w]*[0-9a-zA-Z]

That would probably suffice to eliminate the backtracking problem, but you could also make the part after the "@" more efficient by using an atomic group:

这可能足以消除回溯问题,但你也可以通过使用原子组使“@”之后的部分更有效:

(?>(?:[0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+)[a-zA-Z]{2,9}

In other words, if you've matched all you can of substrings that look like domain components with trailing dots, and the next part doesn't look like a TLD, don't bother backtracking. The first character you would have to give up is the final dot, and you know [a-zA-Z]{2,9} won't match that.

换句话说,如果你已经匹配所有可能的带有尾随点的域组件的子串,并且下一部分看起来不像TLD,那么就不要打扰回溯。你必须放弃的第一个角色是最后一个点,你知道[a-zA-Z] {2,9}将不匹配。

#2


8  

We use this RegEx which has been tested in-house against 1.5 million addresses. It correctly identifies better than 98% of ours, but there are some formats that I'm aware of that it would error on.

我们使用此RegEx已在内部针对150万个地址进行了测试。它正确地识别出超过98%的我们的,但有些格式我知道它会出错。

^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$

We also make sure that there are no EOL characters in the data since an EOL can fake out this RegEx. Our Function:

我们还确保数据中没有EOL字符,因为EOL可以伪造此RegEx。我们的职责:

Public Function IsValidEmail(ByVal strEmail As String) As Boolean
    ' Check An eMail Address To Ensure That It Is Valid
    Const cValidEmail = "^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$"   ' 98% Of All Valid eMail Addresses
    IsValidEmail = False
    ' Take Care Of Blanks, Nulls & EOLs
    strEmail = Replace(Replace(Trim$(strEmail & " "), vbCr, ""), vbLf, "")
    ' Blank eMail Is Invalid
    If strEmail = "" Then Exit Function
    ' RegEx Test The eMail Address
    Dim regEx As New System.Text.RegularExpressions.Regex(cValidEmail)
    IsValidEmail = regEx.IsMatch(strEmail)
End Function

#3


2  

I am a newbie, but I tried the following and it seemed to have limited the ".xxx" to only two occurrences or less, after the symbol '@'.

我是一个新手,但我尝试了以下内容,它似乎将“.xxx”限制为只有两次或更少,在符号'@'之后。

^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-])+(\.+[a-zA-Z]{2,4}){1,2})$

Note: I had to substitute single '\' with double '\\' as I am using this reg expr in R.

注意:由于我在R中使用此reg expr,因此我必须将单个'\'替换为双'\\'。

#4


1  

These don't check for all allowable email addresses according to the email address RFC.

这些不会根据电子邮件地址RFC检查所有允许的电子邮件地址。

#5


1  

I let MS to do the work for me:

我让MS为我做的工作:

Public Function IsValidEmail(ByVal emailString As String) As Boolean
    Dim retval As Boolean = True
    Try
        Dim address As New System.Net.Mail.MailAddress(emailString)
    Catch ex As Exception
        retval = False
    End Try
    Return retval
End Function

#6


1  

For server side validation, I found Phil Haack's solution to be one of the better ones. His attempt was to stick to the RFC:

对于服务器端验证,我发现Phil Haack的解决方案是更好的解决方案之一。他的尝试是坚持RFC:

string pattern = @"^(?!\.)(""([^""\r\\]|\\[""\r\\])*""|"
            + @"([-a-z0-9!#$%&'*+/=?^_`{|}~]|(?<!\.)\.)*)(?<!\.)"
            + @"@[a-z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$";

Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
return regex.IsMatch(emailAddress);

Details: http://blog.degree.no/2013/01/email-validation-finally-a-net-regular-expression-that-works/

详细信息:http://blog.degree.no/2013/01/email-validation-finally-a-net-regular-expression-that-works/

#7


0  

Just to contribute, I am using this regex.

只是为了贡献,我正在使用这个正则表达式。

^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,4})$

#8


0  

The thing about it is the specifications are changing with each domain extension that is introduced.

关于它的事情是规范随着引入的每个域扩展而变化。

You sit here mod your regex, test, test, test, and more testing. You finally get what you "think" is accurate then the specification changes... You update your regex to account for what the new requirements are..

你坐在这里修改你的正则表达式,测试,测试,测试和更多测试。你最终得到你“认为”准确然后规范改变的东西......你更新你的正则表达式来解释新的要求是什么。

Then someone enters aa@aa.aa and you've done all that work for what? It walks through your fancy regex.. bummer!

然后有人进入aa@aa.aa,你已经完成了所有工作吗?它走过你喜欢的正则表达式。真可惜!

You may as well just check for a single @, and a "." and move on. I assure you, you will not get someones email if they do not want to give it up. You'll get garbage or their hotmail account they never check and couldn't care less about.

你也可以只检查一个@和一个“。”。然后继续前进。我向你保证,如果他们不想放弃,你就不会收到某人的电子邮件。你会得到他们从未检查过的垃圾或他们的hotmail帐户,并且不在乎。

I've seen in many cases this goes horribly wrong and a client calls up because their own email address is rejected because of a poorly crafted regex check. Which as mentioned shouldn't have even been attempted.

我在许多情况下看到这种情况非常糟糕,并且客户打电话是因为他们自己的电子邮件地址被拒绝,因为制作精良的正则表达式检查。如上所述,甚至不应该尝试。

#9


0  

TextBox :-

文本框 :-

<asp:TextBox ID="txtemail" runat="server" CssClass="form-control pantxt" Placeholder="Enter Email Address"></asp:TextBox>

Required Filed validator:

必填提交验证人:

<asp:RequiredFieldValidator ID="RequiredFieldValidator9" runat="server" ControlToValidate="txtemail" ErrorMessage="Required"></asp:RequiredFieldValidator>

Regular Expression for email validation :

电子邮件验证的正则表达式

<asp:RegularExpressionValidator ID="validateemail" runat="server" ControlToValidate="txtemail" ValidationExpression="\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*" ErrorMessage="Invalid Email"></asp:RegularExpressionValidator>

Use this regular expression for email validation in asp.net

使用此正则表达式在asp.net中进行电子邮件验证

#1


12  

If you're wondering why this question is generating so little activity, it's because there are so many other issues that should be dealt with before you start thinking about performance. Foremost among those is whether you should be using regexes to validate email addresses at all--and the consensus is that you should not. It's much trickier than most people expect, and probably pointless anyway.

如果你想知道为什么这个问题产生如此少的活动,那是因为在你开始考虑性能之前还有很多其他问题需要处理。其中最重要的是你是否应该使用正则表达式来验证电子邮件地址 - 而且你不应该达成共识。它比大多数人想象的要复杂得多,而且反正可能毫无意义。

Another problem is that your two regexes vary hugely in the kinds of strings they can match. For example, the second one is anchored at both ends, but the first isn't; it would match ">>>>foo@bar.com<<<<" because there's something that looks like an email address embedded in it. Maybe the framework forces the regex to match the whole string, but if that's the case, why is the second one anchored?

另一个问题是你的两个正则表达式在它们可以匹配的字符串种类上差别很大。例如,第二个锚定在两端,但第一个不是;它会匹配“>>>> foo@bar.com <<<<”因为它看起来像是嵌入其中的电子邮件地址。也许框架强制正则表达式匹配整个字符串,但如果是这样,为什么第二个锚定?

Another difference is that the first regex uses \w throughout, while the second uses [0-9a-zA-Z] in many places. In most regex flavors, \w matches the underscore in addition to letters and digits, but in some (including .NET) it also matches letters and digits from every writing system known to Unicode.

另一个区别是第一个正则表达式始终使用\ w,而第二个正则表达式在许多地方使用[0-9a-zA-Z]。在大多数正则表达式中,\ w除了字母和数字之外还匹配下划线,但在某些(包括.NET)中,它还匹配来自Unicode已知的每个书写系统的字母和数字。

There are many other differences, but that's academic; neither of those regexes is very good. See here for a good discussion of the topic, and a much better regex.

还有很多其他的差异,但这是学术上的;这些正则表达式都不是很好。请参阅此处以获得有关该主题的更好讨论,以及更好的正则表达式。

Getting back to the original question, I don't see a performance problem with either of those regexes. Aside from the nested-quantifiers anti-pattern cited in that BCL blog entry, you should also watch out for situations where two or more adjacent parts of the regex can match the same set of characters--for example,

回到最初的问题,我没有看到这些正则表达式的性能问题。除了BCL博客条目中引用的嵌套量词反模式之外,您还应该注意正则表达式的两个或多个相邻部分可以匹配相同字符集的情况 - 例如,

([A-Za-z]+|\w+)@

There's nothing like that in either of the regexes you posted. Parts that are controlled by quantifiers are always broken up by other parts that aren't quantified. Both regexes will experience some avoidable backtracking, but there are many better reasons than performance to reject them.

在你发布的任何一个正则表达式中没有类似的东西。由量词控制的零件总是被其他未量化的零件分解。两个正则表达式都会经历一些可避免的回溯,但是有很多比性能更好的理由来拒绝它们。

EDIT: So the second regex is subject to catastrophic backtracking; I should have tested it thoroughly before shooting my mouth off. Taking a closer look at that regex, I don't see why you need the outer asterisk in the first part:

编辑:所以第二个正则表达式受到灾难性的回溯;在拍摄我的嘴之前,我应该彻底测试一下。仔细看看那个正则表达式,我不明白为什么你需要在第一部分中使用外部星号:

[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*

All that bit does is make sure the first and last characters are alphanumeric while allowing some additional characters in between. This version does the same thing, but it fails much more quickly when no match is possible:

所有这一切都确保第一个和最后一个字符是字母数字,同时允许其间有一些额外的字符。这个版本做了同样的事情,但是当不可能匹配时它会更快地失败:

[0-9a-zA-Z][-.\w]*[0-9a-zA-Z]

That would probably suffice to eliminate the backtracking problem, but you could also make the part after the "@" more efficient by using an atomic group:

这可能足以消除回溯问题,但你也可以通过使用原子组使“@”之后的部分更有效:

(?>(?:[0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+)[a-zA-Z]{2,9}

In other words, if you've matched all you can of substrings that look like domain components with trailing dots, and the next part doesn't look like a TLD, don't bother backtracking. The first character you would have to give up is the final dot, and you know [a-zA-Z]{2,9} won't match that.

换句话说,如果你已经匹配所有可能的带有尾随点的域组件的子串,并且下一部分看起来不像TLD,那么就不要打扰回溯。你必须放弃的第一个角色是最后一个点,你知道[a-zA-Z] {2,9}将不匹配。

#2


8  

We use this RegEx which has been tested in-house against 1.5 million addresses. It correctly identifies better than 98% of ours, but there are some formats that I'm aware of that it would error on.

我们使用此RegEx已在内部针对150万个地址进行了测试。它正确地识别出超过98%的我们的,但有些格式我知道它会出错。

^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$

We also make sure that there are no EOL characters in the data since an EOL can fake out this RegEx. Our Function:

我们还确保数据中没有EOL字符,因为EOL可以伪造此RegEx。我们的职责:

Public Function IsValidEmail(ByVal strEmail As String) As Boolean
    ' Check An eMail Address To Ensure That It Is Valid
    Const cValidEmail = "^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$"   ' 98% Of All Valid eMail Addresses
    IsValidEmail = False
    ' Take Care Of Blanks, Nulls & EOLs
    strEmail = Replace(Replace(Trim$(strEmail & " "), vbCr, ""), vbLf, "")
    ' Blank eMail Is Invalid
    If strEmail = "" Then Exit Function
    ' RegEx Test The eMail Address
    Dim regEx As New System.Text.RegularExpressions.Regex(cValidEmail)
    IsValidEmail = regEx.IsMatch(strEmail)
End Function

#3


2  

I am a newbie, but I tried the following and it seemed to have limited the ".xxx" to only two occurrences or less, after the symbol '@'.

我是一个新手,但我尝试了以下内容,它似乎将“.xxx”限制为只有两次或更少,在符号'@'之后。

^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-])+(\.+[a-zA-Z]{2,4}){1,2})$

Note: I had to substitute single '\' with double '\\' as I am using this reg expr in R.

注意:由于我在R中使用此reg expr,因此我必须将单个'\'替换为双'\\'。

#4


1  

These don't check for all allowable email addresses according to the email address RFC.

这些不会根据电子邮件地址RFC检查所有允许的电子邮件地址。

#5


1  

I let MS to do the work for me:

我让MS为我做的工作:

Public Function IsValidEmail(ByVal emailString As String) As Boolean
    Dim retval As Boolean = True
    Try
        Dim address As New System.Net.Mail.MailAddress(emailString)
    Catch ex As Exception
        retval = False
    End Try
    Return retval
End Function

#6


1  

For server side validation, I found Phil Haack's solution to be one of the better ones. His attempt was to stick to the RFC:

对于服务器端验证,我发现Phil Haack的解决方案是更好的解决方案之一。他的尝试是坚持RFC:

string pattern = @"^(?!\.)(""([^""\r\\]|\\[""\r\\])*""|"
            + @"([-a-z0-9!#$%&'*+/=?^_`{|}~]|(?<!\.)\.)*)(?<!\.)"
            + @"@[a-z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$";

Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
return regex.IsMatch(emailAddress);

Details: http://blog.degree.no/2013/01/email-validation-finally-a-net-regular-expression-that-works/

详细信息:http://blog.degree.no/2013/01/email-validation-finally-a-net-regular-expression-that-works/

#7


0  

Just to contribute, I am using this regex.

只是为了贡献,我正在使用这个正则表达式。

^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,4})$

#8


0  

The thing about it is the specifications are changing with each domain extension that is introduced.

关于它的事情是规范随着引入的每个域扩展而变化。

You sit here mod your regex, test, test, test, and more testing. You finally get what you "think" is accurate then the specification changes... You update your regex to account for what the new requirements are..

你坐在这里修改你的正则表达式,测试,测试,测试和更多测试。你最终得到你“认为”准确然后规范改变的东西......你更新你的正则表达式来解释新的要求是什么。

Then someone enters aa@aa.aa and you've done all that work for what? It walks through your fancy regex.. bummer!

然后有人进入aa@aa.aa,你已经完成了所有工作吗?它走过你喜欢的正则表达式。真可惜!

You may as well just check for a single @, and a "." and move on. I assure you, you will not get someones email if they do not want to give it up. You'll get garbage or their hotmail account they never check and couldn't care less about.

你也可以只检查一个@和一个“。”。然后继续前进。我向你保证,如果他们不想放弃,你就不会收到某人的电子邮件。你会得到他们从未检查过的垃圾或他们的hotmail帐户,并且不在乎。

I've seen in many cases this goes horribly wrong and a client calls up because their own email address is rejected because of a poorly crafted regex check. Which as mentioned shouldn't have even been attempted.

我在许多情况下看到这种情况非常糟糕,并且客户打电话是因为他们自己的电子邮件地址被拒绝,因为制作精良的正则表达式检查。如上所述,甚至不应该尝试。

#9


0  

TextBox :-

文本框 :-

<asp:TextBox ID="txtemail" runat="server" CssClass="form-control pantxt" Placeholder="Enter Email Address"></asp:TextBox>

Required Filed validator:

必填提交验证人:

<asp:RequiredFieldValidator ID="RequiredFieldValidator9" runat="server" ControlToValidate="txtemail" ErrorMessage="Required"></asp:RequiredFieldValidator>

Regular Expression for email validation :

电子邮件验证的正则表达式

<asp:RegularExpressionValidator ID="validateemail" runat="server" ControlToValidate="txtemail" ValidationExpression="\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*" ErrorMessage="Invalid Email"></asp:RegularExpressionValidator>

Use this regular expression for email validation in asp.net

使用此正则表达式在asp.net中进行电子邮件验证