从电子邮件文本中解析“发件人”地址

时间:2022-10-23 18:17:44

I'm trying to extract email addresses from plain text transcripts of emails. I've cobbled together a bit of code to find the addresses themselves, but I don't know how to make it discriminate between them; right now it just spits out all email addresses in the file. I'd like to make it so it only spits out addresses that are preceeded by "From:" and a few wildcard characters, and ending with ">" (because the emails are set up as From [name]<[email]>).

我正在尝试从电子邮件的纯文本脚本中提取电子邮件地址。我拼凑了一些代码来查找地址本身,但我不知道如何区分它们;现在它只是吐出文件中的所有电子邮件地址。我想这样做它只会吐出前面有“From:”和一些通配符的地址,并以“>”结尾(因为电子邮件设置为From [name] <[email]> )。

Here's the code now:

这是现在的代码:

import re #allows program to use regular expressions
foundemail = []
#this is an empty list

mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
 #do not currently know exact meaning of this expression but assuming
 #it means something like "[stuff]@[stuff][stuff1-4 letters]"

        # "line" is a variable is set to a single line read from the file
# ("text.txt"):
for line in open("text.txt"):

    foundemail.extend(mailsrch.findall(line))

    # this extends the previously named list via the "mailsrch" variable
      #which was named before

print foundemail

8 个解决方案

#1


If your goal is actually to extract email addresses from text, you should use a library built for that purpose. Regular expressions are not well suited to match arbitrary email addresses.

如果您的目标实际上是从文本中提取电子邮件地址,则应使用为此目的构建的库。正则表达式不适合匹配任意电子邮件地址。

But if you're doing this as an exercise to understand regular expressions better, I'd take the approach of expanding the expression you're using to include the extra text you want to match. So first, let me explain what that regex does:

但是如果你这样做是为了更好地理解正则表达式,我会采用扩展你正在使用的表达式的方法来包含你想要匹配的额外文本。首先,让我解释一下正则表达式的作用:

[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}
  • [\w\-] matches any "word" character (letter, number, or underscore), or a hyphen
  • [\ w \ - ]匹配任何“单词”字符(字母,数字或下划线)或连字符

  • [\w\-\.]+ matches (any word character or hyphen or period) one or more times
  • [\ w \ - \。] +匹配(任何单词字符或连字符或句点)一次或多次

  • @ matches a literal '@'
  • @匹配文字“@”

  • [\w\-] matches any word character or hyphen
  • [\ w \ - ]匹配任何单词字符或连字符

  • [\w\-\.]+ matches (any word character or hyphen or period) one or more times
  • [\ w \ - \。] +匹配(任何单词字符或连字符或句点)一次或多次

  • [a-zA-Z]{1,4} matches 1, 2, 3, or 4 lowercase or uppercase letters
  • [a-zA-Z] {1,4}匹配1,2,3或4个小写或大写字母

So this matches a sequence of a "word" that may contain hyphens or periods but doesn't start with a period, followed by an @ sign, followed by another "word" (same sense as before) that ends with a letter.

所以这匹配了一个“单词”序列,它可能包含连字符或句号,但不以句号开头,后跟@符号,后跟另一个以字母结尾的“单词”(与之前相同)。

Now, to modify this for your purposes, let's add regex parts to match "From", the name, and the angle brackets:

现在,为了您的目的修改它,让我们添加正则表达式部分以匹配“From”,名称和尖括号:

From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>
  • From: matches the literal text "From: "
  • 发件人:匹配文字文字“来自:”

  • [\w\s]+? matches one or more consecutive word characters or space characters. The question mark makes the match non-greedy, so it will match as few characters as possible while still allowing the whole regular expression to match (in this case, it's probably not necessary, but it does make the match more efficient since the thing that comes immediately afterwards is not a word character or space character).
  • [\ W \ S] +?匹配一个或多个连续的单词字符或空格字符。问号使得匹配非贪婪,因此它将匹配尽可能少的字符,同时仍然允许整个正则表达式匹配(在这种情况下,它可能没有必要,但它确实使匹配更有效,因为事情是之后立即出现不是字符或空格字符)。

  • < matches a literal less-than sign (opening angle bracket)
  • <匹配文字小于号(开角括号)< p>

  • The same regular expression you had before is now surrounded by parentheses. This makes it a capturing group, so you can call m.group(1) to get the text matched by that part of the regex.
  • 您之前使用的正则表达式现在被括号括起来。这使它成为一个捕获组,因此您可以调用m.group(1)来获取正则表达式部分匹配的文本。

  • > matches a literal greater-than sign
  • >匹配文字大于号

Since the regex now uses capturing groups, your code will need to change a little as well:

由于正则表达式现在使用捕获组,因此您的代码也需要更改一点:

import re
foundemail = []

mailsrch = re.compile(r'From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>')

for line in open("text.txt"):
    foundemail.extend([m.group(1) for m in mailsrch.finditer(line)])

print foundemail

The code [m.group(1) for m in mailsrch.finditer(line)] produces a list out of the first capturing group (remember, that was the part in parentheses) from each match found by the regular expression.

mailsrch.finditer(line)中m的代码[m.group(1)]从正则表达式找到的每个匹配中生成第一个捕获组中的列表(请记住,这是括号中的部分)。

#2


Try this out:

试试这个:

>>> from email.utils import parseaddr

>>> parseaddr('From: vg@m.com')
('', 'vg@m.com')

>>> parseaddr('From: Van Gale <vg@m.com>')
('Van Gale', 'vg@m.com')

>>> parseaddr('    From: Van Gale <vg@m.com>   ')
('Van Gale', 'vg@m.com')

>>> parseaddr('blah abdf    From: Van Gale <vg@m.com>   and this')
('Van Gale', 'vg@m.com')

Unfortunately it only finds the first email in each line because it's expecting header lines, but maybe that's ok?

不幸的是,它只找到每行中的第一封电子邮件,因为它期待标题行,但也许这没关系?

#3


import email
msg = email.message_from_string(str)

# or
# f = open(file)
# msg = email.message_from_file(f)

msg['from']

# and optionally
from email.utils import parseaddr
addr = parseaddr(msg['from'])

#4


mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')

Expression breakdown:

[\w-]: any word character (alphanumeric, plus underscore) or a dash

[\ w-]:任何单词字符(字母数字,加上下划线)或破折号

[\w-.]+: any word character, a dash, or a period/dot, one or more times

[\ w - 。] +:任何单词字符,短划线或句点/点,一次或多次

@: literal @ symbol

@:literal @ symbol

[\w-][\w-.]+: any word char or dash, followed by any word char, dash, or period one or more times.

[\ w - ] [\ w - 。] +:任何单词char或dash,后跟任何单词char,dash或period一次或多次。

[a-zA-Z]{1,4}: any alphabetic character 1-4 times.

[a-zA-Z] {1,4}:任何字母字符1-4次。

To make this match only lines starting with From:, and wrapped in < and > symbols:

要使此匹配仅包含以From:开头的行,并包含在 <和> 符号中:

import re

foundemail = []
mailsrch = re.compile(r'^From:\s+.*<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>', re.I | re.M)
foundemail.extend(mailsrch.findall(open('text.txt').read()))

print foundemail

#5


Use the email and mailbox packages to parse the plain text version of the email. This will convert it to an object that will enable to extract all the addresses in the 'From' field.

使用电子邮件和邮箱包解析电子邮件的纯文本版本。这会将其转换为一个对象,该对象将能够提取“发件人”字段中的所有地址。

You can also do a lot of other analysis on the message, if you need to process other header fields, or the message body.

如果需要处理其他标题字段或消息正文,您还可以对消息进行大量其他分析。

As a quick example, the following (untested) code should read all the message in a unix style mailbox, and print all the 'from' headers.

作为一个简单的示例,以下(未经测试的)代码应该读取unix样式邮箱中的所有邮件,并打印所有“from”标头。

import mailbox
import email

mbox = mailbox.PortableUnixMailbox(open(filename, 'rU'), email.message_from_file)

for msg in mbox:
   from = msg['From']
   print from

#6


Roughly speaking, you can:

粗略地说,你可以:

from email.utils import parseaddr

foundemail = []
for line in open("text.txt"):
    if not line.startswith("From:"): continue
    n, e = parseaddr(line)
    foundemail.append(e)
print foundemail

This utilizes the built-in python parseaddr function to parse the address out of the from line (as demonstrated by other answers), without the overhead necessarily of parsing the entire message (e.g. by using the more full featured email and mailbox packages). The script here simply skips any lines that do not begin with "From:". Whether the overhead matters to you depends on how big your input is and how often you will be doing this operation.

这利用了内置的python parseaddr函数来解析from行之外的地址(如其他答案所示),而无需解析整个消息的开销(例如,通过使用功能更全面的电子邮件和邮箱包)。这里的脚本只是跳过任何不以“From:”开头的行。开销对您来说是否重要取决于您的输入有多大以及您执行此操作的频率。

#7


if you can be reasonably sure that lines containing these email addresses start with whitespace followed by "From:" you can simply do this:

如果您可以合理地确定包含这些电子邮件地址的行以空格开头,后跟“From:”,您可以简单地执行此操作:

addresslines = []
for line in open("text.txt"):
    if line.strip().startswith("From:"):
        addresslines.append(line)

then later - or on adding them to the list - you can refine the addresslines items to give out exactly what you want

然后 - 或者将它们添加到列表中 - 您可以优化地址线项目以准确地给出您想要的内容

#8


"[stuff]@[stuff][stuff1-4 letters]" is about right, but if you wanted to you could decode the regular expression using a trick I just found out about, here. Do the compile() in an interactive Python session like this:

“[stuff] @ [stuff] [stuff1-4 letters]”是关于正确的,但如果你想,你可以使用我刚刚发现的技巧来解码正则表达式,这里。在交互式Python会话中执行compile(),如下所示:

mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', 128)

It will print out the following:

它将打印出以下内容:

in 
  category category_word
  literal 45
max_repeat 1 65535 
  in 
    category category_word
    literal 45
    literal 46
literal 64 
in 
  category category_word
  literal 45
max_repeat 1 65535 
  in 
    category category_word
    literal 45
    literal 46
max_repeat 1 4 
  in 
    range (97, 122)
    range (65, 90)

Which, if you can kind of get used to it, shows you exactly how the RE works.

如果您能够习惯它,那么向您展示RE的确切工作方式。

#1


If your goal is actually to extract email addresses from text, you should use a library built for that purpose. Regular expressions are not well suited to match arbitrary email addresses.

如果您的目标实际上是从文本中提取电子邮件地址,则应使用为此目的构建的库。正则表达式不适合匹配任意电子邮件地址。

But if you're doing this as an exercise to understand regular expressions better, I'd take the approach of expanding the expression you're using to include the extra text you want to match. So first, let me explain what that regex does:

但是如果你这样做是为了更好地理解正则表达式,我会采用扩展你正在使用的表达式的方法来包含你想要匹配的额外文本。首先,让我解释一下正则表达式的作用:

[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}
  • [\w\-] matches any "word" character (letter, number, or underscore), or a hyphen
  • [\ w \ - ]匹配任何“单词”字符(字母,数字或下划线)或连字符

  • [\w\-\.]+ matches (any word character or hyphen or period) one or more times
  • [\ w \ - \。] +匹配(任何单词字符或连字符或句点)一次或多次

  • @ matches a literal '@'
  • @匹配文字“@”

  • [\w\-] matches any word character or hyphen
  • [\ w \ - ]匹配任何单词字符或连字符

  • [\w\-\.]+ matches (any word character or hyphen or period) one or more times
  • [\ w \ - \。] +匹配(任何单词字符或连字符或句点)一次或多次

  • [a-zA-Z]{1,4} matches 1, 2, 3, or 4 lowercase or uppercase letters
  • [a-zA-Z] {1,4}匹配1,2,3或4个小写或大写字母

So this matches a sequence of a "word" that may contain hyphens or periods but doesn't start with a period, followed by an @ sign, followed by another "word" (same sense as before) that ends with a letter.

所以这匹配了一个“单词”序列,它可能包含连字符或句号,但不以句号开头,后跟@符号,后跟另一个以字母结尾的“单词”(与之前相同)。

Now, to modify this for your purposes, let's add regex parts to match "From", the name, and the angle brackets:

现在,为了您的目的修改它,让我们添加正则表达式部分以匹配“From”,名称和尖括号:

From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>
  • From: matches the literal text "From: "
  • 发件人:匹配文字文字“来自:”

  • [\w\s]+? matches one or more consecutive word characters or space characters. The question mark makes the match non-greedy, so it will match as few characters as possible while still allowing the whole regular expression to match (in this case, it's probably not necessary, but it does make the match more efficient since the thing that comes immediately afterwards is not a word character or space character).
  • [\ W \ S] +?匹配一个或多个连续的单词字符或空格字符。问号使得匹配非贪婪,因此它将匹配尽可能少的字符,同时仍然允许整个正则表达式匹配(在这种情况下,它可能没有必要,但它确实使匹配更有效,因为事情是之后立即出现不是字符或空格字符)。

  • < matches a literal less-than sign (opening angle bracket)
  • <匹配文字小于号(开角括号)< p>

  • The same regular expression you had before is now surrounded by parentheses. This makes it a capturing group, so you can call m.group(1) to get the text matched by that part of the regex.
  • 您之前使用的正则表达式现在被括号括起来。这使它成为一个捕获组,因此您可以调用m.group(1)来获取正则表达式部分匹配的文本。

  • > matches a literal greater-than sign
  • >匹配文字大于号

Since the regex now uses capturing groups, your code will need to change a little as well:

由于正则表达式现在使用捕获组,因此您的代码也需要更改一点:

import re
foundemail = []

mailsrch = re.compile(r'From: [\w\s]+?<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>')

for line in open("text.txt"):
    foundemail.extend([m.group(1) for m in mailsrch.finditer(line)])

print foundemail

The code [m.group(1) for m in mailsrch.finditer(line)] produces a list out of the first capturing group (remember, that was the part in parentheses) from each match found by the regular expression.

mailsrch.finditer(line)中m的代码[m.group(1)]从正则表达式找到的每个匹配中生成第一个捕获组中的列表(请记住,这是括号中的部分)。

#2


Try this out:

试试这个:

>>> from email.utils import parseaddr

>>> parseaddr('From: vg@m.com')
('', 'vg@m.com')

>>> parseaddr('From: Van Gale <vg@m.com>')
('Van Gale', 'vg@m.com')

>>> parseaddr('    From: Van Gale <vg@m.com>   ')
('Van Gale', 'vg@m.com')

>>> parseaddr('blah abdf    From: Van Gale <vg@m.com>   and this')
('Van Gale', 'vg@m.com')

Unfortunately it only finds the first email in each line because it's expecting header lines, but maybe that's ok?

不幸的是,它只找到每行中的第一封电子邮件,因为它期待标题行,但也许这没关系?

#3


import email
msg = email.message_from_string(str)

# or
# f = open(file)
# msg = email.message_from_file(f)

msg['from']

# and optionally
from email.utils import parseaddr
addr = parseaddr(msg['from'])

#4


mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')

Expression breakdown:

[\w-]: any word character (alphanumeric, plus underscore) or a dash

[\ w-]:任何单词字符(字母数字,加上下划线)或破折号

[\w-.]+: any word character, a dash, or a period/dot, one or more times

[\ w - 。] +:任何单词字符,短划线或句点/点,一次或多次

@: literal @ symbol

@:literal @ symbol

[\w-][\w-.]+: any word char or dash, followed by any word char, dash, or period one or more times.

[\ w - ] [\ w - 。] +:任何单词char或dash,后跟任何单词char,dash或period一次或多次。

[a-zA-Z]{1,4}: any alphabetic character 1-4 times.

[a-zA-Z] {1,4}:任何字母字符1-4次。

To make this match only lines starting with From:, and wrapped in < and > symbols:

要使此匹配仅包含以From:开头的行,并包含在 <和> 符号中:

import re

foundemail = []
mailsrch = re.compile(r'^From:\s+.*<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>', re.I | re.M)
foundemail.extend(mailsrch.findall(open('text.txt').read()))

print foundemail

#5


Use the email and mailbox packages to parse the plain text version of the email. This will convert it to an object that will enable to extract all the addresses in the 'From' field.

使用电子邮件和邮箱包解析电子邮件的纯文本版本。这会将其转换为一个对象,该对象将能够提取“发件人”字段中的所有地址。

You can also do a lot of other analysis on the message, if you need to process other header fields, or the message body.

如果需要处理其他标题字段或消息正文,您还可以对消息进行大量其他分析。

As a quick example, the following (untested) code should read all the message in a unix style mailbox, and print all the 'from' headers.

作为一个简单的示例,以下(未经测试的)代码应该读取unix样式邮箱中的所有邮件,并打印所有“from”标头。

import mailbox
import email

mbox = mailbox.PortableUnixMailbox(open(filename, 'rU'), email.message_from_file)

for msg in mbox:
   from = msg['From']
   print from

#6


Roughly speaking, you can:

粗略地说,你可以:

from email.utils import parseaddr

foundemail = []
for line in open("text.txt"):
    if not line.startswith("From:"): continue
    n, e = parseaddr(line)
    foundemail.append(e)
print foundemail

This utilizes the built-in python parseaddr function to parse the address out of the from line (as demonstrated by other answers), without the overhead necessarily of parsing the entire message (e.g. by using the more full featured email and mailbox packages). The script here simply skips any lines that do not begin with "From:". Whether the overhead matters to you depends on how big your input is and how often you will be doing this operation.

这利用了内置的python parseaddr函数来解析from行之外的地址(如其他答案所示),而无需解析整个消息的开销(例如,通过使用功能更全面的电子邮件和邮箱包)。这里的脚本只是跳过任何不以“From:”开头的行。开销对您来说是否重要取决于您的输入有多大以及您执行此操作的频率。

#7


if you can be reasonably sure that lines containing these email addresses start with whitespace followed by "From:" you can simply do this:

如果您可以合理地确定包含这些电子邮件地址的行以空格开头,后跟“From:”,您可以简单地执行此操作:

addresslines = []
for line in open("text.txt"):
    if line.strip().startswith("From:"):
        addresslines.append(line)

then later - or on adding them to the list - you can refine the addresslines items to give out exactly what you want

然后 - 或者将它们添加到列表中 - 您可以优化地址线项目以准确地给出您想要的内容

#8


"[stuff]@[stuff][stuff1-4 letters]" is about right, but if you wanted to you could decode the regular expression using a trick I just found out about, here. Do the compile() in an interactive Python session like this:

“[stuff] @ [stuff] [stuff1-4 letters]”是关于正确的,但如果你想,你可以使用我刚刚发现的技巧来解码正则表达式,这里。在交互式Python会话中执行compile(),如下所示:

mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', 128)

It will print out the following:

它将打印出以下内容:

in 
  category category_word
  literal 45
max_repeat 1 65535 
  in 
    category category_word
    literal 45
    literal 46
literal 64 
in 
  category category_word
  literal 45
max_repeat 1 65535 
  in 
    category category_word
    literal 45
    literal 46
max_repeat 1 4 
  in 
    range (97, 122)
    range (65, 90)

Which, if you can kind of get used to it, shows you exactly how the RE works.

如果您能够习惯它,那么向您展示RE的确切工作方式。