使用正则表达式解析电子邮件内容

Everyday I receive thousands of emails and I want to parse the content/body of these emails to load them into a database.

每天我收到数以千计的电子邮件,我想解析这些电子邮件的内容/正文,将它们加载到数据库中。

My problem is that nowadays I am parsing the email body manually and I would like to change the logic to a Regular Expression in C#.

我的问题是,现在我正在手动解析电子邮件正文,我想将逻辑更改为C#中的正则表达式。

Here is the body of the emails:

这是电子邮件的正文:

Gentilissima Agenzia Nexity Residenziale

il nostro utente:

Sig./Sig.ra :Pablo Azorin

Sig./Sig.ra:Pablo Azorin

Email: pabloazorin@gmail.com

Tel.: 02322-498900

sta cercando un immobile con le seguenti caratteristiche:

Categoria: Residenziale

Tipologia: Villa

Tipo di contratto: Vendita

Tipo di contratto:Vendita

Comune: Assago Prov. Milano

Comune:Assago Prov。米兰

Zona: non specificata

Zona:非特定数据

Fascia di prezzo: non specificata

Fascia di prezzo:非特定数据

I need to extract the text in bold and I thought a RegEx is what I need for this...

我需要以粗体提取文本,我认为RegEx是我需要的...

Looking forward to get your suggestion about how to make it works.

期待获得有关如何使其有效的建议。

Thanks!

--Pablo

7 个解决方案

#1

Assuming that the parts in your email that are not bold always occur like that in all your emails, you can easily grab all the parts from your email with the regex:

假设您的电子邮件中非粗体的部分始终与您的所有电子邮件中的部分一样,您可以使用正则表达式轻松地从电子邮件中获取所有部分:

Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)

In C#

Regex regexObj = new Regex(@"Sig\./Sig\.ra :(.*)

Email: (.*)

Tel\.: (.*)

sta cercando un immobile con le seguenti caratteristiche:

Categoria: (.*)

Tipologia: (.*)

Tipo di contratto: (.*)

Comune: (.*)

Zona: (.*)

Fascia di prezzo: (.*)");
Match matchObj = regexObj.Match(subjectString);
string Sig = matchObj.Groups[1].Value;
string Email = matchObj.Groups[2].Value;
// and so on to get all the other parts

#2

Read Mastering Regular Expressions. It will teach you everything you need to know to complete this and other similar regex problems, and will give you enough understanding and insight to get you started writing much more complicated regular expressions.

阅读掌握正则表达式。它将教你完成这个和其他类似的正则表达式问题所需要知道的一切,并且会给你足够的理解和洞察力,让你开始编写更复杂的正则表达式。

#3

For email downloading I used Mailbee .Net objects. This library is quite easy to use and is well documented. But if you want to avoid programming you can also use an email parser like EmailParser2Database.

对于电子邮件下载,我使用了Mailbee .Net对象。该库非常易于使用,并且有很好的文档记录。但是如果你想避免编程,你也可以使用像EmailParser2Database这样的电子邮件解析器。

#4

If the emails are in the same format always, you can do this a number of different ways. A simple way of doing it would be to split on the newline and take a substring on each line, starting after the label.

如果电子邮件始终采用相同的格式,您可以通过多种不同方式执行此操作。一种简单的方法是在换行符上拆分并在每行上取一个子串,从标签开始。

With regexes, you'd probably create a regex that creates a number of named captures. You can then index into the Groups property of the match on the name of each named group in order to get the value out of it. This is a little more complex, of course.

使用正则表达式,您可能会创建一个创建一些命名捕获的正则表达式。然后,您可以在每个命名组的名称上索引匹配的Groups属性,以便从中获取值。当然,这有点复杂。

#5

i think it will be much better to split this string into an array of lines you can initialize a dictionary with all the titles as keys and you will search each line for the Title from the dictionary ("Email:" for example) and then u put the the result back into the into a dictionary as value at the end you will have a dictionary with all the titles and values. i think you dont need a regex for that. actually that way the order of the titles wont matter.

我认为将这个字符串拆分成一个行数组会更好,你可以初始化一个字典,所有标题都作为键,你将从字典中搜索每一行的标题(例如“电子邮件:”)然后你将结果放回到字典中作为值,最后您将获得包含所有标题和值的字典。我认为你不需要正则表达式。实际上,标题的顺序并不重要。

#6

We found that for spam filtering and other high-volume applications, regular expressions are a bit slow for parsing MIME headers, which is what you want to do. The code is somewhat specialized, but I wrote a C state machine for doing the parsing which is as fast as you'll get without going to something like re2c. The code is not for the faint of heart, but it is blindingly fast.

我们发现,对于垃圾邮件过滤和其他大量应用程序,正则表达式解析MIME标头有点慢,这是您想要做的。代码有点专业,但是我写了一个用于进行解析的C状态机,它的速度与你得到的速度一样快,而不会像re2c那样。代码不适合胆小的人,但它的速度非常快。

For emails I think you'll find an explicit state machine is easier to work with than regular expressions. It's also the last refuge of the goto statement!

对于电子邮件,我认为您会发现显式状态机比正则表达式更容易使用。它也是goto声明的最后避难所!

#7

You really don't want to do this manually, or with regular expressions. There are many different ways to encode data in an email, and many emails that don't strictly conform to the spec that can still be parsed. I have had success with AnPOP in a .NET environment.

你真的不想手动或使用正则表达式。在电子邮件中编码数据有许多不同的方法,许多电子邮件并不严格符合仍然可以解析的规范。我在.NET环境中使用AnPOP取得了成功。

#1