I am running Python 2.7 on Windows.
我在Windows上运行Python 2.7。
I have a large text file (2 GB) that refers to 500K+ emails. The file has no explicit file type and is in the format:
我有一个大文本文件(2 GB),引用500K +电子邮件。该文件没有明确的文件类型,格式如下:
email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|
email_message#: 3
email_message_sent: 10/13/1991 12:01:16
From: benfor12@xyz.com| Ben For |xyz company|
To: tomfoo@abc.com| Tom Foo |abc company|
To: t212@123.com| Tatiana Xocarsky |numbers firm |
...
As you can see, each email has the following data associated with it:
如您所见,每封电子邮件都包含以下与之关联的数据:
1) the time it was sent
1)发送的时间
2) the email address who sent it
2)发送它的电子邮件地址
3) the name of the person who sent it
3)发送者的姓名
4) the company that person works for
4)该人工作的公司
5) every email address that received the email
5)收到电子邮件的每个电子邮件地址
6) the name of every person who received the email
6)收到电子邮件的每个人的姓名
7) the company of every person who received the email
7)收到电子邮件的每个人的公司
In the text files there are 500K+ emails, and emails can have up to 16K recipients. There is no pattern in the emails in how they refer to names of people or the company they work at.
在文本文件中有500K +电子邮件,电子邮件最多可以有16K收件人。电子邮件中没有关于他们如何引用人员名称或他们所在公司的模式。
I would like to take this large file and manipulate it in python
so that it ends up as a Pandas
Dataframe
. I would like the pandas
dataframe
in the format like the screenshot from excel
below:
我想在python中使用这个大文件并对其进行操作,以便它最终成为Pandas Dataframe。我希望pandas数据框格式如下面的excel截图:
EDIT
My plan to solve this is to write a "parser" that takes this text file and reads in each line, assigning the text in each line to a particular columns of a pandas
dataframe
.
我解决这个问题的计划是编写一个“解析器”,它接受这个文本文件并读入每一行,将每行中的文本分配给pandas数据帧的特定列。
I plan to write something like the below. Can someone confirm that this is the correct way to go about executing this? I want to make sure I am not missing a built-in pandas
function or function from a different module
.
我打算写下面的内容。有人可以确认这是执行此操作的正确方法吗?我想确保我没有错过来自不同模块的内置pandas功能或功能。
#connect to object
data = open('.../Emails', 'r')
#build empty dataframe
import pandas as pd
df = pd.DataFrame()
#function to read lines of the object and put pieces of text into the
# correct column of the dataframe
for line in data:
n = data.readline()
if n.startswith("email_message#:"):
#put a slice of the text into a dataframe
elif n.startswith("email_message_sent:"):
#put a slice of the text into a dataframe
elif n.startswith("From:"):
#put slices of the text into a dataframe
elif n.startswith("To:"):
#put slices of the text into a dataframe
2 个解决方案
#1
1
I could not resist the itch so here is my approach.
我无法抗拒痒,所以这是我的方法。
from __future__ import unicode_literals
import io
import pandas as pd
from pandas.compat import string_types
def iter_fields(buf):
for l in buf:
yield l.rstrip('\n\r').split(':', 1)
def iter_messages(buf):
it = iter_fields(buf)
k, v = next(it)
while True:
n = int(v)
_, v = next(it)
date = pd.Timestamp(v)
_, v = next(it)
from_add, from_name, from_comp = v.split('|')[:-1]
k, v = next(it)
to = []
while k == 'To':
to_add, to_name, to_comp = v.split('|')[:-1]
yield (n, date, from_add[1:], from_name[1:-1], from_comp,
to_add[1:], to_name[1:-1], to_comp)
k, v = next(it)
if not hasattr(filepath_or_buffer, read):
filepath_or_buffer
def _read_email_headers(buf):
columns=['email_message#', 'email_message_sent',
'from_address', 'from_name', 'from_company',
'to_address', 'to_name', 'to_company']
return pd.DataFrame(iter_messages(buf), columns=columns)
def read_email_headers(path_or_buf):
close_buf = False
if isinstance(path_or_buf, string_types):
path_or_buf = io.open(path_or_buf)
close_buf = True
try:
return _read_email_headers(path_or_buf)
finally:
if close_buf:
path_or_buf.close
This is how you would use it:
这就是你如何使用它:
df = read_email_headers('.../data_file')
Just call it with the path to your file and you have your dataframe.
只需使用您的文件路径调用它,您就拥有了数据帧。
Now, what follows is for test purposes only. You wouldn't do this to work with your actual data in the real life.
现在,以下内容仅用于测试目的。你不会这样做来处理现实生活中的实际数据。
Since I (or a random * reader) do not have a copy of your file, I have to fake it using a string:
由于我(或随机*阅读器)没有您的文件的副本,我必须使用字符串伪造它:
text = '''email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|'''
Then I can create a file-like object and pass it to the function:
然后我可以创建一个类文件对象并将其传递给函数:
df = read_email_headers(io.StringIO(text))
print(df.to_string())
email_message# email_message_sent from_address from_name from_company to_address to_name to_company
0 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company adee@abc.com Alex Dee abc company
1 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company benfor12@xyz.com Ben For xyz company
2 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company tomf@abc.com Tom Foo abc company
3 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company adee@abc.com Alex Dee abc company
4 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company benfor12@xyz.com Ben Fo xyz company
Or, if I wanted to work with an actual file:
或者,如果我想使用实际文件:
with io.open('test_file.txt', 'w') as f:
f.write(text)
df = read_email_headers('test_file.txt')
print(df.to_string()) # Same output as before.
But, again, you do not have to do this to use the function with your data. Just call it with a file path.
但是,再次,您不必执行此操作即可将该功能与您的数据一起使用。只需用文件路径调用它即可。
#2
1
I don't know the absolute best way to do this. You're certainly not overlooking an obvious one-liner, which may reassure you.
我不知道绝对最好的方法。你肯定不会忽视一个明显的单行,这可能会让你放心。
It looks like your current parser (call it my_parse
) does all the processing. In pseudocode:
看起来您当前的解析器(称为my_parse)执行所有处理。在伪代码中:
finished_df = my_parse(original_text_file)
However, for such a large file, this is a little like cleaning up after a hurricane using tweezers. a two-stage solution may be faster, where you first roughly hew the file into the structure you want, then use pandas series operations to refine the rest. Continuing the pseudocode, you could do something like the following:
然而,对于如此大的文件,这有点像使用镊子飓风后的清理。两阶段解决方案可能更快,您首先将文件粗略地记录到您想要的结构中,然后使用pandas系列操作来优化其余部分。继续伪代码,您可以执行以下操作:
rough_df = rough_parse(original_text_file)
finished_df = refine(rough_df)
Where rough_parse
uses Python standard-library stuff, and refine
uses pandas series operations, particularly the Series.str methods.
其中rough_parse使用Python标准库的东西,而精炼使用pandas系列操作,特别是Series.str方法。
I would suggest that the main goal of rough_parse
would be simply to achieve a one-email--one-row structure. So basically you'd go through and replace all newline characters with some sort of unique delimiter that appears nowhere else in the file like "$%$%$"
, except where the next thing after the newline is "email_message#:"
我建议rough_parse的主要目标只是实现一个电子邮件 - 一行结构。所以基本上你会通过某种独特的分隔符替换所有换行符,这些分隔符在文件中没有其他地方出现,如“$%$%$”,除非新行之后的下一件事是“email_message#:”
Then Series.str is really good at wrangling the rest of the strings how you want them.
那么Series.str真的很擅长于你想要它们的其余字符串。
#1
1
I could not resist the itch so here is my approach.
我无法抗拒痒,所以这是我的方法。
from __future__ import unicode_literals
import io
import pandas as pd
from pandas.compat import string_types
def iter_fields(buf):
for l in buf:
yield l.rstrip('\n\r').split(':', 1)
def iter_messages(buf):
it = iter_fields(buf)
k, v = next(it)
while True:
n = int(v)
_, v = next(it)
date = pd.Timestamp(v)
_, v = next(it)
from_add, from_name, from_comp = v.split('|')[:-1]
k, v = next(it)
to = []
while k == 'To':
to_add, to_name, to_comp = v.split('|')[:-1]
yield (n, date, from_add[1:], from_name[1:-1], from_comp,
to_add[1:], to_name[1:-1], to_comp)
k, v = next(it)
if not hasattr(filepath_or_buffer, read):
filepath_or_buffer
def _read_email_headers(buf):
columns=['email_message#', 'email_message_sent',
'from_address', 'from_name', 'from_company',
'to_address', 'to_name', 'to_company']
return pd.DataFrame(iter_messages(buf), columns=columns)
def read_email_headers(path_or_buf):
close_buf = False
if isinstance(path_or_buf, string_types):
path_or_buf = io.open(path_or_buf)
close_buf = True
try:
return _read_email_headers(path_or_buf)
finally:
if close_buf:
path_or_buf.close
This is how you would use it:
这就是你如何使用它:
df = read_email_headers('.../data_file')
Just call it with the path to your file and you have your dataframe.
只需使用您的文件路径调用它,您就拥有了数据帧。
Now, what follows is for test purposes only. You wouldn't do this to work with your actual data in the real life.
现在,以下内容仅用于测试目的。你不会这样做来处理现实生活中的实际数据。
Since I (or a random * reader) do not have a copy of your file, I have to fake it using a string:
由于我(或随机*阅读器)没有您的文件的副本,我必须使用字符串伪造它:
text = '''email_message#: 1
email_message_sent: 10/10/1991 02:31:01
From: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For |xyz company|
email_message#: 2
email_message_sent: 10/12/1991 01:28:12
From: timt@abc.com| Tim Tee |abc company|
To: tomf@abc.com| Tom Foo |abc company|
To: adee@abc.com| Alex Dee |abc company|
To: benfor12@xyz.com| Ben For|xyz company|'''
Then I can create a file-like object and pass it to the function:
然后我可以创建一个类文件对象并将其传递给函数:
df = read_email_headers(io.StringIO(text))
print(df.to_string())
email_message# email_message_sent from_address from_name from_company to_address to_name to_company
0 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company adee@abc.com Alex Dee abc company
1 1 1991-10-10 02:31:01 tomf@abc.com Tom Foo abc company benfor12@xyz.com Ben For xyz company
2 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company tomf@abc.com Tom Foo abc company
3 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company adee@abc.com Alex Dee abc company
4 2 1991-10-12 01:28:12 timt@abc.com Tim Tee abc company benfor12@xyz.com Ben Fo xyz company
Or, if I wanted to work with an actual file:
或者,如果我想使用实际文件:
with io.open('test_file.txt', 'w') as f:
f.write(text)
df = read_email_headers('test_file.txt')
print(df.to_string()) # Same output as before.
But, again, you do not have to do this to use the function with your data. Just call it with a file path.
但是,再次,您不必执行此操作即可将该功能与您的数据一起使用。只需用文件路径调用它即可。
#2
1
I don't know the absolute best way to do this. You're certainly not overlooking an obvious one-liner, which may reassure you.
我不知道绝对最好的方法。你肯定不会忽视一个明显的单行,这可能会让你放心。
It looks like your current parser (call it my_parse
) does all the processing. In pseudocode:
看起来您当前的解析器(称为my_parse)执行所有处理。在伪代码中:
finished_df = my_parse(original_text_file)
However, for such a large file, this is a little like cleaning up after a hurricane using tweezers. a two-stage solution may be faster, where you first roughly hew the file into the structure you want, then use pandas series operations to refine the rest. Continuing the pseudocode, you could do something like the following:
然而,对于如此大的文件,这有点像使用镊子飓风后的清理。两阶段解决方案可能更快,您首先将文件粗略地记录到您想要的结构中,然后使用pandas系列操作来优化其余部分。继续伪代码,您可以执行以下操作:
rough_df = rough_parse(original_text_file)
finished_df = refine(rough_df)
Where rough_parse
uses Python standard-library stuff, and refine
uses pandas series operations, particularly the Series.str methods.
其中rough_parse使用Python标准库的东西,而精炼使用pandas系列操作,特别是Series.str方法。
I would suggest that the main goal of rough_parse
would be simply to achieve a one-email--one-row structure. So basically you'd go through and replace all newline characters with some sort of unique delimiter that appears nowhere else in the file like "$%$%$"
, except where the next thing after the newline is "email_message#:"
我建议rough_parse的主要目标只是实现一个电子邮件 - 一行结构。所以基本上你会通过某种独特的分隔符替换所有换行符,这些分隔符在文件中没有其他地方出现,如“$%$%$”,除非新行之后的下一件事是“email_message#:”
Then Series.str is really good at wrangling the rest of the strings how you want them.
那么Series.str真的很擅长于你想要它们的其余字符串。