I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :
我有4亿条推文(实际上我认为它几乎像450但不记得),形式如下:
T "timestamp"
U "username"
W "actual tweet"
I want to write them to a file initially in the form "username \t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".
我想最初以“username \ t tweet”的形式将它们写入文件,然后加载到DB中。问题是在加载到数据库之前,我需要做一些事情:1。预处理推文以删除RT @ [名称]和网址2.从http://twitter.com/中取出用户名用户名”。
I am using python and this is the code . Please let me know how this can be made faster :)
我正在使用python,这是代码。请让我知道如何更快地:)
'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
'''
def regexSub(line):
line = re.sub(regRT,'',line)
line = re.sub(regAt,'',line)
line = line.lstrip(' ')
line = re.sub(regHttp,'',line)
return line
def userName(line):
return line.split('http://twitter.com/')[1]
import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT
regRT = 'RT'
global regHttp
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')
for line1,line2,line3 in itertools.izip_longest(*[data]*3):
line1 = line1.split('\t')[1]
line2 = line2.split('\t')[1]
line3 = line3.split('\t')[1]
#print 'line1',line1
#print 'line2=',line2
#print 'line3=',line3
#print 'line3 before preprocessing',line3
try:
tweet=regexSub(line3)
user = userName(line2)
except:
print 'Line2 is ',line2
print 'Line3 is',line3
#print 'line3 after processig',line3
processed.write(user.strip("\n")+"\t"+tweet)
I ran the code in the following manner:
我按以下方式运行代码:
python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt
This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )
这是我得到的输出:(警告:它相当大,我没有过滤掉小值,思考,它们也可能有一些意义)
Sat Jan 7 03:28:51 2012 profile_dump
3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
528840744 166.402 0.000 166.402 0.000 {method 'split' of 'str' objects}
396630560 81.300 0.000 81.300 0.000 {method 'get' of 'dict' objects}
396630560 326.349 0.000 439.737 0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558 255.662 0.000 1297.705 0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558 602.307 0.000 602.307 0.000 {built-in method sub}
264420442 32.087 0.000 32.087 0.000 {isinstance}
132210186 34.700 0.000 34.700 0.000 {method 'lstrip' of 'str' objects}
132210186 27.296 0.000 27.296 0.000 {method 'strip' of 'str' objects}
132210186 181.287 0.000 1513.691 0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186 79.950 0.000 79.950 0.000 {method 'write' of 'file' objects}
132210186 55.900 0.000 113.960 0.000 TwitterScripts/Preprocessing.py:10(userName)
313/304 0.000 0.000 0.000 0.000 {len}
Removed the ones which were really low ( like 1, 3 and so on)
删除那些非常低的(如1,3等)
Please tell me what other changes can be made. Thanks !
请告诉我可以做出哪些其他更改。谢谢 !
4 个解决方案
#1
7
This is what multiprocessing is for.
这就是多处理的目的。
You have a pipeline that can be broken into a large number of small steps. Each step is a Process
which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.
你有一个可以分解成大量小步骤的管道。每个步骤都是一个过程,它从管道中获取项目,进行小的转换并将中间结果放入下一个管道。
You'll have a Process
which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.
你将有一个进程,它一次读取三行原始文件,并将三行放入一个管道。就这样。
You'll have a Process
which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.
你将有一个Process从管道中获取(T,U,W)三元组,清理用户线,并将其放入下一个管道。
Etc., etc.
Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing
module. After that, it's an empirical study to find out what the optimum mix of processing steps is.
不要构建太多的步骤来开始。读 - 转换 - 写是一个很好的开始,以确保您了解多处理模块。在那之后,这是一项实证研究,以找出最佳的处理步骤组合。
When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.
当你解决这个问题时,它会产生许多通信顺序进程,这些进程将消耗你所有的CPU资源,但会相对快速地处理文件。
Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.
通常,同时工作的更多进程更快。由于操作系统开销和内存限制,您最终会达到限制。
#2
3
Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.
在通过分析器运行之前,很难知道需要更改哪些内容。但是,我建议在创建和运行正则表达式的地方发生最可能的减速。
Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc, you won't see as much of a speed increase, but you won't need to muck about with c code.
由于您的文件遵循特定格式,因此使用lex + yacc组合可能会显着提高速度。如果你使用python lex + yacc,你将不会看到速度增加,但你不需要使用c代码。
If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.
如果这看起来有点矫枉过正,请在开始循环之前尝试编译正则表达式。您还可以拥有由独立工作线程/进程运行的文件块。
Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.
同样,分析将揭示实际导致瓶颈的原因。首先找出它,然后看看这些选项是否能解决问题。
#3
3
str.lstrip is probably not doing what you were expecting:
str.lstrip可能没有做你期望的事情:
>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'
from the docs:
来自文档:
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
#4
1
Looking at the profiling information, you're spending a lot of time in regexSub. You may find that you can combine your regexps into a single one, and do a single substitution.
查看分析信息,您将花费大量时间在regexSub上。您可能会发现可以将正则表达式合并为一个,并进行单个替换。
Something like:
regAll = re.compile(r'RT|(^[ \t]+)|((http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?)|...')
(The intention of this is to not only replace all the things you are doing with re.sub, but also the lstrip). I've ended the pattern with ...: you'll have to fill in the details yourself.
(这样做的目的不仅是用re.sub替换你正在做的所有事情,而且还用lstrip替换你正在做的所有事情)。我已经用......结束了模式:你必须自己填写细节。
Then replace regexSub with just:
然后用以下代码替换regexSub:
line = regAll.sub(line)
Of course, only profiling will show if this is faster, but I expect that it will as there will be fewer intermediate strings being generated.
当然,只有分析才能显示它是否更快,但我希望它会因为生成的中间字符串更少。
#1
7
This is what multiprocessing is for.
这就是多处理的目的。
You have a pipeline that can be broken into a large number of small steps. Each step is a Process
which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.
你有一个可以分解成大量小步骤的管道。每个步骤都是一个过程,它从管道中获取项目,进行小的转换并将中间结果放入下一个管道。
You'll have a Process
which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.
你将有一个进程,它一次读取三行原始文件,并将三行放入一个管道。就这样。
You'll have a Process
which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.
你将有一个Process从管道中获取(T,U,W)三元组,清理用户线,并将其放入下一个管道。
Etc., etc.
Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing
module. After that, it's an empirical study to find out what the optimum mix of processing steps is.
不要构建太多的步骤来开始。读 - 转换 - 写是一个很好的开始,以确保您了解多处理模块。在那之后,这是一项实证研究,以找出最佳的处理步骤组合。
When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.
当你解决这个问题时,它会产生许多通信顺序进程,这些进程将消耗你所有的CPU资源,但会相对快速地处理文件。
Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.
通常,同时工作的更多进程更快。由于操作系统开销和内存限制,您最终会达到限制。
#2
3
Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.
在通过分析器运行之前,很难知道需要更改哪些内容。但是,我建议在创建和运行正则表达式的地方发生最可能的减速。
Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc, you won't see as much of a speed increase, but you won't need to muck about with c code.
由于您的文件遵循特定格式,因此使用lex + yacc组合可能会显着提高速度。如果你使用python lex + yacc,你将不会看到速度增加,但你不需要使用c代码。
If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.
如果这看起来有点矫枉过正,请在开始循环之前尝试编译正则表达式。您还可以拥有由独立工作线程/进程运行的文件块。
Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.
同样,分析将揭示实际导致瓶颈的原因。首先找出它,然后看看这些选项是否能解决问题。
#3
3
str.lstrip is probably not doing what you were expecting:
str.lstrip可能没有做你期望的事情:
>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'
from the docs:
来自文档:
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
#4
1
Looking at the profiling information, you're spending a lot of time in regexSub. You may find that you can combine your regexps into a single one, and do a single substitution.
查看分析信息,您将花费大量时间在regexSub上。您可能会发现可以将正则表达式合并为一个,并进行单个替换。
Something like:
regAll = re.compile(r'RT|(^[ \t]+)|((http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?)|...')
(The intention of this is to not only replace all the things you are doing with re.sub, but also the lstrip). I've ended the pattern with ...: you'll have to fill in the details yourself.
(这样做的目的不仅是用re.sub替换你正在做的所有事情,而且还用lstrip替换你正在做的所有事情)。我已经用......结束了模式:你必须自己填写细节。
Then replace regexSub with just:
然后用以下代码替换regexSub:
line = regAll.sub(line)
Of course, only profiling will show if this is faster, but I expect that it will as there will be fewer intermediate strings being generated.
当然,只有分析才能显示它是否更快,但我希望它会因为生成的中间字符串更少。