I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads).
我将在R中进行一些日志文件分析(除非我不能在R中进行),并且我理解我的数据需要适合于RAM(除非我使用某种修复,比如keyval存储的接口)。所以我想知道如何提前告诉我的数据在RAM中会占用多少空间,以及我是否有足够的空间。我知道我有多少RAM(在XP下不是很大的数量- 3GB),我知道我的日志文件最后将有多少行和cols,以及col条目应该是什么数据类型(这可能需要在它读取时进行检查)。
How do I put this together into a go/nogo decision for undertaking the analysis in R? (Presumably R needs to be able to have some RAM to do operations, as well as holding the data!) My immediate required output is a bunch of simple summary stats, frequencies, contingencies, etc, and so I could probably write some kind of parser/tabulator that will give me the output I need short term, but I also want to play around with lots of different approaches to this data as a next step, so am looking at feasibility of using R.
我如何将这些整合到一个go/nogo决策中来进行R中的分析呢?(假设R需要有一些RAM来执行操作,以及保存数据!)我立即要求输出一串简单的汇总统计,频率,事件,等等,所以我可能会编写一些解析器/制表机,短期会给我我需要的输出,但我也想玩玩很多不同的方法来此数据作为下一个步骤,所以我看使用R的可行性。
I have seen lots of useful advice about large datasets in R here, which I have read and will reread, but for now I would like to understand better how to figure out whether I should (a) go there at all, (b) go there but expect to have to do some extra stuff to make it manageable, or (c) run away before it's too late and do something in some other language/environment (suggestions welcome...!). thanks!
我见过很多有用的建议大型数据集在R,我阅读和重读,但现在我想更好地理解如何弄清楚我应该(a)去那里,(b),但希望去做一些额外的东西使它易于管理,或(c)运行在为时过晚之前,做一些在其他语言/环境(建议欢迎…!)。谢谢!
1 个解决方案
#1
41
R is well suited for big datasets, either using out-of-the-box solutions like bigmemory
or the ff package (especially read.csv.ffdf
) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.
R非常适合于大型数据集,可以使用开箱即用的解决方案,如bigmemory或ff包(特别是read.csv.ffdf),也可以使用自己的脚本以块形式处理数据。在几乎所有的情况下,少量的编程使得处理大型数据集(>>内存,比如100gb)变得非常可能。自己做这种编程需要一些时间去学习(我不知道你的水平),但是会让你非常灵活。如果这是你喜欢的,或者你需要跑步,这取决于你想要投资学习这些技能的时间。但是一旦你有了它们,你作为数据分析师的日子就会好过得多。
In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.
关于分析日志文件,我知道,通过调用任务4(计算机多人游戏)生成的stats页面可以通过迭代地将日志文件解析到数据库中,然后从数据库中检索每个用户的statsistics。这里是接口的示例。迭代(块)方法意味着日志文件大小(几乎)是无限的。然而,获得良好的性能并非易事。
A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:
你可以用R做很多事情,你可以用Python或Matlab,甚至c++或Fortran。但是,只有当该工具对您想要的东西有开箱即用的支持时,我才能看到该工具相对于r的明显优势。在阅读一个非常大的文本文件时,请参见前面的min答案。其他可能对你有兴趣的相关链接:
- Quickly reading very large tables as dataframes in R
- 快速读取非常大的表,如R中的dataframes
- https://*.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing).
- https://*.com/questions/1257021/suitable-function -language for scientific- statisticalcomputing(讨论包括用于大型数据处理的讨论)。
- Trimming a huge (3.5 GB) csv file to read into R
- 删除一个巨大的(3.5 GB) csv文件以读取到R中
- A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype.
- 我的一篇博客文章展示了如何估计数据集的RAM使用情况。注意,这假定数据将存储在一个矩阵或数组中,并且只是一个数据类型。
- Log file processing with R
- 用R记录文件处理
In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).
关于选择R或其他工具,我想说如果它对谷歌足够好,对我也足够好;)
#1
41
R is well suited for big datasets, either using out-of-the-box solutions like bigmemory
or the ff package (especially read.csv.ffdf
) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.
R非常适合于大型数据集,可以使用开箱即用的解决方案,如bigmemory或ff包(特别是read.csv.ffdf),也可以使用自己的脚本以块形式处理数据。在几乎所有的情况下,少量的编程使得处理大型数据集(>>内存,比如100gb)变得非常可能。自己做这种编程需要一些时间去学习(我不知道你的水平),但是会让你非常灵活。如果这是你喜欢的,或者你需要跑步,这取决于你想要投资学习这些技能的时间。但是一旦你有了它们,你作为数据分析师的日子就会好过得多。
In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.
关于分析日志文件,我知道,通过调用任务4(计算机多人游戏)生成的stats页面可以通过迭代地将日志文件解析到数据库中,然后从数据库中检索每个用户的statsistics。这里是接口的示例。迭代(块)方法意味着日志文件大小(几乎)是无限的。然而,获得良好的性能并非易事。
A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:
你可以用R做很多事情,你可以用Python或Matlab,甚至c++或Fortran。但是,只有当该工具对您想要的东西有开箱即用的支持时,我才能看到该工具相对于r的明显优势。在阅读一个非常大的文本文件时,请参见前面的min答案。其他可能对你有兴趣的相关链接:
- Quickly reading very large tables as dataframes in R
- 快速读取非常大的表,如R中的dataframes
- https://*.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing).
- https://*.com/questions/1257021/suitable-function -language for scientific- statisticalcomputing(讨论包括用于大型数据处理的讨论)。
- Trimming a huge (3.5 GB) csv file to read into R
- 删除一个巨大的(3.5 GB) csv文件以读取到R中
- A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype.
- 我的一篇博客文章展示了如何估计数据集的RAM使用情况。注意,这假定数据将存储在一个矩阵或数组中,并且只是一个数据类型。
- Log file processing with R
- 用R记录文件处理
In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).
关于选择R或其他工具,我想说如果它对谷歌足够好,对我也足够好;)