如何从Perl快速访问许多大型CSV文件中的数据?

时间:2022-09-15 15:22:54

I have a number of scripts that currently read in a lot of data from some .CSV files. For efficiency, I use the Text::CSV_XS module to read them in and then create a hash using one of the columns as an index. However, I have a lot of files and they are quite large. And each of the scripts needs to read in the data all over again.

我有许多脚本,目前从一些.CSV文件中读取大量数据。为了提高效率,我使用Text :: CSV_XS模块读取它们,然后使用其中一列作为索引创建哈希。但是,我有很多文件,而且它们非常大。每个脚本都需要重新读取数据。

The question is: How can I have persistent storage of these Perl hashes so that all them can be read back in with a minimum of CPU?

问题是:如何才能持久存储这些Perl哈希值,以便用最少的CPU读回所有这些哈希值?

Combining the scripts is not an option. I wish...

组合脚本不是一种选择。我希望...

I applied the 2nd rule of optimization and used profiling to find that the vast majority of the CPU (about 90%) was in:

我应用了第二个优化规则并使用分析来发现绝大多数CPU(大约90%)在:

Text::CSV_XS::fields
Text::CSV_XS::Parse
Text::CSV_XS::parse

So, I made a test script that read in all the .CSV files (Text::CSV_XS), dumped them using the Storable module, and then went back and read them back in using the Storable module. I profiled this so I could see the CPU times:

因此,我制作了一个测试脚本,读取所有.CSV文件(Text :: CSV_XS),使用Storable模块转储它们,然后返回并使用Storable模块读回它们。我描述了这个,所以我可以看到CPU时间:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1809.397 Seconds
  User+System Time = 950.5560 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 25.6   243.6 243.66    126   1.9338 1.9338  Storable::pretrieve
 20.5   194.9 194.92 893448   0.0002 0.0002  Text::CSV_XS::fields
 9.49   90.19 90.198 893448   0.0001 0.0001  Text::CSV_XS::Parse
 7.48   71.07 71.072    126   0.5641 0.5641  Storable::pstore
 4.45   42.32 132.52 893448   0.0000 0.0001  Text::CSV_XS::parse
 (the rest was in terms of 0.07% or less and can be ignored)

So, using Storable costs about 25.6% to load back in as compared to Text::CSV_XS at about 35%. Not a lot of savings...

因此,与Text :: CSV_XS相比,使用可存储成本约25.6%重新加载,大约35%。节省不多......

Has anybody got a suggestion on how I can read in these data more efficiently?

是否有人建议我如何更有效地阅读这些数据?

Thanks for your help.

谢谢你的帮助。

5 个解决方案

#1


9  

Parse the data once and put it in an SQLite db. Query using DBI.

解析数据一次并将其放入SQLite数据库中。使用DBI查询。

#2


11  

The easiest way to put a very large hash on disk, IMHO, is with BerkeleyDB. It's fast, time-tested and rock-solid, and the CPAN module provides a tied API. That means you can continue using your hash as if it were an in-memory data structure, but it will automatically read and write through BerkeleyDB to disk.

将一个非常大的哈希放在磁盘上的最简单方法就是使用BerkeleyDB。它速度快,经过时间考验且坚如磐石,CPAN模块提供了一个绑定的API。这意味着您可以继续使用您的哈希,就像它是内存中的数据结构一样,但它会自动通过BerkeleyDB读写磁盘。

#3


3  

Well, I've taken the suggestion of Sinan Ünür (thanks!) and made an SQLite database and re-run my test program to compare getting the data via CSV files as compared to getting the data out of the SQLite data base:

好吧,我接受了SinanÜnür的建议(谢谢!)并创建了一个SQLite数据库并重新运行我的测试程序来比较通过CSV文件获取数据,而不是从SQLite数据库中获取数据:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1705.947 Seconds
  User+System Time = 1084.296 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 19.5   212.2 212.26 893448   0.0002 0.0002  Text::CSV_XS::fields
 15.7   170.7 224.45    126   1.3549 1.7814  DBD::_::st::fetchall_hashref
 9.14   99.15 99.157 893448   0.0001 0.0001  Text::CSV_XS::Parse
 6.03   65.34 164.49 893448   0.0001 0.0002  Text::CSV_XS::parse
 4.93   53.41 53.412 893574   0.0001 0.0001  DBI::st::fetch
   [ *removed the items of less than 0.01 percent* ]

The total for CSV_XS is 34.67% as compared to 20.63% for SQLite which is somewhat better than the Storable solution I tried before. However, this isn't a fair comparison since with the CSV_XS solution I have to load the entire CSV file but with the SQLite interface, I can just load the parts I want. Thus in practice, I expect even more improvement than this simple-minded test shows.

CSV_XS的总数为34.67%,而SQLite的总数为20.63%,这比我之前尝试的可存储解决方案要好一些。但是,这不是一个公平的比较,因为使用CSV_XS解决方案我必须加载整个CSV文件但是使用SQLite接口,我可以加载我想要的部分。因此在实践中,我期望比这个简单的测试表明更多的改进。

I have not tried using BerkeleyDB (sorry, friedo) instead of SQLite, mostly because I didn't see that suggestion until I was well involved with trying out SQLite. Setting up the test was a non-trivial task since I almost never have occasion to use SQL databases.

我没有尝试使用BerkeleyDB(抱歉,炒作)而不是SQLite,主要是因为在我参与尝试SQLite之前我没有看到这个建议。设置测试是一项非常重要的任务,因为我几乎从未有机会使用SQL数据库。

Still, the solution is clearly to load all the data into a database and access via the DBI module. Thanks for everyone's help. All responses are greatly appreciated.

尽管如此,解决方案显然是将所有数据加载到数据库中并通过DBI模块进行访问。谢谢大家的帮助。所有回复都非常感谢。

#4


2  

It's vastly preferable to not pull the entire list into memory every time you run the script. Using an on-disk database will allow you to do this. If, for some reason, you have to touch each entry in the CSV file every time you run, I might recommend storing it on a RAM disk instead of physical disk. It obviously fits in memory, I don't think you'll get much improvement by changing the on-disk format you store it in. The only way to really speed it up is store it on a faster medium.

每次运行脚本时,最好不要将整个列表拉入内存。使用磁盘数据库将允许您执行此操作。如果出于某种原因,每次运行时都必须触摸CSV文件中的每个条目,我可能建议将其存储在RAM磁盘而不是物理磁盘上。它显然适合内存,我不认为通过更改存储它的磁盘格式可以获得很大的改进。真正加速它的唯一方法是将它存储在更快的介质上。

#5


1  

If you only need to access part of the data in each script, rather than ALL of it, DBM::Deep is probably your best bet.

如果你只需要访问每个脚本中的部分数据,而不是全部,那么DBM :: Deep可能是你最好的选择。

Disk/IO is likely to be your biggest bottleneck no matter what you do. Perhaps you could use a data provider that keeps all the data available in a mmapped cache--using something like Sys::Mmap::Simple I've never needed to do this sort of thing, so I don't have much else to offer.

无论你做什么,磁盘/ IO都可能是你最大的瓶颈。也许你可以使用一个数据提供程序来保存mmapped缓存中的所有数据 - 使用类似Sys :: Mmap :: Simple的东西我从来不需要做这种事情,所以我没有太多其他的东西提供。

#1


9  

Parse the data once and put it in an SQLite db. Query using DBI.

解析数据一次并将其放入SQLite数据库中。使用DBI查询。

#2


11  

The easiest way to put a very large hash on disk, IMHO, is with BerkeleyDB. It's fast, time-tested and rock-solid, and the CPAN module provides a tied API. That means you can continue using your hash as if it were an in-memory data structure, but it will automatically read and write through BerkeleyDB to disk.

将一个非常大的哈希放在磁盘上的最简单方法就是使用BerkeleyDB。它速度快,经过时间考验且坚如磐石,CPAN模块提供了一个绑定的API。这意味着您可以继续使用您的哈希,就像它是内存中的数据结构一样,但它会自动通过BerkeleyDB读写磁盘。

#3


3  

Well, I've taken the suggestion of Sinan Ünür (thanks!) and made an SQLite database and re-run my test program to compare getting the data via CSV files as compared to getting the data out of the SQLite data base:

好吧,我接受了SinanÜnür的建议(谢谢!)并创建了一个SQLite数据库并重新运行我的测试程序来比较通过CSV文件获取数据,而不是从SQLite数据库中获取数据:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1705.947 Seconds
  User+System Time = 1084.296 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 19.5   212.2 212.26 893448   0.0002 0.0002  Text::CSV_XS::fields
 15.7   170.7 224.45    126   1.3549 1.7814  DBD::_::st::fetchall_hashref
 9.14   99.15 99.157 893448   0.0001 0.0001  Text::CSV_XS::Parse
 6.03   65.34 164.49 893448   0.0001 0.0002  Text::CSV_XS::parse
 4.93   53.41 53.412 893574   0.0001 0.0001  DBI::st::fetch
   [ *removed the items of less than 0.01 percent* ]

The total for CSV_XS is 34.67% as compared to 20.63% for SQLite which is somewhat better than the Storable solution I tried before. However, this isn't a fair comparison since with the CSV_XS solution I have to load the entire CSV file but with the SQLite interface, I can just load the parts I want. Thus in practice, I expect even more improvement than this simple-minded test shows.

CSV_XS的总数为34.67%,而SQLite的总数为20.63%,这比我之前尝试的可存储解决方案要好一些。但是,这不是一个公平的比较,因为使用CSV_XS解决方案我必须加载整个CSV文件但是使用SQLite接口,我可以加载我想要的部分。因此在实践中,我期望比这个简单的测试表明更多的改进。

I have not tried using BerkeleyDB (sorry, friedo) instead of SQLite, mostly because I didn't see that suggestion until I was well involved with trying out SQLite. Setting up the test was a non-trivial task since I almost never have occasion to use SQL databases.

我没有尝试使用BerkeleyDB(抱歉,炒作)而不是SQLite,主要是因为在我参与尝试SQLite之前我没有看到这个建议。设置测试是一项非常重要的任务,因为我几乎从未有机会使用SQL数据库。

Still, the solution is clearly to load all the data into a database and access via the DBI module. Thanks for everyone's help. All responses are greatly appreciated.

尽管如此,解决方案显然是将所有数据加载到数据库中并通过DBI模块进行访问。谢谢大家的帮助。所有回复都非常感谢。

#4


2  

It's vastly preferable to not pull the entire list into memory every time you run the script. Using an on-disk database will allow you to do this. If, for some reason, you have to touch each entry in the CSV file every time you run, I might recommend storing it on a RAM disk instead of physical disk. It obviously fits in memory, I don't think you'll get much improvement by changing the on-disk format you store it in. The only way to really speed it up is store it on a faster medium.

每次运行脚本时,最好不要将整个列表拉入内存。使用磁盘数据库将允许您执行此操作。如果出于某种原因,每次运行时都必须触摸CSV文件中的每个条目,我可能建议将其存储在RAM磁盘而不是物理磁盘上。它显然适合内存,我不认为通过更改存储它的磁盘格式可以获得很大的改进。真正加速它的唯一方法是将它存储在更快的介质上。

#5


1  

If you only need to access part of the data in each script, rather than ALL of it, DBM::Deep is probably your best bet.

如果你只需要访问每个脚本中的部分数据,而不是全部,那么DBM :: Deep可能是你最好的选择。

Disk/IO is likely to be your biggest bottleneck no matter what you do. Perhaps you could use a data provider that keeps all the data available in a mmapped cache--using something like Sys::Mmap::Simple I've never needed to do this sort of thing, so I don't have much else to offer.

无论你做什么,磁盘/ IO都可能是你最大的瓶颈。也许你可以使用一个数据提供程序来保存mmapped缓存中的所有数据 - 使用类似Sys :: Mmap :: Simple的东西我从来不需要做这种事情,所以我没有太多其他的东西提供。