Haskell解析内存不足的大xml文件

时间:2022-12-20 15:25:45

So, I've played around with several Haskell XML libraries, including hexpat and xml-enumerator. After reading the IO chapter in Real World Haskell (http://book.realworldhaskell.org/read/io.html) I was under the impression that if I run the following code, it will be garbage collected as I go through it.

所以,我已经使用了几个Haskell XML库,包括hexpat和xml-enumerator。在阅读了真实世界Haskell(http://book.realworldhaskell.org/read/io.html)中的IO章节后,我的印象是,如果我运行以下代码,它将在我通过它时被垃圾收集。

However, when I run it on a big file, memory usage keeps climbing as it runs.

但是,当我在一个大文件上运行时,内存使用量会随着运行而不断攀升。

runghc parse.hs bigfile.xml

What am I doing wrong? Is my assumption wrong? Does the map/filter force it to evaluate everything?

我究竟做错了什么?我的假设是错的吗?地图/过滤器是否强制它评估所有内容?

import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.UTF8 as U
import Prelude hiding (readFile)
import Text.XML.Expat.SAX 
import System.Environment (getArgs)

main :: IO ()
main = do
    args <- getArgs
    contents <- BSL.readFile (head args)
    -- putStrLn $ U.toString contents
    let events = parse defaultParseOptions contents 
    mapM_ print $ map getTMSId $ filter isEvent events

isEvent :: SAXEvent String String -> Bool 
isEvent (StartElement "event" as) = True
isEvent _ = False

getTMSId :: SAXEvent String String -> Maybe String
getTMSId (StartElement _ as) = lookup "TMSId" as

My end goal is to parse a huge xml file with a simple sax-like interface. I don't want to have to be aware of the whole structure to get notified that I've found an "event".

我的最终目标是使用简单的类似sax的界面解析一个巨大的xml文件。我不想知道整个结构,以获得通知,我发现了一个“事件”。

2 个解决方案

#1


8  

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

我是hexpat的维护者。这是一个错误,我现在已经修复了hexpat-0.19.8。感谢您引起我的注意。

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

这个bug在ghc-7.2.1上是新的,它与我在绑定到三元组的where子句和unsafePerformIO之间没有预料到的交互有关,我需要使它与C代码的交互看起来很纯粹。哈斯克尔。

#2


3  

This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as length, results in linear memory use.

这似乎是hexpat的一个问题。运行已编译,优化,只是为了一个简单的任务,如长度,导致线性内存使用。

Looking at hexpat, I think there is excessive caching going on (see the parseG function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.

看看hexpat,我认为有过多的缓存(参见parseG函数)。我建议联系hexpat维护者并询问这是否是预期的行为。它应该在黑线鳕中以任何一种方式提及,但资源消耗似乎在图书馆文档中经常被忽略。

#1


8  

I'm the maintainer of hexpat. This is a bug, which I have now fixed in hexpat-0.19.8. Thanks for drawing it to my attention.

我是hexpat的维护者。这是一个错误,我现在已经修复了hexpat-0.19.8。感谢您引起我的注意。

The bug is new on ghc-7.2.1, and it's to do with an interaction that I didn't expect between a where clause binding to a triple, and unsafePerformIO, which I need to make the interaction with the C code appear pure in Haskell.

这个bug在ghc-7.2.1上是新的,它与我在绑定到三元组的where子句和unsafePerformIO之间没有预料到的交互有关,我需要使它与C代码的交互看起来很纯粹。哈斯克尔。

#2


3  

This appears to be an issue with hexpat. Running compiled, with optimization, and just for a simple task such as length, results in linear memory use.

这似乎是hexpat的一个问题。运行已编译,优化,只是为了一个简单的任务,如长度,导致线性内存使用。

Looking at hexpat, I think there is excessive caching going on (see the parseG function). I suggest contacting the hexpat maintainer(s) and asking if this is expected behavior. It should have been mentioned in the haddocks either way, but resource consumption seems to get ignored too often in library documentation.

看看hexpat,我认为有过多的缓存(参见parseG函数)。我建议联系hexpat维护者并询问这是否是预期的行为。它应该在黑线鳕中以任何一种方式提及,但资源消耗似乎在图书馆文档中经常被忽略。