在Windows上使用包XML时内存泄漏

时间:2022-02-15 08:59:14

Having read Memory leaks parsing XML in r (including linked posts) and this post on R Help and given that some time has passed again, I still think this is an unresolved issue that deserves attention as the XML package is widely used throughout the R universe.

阅读内存泄漏解析r中的XML(包括链接的帖子)和R帮助上的这篇文章,并且鉴于已经过了一段时间,我仍然认为这是一个值得关注的未解决的问题,因为XML包在整个R宇宙中被广泛使用。

Thus please consider this as a follow up post and/or reference with a hopefully informative yet concise illustration of the problem.

因此,请将此视为后续帖子和/或参考,并希望提供信息丰富而简洁的问题说明。

Issue

Parsing XML/HTML documents in a way that they can be searched with XPath afterwards requires the internal use of C pointers (AFAIU). And it seems that at least on MS Windows (I'm running on Windows 8.1, 64 Bit) these references are not properly recognized by the garbage collector. Thus consumed memory is not properly released which leads to a freeze of an R process at some point.

以后可以使用XPath搜索XML / HTML文档的方式需要内部使用C指针(AFAIU)。而且似乎至少在MS Windows上(我在Windows 8.1,64位运行)这些引用都没有被垃圾收集器正确识别。因此消耗的存储器未被正确释放,这导致在某个时刻冻结R过程。

Central findings so far

To me it seems that XML:free and/or gc does/do not recognize all memory involved when parsing XML/HTML docs via xmlParse or htmlParse and subsequently processing them with xpathApply or the like:

对我来说,似乎XML:free和/或gc在通过xmlParse或htmlParse解析XML / HTML文档并随后使用xpathApply等处理它们时,会识别/不识别所涉及的所有内存:

The reported memory usage of the OS task (Rterm.exe) is adding up significantly fast while the reported memory of the R process as "seen from within R" (function memory.size) increases moderately (in comparison, that is). See list elements mem_r, mem_os and ratio before and after a substantial parsing cycle below.

报告的OS任务(Rterm.exe)的内存使用量显着增加,而R进程的报告内存“从R内部看”(函数memory.size)适度增加(相比之下)。请参阅下面的实质解析周期之前和之后的列表元素mem_r,mem_os和ratio。

All in all and throwing in everything that has been recommended (free, rm and gc), memory usage still always increases when xmlParse and the like are called. It's just a question of how much. So IMHO there must still be something that's not working correctly.

总而言之,抛出所有推荐的内容(free,rm和gc),当调用xmlParse等时,内存使用总是会增加。这只是一个多少的问题。所以恕我直言必须仍然有一些不能正常工作的东西。


Illustration

I borrowed the profiling code from the Duncan's Omegahat git repository.

我从Duncan的Omegahat git存储库中借用了性能分析代码。

Some preparations:

一些准备:

Sys.setenv("LANGUAGE"="en")   
require("compiler")
require("XML")

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] XML_3.98-1.1

Functions we need:

我们需要的功能:

getTaskMemoryByPid <- cmpfun(function(
    pid=Sys.getpid()
) {
    cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
    mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
    mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
    mem
}, options=list(suppressAll=TRUE))  

memoryLeak <- cmpfun(function(
    x=system.file("exampleData", "mtcars.xml", package="XML"),
    n=10000,
    use_text=FALSE,
    xpath=FALSE,
    free_doc=FALSE,
    clean_up=FALSE,
    detailed=FALSE
) {
    if(use_text) {
        x <- readLines(x)
    }
    ## Before //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_1  <- memory.profile()
    mem_before <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)

    ## Per run //
    mem_perrun <- lapply(1:n, function(ii) {
        doc <- xmlParse(x, asText=use_text)
        if (xpath) {
            res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
            rm(res)
        }
        if (free_doc) {
            free(doc)
        }
        rm(doc)
        out <- NULL
        if (detailed) {
            out <- list(
                profile=memory.profile(),
                size=memory.size()
            )
        } 
        out
    })
    has_perrun <- any(sapply(mem_perrun, length) > 0)
    if (!has_perrun) {
        mem_perrun <- NULL
    } 

    ## Garbage collect //
    mem_gc <- NULL
    if(clean_up) {
        gc()
        tmp <- gc()
        mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
    }

    ## After //
    mem_os  <- getTaskMemoryByPid()
    mem_r   <- memory.size()
    prof_2  <- memory.profile()
    mem_after <- list(mem_r=mem_r,
        mem_os=mem_os, ratio=mem_os/mem_r)
    list(
        before=mem_before, 
        perrun=mem_perrun, 
        gc=mem_gc, 
        after=mem_after, 
        comparison_r=data.frame(
            before=prof_1, 
            after=prof_2, 
            increase=round((prof_2/prof_1)-1, 4)
        ),
        increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
        increase_os=(mem_after$mem_os/mem_before$mem_os)-1
    )
}, options=list(suppressAll=TRUE))  

Results

Scenario 1

Quick facts: garbage collection enabled, XML doc is parsed n times but not searched via xpathApply

快速事实:启用垃圾收集,XML文档解析n次但不通过xpathApply搜索

Notice the ratios of OS memory vs. R memory:

注意OS内存与R内存的比率:

Before: 1.364832

之前:1.364832

After: 1.322702

之后:1.322702

res <- memoryLeak(clean_up=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))

> res
$before
$before$mem_r
[1] 37.42

$before$mem_os
[1] 51.072

$before$ratio
[1] 1.364832


$perrun
NULL

$gc
$gc$gc_mb
[1] 45


$after
$after$mem_r
[1] 63.21

$after$mem_os
[1] 83.608

$after$ratio
[1] 1.322702


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7387   7392   0.0007
pairlist    190383 390633   1.0518
closure       5077  55085   9.8499
environment   1032  51032  48.4496
promise       5226 105226  19.1351
language     54675  54791   0.0021
special         44     44   0.0000
builtin        648    648   0.0000
char          8746   8763   0.0019
logical       9081   9084   0.0003
integer      22804  22807   0.0001
double        2773   2783   0.0036
complex          1      1   0.0000
character    44522  94569   1.1241
...              0      0      NaN
any              0      0      NaN
list         19946  19951   0.0003
expression       1      1   0.0000
bytecode     16049  16050   0.0001
externalptr   1487   1487   0.0000
weakref        391    391   0.0000
raw            392    392   0.0000
S4            1392   1392   0.0000

$increase_r
[1] 0.6892036

$increase_os
[1] 0.6370614

Scenario 2

Quick facts: garbage collection enabled, free is explicitly called, XML doc is parsed n times but not searched via xpathApply.

快速事实:启用垃圾收集,显式调用free,XML doc被解析n次但不通过xpathApply搜索。

Notice the ratios of OS memory vs. R memory:

注意OS内存与R内存的比率:

Before: 1.315249

之前:1.315249

After: 1.222143

之后:1.222143

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res

$before    
$before$mem_r
[1] 63.48

$before$mem_os
[1] 83.492

$before$ratio
[1] 1.315249


$perrun
NULL

$gc
$gc$gc_mb
[1] 69.3


$after
$after$mem_r
[1] 95.92

$after$mem_os
[1] 117.228

$after$ratio
[1] 1.222143


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7454   0.0000
pairlist    392455 592466   0.5096
closure      55104 105104   0.9074
environment  51032 101032   0.9798
promise     105226 205226   0.9503
language     55592  55592   0.0000
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8848   0.0001
logical       9141   9141   0.0000
integer      23109  23111   0.0001
double        2802   2807   0.0018
complex          1      1   0.0000
character    94775 144781   0.5276
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.5110271

$increase_os
[1] 0.4040627

Scenario 3

Quick facts: garbage collection enabled, free is explicitly called, XML doc is parsed n times and searched via xpathApply each time.

快速事实:启用垃圾收集,显式调用free,每次解析XML文档n次并通过xpathApply搜索。

Notice the ratios of OS memory vs. R memory:

注意OS内存与R内存的比率:

Before: 1.220429

之前:1.220429

After: 13.15629 (!)

之后:13.15629(!)

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94

$before$mem_os
[1] 117.088

$before$ratio
[1] 1.220429


$perrun
NULL

$gc
$gc$gc_mb
[1] 93.4


$after
$after$mem_r
[1] 124.64

$after$mem_os
[1] 1639.8

$after$ratio
[1] 13.15629


$comparison_r
            before  after increase
NULL             1      1   0.0000
symbol        7454   7460   0.0008
pairlist    592458 793042   0.3386
closure     105104 155110   0.4758
environment 101032 151032   0.4949
promise     205226 305226   0.4873
language     55592  55882   0.0052
special         44     44   0.0000
builtin        648    648   0.0000
char          8847   8867   0.0023
logical       9142   9162   0.0022
integer      23109  23112   0.0001
double        2802   2832   0.0107
complex          1      1   0.0000
character   144775 194819   0.3457
...              0      0      NaN
any              0      0      NaN
list         20174  20177   0.0001
expression       1      1   0.0000
bytecode     16265  16265   0.0000
externalptr   1488   1487  -0.0007
weakref        392    391  -0.0026
raw            393    392  -0.0025
S4            1392   1392   0.0000

$increase_r
[1] 0.2991453

$increase_os
[1] 13.00485

I also tried different versions. Well, I tried to try ;-)

我也试过不同的版本。好吧,我试过试试;-)

From source, from omegahat.org

FYI: latest Rtools 3.1 is installed and included in the Windows PATH (e.g. installing stringr form the source code worked just fine).

仅供参考:最新的Rtools 3.1已安装并包含在Windows PATH中(例如,安装stringr表单,源代码工作正常)。

> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'

The downloaded source packages are in
    'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 
2: In install.packages("XML", repos = "http://www.omegahat.org/R",  :
  installation of package 'XML' had non-zero exit status

Github

I did not follow the recommendations in the README on the github repo as it points to this directory that only contains a tar.gz of version 3.94-0 (while we're at 3.98-1.1 on CRAN).

我没有按照github repo上README中的建议,因为它指向的目录只包含版本3.94-0的tar.gz(而我们在CRAN上的时候是3.98-1.1)。

Even though it is stated that the gihub repo is not in a standard R package structure, I tried it anyway with install_github - and failed ;-)

尽管声明gihub repo不在标准的R包结构中,但无论如何我用install_github尝试了它 - 并且失败了;-)

require("devtools")
> install_github(repo="XML", username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL  \
  "C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master"  \
  --library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source  \
  --install-tests 

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command failed (1)

3 个解决方案

#1


4  

Whilst it is still in its infancy (only a couple of months old!), and has a few quirks, Hadley Wickham has written a library for XML parsing, xml2, that can be found on Github at https://github.com/hadley/xml2. It is restricted to reading rather than writing XML, but for parsing XML I've been experimenting and it looks like it will do the job, without the memory leaks of the xml package! It provides functions including:

虽然它仍然处于起步阶段(只有几个月的时间!),并且有一些怪癖,但Hadl​​ey Wickham编写了一个用于XML解析的库xml2,可以在Github上找到https://github.com/哈德利/ XML2。它仅限于阅读而不是编写XML,但是对于解析XML我一直在尝试,看起来它会完成工作,没有xml包的内存泄漏!它提供的功能包括:

  • read_xml() to read an XML file
  • read_xml()读取XML文件
  • xml_children() to get the child nodes of a node
  • xml_children()获取节点的子节点
  • xml_text() to get the text within a tag
  • xml_text()获取标记内的文本
  • xml_attrs() to get a character vector of the attributes and values of a node, that can be cast to a named list with as.list()
  • xml_attrs()获取节点的属性和值的字符向量,可以使用as.list()强制转换为命名列表

Note that you still need to ensure that you rm() the XML node objects after you're done with them, and force a garbage collection with gc(), but the memory then does actually get released to the O/S (Disclaimer: Only tested on Windows 7 but this seems to be the most 'memory leaky' platform anyway).

请注意,您仍然需要确保在完成XML节点对象后使用它们,并使用gc()强制执行垃圾收集,但内存实际上会释放到O / S(免责声明:仅在Windows 7上测试过,但这似乎是最“内存泄漏”的平台。

Hope this helps someone!

希望这有助于某人!

#2


1  

Following Matthew Wise's answer above for using xml2, I found the function that really releases memory is xml_remove() followed by gc(), not rm().

按照Matthew Wise上面的回答使用xml2,我发现真正释放内存的函数是xml_remove(),后跟gc(),而不是rm()。

#3


0  

Nothing really happened since I posted the question, so I thought I'd try to raise attention again.

自从我发布问题以来没有发生任何事情,所以我想我会再次引起注意。

Here's a slightly updated version of my investigations

这是我的调查的略微更新版本

Preliminaries

require("rvest")
require("XML")

Functions

getTaskMemoryByPid <- function(
  pid = Sys.getpid()
) {
  cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
  mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
  mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
  mem
}  

getCurrentMemoryStatus <- function() {
  mem_os  <- getTaskMemoryByPid()
  mem_r   <- memory.size()
  prof_1  <- memory.profile()
  list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}

memoryLeak <- function(
  x = system.file("exampleData", "mtcars.xml", package="XML"),
  n = 10000,
  use_text = FALSE,
  xpath = FALSE,
  free_doc = FALSE,
  clean_up = FALSE,
  detailed = FALSE,
  use_rvest = FALSE,
  user_agent = httr::user_agent("Mozilla/5.0")
) {
  if(use_text) {
    x <- readLines(x)
  }
  ## Before //
  prof_1  <- memory.profile()
  mem_before <- getCurrentMemoryStatus()

  ## Per run //
  mem_perrun <- lapply(1:n, function(ii) {
    doc <- if (!use_rvest) {
      xmlParse(x, asText = use_text)
    } else {
      if (file.exists(x)) {
      ## From disk //        
        rvest::html(x)  
      } else {
      ## From web //
        rvest::html_session(x, user_agent)  
      }
    }
    if (xpath) {
      res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
      rm(res)
    }
    if (free_doc) {
      free(doc)
    }
    rm(doc)
    out <- NULL
    if (detailed) {
      out <- list(
        profile = memory.profile(),
        size = memory.size()
      )
    } 
    out
  })
  has_perrun <- any(sapply(mem_perrun, length) > 0)
  if (!has_perrun) {
    mem_perrun <- NULL
  } 

  ## Garbage collect //
  mem_gc <- NULL
  if(clean_up) {
    gc()
    tmp <- gc()
    mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
  }

  ## After //
  prof_2  <- memory.profile()
  mem_after <- getCurrentMemoryStatus()

  ## Return value //
  if (detailed) {
    list(
      before = mem_before, 
      perrun = mem_perrun, 
      gc = mem_gc, 
      after = mem_after, 
      comparison_r = data.frame(
        before = prof_1, 
        after = prof_2, 
        increase = round((prof_2/prof_1)-1, 4)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  } else {
    list(
      before_after = data.frame(
        r = c(mem_before$r, mem_after$r),
        os = c(mem_before$os, mem_after$os)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  }
}

Memory status before anything has ever been requested

getCurrentMemoryStatus()

Generate additional offline example content

s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")

s <- html_session(
  "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")

getCurrentMemoryStatus()

Profiling

################
## Mtcars.xml ##
################

res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)

###################
## www.had.co.nz ##
###################

## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)

####################
## www.amazon.com ##
####################

## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

#1


4  

Whilst it is still in its infancy (only a couple of months old!), and has a few quirks, Hadley Wickham has written a library for XML parsing, xml2, that can be found on Github at https://github.com/hadley/xml2. It is restricted to reading rather than writing XML, but for parsing XML I've been experimenting and it looks like it will do the job, without the memory leaks of the xml package! It provides functions including:

虽然它仍然处于起步阶段(只有几个月的时间!),并且有一些怪癖,但Hadl​​ey Wickham编写了一个用于XML解析的库xml2,可以在Github上找到https://github.com/哈德利/ XML2。它仅限于阅读而不是编写XML,但是对于解析XML我一直在尝试,看起来它会完成工作,没有xml包的内存泄漏!它提供的功能包括:

  • read_xml() to read an XML file
  • read_xml()读取XML文件
  • xml_children() to get the child nodes of a node
  • xml_children()获取节点的子节点
  • xml_text() to get the text within a tag
  • xml_text()获取标记内的文本
  • xml_attrs() to get a character vector of the attributes and values of a node, that can be cast to a named list with as.list()
  • xml_attrs()获取节点的属性和值的字符向量,可以使用as.list()强制转换为命名列表

Note that you still need to ensure that you rm() the XML node objects after you're done with them, and force a garbage collection with gc(), but the memory then does actually get released to the O/S (Disclaimer: Only tested on Windows 7 but this seems to be the most 'memory leaky' platform anyway).

请注意,您仍然需要确保在完成XML节点对象后使用它们,并使用gc()强制执行垃圾收集,但内存实际上会释放到O / S(免责声明:仅在Windows 7上测试过,但这似乎是最“内存泄漏”的平台。

Hope this helps someone!

希望这有助于某人!

#2


1  

Following Matthew Wise's answer above for using xml2, I found the function that really releases memory is xml_remove() followed by gc(), not rm().

按照Matthew Wise上面的回答使用xml2,我发现真正释放内存的函数是xml_remove(),后跟gc(),而不是rm()。

#3


0  

Nothing really happened since I posted the question, so I thought I'd try to raise attention again.

自从我发布问题以来没有发生任何事情,所以我想我会再次引起注意。

Here's a slightly updated version of my investigations

这是我的调查的略微更新版本

Preliminaries

require("rvest")
require("XML")

Functions

getTaskMemoryByPid <- function(
  pid = Sys.getpid()
) {
  cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
  mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
  mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
  mem
}  

getCurrentMemoryStatus <- function() {
  mem_os  <- getTaskMemoryByPid()
  mem_r   <- memory.size()
  prof_1  <- memory.profile()
  list(r = mem_r, os = mem_os, ratio = mem_os/mem_r)
}

memoryLeak <- function(
  x = system.file("exampleData", "mtcars.xml", package="XML"),
  n = 10000,
  use_text = FALSE,
  xpath = FALSE,
  free_doc = FALSE,
  clean_up = FALSE,
  detailed = FALSE,
  use_rvest = FALSE,
  user_agent = httr::user_agent("Mozilla/5.0")
) {
  if(use_text) {
    x <- readLines(x)
  }
  ## Before //
  prof_1  <- memory.profile()
  mem_before <- getCurrentMemoryStatus()

  ## Per run //
  mem_perrun <- lapply(1:n, function(ii) {
    doc <- if (!use_rvest) {
      xmlParse(x, asText = use_text)
    } else {
      if (file.exists(x)) {
      ## From disk //        
        rvest::html(x)  
      } else {
      ## From web //
        rvest::html_session(x, user_agent)  
      }
    }
    if (xpath) {
      res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue)
      rm(res)
    }
    if (free_doc) {
      free(doc)
    }
    rm(doc)
    out <- NULL
    if (detailed) {
      out <- list(
        profile = memory.profile(),
        size = memory.size()
      )
    } 
    out
  })
  has_perrun <- any(sapply(mem_perrun, length) > 0)
  if (!has_perrun) {
    mem_perrun <- NULL
  } 

  ## Garbage collect //
  mem_gc <- NULL
  if(clean_up) {
    gc()
    tmp <- gc()
    mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"])
  }

  ## After //
  prof_2  <- memory.profile()
  mem_after <- getCurrentMemoryStatus()

  ## Return value //
  if (detailed) {
    list(
      before = mem_before, 
      perrun = mem_perrun, 
      gc = mem_gc, 
      after = mem_after, 
      comparison_r = data.frame(
        before = prof_1, 
        after = prof_2, 
        increase = round((prof_2/prof_1)-1, 4)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  } else {
    list(
      before_after = data.frame(
        r = c(mem_before$r, mem_after$r),
        os = c(mem_before$os, mem_after$os)
      ),
      increase_r = (mem_after$r/mem_before$r)-1,
      increase_os = (mem_after$os/mem_before$os)-1
    )
  }
}

Memory status before anything has ever been requested

getCurrentMemoryStatus()

Generate additional offline example content

s <- html_session("http://had.co.nz/")
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "hadley.html")
# html("hadley.html")

s <- html_session(
  "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd",
  httr::user_agent("Mozilla/5.0"))
tmp <- capture.output(httr::content(s$response))
write(tmp, file = "amazon.html")
# html("amazon.html")

getCurrentMemoryStatus()

Profiling

################
## Mtcars.xml ##
################

res <- memoryLeak(n = 50000, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.1.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.2.rdata")
save(res, file = fpath)

res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE)
fpath <- file.path(tempdir(), "memory-profile-1.3.rdata")
save(res, file = fpath)

###################
## www.had.co.nz ##
###################

## Offline //
res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-2.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://had.co.nz/"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-3.3.rdata")
save(res, file = fpath)

####################
## www.amazon.com ##
####################

## Offline //
res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
  detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)

## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) //
.url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd"
res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.1.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.2.rdata")
save(res, file = fpath)

res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, 
    free_doc = TRUE, detailed = FALSE, use_rvest = TRUE)
fpath <- file.path(tempdir(), "memory-profile-4.3.rdata")
save(res, file = fpath)