在Windows上使用软件包XML时发生内存泄漏

阅读内存泄漏r (包括链接的post) parsingXML和这篇文章在R帮助,并考虑到一段时间了,我仍然认为这是一个未解决的问题值得关注,因为在整个R宇宙中广泛使用的XML包。

因此,请将此视为后续行动和/或参考,并提供一个有希望的信息,并简要说明问题

问题

parsingXML / HTML文档的方式可以使用XPath进行search,这需要内部使用C指针(AFAIU)。 至less在MS Windows(我在Windows 8.1,64位上运行)似乎这些引用没有被垃圾收集器正确识别。 因此消耗的内存没有被正确释放,导致R进程在某个时刻被冻结。

中央调查结果迄今

对我来说, XML:free和/或gc在通过xmlParsehtmlParseparsingXML / HTML文档时,并没有识别出涉及的所有内存,随后使用xpathApply或类似方法处理它们:

所报告的操作系统任务 (Rterm.exe)的内存使用量显着增加,而R进程的报告内存为“从R内部看”(函数memory.size )适度增加(相比之下)。 请参阅下面的大量parsing周期之前和之后的列表元素mem_rmem_osratio

总而言之,抛出所有被推荐的东西( freermgc ),当xmlParse等被调用时,内存使用率总是会增加。 这只是一个多less的问题。 因此恕我直言,那里一定还有一些工作不正常。


插图

我借用了Duncan的Omegahat git仓库的分析代码。

一些准备:

 Sys.setenv("LANGUAGE"="en") require("compiler") require("XML") > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] compiler stats graphics grDevices utils datasets methods [8] base other attached packages: [1] XML_3.98-1.1 

我们需要的function:

 getTaskMemoryByPid <- cmpfun(function( pid=Sys.getpid() ) { cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid) mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5] mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000 mem }, options=list(suppressAll=TRUE)) memoryLeak <- cmpfun(function( x=system.file("exampleData", "mtcars.xml", package="XML"), n=10000, use_text=FALSE, xpath=FALSE, free_doc=FALSE, clean_up=FALSE, detailed=FALSE ) { if(use_text) { x <- readLines(x) } ## Before // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_1 <- memory.profile() mem_before <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) ## Per run // mem_perrun <- lapply(1:n, function(ii) { doc <- xmlParse(x, asText=use_text) if (xpath) { res <- xpathApply(doc=doc, path="/blah", fun=xmlValue) rm(res) } if (free_doc) { free(doc) } rm(doc) out <- NULL if (detailed) { out <- list( profile=memory.profile(), size=memory.size() ) } out }) has_perrun <- any(sapply(mem_perrun, length) > 0) if (!has_perrun) { mem_perrun <- NULL } ## Garbage collect // mem_gc <- NULL if(clean_up) { gc() tmp <- gc() mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"]) } ## After // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_2 <- memory.profile() mem_after <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) list( before=mem_before, perrun=mem_perrun, gc=mem_gc, after=mem_after, comparison_r=data.frame( before=prof_1, after=prof_2, increase=round((prof_2/prof_1)-1, 4) ), increase_r=(mem_after$mem_r/mem_before$mem_r)-1, increase_os=(mem_after$mem_os/mem_before$mem_os)-1 ) }, options=list(suppressAll=TRUE)) 

结果

情况1

快速事实:启用垃圾回收,XML文档被parsingn次,但通过xpathApplysearch

注意OS内存与R内存的比率:

之前: 1.364832

之后: 1.322702

 res <- memoryLeak(clean_up=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-1.rdata")) > res $before $before$mem_r [1] 37.42 $before$mem_os [1] 51.072 $before$ratio [1] 1.364832 $perrun NULL $gc $gc$gc_mb [1] 45 $after $after$mem_r [1] 63.21 $after$mem_os [1] 83.608 $after$ratio [1] 1.322702 $comparison_r before after increase NULL 1 1 0.0000 symbol 7387 7392 0.0007 pairlist 190383 390633 1.0518 closure 5077 55085 9.8499 environment 1032 51032 48.4496 promise 5226 105226 19.1351 language 54675 54791 0.0021 special 44 44 0.0000 builtin 648 648 0.0000 char 8746 8763 0.0019 logical 9081 9084 0.0003 integer 22804 22807 0.0001 double 2773 2783 0.0036 complex 1 1 0.0000 character 44522 94569 1.1241 ... 0 0 NaN any 0 0 NaN list 19946 19951 0.0003 expression 1 1 0.0000 bytecode 16049 16050 0.0001 externalptr 1487 1487 0.0000 weakref 391 391 0.0000 raw 392 392 0.0000 S4 1392 1392 0.0000 $increase_r [1] 0.6892036 $increase_os [1] 0.6370614 

情景2

快速的事实:垃圾收集启用, free显式调用,XML文档被parsingn次,但不是通过xpathApplysearch。

注意OS内存与R内存的比率:

之前: 1.315249

之后: 1.222143

 res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-2.rdata")) > res $before $before$mem_r [1] 63.48 $before$mem_os [1] 83.492 $before$ratio [1] 1.315249 $perrun NULL $gc $gc$gc_mb [1] 69.3 $after $after$mem_r [1] 95.92 $after$mem_os [1] 117.228 $after$ratio [1] 1.222143 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7454 0.0000 pairlist 392455 592466 0.5096 closure 55104 105104 0.9074 environment 51032 101032 0.9798 promise 105226 205226 0.9503 language 55592 55592 0.0000 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8848 0.0001 logical 9141 9141 0.0000 integer 23109 23111 0.0001 double 2802 2807 0.0018 complex 1 1 0.0000 character 94775 144781 0.5276 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.5110271 $increase_os [1] 0.4040627 

情景3

快速的事实:垃圾收集启用, free显式调用,XML文档被parsingn次,每次通过xpathApply search

注意OS内存与R内存的比率:

之前: 1.220429

之后: 13.15629 (!)

 res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000) save(res, file=file.path(tempdir(), "memory-profile-3.rdata")) res $before $before$mem_r [1] 95.94 $before$mem_os [1] 117.088 $before$ratio [1] 1.220429 $perrun NULL $gc $gc$gc_mb [1] 93.4 $after $after$mem_r [1] 124.64 $after$mem_os [1] 1639.8 $after$ratio [1] 13.15629 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7460 0.0008 pairlist 592458 793042 0.3386 closure 105104 155110 0.4758 environment 101032 151032 0.4949 promise 205226 305226 0.4873 language 55592 55882 0.0052 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8867 0.0023 logical 9142 9162 0.0022 integer 23109 23112 0.0001 double 2802 2832 0.0107 complex 1 1 0.0000 character 144775 194819 0.3457 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.2991453 $increase_os [1] 13.00485 

我也尝试了不同的版本。 那么,我试图尝试;-)

来自omegahat.org

仅供参考:最新的Rtools 3.1被安装并包含在Windows PATH (例如安装stringr ,源代码工作得很好)。

 > install.packages("XML", repos="http://www.omegahat.org/R", type="source") trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz' Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb) opened URL downloaded 1.5 Mb * installing *source* package 'XML' ... Please define LIB_XML (and LIB_ZLIB, LIB_ICONV) Warning: running command 'sh ./configure.win' had status 1 ERROR: configuration failed for package 'XML' * removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' * restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' The downloaded source packages are in 'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages' Warning messages: 1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 2: In install.packages("XML", repos = "http://www.omegahat.org/R", : installation of package 'XML' had non-zero exit status 

Github上

我没有按照在github repo上的README中的build议,因为它指向的这个目录只包含tar.gz的版本3.94-0 (而我们在CRAN的3.98-1.1 )。

即使它声明gihub repo不是在一个标准的R软件包结构中,我仍然用install_github尝试了它 – 并且失败了;-)

 require("devtools") > install_github(repo="XML", username="omegahat") Installing github repo XML/master from omegahat Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip Installing XML "R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL \ "C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master" \ --library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source \ --install-tests * installing *source* package 'XML' ... Please define LIB_XML (and LIB_ZLIB, LIB_ICONV) Warning: running command 'sh ./configure.win' had status 1 ERROR: configuration failed for package 'XML' * removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' * restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' Error: Command failed (1) 

尽管它还处于起步阶段(只有几个月的时间!),并且有一些怪癖,Hadley Wickham已经写了一个XML解析库xml2 ,可以在Github找到https://github.com/哈德利/ xml2 。 它仅限于阅读而不是编写XML,而是用于解析XML,我一直在试验,看起来它可以完成这个工作,而不需要xml包的内存泄漏! 它提供的功能包括:

  • read_xml()读取一个XML文件
  • xml_children()获取节点的子节点
  • xml_text()获取标签内的文本
  • xml_attrs()获取节点属性和值的字符向量,可以通过as.list()将其转换为命名列表。

请注意,在完成这些操作后,仍然需要确保rm() XML节点对象,并使用gc()强制执行垃圾回收,但是内存实际上会发布到O / S(声明:只在Windows 7上测试,但这似乎是最“内存泄漏”的平台)。

希望这可以帮助别人!

遵循Matthew Wise的上面关于使用xml2的回答,我发现真正释放内存的函数是xml_remove()紧接着是gc() ,而不是rm()

自从我发布这个问题以来没有什么事情发生,所以我想我会再次提高注意力。

这是我调查的一个稍微更新的版本

预赛

 require("rvest") require("XML") 

功能

 getTaskMemoryByPid <- function( pid = Sys.getpid() ) { cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid) mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5] mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000 mem } getCurrentMemoryStatus <- function() { mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_1 <- memory.profile() list(r = mem_r, os = mem_os, ratio = mem_os/mem_r) } memoryLeak <- function( x = system.file("exampleData", "mtcars.xml", package="XML"), n = 10000, use_text = FALSE, xpath = FALSE, free_doc = FALSE, clean_up = FALSE, detailed = FALSE, use_rvest = FALSE, user_agent = httr::user_agent("Mozilla/5.0") ) { if(use_text) { x <- readLines(x) } ## Before // prof_1 <- memory.profile() mem_before <- getCurrentMemoryStatus() ## Per run // mem_perrun <- lapply(1:n, function(ii) { doc <- if (!use_rvest) { xmlParse(x, asText = use_text) } else { if (file.exists(x)) { ## From disk // rvest::html(x) } else { ## From web // rvest::html_session(x, user_agent) } } if (xpath) { res <- xpathApply(doc = doc, path = "/blah", fun = xmlValue) rm(res) } if (free_doc) { free(doc) } rm(doc) out <- NULL if (detailed) { out <- list( profile = memory.profile(), size = memory.size() ) } out }) has_perrun <- any(sapply(mem_perrun, length) > 0) if (!has_perrun) { mem_perrun <- NULL } ## Garbage collect // mem_gc <- NULL if(clean_up) { gc() tmp <- gc() mem_gc <- list(gc_mb = tmp["Ncells", "(Mb)"]) } ## After // prof_2 <- memory.profile() mem_after <- getCurrentMemoryStatus() ## Return value // if (detailed) { list( before = mem_before, perrun = mem_perrun, gc = mem_gc, after = mem_after, comparison_r = data.frame( before = prof_1, after = prof_2, increase = round((prof_2/prof_1)-1, 4) ), increase_r = (mem_after$r/mem_before$r)-1, increase_os = (mem_after$os/mem_before$os)-1 ) } else { list( before_after = data.frame( r = c(mem_before$r, mem_after$r), os = c(mem_before$os, mem_after$os) ), increase_r = (mem_after$r/mem_before$r)-1, increase_os = (mem_after$os/mem_before$os)-1 ) } } 

在任何请求之前的内存状态

 getCurrentMemoryStatus() 

生成更多的离线示例内容

 s <- html_session("http://had.co.nz/") tmp <- capture.output(httr::content(s$response)) write(tmp, file = "hadley.html") # html("hadley.html") s <- html_session( "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd", httr::user_agent("Mozilla/5.0")) tmp <- capture.output(httr::content(s$response)) write(tmp, file = "amazon.html") # html("amazon.html") getCurrentMemoryStatus() 

剖析

 ################ ## Mtcars.xml ## ################ res <- memoryLeak(n = 50000, detailed = FALSE) fpath <- file.path(tempdir(), "memory-profile-1.1.rdata") save(res, file = fpath) res <- memoryLeak(n = 50000, clean_up = TRUE, detailed = FALSE) fpath <- file.path(tempdir(), "memory-profile-1.2.rdata") save(res, file = fpath) res <- memoryLeak(n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE) fpath <- file.path(tempdir(), "memory-profile-1.3.rdata") save(res, file = fpath) ################### ## www.had.co.nz ## ################### ## Offline // res <- memoryLeak(x = "hadley.html", n = 50000, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-2.1.rdata") save(res, file = fpath) res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-2.2.rdata") save(res, file = fpath) res <- memoryLeak(x = "hadley.html", n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-2.3.rdata") save(res, file = fpath) ## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) // .url <- "http://had.co.nz/" res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-3.1.rdata") save(res, file = fpath) res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-3.2.rdata") save(res, file = fpath) res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, free_doc = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-3.3.rdata") save(res, file = fpath) #################### ## www.amazon.com ## #################### ## Offline // res <- memoryLeak(x = "amazon.html", n = 50000, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.1.rdata") save(res, file = fpath) res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.2.rdata") save(res, file = fpath) res <- memoryLeak(x = "amazon.html", n = 50000, clean_up = TRUE, free_doc = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.3.rdata") save(res, file = fpath) ## Online (PLEASE USE "POLITE" VALUE FOR `n`!!!) // .url <- "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=ssd" res <- memoryLeak(x = .url, n = 50, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.1.rdata") save(res, file = fpath) res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.2.rdata") save(res, file = fpath) res <- memoryLeak(x = .url, n = 50, clean_up = TRUE, free_doc = TRUE, detailed = FALSE, use_rvest = TRUE) fpath <- file.path(tempdir(), "memory-profile-4.3.rdata") save(res, file = fpath)