如何使Powershell更快地parsingXML或进一步优化我的脚本？

我有一个包含700万个XML文件的设置，大小从几KB到多MB不等。总而言之，这大概是180GB的XML文件。我需要执行的工作是分析每个XML文件，并确定文件是否包含string<ref> ，并且如果不将它移出当前包含在Referenceless文件夹中的Chunk文件夹。

我创build的脚本效果不错，但对于我的目的来说，速度非常慢。预计在24天内完成对所有700万个文件的分析，每秒处理大约3个文件。有什么我可以在我的脚本中改变，以消除更多的performance？

此外，为了使事情更加复杂，我没有在我的服务器上运行.PS1文件的正确权限，所以脚本需要能够通过一个命令从PowerShell运行。如果我有权限，我会设置权限。

 # This script will iterate through the Chunk folders, removing pages that contain no # references and putting them into the Referenceless folder. # Change this variable to start the program on a different chunk. This is the first # command to be run in Windows PowerShell. $chunknumber = 1 #This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113. while($chunknumber -le 113){ #Jumps the terminal to the correct folder. cd C:\Wiki_Pages #Creates an index for the chunk being worked on. $items = Get-ChildItem -Path "Chunk_$chunknumber" echo "Chunk $chunknumber Indexed" #Jumps to chunk folder. cd C:\Wiki_Pages\Chunk_$chunknumber #Loops through the index. Each entry is one of the pages. foreach ($page in $items){ #Creates a variable holding the page's content. $content = Get-Content $page #If the page has a reference, then it's echoed. if($content | Select-String "<ref>" -quiet){echo "Referenced!"} #if the page doesn't have a reference, it's copied to Referenceless then deleted. else{ Copy-Item $page C:\Wiki_Pages\Referenceless -force Remove-Item $page -force echo "Moved to Referenceless!" } } #The chunk number is increased by one and the cycle continues. $chunknumber = $chunknumber + 1 }

我对PowerShell知之甚less，昨天是我第一次打开这个程序。

您需要将-ReadCount 0参数添加到您的Get-Content命令中以加快速度（这非常有帮助）。我从这篇伟大的文章中学到了这个技巧，说明在整个文件的内容上运行foreach比通过管道解析它更快。

另外，您可以使用Set-ExecutionPolicy Bypass -Scope Process来运行当前Powershell会话中的脚本，而不需要额外的权限！

PowerShell管道可能明显比本地系统调用慢。

PowerShell：管道性能

在本文中，将在PowerShell上执行的两个等效命令和经典的Windows命令提示符之间执行性能测试。

 PS> grep [0-9] numbers.txt | wc -l > $null CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

以下是其输出的示例。

 PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) } 10 iterations 30 ms ( 0 lines / ms) grep in PS 15 ms ( 1 lines / ms) grep in cmd.exe 100 iterations 28 ms ( 4 lines / ms) grep in PS 12 ms ( 8 lines / ms) grep in cmd.exe 1000 iterations 147 ms ( 7 lines / ms) grep in PS 11 ms ( 89 lines / ms) grep in cmd.exe 10000 iterations 1347 ms ( 7 lines / ms) grep in PS 13 ms ( 786 lines / ms) grep in cmd.exe 100000 iterations 13410 ms ( 7 lines / ms) grep in PS 22 ms (4580 lines / ms) grep in cmd.exe

编辑：这个问题的原始答案提到管道性能以及一些其他的建议。为了保持这篇文章的简洁，我删除了其他与管道性能没有任何关系的建议。

在开始优化之前，您需要确定需要优化的位置。你是I / O绑定（需要多长时间来读取每个文件）？内存绑定（可能不是）？ CPU绑定（搜索内容的时间）？

你说这些是XML文件; 你有没有测试读取文件到一个XML对象（而不是纯文本），并通过XPath定位<ref>节点？你会有：

 $content = [xml](Get-Content $page) #If the page has a reference, then it's echoed. if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}

如果您拥有CPU，内存和I / O资源，可以通过并行搜索多个文件来看到一些改进。请参阅关于并行运行多个作业的讨论。显然，你不能同时运行大量的数据，但是通过一些测试，你可以找到最佳点（可能在3-5的附近）。 foreach ($page in $items){所有内容foreach ($page in $items){将是作业的脚本块。

我将尝试使用Start-Job cmdlet一次解析5个文件。关于PowerShell作业有很多优秀的文章。如果由于某种原因无济于事，而且遇到I / O或实际的资源瓶颈，甚至可以使用Start-Job和WinRM来启动其他机器上的工作人员。