I have a setup that contains 7 million XML files, varying in size from a few KB to multiple MB. All in all, it's about 180GB of XML files. The job I need performed is to analyze each XML file and determine if the file contains the string <ref>
, and if it does not to move it out of the Chunk folder that it currently is contained in to the Referenceless folder.
我有一个包含700万个XML文件的设置,大小从几KB到多MB不等。总而言之,它是大约180GB的XML文件。我需要执行的工作是分析每个XML文件并确定该文件是否包含字符串 ,以及它是否不将其移出当前包含在Referenceless文件夹中的Chunk文件夹。
The script I have created works well enough, but it's extremely slow for my purposes. It's slated to finish analyzing all 7 million files in about 24 days, going at a rate of about 3 files per second. Is there anything I can change in my script to eek out more performance?
我创建的脚本运行得很好,但对于我的目的来说它非常慢。它计划在大约24天内完成对所有700万个文件的分析,每秒大约3个文件。我的脚本中有什么可以更改以获得更多性能吗?
Also, to make matters even more complicated, I do not have the correct permissions on my server box to run .PS1 files, and so the script will need to be able to be run from the PowerShell in one command. I would set the permissions if I had the authorization to.
另外,为了使问题更加复杂,我在服务器盒上没有正确的权限来运行.PS1文件,因此需要能够在一个命令中从PowerShell运行脚本。如果我有权限,我会设置权限。
# This script will iterate through the Chunk folders, removing pages that contain no
# references and putting them into the Referenceless folder.
# Change this variable to start the program on a different chunk. This is the first
# command to be run in Windows PowerShell.
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}
I have very little knowledge of PowerShell, yesterday was the first time I had ever even opened the program.
我对PowerShell知之甚少,昨天是我第一次打开程序。
4 个解决方案
#1
4
You will want to add the -ReadCount 0
argument to your Get-Content
commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach
over a whole file's contents is faster than trying to parse it through a pipeline.
您需要将-ReadCount 0参数添加到Get-Content命令中以加速它们(它有很大帮助)。我从这篇伟大的文章中了解到了这一提示,该文章显示在整个文件的内容上运行foreach比尝试通过管道解析它更快。
Also, you can use Set-ExecutionPolicy Bypass -Scope Process
in order to run scripts in your current Powershell session, without needing extra permissions!
此外,您可以使用Set-ExecutionPolicy Bypass -Scope Process在当前的Powershell会话中运行脚本,而无需额外的权限!
#2
2
The PowerShell pipeline can be markedly slower than native system calls.
PowerShell管道可能明显慢于本机系统调用。
PowerShell: pipeline performance
PowerShell:管道性能
In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.
在本文中,在PowerShell和传统的Windows命令提示符上执行的两个等效命令之间执行性能测试。
PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"
Here's a sample of its output.
这是一个输出样本。
PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }
10 iterations
30 ms ( 0 lines / ms) grep in PS
15 ms ( 1 lines / ms) grep in cmd.exe
100 iterations
28 ms ( 4 lines / ms) grep in PS
12 ms ( 8 lines / ms) grep in cmd.exe
1000 iterations
147 ms ( 7 lines / ms) grep in PS
11 ms ( 89 lines / ms) grep in cmd.exe
10000 iterations
1347 ms ( 7 lines / ms) grep in PS
13 ms ( 786 lines / ms) grep in cmd.exe
100000 iterations
13410 ms ( 7 lines / ms) grep in PS
22 ms (4580 lines / ms) grep in cmd.exe
EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.
编辑:这个问题的原始答案提到了管道性能以及其他一些建议。为了使这篇文章简明扼要,我删除了其他与管道性能无关的建议。
#3
1
Before you start optimizing, you need to determine exactly where you need to optimize. Are you I/O bound (how long it takes to read each file)? Memory bound (likely not)? CPU bound (time to search the content)?
在开始优化之前,您需要准确确定需要优化的位置。你是I / O绑定的(读取每个文件需要多长时间)?内存限制(可能不是)? CPU绑定(搜索内容的时间)?
You say these are XML files; have you tested reading the files into an XML object (instead of plain text), and locating the <ref>
node via XPath? You would then have:
你说这些是XML文件;你测试过将文件读入XML对象(而不是纯文本),并通过XPath定位 节点吗?然后你会:
$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}
If you have CPU, memory & I/O resources to spare, you may see some improvement by searching multiple files in parallel. See this discussion on running several jobs in parallel. Obviously you can't run a large number simultaneously, but with some testing you can find the sweet spot (probably in the neighborhood of 3-5). Everything inside foreach ($page in $items){
would be the scriptblock for the job.
如果您有备用的CPU,内存和I / O资源,您可以通过并行搜索多个文件来看到一些改进。请参阅有关并行运行多个作业的讨论。显然你不能同时运行大量的数字,但通过一些测试你可以找到最佳位置(可能在3-5附近)。 foreach中的所有内容($ items in $ items){将成为该作业的脚本块。
#4
0
I would experiment with parsing 5 files at once using the Start-Job cmdlet. There are many excellent articles on PowerShell Jobs. If for some reason that doesn't help, and you're experiencing I/O or actual resource bottlenecks, you could even use Start-Job and WinRM to spin up workers on other machines.
我将尝试使用Start-Job cmdlet一次解析5个文件。有很多关于PowerShell Jobs的优秀文章。如果由于某种原因没有帮助,并且您遇到I / O或实际资源瓶颈,您甚至可以使用Start-Job和WinRM来启动其他计算机上的工作人员。
#1
4
You will want to add the -ReadCount 0
argument to your Get-Content
commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach
over a whole file's contents is faster than trying to parse it through a pipeline.
您需要将-ReadCount 0参数添加到Get-Content命令中以加速它们(它有很大帮助)。我从这篇伟大的文章中了解到了这一提示,该文章显示在整个文件的内容上运行foreach比尝试通过管道解析它更快。
Also, you can use Set-ExecutionPolicy Bypass -Scope Process
in order to run scripts in your current Powershell session, without needing extra permissions!
此外,您可以使用Set-ExecutionPolicy Bypass -Scope Process在当前的Powershell会话中运行脚本,而无需额外的权限!
#2
2
The PowerShell pipeline can be markedly slower than native system calls.
PowerShell管道可能明显慢于本机系统调用。
PowerShell: pipeline performance
PowerShell:管道性能
In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.
在本文中,在PowerShell和传统的Windows命令提示符上执行的两个等效命令之间执行性能测试。
PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"
Here's a sample of its output.
这是一个输出样本。
PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }
10 iterations
30 ms ( 0 lines / ms) grep in PS
15 ms ( 1 lines / ms) grep in cmd.exe
100 iterations
28 ms ( 4 lines / ms) grep in PS
12 ms ( 8 lines / ms) grep in cmd.exe
1000 iterations
147 ms ( 7 lines / ms) grep in PS
11 ms ( 89 lines / ms) grep in cmd.exe
10000 iterations
1347 ms ( 7 lines / ms) grep in PS
13 ms ( 786 lines / ms) grep in cmd.exe
100000 iterations
13410 ms ( 7 lines / ms) grep in PS
22 ms (4580 lines / ms) grep in cmd.exe
EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.
编辑:这个问题的原始答案提到了管道性能以及其他一些建议。为了使这篇文章简明扼要,我删除了其他与管道性能无关的建议。
#3
1
Before you start optimizing, you need to determine exactly where you need to optimize. Are you I/O bound (how long it takes to read each file)? Memory bound (likely not)? CPU bound (time to search the content)?
在开始优化之前,您需要准确确定需要优化的位置。你是I / O绑定的(读取每个文件需要多长时间)?内存限制(可能不是)? CPU绑定(搜索内容的时间)?
You say these are XML files; have you tested reading the files into an XML object (instead of plain text), and locating the <ref>
node via XPath? You would then have:
你说这些是XML文件;你测试过将文件读入XML对象(而不是纯文本),并通过XPath定位 节点吗?然后你会:
$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}
If you have CPU, memory & I/O resources to spare, you may see some improvement by searching multiple files in parallel. See this discussion on running several jobs in parallel. Obviously you can't run a large number simultaneously, but with some testing you can find the sweet spot (probably in the neighborhood of 3-5). Everything inside foreach ($page in $items){
would be the scriptblock for the job.
如果您有备用的CPU,内存和I / O资源,您可以通过并行搜索多个文件来看到一些改进。请参阅有关并行运行多个作业的讨论。显然你不能同时运行大量的数字,但通过一些测试你可以找到最佳位置(可能在3-5附近)。 foreach中的所有内容($ items in $ items){将成为该作业的脚本块。
#4
0
I would experiment with parsing 5 files at once using the Start-Job cmdlet. There are many excellent articles on PowerShell Jobs. If for some reason that doesn't help, and you're experiencing I/O or actual resource bottlenecks, you could even use Start-Job and WinRM to spin up workers on other machines.
我将尝试使用Start-Job cmdlet一次解析5个文件。有很多关于PowerShell Jobs的优秀文章。如果由于某种原因没有帮助,并且您遇到I / O或实际资源瓶颈,您甚至可以使用Start-Job和WinRM来启动其他计算机上的工作人员。