Input file:
输入文件:
column1;column2;column3
data1a;data2a;data3a
data1b;data2b;data3b
Goal: output file with reordered columns, say
目标:输出带有重新排序列的文件,比方说
column1;column3;column2
...
UPDATED Question: What is a good way of using powershell to solve this problem. I am aware of the existence of CSV related cmdlets, but these have limitations. Note that the order of records does not need to be changed, so loading the entire input/output file in memory should not be needed.
更新问题:使用PowerShell解决此问题的好方法是什么。我知道存在与CSV相关的cmdlet,但这些都有局限性。请注意,不需要更改记录的顺序,因此不需要将整个输入/输出文件加载到内存中。
5 个解决方案
#1
17
Here is the solution suitable for millions of records (assuming that your data do not have embedded ';')
这是适用于数百万条记录的解决方案(假设您的数据没有嵌入';')
$reader = [System.IO.File]::OpenText('data1.csv')
$writer = New-Object System.IO.StreamWriter 'data2.csv'
for(;;) {
$line = $reader.ReadLine()
if ($null -eq $line) {
break
}
$data = $line.Split(";")
$writer.WriteLine('{0};{1};{2}', $data[0], $data[2], $data[1])
}
$reader.Close()
$writer.Close()
#2
15
Import-CSV C:\Path\To\Original.csv | Select-Object Column1, Column3, Column2 | Export-CSV C:\Path\To\Newfile.csv
#3
5
It's great that people came with their solutions based on pure .NET. However, I would fight for the simplicity, if possible. That's why I upvoted all of you ;)
很高兴人们带来了基于纯.NET的解决方案。但是,如果可能的话,我会为简单而战。这就是为什么我向你们所有人投票;)
Why? I tried to generate 1.000.000 records and store it in CSV and then reorder the columns. Generating the csv was in my case much more demanding then the reordering. Look at the results.
为什么?我尝试生成1.000.000记录并将其存储在CSV中,然后重新排序列。在我的情况下生成csv比重新排序要求更高。看看结果。
It took only 1,8 minute to reorder the columns. For me it's pretty decent result. Is it ok for me? -> Yes, I don't need to try to find out quicker solution, it's good enough -> saved my time for some other interesting stuff ;)
只需1.8分钟就可以对列进行重新排序。对我而言,这是相当不错的结果。对我好吗? - >是的,我不需要尝试找出更快的解决方案,这已经足够了 - >为其他一些有趣的东西节省了我的时间;)
# generate some csv; objects have several properties
measure-command {
1..1mb |
% {
$date = get-date
New-Object PsObject -Property @{
Column1=$date
Column2=$_
Column3=$date.Ticks/$_
Hour = $date.Hour
Minute = $date.Minute
Second = $date.Second
ReadableTime = $date.ToLongTimeString()
ReadableDate = $date.ToLongDateString()
}} |
Export-Csv d:\temp\exported.csv
}
TotalMinutes : 6,100025295
# reorder the columns
measure-command {
Import-Csv d:\temp\exported.csv |
Select ReadableTime, ReadableDate, Hour, Minute, Second, Column1, Column2, Column3 |
Export-Csv d:\temp\exported2.csv
}
TotalMinutes : 2,33151559833333
#4
5
Edit: Benchmarking info below.
编辑:下面的基准测试信息。
I would not use the Powershell csv-related cmdlets. I would use either System.IO.StreamReader
or Microsoft.VisualBasic.FileIO.TextFieldParser
for reading in the file line-by-line to avoid loading the entire thing in memory, and I would use System.IO.StreamWriter
to write it back out. The TextFieldParser
internally uses a StreamReader
, but handles parsing delimited fields so you don't have to, making it very useful if the CSV format is not straightforward (e.g., has delimiter characters in quoted fields).
我不会使用Powershell csv相关的cmdlet。我会使用System.IO.StreamReader或Microsoft.VisualBasic.FileIO.TextFieldParser逐行读取文件,以避免将整个内容加载到内存中,我会使用System.IO.StreamWriter将其写回。 TextFieldParser在内部使用StreamReader,但处理分析的分隔字段,因此您不必这样做,如果CSV格式不简单(例如,在引用字段中具有分隔符字符),则非常有用。
I would also not do this in Powershell at all, but rather in a .NET application, as it will be much faster than a Powershell script even if they use the same objects.
我根本不会在Powershell中这样做,而是在.NET应用程序中,因为即使它们使用相同的对象,它也会比Powershell脚本快得多。
Here's C# for a simple version, assuming no quoted fields and ASCII encoding:
这是一个简单版本的C#,假设没有引用字段和ASCII编码:
static void Main(){
string source = @"D:\test.csv";
string dest = @"D:\test2.csv";
using ( var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser( source, Encoding.ASCII ) ) {
using ( var writer = new System.IO.StreamWriter( dest, false, Encoding.ASCII ) ) {
reader.SetDelimiters( ";" );
while ( !reader.EndOfData ) {
var fields = reader.ReadFields();
swap(fields, 1, 2);
writer.WriteLine( string.Join( ";", fields ) );
}
}
}
}
static void swap( string[] arr, int a, int b ) {
string t = arr[ a ];
arr[ a ] = arr[ b ];
arr[ b ] = t;
}
Here's the Powershell version:
这是Powershell版本:
[void][reflection.assembly]::loadwithpartialname("Microsoft.VisualBasic")
$source = 'D:\test.csv'
$dest = 'D:\test2.csv'
$reader = new-object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$writer = new-object System.IO.StreamWriter $dest
function swap($f,$a,$b){ $t = $f[$a]; $f[$a] = $f[$b]; $f[$b] = $t}
$reader.SetDelimiters(';')
while ( !$reader.EndOfData ) {
$fields = $reader.ReadFields()
swap $fields 1 2
$writer.WriteLine([string]::join(';', $fields))
}
$reader.close()
$writer.close()
I benchmarked both of these against a 3-column csv file with 10,000,000 rows. The C# version took 171.132 seconds (just under 3 minutes). The Powershell version took 2,364.995 seconds (39 minutes, 25 seconds).
我将这两个对比为具有10,000,000行的3列csv文件。 C#版本花了171.132秒(不到3分钟)。 Powershell版本耗时2,364.995秒(39分25秒)。
Edit: Why mine take so darn long.
编辑:为什么我的这么长时间。
The swap function is a huge bottleneck in my Powershell version. Replacing it with '{0};{1};{2}'
-style output like Roman Kuzmin's answer cut it down to less than 9 minutes. Replacing TextFieldParser
more than halved the remaining to under 4 minutes.
交换功能是我的Powershell版本的一个巨大瓶颈。将其替换为“{0}; {1}; {2}” - 样式输出,如Roman Kuzmin的答案,将其缩短至不到9分钟。将TextFieldParser替换为将剩余时间减少一半以下不到4分钟。
However, a .NET console app version of Roman Kuzmin's answer took 20 seconds.
但是,一个.NET控制台应用程序版本的Roman Kuzmin的答案花了20秒。
#5
1
I'd do it this way:
我这样做:
$new_csv = new-object system.collections.ArrayList
get-content mycsv.csv |% {
$new_csv.add((($_ -split ";")[0,2,1]) -join ";") > $nul
}
$new_csv | out-file myreordered.csv
#1
17
Here is the solution suitable for millions of records (assuming that your data do not have embedded ';')
这是适用于数百万条记录的解决方案(假设您的数据没有嵌入';')
$reader = [System.IO.File]::OpenText('data1.csv')
$writer = New-Object System.IO.StreamWriter 'data2.csv'
for(;;) {
$line = $reader.ReadLine()
if ($null -eq $line) {
break
}
$data = $line.Split(";")
$writer.WriteLine('{0};{1};{2}', $data[0], $data[2], $data[1])
}
$reader.Close()
$writer.Close()
#2
15
Import-CSV C:\Path\To\Original.csv | Select-Object Column1, Column3, Column2 | Export-CSV C:\Path\To\Newfile.csv
#3
5
It's great that people came with their solutions based on pure .NET. However, I would fight for the simplicity, if possible. That's why I upvoted all of you ;)
很高兴人们带来了基于纯.NET的解决方案。但是,如果可能的话,我会为简单而战。这就是为什么我向你们所有人投票;)
Why? I tried to generate 1.000.000 records and store it in CSV and then reorder the columns. Generating the csv was in my case much more demanding then the reordering. Look at the results.
为什么?我尝试生成1.000.000记录并将其存储在CSV中,然后重新排序列。在我的情况下生成csv比重新排序要求更高。看看结果。
It took only 1,8 minute to reorder the columns. For me it's pretty decent result. Is it ok for me? -> Yes, I don't need to try to find out quicker solution, it's good enough -> saved my time for some other interesting stuff ;)
只需1.8分钟就可以对列进行重新排序。对我而言,这是相当不错的结果。对我好吗? - >是的,我不需要尝试找出更快的解决方案,这已经足够了 - >为其他一些有趣的东西节省了我的时间;)
# generate some csv; objects have several properties
measure-command {
1..1mb |
% {
$date = get-date
New-Object PsObject -Property @{
Column1=$date
Column2=$_
Column3=$date.Ticks/$_
Hour = $date.Hour
Minute = $date.Minute
Second = $date.Second
ReadableTime = $date.ToLongTimeString()
ReadableDate = $date.ToLongDateString()
}} |
Export-Csv d:\temp\exported.csv
}
TotalMinutes : 6,100025295
# reorder the columns
measure-command {
Import-Csv d:\temp\exported.csv |
Select ReadableTime, ReadableDate, Hour, Minute, Second, Column1, Column2, Column3 |
Export-Csv d:\temp\exported2.csv
}
TotalMinutes : 2,33151559833333
#4
5
Edit: Benchmarking info below.
编辑:下面的基准测试信息。
I would not use the Powershell csv-related cmdlets. I would use either System.IO.StreamReader
or Microsoft.VisualBasic.FileIO.TextFieldParser
for reading in the file line-by-line to avoid loading the entire thing in memory, and I would use System.IO.StreamWriter
to write it back out. The TextFieldParser
internally uses a StreamReader
, but handles parsing delimited fields so you don't have to, making it very useful if the CSV format is not straightforward (e.g., has delimiter characters in quoted fields).
我不会使用Powershell csv相关的cmdlet。我会使用System.IO.StreamReader或Microsoft.VisualBasic.FileIO.TextFieldParser逐行读取文件,以避免将整个内容加载到内存中,我会使用System.IO.StreamWriter将其写回。 TextFieldParser在内部使用StreamReader,但处理分析的分隔字段,因此您不必这样做,如果CSV格式不简单(例如,在引用字段中具有分隔符字符),则非常有用。
I would also not do this in Powershell at all, but rather in a .NET application, as it will be much faster than a Powershell script even if they use the same objects.
我根本不会在Powershell中这样做,而是在.NET应用程序中,因为即使它们使用相同的对象,它也会比Powershell脚本快得多。
Here's C# for a simple version, assuming no quoted fields and ASCII encoding:
这是一个简单版本的C#,假设没有引用字段和ASCII编码:
static void Main(){
string source = @"D:\test.csv";
string dest = @"D:\test2.csv";
using ( var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser( source, Encoding.ASCII ) ) {
using ( var writer = new System.IO.StreamWriter( dest, false, Encoding.ASCII ) ) {
reader.SetDelimiters( ";" );
while ( !reader.EndOfData ) {
var fields = reader.ReadFields();
swap(fields, 1, 2);
writer.WriteLine( string.Join( ";", fields ) );
}
}
}
}
static void swap( string[] arr, int a, int b ) {
string t = arr[ a ];
arr[ a ] = arr[ b ];
arr[ b ] = t;
}
Here's the Powershell version:
这是Powershell版本:
[void][reflection.assembly]::loadwithpartialname("Microsoft.VisualBasic")
$source = 'D:\test.csv'
$dest = 'D:\test2.csv'
$reader = new-object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$writer = new-object System.IO.StreamWriter $dest
function swap($f,$a,$b){ $t = $f[$a]; $f[$a] = $f[$b]; $f[$b] = $t}
$reader.SetDelimiters(';')
while ( !$reader.EndOfData ) {
$fields = $reader.ReadFields()
swap $fields 1 2
$writer.WriteLine([string]::join(';', $fields))
}
$reader.close()
$writer.close()
I benchmarked both of these against a 3-column csv file with 10,000,000 rows. The C# version took 171.132 seconds (just under 3 minutes). The Powershell version took 2,364.995 seconds (39 minutes, 25 seconds).
我将这两个对比为具有10,000,000行的3列csv文件。 C#版本花了171.132秒(不到3分钟)。 Powershell版本耗时2,364.995秒(39分25秒)。
Edit: Why mine take so darn long.
编辑:为什么我的这么长时间。
The swap function is a huge bottleneck in my Powershell version. Replacing it with '{0};{1};{2}'
-style output like Roman Kuzmin's answer cut it down to less than 9 minutes. Replacing TextFieldParser
more than halved the remaining to under 4 minutes.
交换功能是我的Powershell版本的一个巨大瓶颈。将其替换为“{0}; {1}; {2}” - 样式输出,如Roman Kuzmin的答案,将其缩短至不到9分钟。将TextFieldParser替换为将剩余时间减少一半以下不到4分钟。
However, a .NET console app version of Roman Kuzmin's answer took 20 seconds.
但是,一个.NET控制台应用程序版本的Roman Kuzmin的答案花了20秒。
#5
1
I'd do it this way:
我这样做:
$new_csv = new-object system.collections.ArrayList
get-content mycsv.csv |% {
$new_csv.add((($_ -split ";")[0,2,1]) -join ";") > $nul
}
$new_csv | out-file myreordered.csv