Intereting Posts

什么时候在Azure诊断中使用EndOnDemandTransfer（）？如何避免序列化浮点数组属性模拟阻塞系统调用中的进程从netcatstream到前端为什么使用gunicorn与反向代理？如何使用没有root权限的PAM来validation用户名/密码没有findShell命令的shell脚本。忽略shebang？在* nix系统中创build临时命名的fifo 使用TCPREPLAY将时间戳添加到数据包有效负载有没有Windows徽标添加徽章到任务栏图标？如何将gperf添加到Windows 7 classpath（并在DOS中识别它）如何获取与打开的句柄关联的名称如何select最高质量的照片版本？从bat文件运行Java .jar和Windows .exe DTCPing和DTCTester有什么区别？

如何修剪文件 – 删除具有相同值的列

我想通过删除具有相同值的列修剪文件的帮助。

# the file I have (tab-delimited, millions of columns) jack 1 5 9 john 3 5 0 lisa 4 5 7

 # the file I want (remove the columns with the same value in all lines) jack 1 9 john 3 0 lisa 4 7

你能给我一些关于这个问题的方向吗？我更喜欢sed或awk解决scheme，或者perl解决scheme。

提前致谢。最好，

 #!/usr/bin/perl $/="\t"; open(R,"<","/tmp/filename") || die; while (<R>) { next if (($. % 4) == 3); print; }

那么，这是假设它是第三列。如果它是有价值的：

 #!/usr/bin/perl $/="\t"; open(R,"<","/tmp/filename") || die; while (<R>) { next if (($_ == 5); print; }

随着问题编辑，OP的愿望变得清晰。怎么样：

 #!/usr/bin/perl open(R,"<","/tmp/filename") || die; my $first = 1; my (@cols); while (<R>) { my (@this) = split(/\t/); if ($. == 1) { @cols = @this; } else { for(my $x=0;$x<=$#cols;$x++) { if (defined($cols[$x]) && !($cols[$x] ~~ $this[$x])) { $cols[$x] = undef; } } } next if (($_ == 5)); # print; } close(R); my(@del); print "Deleting columns: "; for(my $x=0;$x<=$#cols;$x++) { if (defined($cols[$x])) { print "$x ($cols[$x]), "; push(@del,$x-int(@del)); } } print "\n"; open(R,"<","/tmp/filename") || die; while (<R>) { chomp; my (@this) = split(/\t/); foreach my $col (@del) { splice(@this,$col,1); } print join("\t",@this)."\n"; } close(R);

这里有一个快速的perl脚本来确定哪些列可以被剪切。

 open FH, "file" or die $!; my @baseline = split /\t/,<FH>; #snag the first row my @linemap = 0..$#baseline; #list all equivalent columns (all of them) while(<FH>) { #loop over the file my @line = split /\t/; @linemap = grep {$baseline[$_] eq $line[$_]} @linemap; #filter out any that aren't equal } print join " ", @linemap; print "\n";

您可以使用上述许多建议来实际删除列。我最喜欢的可能是切割实现，部分原因是上面的perl脚本可以修改为给你精确的命令（甚至为你运行）。

 @linemap = map {$_+1} @linemap; #Cut is 1-index based print "cut --complement -f ".join(",",@linemap)." file\n";

如果你知道哪一列要提前消除，那么cut将是有帮助的：

 cut --complement -d' ' -f 3 filename

据我所知，你想通过每一行，并检查某些列中的值是否有差异，然后我可以删除该列。如果是这种情况，我有一个建议，但没有准备好的脚本，但我想你可以弄明白。你应该看看cut 。它提取部分行。你可以用它来提取第一列，然后在输出的数据上运行uniq ，然后如果唯一后只有一个值，则意味着该列中的所有值都是相同的。这样你可以收集没有差异的列数。你将需要shell脚本来查看你有多少列（我猜是使用head -n 1和计数分隔符的数量），并在每一列上运行这样的过程，将列号存储在数组中，然后在最终的工匠剪切行中删除不感兴趣的专栏。授予它不awk或perl，但应该工作，并将只使用传统的Unix工具。那么你可以在Perl脚本中使用它们，如果你想:)

那么，如果我误解了这个问题也许削减仍然是有用的:)它似乎是一个鲜为人知的工具。

据我所知，你需要做一个多通程序来满足你的需求，而不是通过记忆。对于初学者，将一行文件加载到一个数组中。

 open FH,'datafile.txt' or die "$!"; my @mask; my @first_line= split(/\s+/,<FH>);

然后，你会想顺序阅读其他行

 while(my @next_line= split(/\s+/,<FH>)) { /* compare each member of @first_line to @next_line * any match, make a mark in mask to true */

当到达文件底部时，返回顶部并使用掩码确定要打印的柱。

您可以选择要剪切的列

 # using bash/awk # I had used 1000000 here, as you had written millions of columns but you should adjust it for cols in `seq 2 1000000` ; do cut -d DELIMITER -f $cols FILE | awk -vc=$cols '{s+=$0} END {if (s/NR==$0) {printf("%i,",c)}}' done | sed 's/,$//' > tmplist cut --complement -d DELIMITER -f `cat tmplist` FILE

但它可能会非常慢，因为它没有被优化，并且多次读取文件…所以要注意大文件。

或者你可以用awk读取整个文件一次，然后选择可以打印的列，然后使用剪切。

 cut --complement -d DELIMITER -f `awk '{for (i=1;i<=NF;i++) {sums[i]+=$i}} END {for (i=1;i<=NF; i++) {if (sums[i]/NR==$i) {printf("%i,",c)}}}' FILE | sed 's/,$//'` FILE

HTH

没有完全测试，但这似乎工作提供的测试集，请注意，它破坏了原始文件…

 #!/bin/bash #change 4 below to match number of columns for i in {2..4}; do cut -f $i input | sort | uniq -c > tmp while read ab; do if [ $a -ge 2 ]; then awk -vfield=$i '{$field="_";print}' input > tmp2 $(mv tmp2 input) fi done < tmp done $ cat input jack 1 5 9 john 3 5 0 lisa 4 5 7 $ ./cnt.sh $ cat input jack 1 _ 9 john 3 _ 0 lisa 4 _ 7

使用_使输出更清晰

这里的主要问题是，你说“数百万列”，并没有指定多少行。为了检查每一行中的每个值与其他列中的对应值，您正在查看大量的检查。

当然，你可以减少列的数量，但是你仍然需要检查每一个到最后一行。所以…很多处理。

我们可以从两个第一行开始“种子”散列：

 use strict; use warnings; open my $fh, '<', "inputfile.txt" or die; my %matches; my $line = <$fh>; my $nextline = <$fh>; my $i=0; while ($line =~ s/\t(\d+)//) { my $num1 = $1; if ($nextline =~ s/\t(\d+)//) { if ($1 == $num1) { $matches{$i} = $num1 } } else { die "Mismatched line at line $."; } $i++; }

然后用这个“seed”散列，你可以读取其余的行，并从散列中删除不匹配的值，例如：

 while($line = <$fh>) { my $i = 0; while ($line =~ s/\t(\d+)//) { if (defined $matches{$i}) { $matches{$i} = undef if ($matches{$i} != $1); } $i++; } }

人们可以想象一个解决方案，其中一个已经被证明是独一无二的行被剥离了，但为了做到这一点，你需要做一个行的数组，或做一个正则表达式，我不知道这不会只要简单地穿过字符串就可以同样长。

然后，在处理完所有行之后，您将得到一个包含重复数字值的哈希值，以便您可以重新打开该文件并进行打印：

 open my $fh, '<', "inputfile.txt" or die; open my $outfile, '>', "outfile.txt" or die; while ($line = <$fh>) { my $i = 0; if ($line =~ s/^([^\t]+)(?=\t)//) { print $outfile $1; } else { warn "Missing header at line $.\n"; } while ($line =~ s/(\t\d+)//) { if (defined $matches{$i}) { print $1 } $i++; } print "\n"; }

这是一个相当繁重的操作，这个代码是未经测试的。这会给你一个解决方案的提示，它可能需要一段时间来处理整个文件。我建议运行一些测试，看看它是否适用于你的数据，并调整它。

如果你只有几个匹配的列，那么简单地从行中提取它们就容易得多，但是我犹豫在这么长的行上使用split 。就像是：

 while ($line = <$fh>) { my @line = split /\t/, $line; for my $key (sort { $b <=> $a } keys %matches) { splice @line, $key + 1, 1; } $line = join ("\t", @line); $line =~ s/\n*$/\n/; # awkward way to make sure to get a single newline print $outfile $line; }

请注意，我们必须按降序排序键，以便从最后修剪值。否则，我们搞砸了后续数组的唯一性。