Bash – 在列中交换值

我在文件中有一些CSV /表格数据，如下所示：

1,7,3,2 8,3,8,0 4,9,5,3 8,5,7,3 5,6,1,9

（它们并不总是数字，只是随机逗号分隔的值，但是一位数字比较容易。

我想随机洗牌40％的任何列。作为一个例子，说第三个。所以也许3和1互换。现在第三栏是：

 1 << Came from the last position 8 5 7 3 << Came from the first position

我正在尝试在我正在工作的bash脚本中的一个文件中执行此操作，而且我没有太多的运气。我一直在徘徊一些非常疯狂和没有结果的grep兔子洞，让我以为我走错了路（不断的失败是什么让我掉）。

我用一些事情来标记这个问题，因为我不完全确定我应该使用哪个工具。

编辑：我可能最终会接受鲁本斯的答案，不过这很奇怪，因为它直接包含了交换的概念（我想我可以在原始问题中强调更多），它允许我指定一个百分比的交换列。它也恰好工作，这总是一个加号。

对于不需要这个的人来说，只需要一个基本的洗牌，Jim Garrison的答案也是有效的（我testing过）。

然而，鲁本斯的解决scheme只是一个警告。我拿着这个：

 for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : ""; ... } printf "\n";

删除了printf "\n"; 并像这样移动换行符：

 for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : "\n"; ... }

因为在其他情况下只有"" ，导致awk在每行的末尾（ \00 ）写破碎的字符。有一次，它甚至设法用中文字符replace我的整个文件。虽然，老实说，这可能涉及到我在这个问题上做了一些额外的愚蠢的事情。

算法：

创建一个具有n对的向量，从1到number of lines ，以及number of lines的相应值（对于选定的列），然后对其进行随机排序;
找到应该随机化多少行： num_random = percentage * num_lines / 100 ;
从您的随机向量中选择第一个num_random条目;
您可以随机排列选定的行，但应该已经随机排序;

打印输出：

 i = 0 for num_line, value in column; do if num_line not in random_vector: print value; # printing non-randomized value else: print random_vector[i]; # randomized entry i++; done

实现：

 #! /bin/bash infile=$1 col=$2 n_lines=$(wc -l < ${infile}) prob=$(bc <<< "$3 * ${n_lines} / 100") # Selected lines tmp=$(tempfile) paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \ | sort -R | head -n ${prob} > ${tmp} # Rewriting file awk -v "col=$col" -F "," ' (NR == FNR) {id[$1] = $2; next} (FNR == 1) { i = c = 1; for (v in id) {value[i] = id[v]; ++i;} } { for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : ""; if (i != col) {printf "%s%c", $i, delim; continue;} if (FNR in id) {printf "%s%c", value[c], delim; c++;} else {printf "%s%c", $i, delim;} } printf "\n"; } ' ${tmp} ${infile} rm ${tmp}

如果您想要靠近放置位置 ，可以使用海绵将输出传回输入文件。

执行：

要执行，只需使用：

 $ ./script.sh <inpath> <column> <percentage>

如：

 $ ./script.sh infile 3 40 1,7,3,2 8,3,8,0 4,9,1,3 8,5,7,3 5,6,5,9

结论：

这使您可以选择该列，随机对该列中的某些条目进行排序，并替换原始文件中的新列。

这个脚本与其他脚本一样，不仅shell脚本非常有趣，而且还有一些情况是绝对不能使用的。（：

这将适用于专门指定的专栏，但应该足以指向正确的方向。这适用于包括Cygwin在内的现代bash shell：

 paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)

操作功能是“ 过程替换 ”。

paste命令水平地加入文件，三个部分通过cut从原始文件中分离出来，第二部分（要随机化的列）通过shuf命令运行以重新排序行。以下是几次运行的输出：

 $ cat test.dat 1,7,3,2 8,3,8,0 4,9,5,3 8,5,7,3 5,6,1,9 $ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 1,7,1,2 8,3,8,0 4,9,7,3 8,5,3,3 5,6,5,9 $ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 1,7,8,2 8,3,1,0 4,9,3,3 8,5,7,3 5,6,5,9

我会使用一个2-pass方法，首先得到一些行数并读取文件到一个数组中，然后使用awk的rand（）函数产生随机数来标识你要改变的行，然后rand （）再次确定哪些行将交换，然后在打印之前交换数组元素。像这样的PSEUDO-CODE，粗略算法：

 awk -F, -v pct=40 -v col=3 ' NR == FNR { array[++totNumLines] = $0 next } FNR == 1{ pctNumLines = totNumLines * pct / 100 srand() for (i=1; i<=(pctNumLines / 2); i++) { oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array. newLineNr = ditto plus must not equal oldLineNr swap field $col between array[oldLineNr] and array[newLineNr] swapped[oldLineNr] swapped[newLineNr] } next } { print array[FNR] } ' "$file" "$file" > tmp && mv tmp "$file"