Bash：在Linux中只用空行分割一个文件

我目前正在使用一些文件来parsing一个Scala应用程序。问题是，文件太大，所以他们总是最终抛出一个exception的堆大小（我已经尝试了最大的堆大小，我可以仍然没有用）。

现在，这些文件看起来像这样：

This is one paragraph for Scala to parse This is another paragraph for Scala to parse Yet another paragraph

等等。基本上我想把所有这些文件分成10或20个，但是我必须确定一个段落在结果中不会被分成两半。有没有办法做到这一点？

谢谢！

这里有一个awk脚本，将输入文件分解成batch_size块（垃圾尾随记录分隔换行符）。把它放到一个文件中，并把它改成一个可执行文件：

 #!/usr/bin/awk -f BEGIN {RS=""; ORS="\n\n"; last_f=""; batch_size=20} # perform setup whenever the filename changes FILENAME!=last_f {r_per_f=calc_r_per_f(); incr_out(); last_f=FILENAME; fnum=1} # write a record to an output file {print $0 > out} # after a batch, change the file name (FNR%r_per_f)==0 {incr_out()} # function to roll the file name function incr_out() {close(out); fnum++; out=FILENAME"_"fnum".out"} # function to get the number of records per file function calc_r_per_f() { cmd=sprintf( "grep \"^$\" %s | wc -l", FILENAME ) cmd | getline rcnt close(cmd) return( sprintf( "%d", rcnt/batch_size ) ) }

您可以更改begin块中的batch_size元素来调整每个输入文件的输出文件数，并且可以通过更改incr_out()的out=赋值来更改输出文件名本身。

如果你把它放到一个名为awko的文件中，你可以像awko data1 data2那样运行它，并获取像data2_7.out这样的文件。当然，如果你的输入文件名有扩展名，那输出的名字就更可怕了。

csplit file.txt /^$/ {*}

csplit分割由指定模式分隔的文件。

/^$/匹配空行。

{*}无限地重复之前的模式。

分三段：

 awk 'BEGIN{nParMax=3;npar=0;nFile=0} /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}} {print $0 > "foo."nFile}' foo.orig

每10行分割一次：

 awk 'BEGIN{nLineMax=10;nline=0;nFile=0} /^$/{if(nline>=nLineMax){nFile++;nline=0;next}} {nline++;print $0 > "foo."nFile}' foo.orig

你可以使用“分割”命令，但是因为你想分割段落，你可以使用这种脚本：

 awk -v RS="\n\n" 'BEGIN {n=1}{print $0 > "file"n++".txt"}' yourfile.txt

将文件中的每个段落分别命名为“file1.txt”，“file2.txt”等等。

要设置“n ++”每个“N”段，你可以这样做：

 awk -v RS="\n\n" 'BEGIN{n=1; i=0; nbp=100}{if (i++ == nbp) {i=0; n++} print $0 > "file"n".txt"}' yourfile.txt

只需更改“nbp”值来设置段落号码