从bash中的文件计算Word出现次数

对于这个非常小的问题,我感到很抱歉,但是我对bash编程(几天前开始)很陌生。 基本上我想要做的就是保持一个文件与另一个文件的所有单词出现

我知道我可以做到这一点:

 sort | uniq -c | sort 

事情是,在那之后,我想采取第二个文件,再次计算出现并更新第一个。 我拿了第三个文件之后等等。

我现在正在做的工作没有任何问题(我使用grepsedawk ),但它看起来很慢。

我敢肯定,只有使用uniq一个命令左右有一个非常有效的方法,但我不明白。

你能带我走对路吗?

我也粘贴我写的代码:

 #!/bin/bash # count the number of word occurrences from a file and writes to another file # # the words are listed from the most frequent to the less one # touch .check # used to check the occurrances. Temporary file touch distribution.txt # final file with all the occurrences calculated page=$1 # contains the file I'm calculating occurrences=$2 # temporary file for the occurrences # takes all the words from the file $page and orders them by occurrences cat $page | tr -cs A-Za-z\' '\n'| tr AZ az > .check # loop to update the old file with the new information # basically what I do is check word by word and add them to the old file as an update cat .check | while read words do word=${words} # word I'm calculating strlen=${#word} # word's length # I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ] then # if the word was never found before it writes it with 1 occurrence if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ] then echo "$word: 1" | cat >> $occurrences # else it calculates the occurrences else old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences` let "new=old+1" sed -i "s/^$word: $old$/$word: $new/g" $occurrences fi fi done rm .check # finally it orders the words awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt 

那么,我不确定我是否已经明白了你正在尝试做的事情,但我会这样做:

 while read file do cat $file | tr -cs A-Za-z\' '\n'| tr AZ az | sort | uniq -c > stat.$file done < file-list 

现在你已经统计了你所有的文件,现在你简单的聚合它:

 while read file do cat stat.$file done < file-list \ | sort -k2 \ | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' 

使用示例:

 $ for i in ls bash cp; do man $i > $i.txt ; done $ cat <<EOF > file-list > ls.txt > bash.txt > cp.txt > EOF $ while read file; do > cat $file | tr -cs A-Za-z\' '\n'| tr AZ az | sort | uniq -c > stat.$file > done < file-list $ while read file > do > cat stat.$file > done < file-list \ > | sort -k2 \ > | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head 3875 the 1671 is 1137 to 1118 a 1072 of 793 if 744 and 533 command 514 in 507 shell