在UNIX中查找包含字符的所有单词

给定一个单词W，我想从/ usr / dict / words中find包含W中所有字母的单词。例如，“bat”应该返回“bat”和“tab”（而不是“table”）。

这是一个解决scheme，涉及sortinginput单词和匹配：

word=$1 sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` while read line do sortedLine=`echo $line | grep -o . | sort | tr -d '\n'` if [ "$sortedWord" == "$sortedLine" ] then echo $line fi done < /usr/dict/words

有没有更好的办法？我宁愿使用基本的命令（而不是perl / awk等），但所有的解决scheme都欢迎！

为了澄清，我想find原始单词的所有排列。不允许添加或删除字符。

这是一个awk的实现。它找到那些在“W”中的字母。

 dict="/usr/share/dict/words" word=$1 awk -vw="$word" 'BEGIN{ m=split(w,c,"") for(p=1;p<=m;p++){ chars[c[p]]++ } } length($0)==length(w){ f=0;g=0 n=split($0,t,"") for(o=1;o<=n;o++){ if (!( t[o] in chars) ){ f=1; break }else{ st[t[o]]++ } } if (!f || $0==w){ for(z in st){ if ( st[z] != chars[z] ) { g=1 ;break} } if(!g){ print "found: "$0 } } delete st }' $dict

产量

 $ wc -l < /usr/share/dict/words 479829 $ time ./shell.sh look found: kolo found: look real 0m1.361s user 0m1.074s sys 0m0.015s

更新：改变算法，使用排序

 dict="/usr/share/dict/words" awk 'BEGIN{ w="table" m=split(w,c,"") b=asort(c,chars) } length($0)==length(w){ f=0 n=split($0,t,"") e=asort(t,d) for(i=1;i<=e;i++) { if(d[i]!=chars[i]){ f=1;break } } if(!f) print $0 }' $dict

产量

 $ time ./shell.sh #looking for table ablet batel belat blate bleat tabel table real 0m1.416s user 0m1.343s sys 0m0.014s $ time ./shell.sh #looking for chairs chairs ischar rachis real 0m1.697s user 0m1.660s sys 0m0.014s $ time perl perl.pl #using beamrider's Perl script table tabel ablet batel blate bleat belat real 0m2.680s user 0m1.633s sys 0m0.881s $ time perl perl.pl # looking for chairs chairs ischar rachis real 0m14.044s user 0m8.328s sys 0m5.236s

这是一个shell解决方案。最好的算法似乎是＃4。它过滤掉所有长度不正确的单词。然后，它使用一个简单的替换密码（a = 1，b = 2，A = 27，…）将这些词相加。如果和数相符，那么它实际上会进行原始排序和比较。在我的系统上，它可以通过约235k字找到“蝙蝠”在不到1/2秒。我提供了我所有的解决方案，以便您可以看到不同的方法。

更新：没有显示，但我也尝试把总和在我尝试的直方图方法的第一个bin，但它比没有直方图的速度更慢。我以为它会作为一个短路，但它没有奏效。

Update2：我尝试了awk解决方案，它运行在我最好的shell解决方案的大约三分之一的时间内，或者大约0.126s，大约是0.490s。 perl解决方案运行〜1.1s。

 #!/bin/bash word=$1 #dict=words dict=/usr/share/dict/words #dict=/usr/dict/words alg1() { sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` while read line do sortedLine=`echo $line | grep -o . | sort | tr -d '\n'` if [ "$sortedWord" == "$sortedLine" ] then echo $line fi done < $dict } check_sorted_versus_not() { local word=$1 local line=`echo $2 | grep -o . | sort | tr -d '\n'` if [ "$word" == "$line" ] then echo $2 fi } # Filter out all words of incorrect length alg2() { sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sortedLine=`echo $line | grep -o . | sort | tr -d '\n'` if [ "$sortedWord" == "$sortedLine" ] then echo $line fi done } # Create a lot of variables like this: # _a=1, _b=2, ... _z=26, _A=27, _B=28, ... _Z=52 gen_chars() { # [ -n "$GEN_CHARS" ] && return GEN_CHARS=1 local alpha="abcdefghijklmnopqrstuvwxyz" local upperalpha=`echo -n $alpha | tr 'az' 'A-Z'` local both="$alpha$upperalpha" for ((i=0; i < ${#both}; i++)) do ACHAR=${both:i:1} eval "_$ACHAR=$((i+1))" done } # I think it's faster to return the value in a var then to echo it in a sub process. # Try summing the word one char at a time by building an arithmetic expression # and then evaluate that expression. # Requires: gen_chars sum_word() { SUM=0 local s="" # parsing input one character at a time for ((i=0; i < ${#1}; i++)) do ACHAR=${1:i:1} s="$s\$_$ACHAR+" done SUM=$(( $(eval echo -n ${s}0) )) } # I think it's faster to return the value in a var then to echo it in a sub process. # Try summing the word one char at a time using a case statement. sum_word2() { SUM=0 local s="" # parsing input one character at a time for ((i=0; i < ${#1}; i++)) do ACHAR=${1:i:1} case $ACHAR in a) SUM=$((SUM+ 1));; b) SUM=$((SUM+ 2));; c) SUM=$((SUM+ 3));; d) SUM=$((SUM+ 4));; e) SUM=$((SUM+ 5));; f) SUM=$((SUM+ 6));; g) SUM=$((SUM+ 7));; h) SUM=$((SUM+ 8));; i) SUM=$((SUM+ 9));; j) SUM=$((SUM+ 10));; k) SUM=$((SUM+ 11));; l) SUM=$((SUM+ 12));; m) SUM=$((SUM+ 13));; n) SUM=$((SUM+ 14));; o) SUM=$((SUM+ 15));; p) SUM=$((SUM+ 16));; q) SUM=$((SUM+ 17));; r) SUM=$((SUM+ 18));; s) SUM=$((SUM+ 19));; t) SUM=$((SUM+ 20));; u) SUM=$((SUM+ 21));; v) SUM=$((SUM+ 22));; w) SUM=$((SUM+ 23));; x) SUM=$((SUM+ 24));; y) SUM=$((SUM+ 25));; z) SUM=$((SUM+ 26));; A) SUM=$((SUM+ 27));; B) SUM=$((SUM+ 28));; C) SUM=$((SUM+ 29));; D) SUM=$((SUM+ 30));; E) SUM=$((SUM+ 31));; F) SUM=$((SUM+ 32));; G) SUM=$((SUM+ 33));; H) SUM=$((SUM+ 34));; I) SUM=$((SUM+ 35));; J) SUM=$((SUM+ 36));; K) SUM=$((SUM+ 37));; L) SUM=$((SUM+ 38));; M) SUM=$((SUM+ 39));; N) SUM=$((SUM+ 40));; O) SUM=$((SUM+ 41));; P) SUM=$((SUM+ 42));; Q) SUM=$((SUM+ 43));; R) SUM=$((SUM+ 44));; S) SUM=$((SUM+ 45));; T) SUM=$((SUM+ 46));; U) SUM=$((SUM+ 47));; V) SUM=$((SUM+ 48));; W) SUM=$((SUM+ 49));; X) SUM=$((SUM+ 50));; Y) SUM=$((SUM+ 51));; Z) SUM=$((SUM+ 52));; *) SUM=0; return;; esac done } # I think it's faster to return the value in a var then to echo it in a sub process. # Try summing the word by building an arithmetic expression using sed and then evaluating # the expression. # Requires: gen_chars sum_word3() { SUM=$(( $(eval echo -n `echo -n $1 | sed -E -ne 's,.,$_&+,pg'`) 0)) #echo "SUM($1)=$SUM" } # Filter out all words of incorrect length # Sum the characters in the word: ie a=1, b=2, ... and "abbc" = 1+2+2+3 = 8 alg3() { gen_chars sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` sum_word $word word_sum=$SUM grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sum_word $line line_sum=$SUM if [ $word_sum == $line_sum ] then check_sorted_versus_not $sortedWord $line fi done } # Filter out all words of incorrect length # Sum the characters in the word: ie a=1, b=2, ... and "abbc" = 1+2+2+3 = 8 # Use sum_word2 alg4() { sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` sum_word2 $word word_sum=$SUM grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sum_word2 $line line_sum=$SUM if [ $word_sum == $line_sum ] then check_sorted_versus_not $sortedWord $line fi done } # Filter out all words of incorrect length # Sum the characters in the word: ie a=1, b=2, ... and "abbc" = 1+2+2+3 = 8 # Use sum_word3 alg5() { gen_chars sortedWord=`echo $word | grep -o . | sort | tr -d '\n'` sum_word3 $word word_sum=$SUM grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sum_word3 $line line_sum=$SUM if [ $word_sum == $line_sum ] then check_sorted_versus_not $sortedWord $line fi done } # I think it's faster to return the value in a var then to echo it in a sub process. # Try summing the word one char at a time using a case statement. # Place results in a histogram sum_word4() { SUM=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) # parsing input one character at a time for ((i=0; i < ${#1}; i++)) do ACHAR=${1:i:1} case $ACHAR in a) SUM[1]=$((SUM[ 1] + 1));; b) SUM[2]=$((SUM[ 2] + 1));; c) SUM[3]=$((SUM[ 3] + 1));; d) SUM[4]=$((SUM[ 4] + 1));; e) SUM[5]=$((SUM[ 5] + 1));; f) SUM[6]=$((SUM[ 6] + 1));; g) SUM[7]=$((SUM[ 7] + 1));; h) SUM[8]=$((SUM[ 8] + 1));; i) SUM[9]=$((SUM[ 9] + 1));; j) SUM[10]=$((SUM[10] + 1));; k) SUM[11]=$((SUM[11] + 1));; l) SUM[12]=$((SUM[12] + 1));; m) SUM[13]=$((SUM[13] + 1));; n) SUM[14]=$((SUM[14] + 1));; o) SUM[15]=$((SUM[15] + 1));; p) SUM[16]=$((SUM[16] + 1));; q) SUM[17]=$((SUM[17] + 1));; r) SUM[18]=$((SUM[18] + 1));; s) SUM[19]=$((SUM[19] + 1));; t) SUM[20]=$((SUM[20] + 1));; u) SUM[21]=$((SUM[21] + 1));; v) SUM[22]=$((SUM[22] + 1));; w) SUM[23]=$((SUM[23] + 1));; x) SUM[24]=$((SUM[24] + 1));; y) SUM[25]=$((SUM[25] + 1));; z) SUM[26]=$((SUM[26] + 1));; A) SUM[27]=$((SUM[27] + 1));; B) SUM[28]=$((SUM[28] + 1));; C) SUM[29]=$((SUM[29] + 1));; D) SUM[30]=$((SUM[30] + 1));; E) SUM[31]=$((SUM[31] + 1));; F) SUM[32]=$((SUM[32] + 1));; G) SUM[33]=$((SUM[33] + 1));; H) SUM[34]=$((SUM[34] + 1));; I) SUM[35]=$((SUM[35] + 1));; J) SUM[36]=$((SUM[36] + 1));; K) SUM[37]=$((SUM[37] + 1));; L) SUM[38]=$((SUM[38] + 1));; M) SUM[39]=$((SUM[39] + 1));; N) SUM[40]=$((SUM[40] + 1));; O) SUM[41]=$((SUM[41] + 1));; P) SUM[42]=$((SUM[42] + 1));; Q) SUM[43]=$((SUM[43] + 1));; R) SUM[44]=$((SUM[44] + 1));; S) SUM[45]=$((SUM[45] + 1));; T) SUM[46]=$((SUM[46] + 1));; U) SUM[47]=$((SUM[47] + 1));; V) SUM[48]=$((SUM[48] + 1));; W) SUM[49]=$((SUM[49] + 1));; X) SUM[50]=$((SUM[50] + 1));; Y) SUM[51]=$((SUM[51] + 1));; Z) SUM[52]=$((SUM[52] + 1));; *) SUM[53]=-1; return;; esac done #echo ${SUM[*]} } # Check if two histograms are equal hist_are_equal() { # Array sizes differ? [ ${#_h1[*]} != ${#SUM[*]} ] && return 1 # parsing input one index at a time for ((i=0; i < ${#_h1[*]}; i++)) do [ ${_h1[i]} != ${SUM[i]} ] && return 1 done return 0 } # Check if two histograms are equal hist_are_equal2() { # Array sizes differ? local size=${#_h1[*]} [ $size != ${#SUM[*]} ] && return 1 # parsing input one index at a time for ((i=0; i < $size; i++)) do [ ${_h1[i]} != ${SUM[i]} ] && return 1 done return 0 } # Filter out all words of incorrect length # Use sum_word4 which generates a histogram of character frequency alg6() { sum_word4 $word _h1=${SUM[*]} grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sum_word4 $line if hist_are_equal then echo $line fi done } # Filter out all words of incorrect length # Use sum_word4 which generates a histogram of character frequency alg7() { sum_word4 $word _h1=${SUM[*]} grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$" grep "$grep_string" "$dict" | \ while read line do sum_word4 $line if hist_are_equal2 then echo $line fi done } run_test() { echo alg$1 eval time alg$1 } #run_test 1 #run_test 2 #run_test 3 run_test 4 #run_test 5 run_test 6 #run_test 7

 #!/usr/bin/perl $myword=join("", sort split (//, $ARGV[0])); shift; while (<>) { chomp; print "$_\n" if (join("", sort split (//)) eq $myword); }

像这样使用它： bla.pl < /usr/dict/words searchword

你想找到只包含一组给定的字符的单词。一个正则表达式是：

 '^[letters_you_care_about]*$'

所以，你可以这样做：

 grep "^[$W]*$" /usr/dict/words

'^'匹配行的开始; '$'表示行结束。这意味着我们必须有一个精确的匹配，而不只是一个部分匹配（如“表”）。

'['和']'用于在输入文件的一个字符空间中定义一组可能的字符。我们用它来查找/ usr / dict / word中只包含$ W中字符的单词。

'*'重复前一个字符（'[']'规则），该字符表示查找任何长度的字，其中所有字符均为$ W。

所以我们有以下几点：

n =输入单词的长度
L =字典文件中的行

如果n趋向于小而L趋向于巨大，那么我们应该更好地查找输入词的所有排列并寻找这些排列，而不是对词典文件的所有L行进行某种操作（如排序）？（实际上，由于找到一个单词的所有排列是O（n！），我们必须为每个单词遍历整个字典文件，也许不是，但我仍然编写代码。

这是Perl – 我知道你想要的命令行操作，但我没有办法做到这一点在shell脚本不超级哈西：

 sub dedupe { my (@list) = @_; my (@new_list, %seen_entries, $entry); foreach $entry (@list) { if (!(defined($seen_entries{$entry}))) { push(@new_list, $entry); $seen_entries{$entry} = 1; } } return @new_list; } sub find_all_permutations { my ($word) = @_; my (@permutations, $subword, $letter, $rest_of_word, $i); if (length($word) == 1) { push(@permutations, $word); } else { for ($i=0; $i<length($word); $i++) { $letter = substr($word, $i, 1); $rest_of_word = substr($word, 0, $i) . substr($word, $i + 1); foreach $subword (find_all_permutations($rest_of_word)) { push(@permutations, $letter . $subword); } } } return @permutations; } $words_file = '/usr/share/dict/words'; $word = 'table'; @all_permutations = dedupe(find_all_permutations($word)); foreach $permutation (@all_permutations) { if (`grep -c -m 1 ^$permutation\$ $words_file` == 1) { print $permutation . "\n"; } }

这个工具可能会让你感兴趣

 an -w "tab" -m 3

只给bat和tab 。

原作者似乎已经不在身边了，但是你可以在http://packages.qa.debian.org/a/an.html上找到信息（即使你自己不想使用它，值得一看）。