Perl用file2从file1中删除单词

我正在使用perl脚本来删除文本中的所有停用词。停用词是逐行存储的。我正在使用Mac OSX命令行，并且perl安装正确。

这个脚本工作不正常，有边界问题。

#!/usr/bin/env perl -w # usage: script.pl words text >newfile use English; # poor man's argument handler open(WORDS, shift @ARGV) || die "failed to open words file: $!"; open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!"; my @words; # get all words into an array while ($_=<WORDS>) { chop; # strip eol push @words, split; # break up words on line } # (optional) # sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the" @words=sort { length($b) <=> length($a) } @words; # slurp text file into one variable. undef $RS; $text = <REPLACE>; # now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space. foreach $word (@words) { $text =~ s/\b\Q$word\E\s?//sg; } # output "fixed" text print $text;

为sample.txt

 $ cat sample.txt how about i decide to look at it afterwards what across do you think is it a good idea to go out and about i think id rather go up and above

stopWords.txt中

 I a about an are as at be by com for from how in is it ..

输出：

 $ ./remove.pl stopwords.txt sample.txt i decide look fterwards cross do you think good idea go out di think id rather go up d bove

正如你所看到的，之后用a代替。认为它是一个正则expression式问题。请有人能帮我快点补丁吗？感谢所有的帮助：J

在$word两边使用字边界。目前，您只是在开始时检查它。

你不需要\s? 条件与\b到位：

 $text =~ s/\b\Q$word\E\b//sg;

你的正则表达式不够严格。

 $text =~ s/\b\Q$word\E\s?//sg;

当$word是a ，该命令实际上是s/\ba\s?//sg 。这意味着，删除一个新的单词，后面跟着零个或多个空格。在此afterwards ，这将成功匹配第a 。

您可以通过用另一个\b结尾来使得比赛更为严格。喜欢

 $text =~ s/\b\Q$word\E\b\s?//sg;