Perl用file2从file1中删除单词

我正在使用perl脚本来删除文本中的所有停用词。 停用词是逐行存储的。 我正在使用Mac OSX命令行,并且perl安装正确。

这个脚本工作不正常,有边界问题。

#!/usr/bin/env perl -w # usage: script.pl words text >newfile use English; # poor man's argument handler open(WORDS, shift @ARGV) || die "failed to open words file: $!"; open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!"; my @words; # get all words into an array while ($_=<WORDS>) { chop; # strip eol push @words, split; # break up words on line } # (optional) # sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the" @words=sort { length($b) <=> length($a) } @words; # slurp text file into one variable. undef $RS; $text = <REPLACE>; # now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space. foreach $word (@words) { $text =~ s/\b\Q$word\E\s?//sg; } # output "fixed" text print $text; 

为sample.txt

 $ cat sample.txt how about i decide to look at it afterwards what across do you think is it a good idea to go out and about i think id rather go up and above 

stopWords.txt中

 I a about an are as at be by com for from how in is it .. 

输出:

 $ ./remove.pl stopwords.txt sample.txt i decide look fterwards cross do you think good idea go out di think id rather go up d bove 

正如你所看到的,之后用a代替。 认为它是一个正则expression式问题。 请有人能帮我快点补丁吗? 感谢所有的帮助:J

$word两边使用字边界。 目前,您只是在开始时检查它。

你不需要\s? 条件与\b到位:

 $text =~ s/\b\Q$word\E\b//sg; 

你的正则表达式不够严格。

 $text =~ s/\b\Q$word\E\s?//sg; 

$worda ,该命令实际上是s/\ba\s?//sg 。 这意味着,删除一个新的单词,后面跟着零个或多个空格。 在此afterwards ,这将成功匹配第a

您可以通过用另一个\b结尾来使得比赛更为严格。 喜欢

 $text =~ s/\b\Q$word\E\b\s?//sg;