我正在使用perl脚本来删除文本中的所有停用词。 停用词是逐行存储的。 我正在使用Mac OSX命令行,并且perl安装正确。
这个脚本工作不正常,有边界问题。
#!/usr/bin/env perl -w # usage: script.pl words text >newfile use English; # poor man's argument handler open(WORDS, shift @ARGV) || die "failed to open words file: $!"; open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!"; my @words; # get all words into an array while ($_=<WORDS>) { chop; # strip eol push @words, split; # break up words on line } # (optional) # sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the" @words=sort { length($b) <=> length($a) } @words; # slurp text file into one variable. undef $RS; $text = <REPLACE>; # now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space. foreach $word (@words) { $text =~ s/\b\Q$word\E\s?//sg; } # output "fixed" text print $text;
$ cat sample.txt how about i decide to look at it afterwards what across do you think is it a good idea to go out and about i think id rather go up and above
I a about an are as at be by com for from how in is it ..
$ ./remove.pl stopwords.txt sample.txt i decide look fterwards cross do you think good idea go out di think id rather go up d bove
正如你所看到的,之后用a代替。 认为它是一个正则expression式问题。 请有人能帮我快点补丁吗? 感谢所有的帮助:J
在$word
两边使用字边界。 目前,您只是在开始时检查它。
你不需要\s?
条件与\b
到位:
$text =~ s/\b\Q$word\E\b//sg;
你的正则表达式不够严格。
$text =~ s/\b\Q$word\E\s?//sg;
当$word
是a
,该命令实际上是s/\ba\s?//sg
。 这意味着,删除一个新的单词,后面跟着零个或多个空格。 在此afterwards
,这将成功匹配第a
。
您可以通过用另一个\b
结尾来使得比赛更为严格。 喜欢
$text =~ s/\b\Q$word\E\b\s?//sg;