从fileA中有效地删除包含来自fileB的string的行

FileA包含行FileB包含单词

我怎样才能有效地从文件B中包含文件中find的文字？

我尝试了以下，我甚至不知道他们是否工作，因为它需要很长时间才能运行。

尝试grep ：

 grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

还试过了python ：

 f = open(sys.argv[1],'r') out = open(sys.argv[2], 'w') bad_words = f.read().splitlines() with open('FileA') as master_lines: for line in master_lines: if not any(bad_word in line for bad_word in bad_words): out.write(line)

FILEA：

 abadan refinery is one of the largest in the world. a bad apple spoils the barrel. abaiara is a city in the south region of brazil. a ban has been imposed on the use of faxes

FILEB：

 abadan abaiara

期望的输出：

 a bad apple spoils the barrel. a ban has been imposed on the use of faxes

你看起来不错的命令可能是尝试一种好的脚本语言的时候了。尝试运行以下perl脚本，看看它是否更快地报告。

 #!/usr/bin/perl #use strict; #use warnings; open my $LOOKUP, "<", "fileA" or die "Cannot open lookup file: $!"; open my $MASTER, "<", "fileB" or die "Cannot open Master file: $!"; open my $OUTPUT, ">", "out" or die "Cannot create Output file: $!"; my %words; my @l; while (my $word = <$LOOKUP>) { chomp($word); ++$words{$word}; } LOOP_FILE_B: while (my $line = <$MASTER>) { @l = split /\s+/, $line; for my $i (0 .. $#l) { if (defined $words{$l[$i]}) { next LOOP_FILE_B; } } print $OUTPUT "$line" }

我拒绝相信Python不能与Perl的性能相匹配。这是我在Python中解决这个问题的更高效版本的快速尝试。我使用集来优化这个问题的搜索部分。＆运算符返回一个新集合，其中包含两个集合的通用元素。

这个解决方案需要12秒钟在我的机器上运行一个文件A与3M行和fileB与200K的单词和perl取9.最大的减速似乎是re.split，这似乎比string.split在这个更快案件。

如果您有任何提高速度的建议，请评论这个答案。

 import re filea = open('Downloads/fileA.txt') fileb = open('Downloads/fileB.txt') output = open('output.txt', 'w') bad_words = set(line.strip() for line in fileb) splitter = re.compile("\s") for line in filea: line_words = set(splitter.split(line)) if bad_words.isdisjoint(line_words): output.write(line) output.close()

使用grep

 grep -v -Fwf fileB fileA