查找文本文件中至less包含两个单词的所有行（Bash）

我有几个不同的人生成的大文本文件。这些文件包含每行一个标题的列表。每个句子都是不同的，但据说是指未知的项目。

鉴于格式和措辞不同，我尝试生成一个较短的文件，可能匹配人工检查。我是Bash的新手，我尝试了几个命令来比较每一行与两个或多个共同关键词的标题。应避免大小写敏感，超过4个字符的关键词排除文章等。

例：

input文本文件＃1

Investigating Amusing King : Expl and/in the Proletariat Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized Loss: Supplementing Transgressive Production and Assimilation

input文本文件＃2

 Loss: Judging Foolhardy Historicism and Homosexuality Loss: Developping Homophobic Textuality and Outrage Loss: Supplement of transgressive production Loss: Questioning Diligent Verbiage and Mythos Me Against You: Transgressing Easygoing Materialism and Dialectic

输出文本文件

 File #1-->Loss: Supplementing Transgressive Production and Assimilation File #2-->Loss: Supplement of transgressive production

到目前为止，我已经能够清除几个重复的完全相同的条目…

 cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt

…和其他几个括号之间有相同的注释

  cat FILE_num*.txt | sort | cut -d "{" -f2 | cut -d "}" -f1 | uniq -d > same_annotations.txt

一个看起来很有希望的命令是使用正则expression式，但我没有使它正常工作。

提前致谢。

在Python 3中：

 from sys import argv from re import sub def getWordSet(line): line=sub(r'\[.*\]|\(.*\)|[.,!?:]','',line).split() s=set() for word in line: if len(word)>4: word=word.lower() s.add(word) return s def compare(file1, file2): file1 = file1.split('\n') file2 = file2.split('\n') for line1,set1 in zip(file1,map(getWordSet,file1)): for line2,set2 in zip(file2,map(getWordSet,file2)): if len(set1.intersection(set2))>1: print("File #1-->",line1,sep='') print("File #2-->",line2,sep='') if __name__=='__main__': with open(argv[1]) as file1, open(argv[2]) as file2: compare(file1.read(),file2.read())

给出预期的输出。它显示匹配的文件行对。

将这个脚本保存在一个文件中 – 我将其称为script.py，但是您可以根据需要命名它。您可以启动它

 python3 script.py file1 file2

你甚至可以使用别名：

 alias comp="python3 script.py"

接着

 comp file1 file2

我将下面的讨论包括在内。