在terminal上用正则expression式分组提取string

我有一个文本文件conatins像这样的HTML信息：

<li><a href="https://www.youtube.com/watch?v=YDubYJsZ9iM&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: K-means Clustering</a> (6:31)</li> <li><a href="https://www.youtube.com/watch?v=4u3zvtfqb7w&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: Hierarchical Clustering</a> (6:33)</li> <li><a href="https://www.youtube.com/watch?v=jk9S3RTAl38&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with John Chambers</a> (10:20)</li> <li><a href="https://www.youtube.com/watch?v=6l9V1sINzhE&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Bradley Efron</a> (12:08)</li> <li><a href="https://www.youtube.com/watch?v=79tR7BvYE6w&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Jerome Friedman</a> (10:29)</li> <li><a href="https://www.youtube.com/watch?v=MEMGOlJxxz0&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interviews with statistics graduate students</a> (7:44)</li>

我用grep -oP "https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)" list > links提取链接，使得list是html文件。从另一方面，我需要提取每个文件的名称，即我需要像这样的另一个列表：

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students

问题是我有一些标签，如<a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning with Applications in R</a> ，因此我可以不要使用一些带有标签的图案。所以我必须使用一些像模式分组，我将能够使用一些$1为第一匹配模式， $2为第二模式，等等到https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)/[SOME INFORMATION ON URL HERE]/([A-Za-z0-9-_]+) 。 我怎么能在terminal（Bash）上做到这一点？

您可以执行以下操作：

 grep -oP "(?<=\">).*(?=</a)" your_file

这将打印：

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students

由于没有简单的方法只使用grep打印捕获的组，所以我使用了前视和后视断言来确保只打印指定的部分。

您可以使用\K删除所有匹配的内容，然后再实际使用

 grep -oP "a href=\"[^>]+>\K[^<]+" file Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students

或者假定">在其他地方没有出现

 grep -oP "\">\K[^<]+" file

你可以使用一个非贪婪的正则表达式，如下所示：

 >([^<]+?)</a>

看演示

或者更确切地说，你可以使用环视：

 (?<=>)([^<]+?)(?=</a>)

结果：

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students

使用便携式awk解决方案：

 awk -F '<a href[^>]*>|</a>' '{print $2}' file.html Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students