在terminal上用正则expression式分组提取string

我有一个文本文件conatins像这样的HTML信息:

<li><a href="https://www.youtube.com/watch?v=YDubYJsZ9iM&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: K-means Clustering</a> (6:31)</li> <li><a href="https://www.youtube.com/watch?v=4u3zvtfqb7w&amp;list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: Hierarchical Clustering</a> (6:33)</li> <li><a href="https://www.youtube.com/watch?v=jk9S3RTAl38&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with John Chambers</a> (10:20)</li> <li><a href="https://www.youtube.com/watch?v=6l9V1sINzhE&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Bradley Efron</a> (12:08)</li> <li><a href="https://www.youtube.com/watch?v=79tR7BvYE6w&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Jerome Friedman</a> (10:29)</li> <li><a href="https://www.youtube.com/watch?v=MEMGOlJxxz0&amp;list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interviews with statistics graduate students</a> (7:44)</li> 

我用grep -oP "https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)" list > links提取链接,使得list是html文件。 从另一方面,我需要提取每个文件的名称,即我需要像这样的另一个列表:

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students 

问题是我有一些标签,如<a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning with Applications in R</a> ,因此我可以不要使用一些带有标签的图案。 所以我必须使用一些像模式分组,我将能够使用一些$1为第一匹配模式, $2为第二模式,等等到https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)/[SOME INFORMATION ON URL HERE]/([A-Za-z0-9-_]+)我怎么能在terminal(Bash)上做到这一点?

您可以执行以下操作:

 grep -oP "(?<=\">).*(?=</a)" your_file 

这将打印:

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students 

由于没有简单的方法只使用grep打印捕获的组,所以我使用了前视和后视断言来确保只打印指定的部分。

您可以使用\K删除所有匹配的内容,然后再实际使用

 grep -oP "a href=\"[^>]+>\K[^<]+" file Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students 

或者假定">在其他地方没有出现

 grep -oP "\">\K[^<]+" file 

你可以使用一个非贪婪的正则表达式,如下所示:

 >([^<]+?)</a> 

看演示

或者更确切地说,你可以使用环视 :

 (?<=>)([^<]+?)(?=</a>) 

结果:

 Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students 

使用便携式awk解决方案:

 awk -F '<a href[^>]*>|</a>' '{print $2}' file.html Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students