我有一个文本文件conatins像这样的HTML信息:
<li><a href="https://www.youtube.com/watch?v=YDubYJsZ9iM&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: K-means Clustering</a> (6:31)</li> <li><a href="https://www.youtube.com/watch?v=4u3zvtfqb7w&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2">Lab: Hierarchical Clustering</a> (6:33)</li> <li><a href="https://www.youtube.com/watch?v=jk9S3RTAl38&list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with John Chambers</a> (10:20)</li> <li><a href="https://www.youtube.com/watch?v=6l9V1sINzhE&list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Bradley Efron</a> (12:08)</li> <li><a href="https://www.youtube.com/watch?v=79tR7BvYE6w&list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interview with Jerome Friedman</a> (10:29)</li> <li><a href="https://www.youtube.com/watch?v=MEMGOlJxxz0&list=PL5-da3qGB5IC8_kWZXDcmLx7_n4RTBkAS">Interviews with statistics graduate students</a> (7:44)</li>
我用grep -oP "https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)" list > links
提取链接,使得list
是html文件。 从另一方面,我需要提取每个文件的名称,即我需要像这样的另一个列表:
Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students
问题是我有一些标签,如<a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning with Applications in R</a>
,因此我可以不要使用一些带有标签的图案。 所以我必须使用一些像模式分组,我将能够使用一些$1
为第一匹配模式, $2
为第二模式,等等到https:\/\/www.youtube.com\/watch\?v=([A-Za-z0-9-_]+)/[SOME INFORMATION ON URL HERE]/([A-Za-z0-9-_]+)
。 我怎么能在terminal(Bash)上做到这一点?
您可以执行以下操作:
grep -oP "(?<=\">).*(?=</a)" your_file
这将打印:
Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students
由于没有简单的方法只使用grep
打印捕获的组,所以我使用了前视和后视断言来确保只打印指定的部分。
您可以使用\K
删除所有匹配的内容,然后再实际使用
grep -oP "a href=\"[^>]+>\K[^<]+" file Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students
或者假定">
在其他地方没有出现
grep -oP "\">\K[^<]+" file
你可以使用一个非贪婪的正则表达式,如下所示:
>([^<]+?)</a>
看演示
或者更确切地说,你可以使用环视 :
(?<=>)([^<]+?)(?=</a>)
结果:
Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students
使用便携式awk解决方案:
awk -F '<a href[^>]*>|</a>' '{print $2}' file.html Lab: K-means Clustering Lab: Hierarchical Clustering Interview with John Chambers Interview with Bradley Efron Interview with Jerome Friedman Interviews with statistics graduate students