使用bash脚本遍历包含域的文本文件

嘿家伙,我写了一个脚本,读取网页的href标记,并获取该网页上的链接,并将其写入文本文件。 现在我有一个文本文件包含这样的链接,例如:

http://news.bbc.co.uk/2/hi/health/default.stm http://news.bbc.co.uk/weather/ http://news.bbc.co.uk/weather/forecast/8?area=London http://newsvote.bbc.co.uk/1/shared/fds/hi/business/market_data/overview/default.stm http://purl.org/dc/terms/ http://static.bbci.co.uk/bbcdotcom/0.3.131/style/3pt_ads.css http://static.bbci.co.uk/frameworks/barlesque/2.8.7/desktop/3.5/style/main.css http://static.bbci.co.uk/frameworks/pulsesurvey/0.7.0/style/pulse.css http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie6.css http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie7.css http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie8.css http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/main.css http://img.zgserver.com/linux/iphone.png http://www.bbcamerica.com/ http://www.bbc.com/future http://www.bbc.com/future/ http://www.bbc.com/future/story/20120719-how-to-land-on-mars http://www.bbc.com/future/story/20120719-road-opens-for-connected-cars http://www.bbc.com/future/story/20120724-in-search-of-aliens http://www.bbc.com/news/ 

我希望能够过滤它们,使我返回像这样的东西:

 http://www.bbc.com : 6 http://static.bbci.co.uk: 15 

侧面的值表示域出现在文件中的次数。 我怎么能在bash中实现这一点,考虑到我将有一个循环通过文件。 我是bash shell脚本的新手?

 $ cut -d/ -f-3 urls.txt | sort | uniq -c 3 http://news.bbc.co.uk 1 http://newsvote.bbc.co.uk 1 http://purl.org 8 http://static.bbci.co.uk 1 http://www.bbcamerica.com 6 http://www.bbc.com 

像这样

 egrep -o '^http://[^/]+' domain.txt | sort | uniq -c 

对你的例子数据输出:

 3 http://news.bbc.co.uk/ 1 http://newsvote.bbc.co.uk/ 1 http://purl.org/ 8 http://static.bbci.co.uk/ 6 http://www.bbc.com/ 1 http://www.bbcamerica.com/ 

即使你的线路是由一个没有斜线的简单网址组成的,这个解决方案也可以工作

 http://www.bbc.com/news http://www.bbc.com/ http://www.bbc.com 

将全部在同一组中。

如果你想允许https,那么你可以写:

 egrep -o '^https?://[^/]+' domain.txt | sort | uniq -c 

如果其他的协议是可能的,比如ftp,mailto等等,你甚至可以很宽松的写下:

 egrep -o '^[^:]+://[^/]+' domain.txt | sort | uniq -c