使用javascript自动滚动，定期从网站上下载和下载所有图像

我发现一个网站有很多高品质的免费图像托pipe在Tumblr（它说，做任何你想要的主题图像：P）

我在Ubuntu 12.04LTS上运行。我需要编写一个脚本，它会定期运行（比如每天），只下载那些以前没有下载过的图片。

附加说明：它具有一个JavaScript自动滚动器，当您到达页面底部时，图片被下载。

由TMS完成的梦幻般的原始脚本不再与新的unsplash网站工作。这是一个更新的工作版本。

#!/bin/bash mkdir -p imgs I=1 while true ; do # for all the pages wget "https://unsplash.com/grid?page=$I" -O tmppage grep img.*src.*unsplash.imgix.net tmppage | cut -d'?' -f1 | cut -d'"' -f2 > tmppage.imgs if [ ! -s tmppage.imgs ] ; then # empty page - end the loop break fi echo "Reading page $I:" cat tmppage.imgs | while read IMG; do # for all the images on page TARGET=imgs/$(basename "$IMG") echo -n "Photo $TARGET: " if [ -f $TARGET ] ; then # we already have this image echo "file already exists" continue fi echo -n "downloading (PAGE $I)" wget $IMG -O $TARGET done I=$((I+1)) done

首先，你必须找出自动滚动脚本是如何工作的。这样做的最简单的方法不是反向工程JavaScript，而是查看网络活动。最简单的方法是使用Firebug Firefox插件，并在“Net”面板中查看活动。您很快就会看到该网站按页面组织：

 unsplash.com/page/1 unsplash.com/page/2 unsplash.com/page/3 ...

滚动时，脚本请求下载后续页面。

所以，我们实际上可以编写一个脚本来下载所有的页面，解析所有图像的html并下载它们。如果你看看html代码，你会发现这些图像有很好的和独特的形式：

 <a href="http://bit.ly/14nUvzx"><img src="http://img.zgserver.com/javascript/tumblr_mq7bnogm3e1st5lhmo1_1280.jpg" alt="Download &nbsp;/ &nbsp;By Tony&nbsp;Naccarato" title="http://unsplash.com/post/55904517579/download-by-tony-naccarato" class="photo_img" /></a>

<a href包含完整分辨率图像的URL。 title属性包含一个很好的唯一的URL也导致图像。我们将使用它来为图像构建漂亮的独特名称，比存储图像更好。这个不错的独特的名字也将确保没有图像下载两次。

Shell脚本（unsplash.sh）

 mkdir imgs I=1 while true ; do # for all the pages wget unsplash.com/page/$I -O tmppage grep '<a href.*<img src.*title' tmppage > tmppage.imgs if [ ! -s tmppage.imgs ] ; then # empty page - end the loop break fi echo "Reading page $I:" sed 's/^.*<a href="\([^"]*\)".*title="\([^"]*\)".*$/\1 \2/' tmppage.imgs | while read IMG POST ; do # for all the images on page TARGET=imgs/`echo $POST | sed 's|.*post/\(.*\)$|\1|' | sed 's|/|_|g'`.jpg echo -n "Photo $TARGET: " if [ -f $TARGET ] ; then # we already have this image echo "already have" continue fi echo "downloading" wget $IMG -O $TARGET done I=$((I+1)) done

为了确保这个每天运行..

创建一个包装脚本usplash.cron ：

 #!/bin/bash export PATH=... # might not be needed, but sometimes the PATH is not set # correctly in cron-called scripts. Copy the PATH setting you # normally see under console. cd YOUR_DIRECTORY # the directory where the script and imgs directory is located { echo "========================" echo -n "run unsplash.sh from cron " date ./unsplash.sh } >> OUT.log 2>> ERR.log

然后在你的crontab中添加这行（在控制台上发出crontab -e后）：

 10 3 * * * PATH_to_the/unsplash.cron

这将在3:10每天运行脚本。

下面是一个小的Python版本的下载部分。 getImageURLs函数查找来自http://unsplash.com/page/X的数据，用于包含单词“下载”的行，并在那里查找图像的“src”属性。 它还会查找字符串current_page和total_pages （存在于JavaScript代码中）以查明要持续多久。

目前，它首先从所有页面中检索所有URL，如果相应的文件不存在本地，则下载这些URL。根据页面编号随着时间的推移如何变化，一旦找到文件的本地副本，停止查找图像URL可能会更有效。这些文件存储在执行脚本的目录中。

另一个答案很好地解释了如何确保这样的事情可以每天执行。

 #!/usr/bin/env python import urllib import pprint import os def getImageURLs(pageIndex): f = urllib.urlopen('http://unsplash.com/page/' + str(pageIndex)) data = f.read() f.close() curPage = None numPages = None imgUrls = [ ] for l in data.splitlines(): if 'Download' in l and 'src=' in l: idx = l.find('src="') if idx >= 0: idx2 = l.find('"', idx+5) if idx2 >= 0: imgUrls.append(l[idx+5:idx2]) elif 'current_page = ' in l: idx = l.find('=') idx2 = l.find(';', idx) curPage = int(l[idx+1:idx2].strip()) elif 'total_pages = ' in l: idx = l.find('=') idx2 = l.find(';', idx) numPages = int(l[idx+1:idx2].strip()) return (curPage, numPages, imgUrls) def retrieveAndSaveFile(fileName, url): f = urllib.urlopen(url) data = f.read() f.close() g = open(fileName, "wb") g.write(data) g.close() if __name__ == "__main__": allImages = [ ] done = False page = 1 while not done: print "Retrieving URLs on page", page res = getImageURLs(page) allImages += res[2] if res[0] >= res[1]: done = True else: page += 1 for url in allImages: idx = url.rfind('/') fileName = url[idx+1:] if not os.path.exists(fileName): print "File", fileName, "not found locally, downloading from", url retrieveAndSaveFile(fileName, url) print "Done."