这是一个正在进行的工作,我正在寻找更多知识的人的build议(电脑是我的爱好,而不是我的专业)。
这个脚本是为了组织一个电视节目目录(把每个文件重命名为Ep1ode.ext的惯例S01E01.Title,并创build一个原始名称的符号链接)。
我喜欢写这篇文章,我不希望别人花太多时间。 我想现在我最大的“走狗”是:
用wiki中的awk抓取正确的文本块(根据季节#)
(如果有什么看起来效率低下,请让我知道 – 我正在学习)
我一直在左右这些论坛上,随着我的进展而取得进展。 我已经用尽了大部分类似的问题。 (这些论坛是我build立这个目标的100%)。
## Find show name and season (directories nested: /show/season) show1=$(cd .. ; pwd) show="${show1##*/}" season=("${PWD##*/}") IFS=$'\n' ## Download list of episodes for given season wget -q -O- --header\="Accept-Encoding: gzip" https://en.wikipedia.org/wiki/List_of_$show\_episodes | gunzip > tmp.html ## Working on first awk/sed command to grab textblock of only specific season ## grep command works great, except when episode is hyperlinked ('a href' tag gets cut) if [ "$season" == 'Season 1' ]; then listing=( $(awk '/\(season_1\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) ) unset IFS elif [ "$season" == 'Season 2' ]; then listing=( $(awk '/\(season_2\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) ) unset IFS #..........................continued 20 times or so fi
我已经对上面的代码进行了很多的调整,所以下半年就要完成了。 但之前它已经工作了90%。 唯一的问题是,如果它们在维基百科页面上超链接(因为剪切),它将命名一些文件S01E05.ahref = .mkv 。
## Parse filename for season/episode descriptor ## Rename file with season/episode and name from wikipedia database for file in * do name=$(ls "$file" | grep -o "S[0-9][0-9]E[0-9][0-9]") episode=$(ls "$file" | grep -o "E[0-9][0-9]") if [ "$episode" == 'E01' ]; then mv "$file" "$name.${listing[0]}.mkv" ln -s "$name.${listing[0]}.mkv" "$file" echo "Renamed '$file' and created a symbolic link." #..........................continued fi done
同意这些意见, bash
不是解析网页或html时要走的路。 但是,如果你已经开始,并想在bash中做到这一点,那么这不是不可能的。 看看你的代码,我喜欢你使用bash替换和globing,但是如何把它们放在一起,有点困惑,所以写了一个我自己的简单版本,你可以插入或删除它。
#!/bin/bash show="Archer" url="http://en.wikipedia.org/wiki/List_of_${show}_episodes" while read line; do [[ $line =~ "<h3><span class=\"mw-headline\" id=\"Season" ]] && episode= && (( if [[ $line =~ "<td class=\"summary\" style=\"text-align: left;\">\""(.*)"\"" title="${BASH_REMATCH[1]}" [[ "$title" =~ "title=\""(.*)"\"" ]] && title="${BASH_REMATCH[1]}" title="${title%%\"*}" title="$(echo ${title/($show)/})" echo "Season [$season] Episode [$((episode+=1))] Title [$title]" fi done < <(wget -qO- "$url")
输出示例:(也用scrubs
和simpsons
测试得到正确的结果)
Season [1] Episode [1] Title [Mole Hunt] Season [1] Episode [2] Title [Training Day] Season [1] Episode [3] Title [Diversity Hire] Season [1] Episode [4] Title [Killing Utne] Season [1] Episode [5] Title [Honeypot] Season [1] Episode [6] Title [Skorpio] Season [1] Episode [7] Title [Skytanic] Season [1] Episode [8] Title [The Rock] Season [1] Episode [9] Title [Job Offer] Season [1] Episode [10] Title [Dial M for Mother] Season [2] Episode [1] Title [Swiss Miss] Season [2] Episode [2] Title [A Going Concern] Season [2] Episode [3] Title [Blood Test] Season [2] Episode [4] Title [Pipeline Fever] Season [2] Episode [5] Title [The Double Deuce] Season [2] Episode [6] Title [Tragical History] Season [2] Episode [7] Title [Movie Star] Season [2] Episode [8] Title [Stage Two] Season [2] Episode [9] Title [Placebo Effect] Season [2] Episode [10] Title [El Secuestro] Season [2] Episode [11] Title [Jeu Monégasque] Season [2] Episode [12] Title [White Nights] Season [2] Episode [13] Title [Double Trouble] Season [3] Episode [1] Title [Heart of Archness: Part I] Season [3] Episode [2] Title [Heart of Archness: Part II] Season [3] Episode [3] Title [Heart of Archness: Part III] Season [3] Episode [4] Title [The Man from Jupiter] Season [3] Episode [5] Title [El Contador] Season [3] Episode [6] Title [The Limited] Season [3] Episode [7] Title [Drift Problem] Season [3] Episode [8] Title [Lo Scandalo] Season [3] Episode [9] Title [Bloody Ferlin] Season [3] Episode [10] Title [Crossing Over] Season [3] Episode [11] Title [Skin Game] Season [3] Episode [12] Title [Space Race] Season [3] Episode [13] Title [Space Race] Season [4] Episode [1] Title [Fugue and Riffs] Season [4] Episode [2] Title [The Wind Cries Mary] Season [4] Episode [3] Title [Legs] Season [4] Episode [4] Title [Midnight Ron] Season [4] Episode [5] Title [Viscous Coupling] Season [4] Episode [6] Title [Once Bitten] Season [4] Episode [7] Title [Live and Let Dine] Season [4] Episode [8] Title [Coyote Lovely] Season [4] Episode [9] Title [The Honeymooners] Season [4] Episode [10] Title [Un Chien Tangerine] Season [4] Episode [11] Title [The Papal Chase] Season [4] Episode [12] Title [Sea Tunt: Part I] Season [4] Episode [13] Title [Sea Tunt: Part II] Season [5] Episode [1] Title [White Elephant] Season [5] Episode [2] Title [Archer Vice: A Kiss While Dying] Season [5] Episode [3] Title [Archer Vice: A Debt of Honor] Season [5] Episode [4] Title [Archer Vice: House Call] Season [5] Episode [5] Title [Archer Vice: Southbound and Down]
说明 :
我发现BASH_REMATCH
在很多情况下很有用,比如你必须匹配子字符串,而不想找出一些疯狂的正则表达式。
BASH_REMATCH An array variable whose members are assigned by the =~ binary operator to the [[ conditional command. The element with index 0 is the portion of the string matching the entire regular expression. The element with index n is the portion of the string matching the nth parenthesized subexpression. This variable is read-only.
否则,主要问题如您所述,标题格式可能会有所不同。 所以我只是做了另外一个BASH_REMATCH
,当它是一个href的时候(当它有一个title
属性的时候),并且在情节还没出来的时候删除了奇数情况下的尾随文本。 也许还有一些其他的情况,但是这对我测试过的所有3个节目都有效。
让我建议一个多平台的网络抓取CLI ,似乎没有得到足够的关注: xidel
它支持Xpath 2,CSS 3,XQuery 1,JSONiq查询语言。
与手头场景的简化版本一起使用,我们得到以下内容,希望能够让您了解更轻松的抓取变得如何:
#!/usr/bin/env bash # Example values. show='Archer' season=2 # Synthesize the URL to scrape. url="http://en.wikipedia.org/wiki/List_of_${show}_episodes" # The XPath expression for extracting the specified season's episode titles qryEpisodeTitles="//*[matches(@id, '^Season_$season')] /../following-sibling::table[1]//td[@class='summary']" # Scrape the page at the URL and read all episode titles # (including enclosing " chars.) into an array. IFS=$'\n' read -d '' -ra episodeTitles <<<"$(xidel -e "$qryEpisodeTitles" "$url")" # Enumerate all episode titles with an index. # Note: Typically, episodes are enclosed in literal `"` chars.; additionally, # after the closing `"`, they may contain footnote references, such as # `[2]` or `†`, so some cleaning-up may be required. i=0 for episodeTitle in "${episodeTitles[@]}"; do echo "Episode $((++i)): $episodeTitle" done
感谢BroSlow的帮助(和代码!)脚本是完整的。
#!/bin/bash show1=$(cd .. ; pwd) show="${show1##*/}" seas1="${PWD##*/}" seas=$(echo $seas1 | grep -o "[0-9][0-9]*") url=http://en.wikipedia.org/wiki/List_of_$show\_episodes IFS=$'\n' while read line; do [[ $line =~ "<h3><span class=\"mw-headline\" id=\"Season" ]] && episode= && ((season+=1)) if [[ $line =~ "<td class=\"summary\" style=\"text-align: left;\">\""(.*)"\"" ]]; then title="${BASH_REMATCH[1]}" [[ "$title" =~ "title=\""(.*)"\"" ]] && title="${BASH_REMATCH[1]}" title="${title%%\"*}" title="$(echo ${title/($show)/})" arrTitle+=( "${season}.${title}" ) fi done < <(wget -qO- "$url") ## Make new array of only specific season ($seas=current dirname). ## Remove # in front of name with 'cut' for i in "${arrTitle[@]}"; do if [[ $i == $seas.* ]]; then arrNewTitle+=( $(echo $i | cut -d '.' -f2)) fi done n=-1 for file in *; do $((n+=1)) name=$(grep -o "S[0-9][0-9]E[0-9][0-9]" <<< "$file") mv "$file" "$name.${arrNewTitle[n]}.mkv" ln -s "$name.${arrNewTitle[n]}.mkv" "$file" echo "Renamed '$file' and created a symbolic link." done ## Remove script when done, and its symbolic link (z to be at bottom of filelist) rm 'z_rename.sh' '..mkv'