从HTML表格中提取数据

我正在寻找一种方法来从Linux shell环境中获取HTML的某些信息。

这是我感兴趣的一点:

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table> 

我想存储在shellvariables中,或者在从上面的html中提取的键值对中回显这些variables。 例如:

 Tests : 103 Failures : 24 Success Rate : 76.70 % and so on.. 

我现在可以做的是创build一个java程序,该程序将使用saxparsing器或htmlparsing器(如jsoup)来提取此信息。

但是,在这里使用java似乎是在你要执行的“包装器”脚本中包含可运行jar的开销。

我敢肯定,那里一定有可以做同样的“shell”语言,即perlpythonbash等。

我的问题是,我没有这些经验,有人可以帮我解决这个“相当容易”的问题

快速更新:

我忘了提及,在.html文档中有更多的表格和更多的行(早上)。

更新#2:

试图像这样安装Bsoup,因为我没有root权限:

 $ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz $ tar -zxvf beautifulsoup4-4.1.0.tar.gz $ cp -r beautifulsoup4-4.1.0/bs4 . $ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted $ run file (python htmlParse.py) 

错误:

 $ python htmlParse.py Traceback (most recent call last): File "htmlParse.py", line 1, in ? from bs4 import BeautifulSoup File "/home/gdd/setup/py/bs4/__init__.py", line 29 from .builder import builder_registry ^ SyntaxError: invalid syntax 

更新#3:

运行Tichodromas的答案得到这个错误:

 Traceback (most recent call last): File "test.py", line 27, in ? headings = [th.get_text() for th in table.find("tr").find_all("th")] TypeError: 'NoneType' object is not callable 

有任何想法吗?

Solutions Collecting From Web of "从HTML表格中提取数据"

使用BeautifulSoup4的Python解决方案( 编辑:正确跳过。Edit3:使用class="details"选择table ):

 from bs4 import BeautifulSoup html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" soup = BeautifulSoup(html) table = soup.find("table", attrs={"class":"details"}) # The first tr contains the field names. headings = [th.get_text() for th in table.find("tr").find_all("th")] datasets = [] for row in table.find_all("tr")[1:]: dataset = zip(headings, (td.get_text() for td in row.find_all("td"))) datasets.append(dataset) print datasets 

结果如下所示:

 [[(u'Tests', u'103'), (u'Failures', u'24'), (u'Success Rate', u'76.70%'), (u'Average Time', u'71 ms'), (u'Min Time', u'0 ms'), (u'Max Time', u'829 ms')]] 

编辑2:要产生所需的输出,使用这样的东西:

 for dataset in datasets: for field in dataset: print "{0:<16}: {1}".format(field[0], field[1]) 

结果:

 Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms 

假设你的html代码存储在一个mycode.html文件中,这里是一个bash的方式:

 paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g') 

注意:输出不完全对齐

 undef $/; $text = <DATA>; @tabs = $text =~ m!<table.*?>(.*?)</table>!gms; for (@tabs) { @th = m!<th>(.*?)</th>!gms; @td = m!<td>(.*?)</td>!gms; } for $i (0..$#th) { printf "%-16s\t: %s\n", $th[$i], $td[$i]; } __DATA__ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table> 

输出如下:

 Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms 

一个只使用标准库的Python解决方案(利用了HTML恰好是格式良好的XML的事实)。 可以处理多行数据。

(用Python 2.6和2.7进行测试,问题被更新,说OP使用Python 2.4,所以这个答案在这种情况下可能不是很有用,ElementTree是在Python 2.5中添加的)

 from xml.etree.ElementTree import fromstring HTML = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> <tr valign="top" class="whatever"> <td>A</td> <td>B</td> <td>C</td> <td>D</td> <td>E</td> <td>F</td> </tr> </table>""" tree = fromstring(HTML) rows = tree.findall("tr") headrow = rows[0] datarows = rows[1:] for num, h in enumerate(headrow): data = ", ".join([row[num].text for row in datarows]) print "{0:<16}: {1}".format(h.text, data) 

输出:

 Tests : 103, A Failures : 24, B Success Rate : 76.70%, C Average Time : 71 ms, D Min Time : 0 ms, E Max Time : 829 ms, F 

这里是最好的答案,适应Python3的兼容性,并通过在单元格中删除空格来改进:

 from bs4 import BeautifulSoup html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" soup = BeautifulSoup(s, 'html.parser') table = soup.find("table") # The first tr contains the field names. headings = [th.get_text().strip() for th in table.find("tr").find_all("th")] print(headings) datasets = [] for row in table.find_all("tr")[1:]: dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td")))) datasets.append(dataset) print(datasets) 

下面是一个基于python正则表达式的解决方案,我已经测试了python 2.7。 它不依赖于xml模块 – 所以在xml格式不完善的情况下工作。

 import re # input args: html string # output: tables as a list, column max length def extract_html_tables(html): tables=[] maxlen=0 rex1=r'<table.*?/table>' rex2=r'<tr.*?/tr>' rex3=r'<(td|th).*?/(td|th)>' s = re.search(rex1,html,re.DOTALL) while s: t = s.group() # the table s2 = re.search(rex2,t,re.DOTALL) table = [] while s2: r = s2.group() # the row s3 = re.search(rex3,r,re.DOTALL) row=[] while s3: d = s3.group() # the cell #row.append(strip_tags(d).strip() ) row.append(d.strip() ) r = re.sub(rex3,'',r,1,re.DOTALL) s3 = re.search(rex3,r,re.DOTALL) table.append( row ) if maxlen<len(row): maxlen = len(row) t = re.sub(rex2,'',t,1,re.DOTALL) s2 = re.search(rex2,t,re.DOTALL) html = re.sub(rex1,'',html,1,re.DOTALL) tables.append(table) s = re.search(rex1,html,re.DOTALL) return tables, maxlen html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" print extract_html_tables(html)