Parsing HTML table in Python with BeautifulSoup
Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.
Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.
Include Beautiful Soup in your application with a line like one of the following:
from BeautifulSoup import BeautifulSoup # For processing HTML |
Getting started with BeautifulSoup:
#!/usr/bin/python from BeautifulSoup import BeautifulSoup html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>'] soup = BeautifulSoup(''.join(html)) print soup.prettify() |
The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.
The output will be:
<html>
<body>
<table>
<tr>
<td>
row 1, cell 1
</td>
<td>
row 1, cell 2
</td>
</tr>
<tr>
<td>
row 2, cell 1
</td>
<td>
row 2, cell 2
</td>
</tr>
</table>
</body>
</html> |
Here is the ways to navigate the soup:
table = soup.find('table') rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: text = ''.join(td.find(text=True)) print text+"|", print |
and the output will be:
row 1, cell 1| row 1, cell 2| row 2, cell 1| row 2, cell 2|
You can search the soup for tags with certain properties:
Get the table with id ‘mytable’
table = soup.find('table', id="mytable") |
Or
table aligned center:
table = soup.find('table', align="center") |
Or
border is 3:
table = soup.find('table', border="3") |
If you want to fetch data across the web use the urllib module.
import urllib f = urllib.urlopen("http://somesite/table.html") html = f.read() |
Installing BeautifulSoup in Debian/Ubuntu:
$ sudo aptitude install python-beautifulsoup |
Read More at BeautifulSoup Documentation Page.
That’s it! Have fun!
Thanx a lot for this demonstration, it was very useful for me, still i need to know how to deal with empty td’s ex
which return
text = ''.join(td.find(text=True))TypeError
thanks again
Claudio
@Maclaud
Check if ‘text’ is None first. If ‘td’ is empty td.find() will return None.
text = td.find(text=True) if text != None: .... ....