July 16, 2010 at 12:46 pm | PROGRAMMING, PYTHON | No comment
Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.
Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.
Include Beautiful Soup in your application with a line like one of the following:
from BeautifulSoup import BeautifulSoup # For processing HTML
Getting started with BeautifulSoup:
#!/usr/bin/python from BeautifulSoup import BeautifulSoup html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>'] soup = BeautifulSoup(''.join(html)) print soup.prettify()
The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.
The output will be:
<html>
<body>
<table>
<tr>
<td>
row 1, cell 1
</td>
<td>
row 1, cell 2
</td>
</tr>
<tr>
<td>
row 2, cell 1
</td>
<td>
row 2, cell 2
</td>
</tr>
</table>
</body>
</html>Here is the ways to navigate the soup:
table = soup.find('table') rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: text = ''.join(td.find(text=True)) print text+"|", print
and the output will be:
row 1, cell 1| row 1, cell 2| row 2, cell 1| row 2, cell 2|
You can search the soup for tags with certain properties:
Get the table with id ‘mytable’
table = soup.find('table', id="mytable")
Or
table aligned center:
table = soup.find('table', align="center")
Or
border is 3:
table = soup.find('table', border="3")
If you want to fetch data across the web use the urllib module.
import urllib f = urllib.urlopen("http://somesite/table.html") html = f.read()
Installing BeautifulSoup in Debian/Ubuntu:
$ sudo aptitude install python-beautifulsoup
Read More at BeautifulSoup Documentation Page.
That’s it! Have fun!
Related posts:
Tags: beautifulsoup, html, parsing