Parsing HTML table in Python with BeautifulSoup

pythonBeautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.

Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.

Include Beautiful Soup in your application with a line like one of the following:

from BeautifulSoup import BeautifulSoup          # For processing HTML

Getting started with BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
print soup.prettify()

The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.

The output will be:

     row 1, cell 1
     row 1, cell 2
     row 2, cell 1
     row 2, cell 2

Here is the ways to navigate the soup:

table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
      text = ''.join(td.find(text=True))
      print text+"|",

and the output will be:

row 1, cell 1| row 1, cell 2|
row 2, cell 1| row 2, cell 2|

You can search the soup for tags with certain properties:

Get the table with id ‘mytable’

table = soup.find('table', id="mytable")

table aligned center:

table = soup.find('table', align="center")

border is 3:

table = soup.find('table', border="3")

If you want to fetch data across the web use the urllib module.

import urllib
f = urllib.urlopen("http://somesite/table.html")
html =

Installing BeautifulSoup in Debian/Ubuntu:

$ sudo aptitude install python-beautifulsoup

Read More at BeautifulSoup Documentation Page.

That’s it! Have fun!


Maclaud October 11, 2010

Thanx a lot for this demonstration, it was very useful for me, still i need to know how to deal with empty td’s ex

which return text = ''.join(td.find(text=True))

thanks again

segfault October 11, 2010


Check if ‘text’ is None first. If ‘td’ is empty td.find() will return None.

text = td.find(text=True)
if text != None: