Home > PYTHON > Parsing HTML table in Python with BeautifulSoup

Parsing HTML table in Python with BeautifulSoup

July 16th, 2010

pythonBeautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.

Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.

Include Beautiful Soup in your application with a line like one of the following:

from BeautifulSoup import BeautifulSoup          # For processing HTML

Getting started with BeautifulSoup:

#!/usr/bin/python
 
from BeautifulSoup import BeautifulSoup
 
html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>']
 
soup = BeautifulSoup(''.join(html))
 
print soup.prettify()

The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.

The output will be:

<html>
 <body>
  <table>
   <tr>
    <td>
     row 1, cell 1
    </td>
    <td>
     row 1, cell 2
    </td>
   </tr>
   <tr>
    <td>
     row 2, cell 1
    </td>
    <td>
     row 2, cell 2
    </td>
   </tr>
  </table>
 </body>
</html>

Here is the ways to navigate the soup:

table = soup.find('table')
 
rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
      text = ''.join(td.find(text=True))
      print text+"|",
  print

and the output will be:

row 1, cell 1| row 1, cell 2|
row 2, cell 1| row 2, cell 2|

You can search the soup for tags with certain properties:

Get the table with id ‘mytable’

table = soup.find('table', id="mytable")

Or
table aligned center:

table = soup.find('table', align="center")

Or
border is 3:

table = soup.find('table', border="3")

If you want to fetch data across the web use the urllib module.

import urllib
f = urllib.urlopen("http://somesite/table.html")
html = f.read()

Installing BeautifulSoup in Debian/Ubuntu:

$ sudo aptitude install python-beautifulsoup

Read More at BeautifulSoup Documentation Page.

That’s it! Have fun!

Categories: PYTHON Tags: , ,
  1. Maclaud
    October 11th, 2010 at 16:07 | #1

    Thanx a lot for this demonstration, it was very useful for me, still i need to know how to deal with empty td’s ex

    which return text = ''.join(td.find(text=True))
    TypeError

    thanks again
    Claudio

    • October 11th, 2010 at 16:33 | #2

      @Maclaud

      Check if ‘text’ is None first. If ‘td’ is empty td.find() will return None.

      text = td.find(text=True)
      if text != None:
          ....
          ....
  1. January 7th, 2013 at 19:08 | #1
Comments are closed.