Archive

Posts Tagged ‘beautifulsoup’

Parsing HTML table in Python with BeautifulSoup

July 16th, 2010 2 comments

pythonBeautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.

Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.

Include Beautiful Soup in your application with a line like one of the following:

from BeautifulSoup import BeautifulSoup          # For processing HTML

Getting started with BeautifulSoup:

#!/usr/bin/python
 
from BeautifulSoup import BeautifulSoup
 
html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>']
 
soup = BeautifulSoup(''.join(html))
 
print soup.prettify()

The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.

The output will be:

<html>
 <body>
  <table>
   <tr>
    <td>
     row 1, cell 1
    </td>
    <td>
     row 1, cell 2
    </td>
   </tr>
   <tr>
    <td>
     row 2, cell 1
    </td>
    <td>
     row 2, cell 2
    </td>
   </tr>
  </table>
 </body>
</html>

Here is the ways to navigate the soup:

table = soup.find('table')
 
rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
      text = ''.join(td.find(text=True))
      print text+"|",
  print

and the output will be:

row 1, cell 1| row 1, cell 2|
row 2, cell 1| row 2, cell 2|

You can search the soup for tags with certain properties:

Get the table with id ‘mytable’

table = soup.find('table', id="mytable")

Or
table aligned center:

table = soup.find('table', align="center")

Or
border is 3:

table = soup.find('table', border="3")

If you want to fetch data across the web use the urllib module.

import urllib
f = urllib.urlopen("http://somesite/table.html")
html = f.read()

Installing BeautifulSoup in Debian/Ubuntu:

$ sudo aptitude install python-beautifulsoup

Read More at BeautifulSoup Documentation Page.

That’s it! Have fun!

Categories: PYTHON Tags: , ,