Archive

Archive for July, 2010

Check Your IMAP Quota Using Python Imaplib

July 16th, 2010 No comments

pythonThe python-imaplib module defines three classes, IMAP4, IMAP4_SSL and IMAP4_stream, which encapsulate a connection to an IMAP4 server and implement a large subset of the IMAP4rev1 client protocol as defined in RFC 2060.

Here’s a code sample that returns your gmail quota details.

import getpass, imaplib, re
 
p = re.compile('\d+')
 
IMAP_SERVER='imap.gmail.com'
IMAP_PORT=993
IMAP_USERNAME='username@gmail.com'
 
M = imaplib.IMAP4_SSL(IMAP_SERVER, IMAP_PORT)
M.login(IMAP_USERNAME, getpass.getpass())
quotaStr = M.getquotaroot("INBOX")[1][1][0]
r = p.findall(quotaStr)
if r == []:
  print "Unlimited Quota Account"
  r.append(0)
  r.append(0)
 
print 'Allotted = %f MB'%(float(r[1])/1024)
print 'Used = %f MB'%(float(r[0])/1024)
M.logout()

The script will ask your gmail password and the output will be something like this:

Allotted = 7476.760000 MB
Used = 961.390000 MB
Categories: PYTHON Tags: , , ,

Parsing HTML table in Python with BeautifulSoup

July 16th, 2010 2 comments

pythonBeautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree.

Here’s some code demonstrating how to extract data from HTML tables using Beautiful Soup.

Include Beautiful Soup in your application with a line like one of the following:

from BeautifulSoup import BeautifulSoup          # For processing HTML

Getting started with BeautifulSoup:

#!/usr/bin/python
 
from BeautifulSoup import BeautifulSoup
 
html = ['<html><body><table><tr><td>row 1, cell 1</td><td>row 1, cell 2</td></tr><tr><td>row 2, cell 1</td><td>row 2, cell 2</td></tr></table></html>']
 
soup = BeautifulSoup(''.join(html))
 
print soup.prettify()

The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace.

The output will be:

<html>
 <body>
  <table>
   <tr>
    <td>
     row 1, cell 1
    </td>
    <td>
     row 1, cell 2
    </td>
   </tr>
   <tr>
    <td>
     row 2, cell 1
    </td>
    <td>
     row 2, cell 2
    </td>
   </tr>
  </table>
 </body>
</html>

Here is the ways to navigate the soup:

table = soup.find('table')
 
rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
      text = ''.join(td.find(text=True))
      print text+"|",
  print

and the output will be:

row 1, cell 1| row 1, cell 2|
row 2, cell 1| row 2, cell 2|

You can search the soup for tags with certain properties:

Get the table with id ‘mytable’

table = soup.find('table', id="mytable")

Or
table aligned center:

table = soup.find('table', align="center")

Or
border is 3:

table = soup.find('table', border="3")

If you want to fetch data across the web use the urllib module.

import urllib
f = urllib.urlopen("http://somesite/table.html")
html = f.read()

Installing BeautifulSoup in Debian/Ubuntu:

$ sudo aptitude install python-beautifulsoup

Read More at BeautifulSoup Documentation Page.

That’s it! Have fun!

Categories: PYTHON Tags: , ,

PDF Manipulations And Conversions From Linux Command Prompt

July 14th, 2010 2 comments

pdf

If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses. Pdftk is a simple tool for doing everyday things with PDF documents. Pdftk allows you to manipulate PDF easily and freely. It does not require Acrobat, and it runs on Linux, Windows, Mac OS X, FreeBSD and Solaris.

In Debian/Ubuntu you can install pdftk via apt:

$ sudo aptitude install pdftk

Examples:
Merge Two or More PDFs into a New Document

$ pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf

or (Using Handles):

$ pdftk A=1.pdf B=2.pdf cat A B output 12.pdf

or (Using Wildcards):

$ pdftk *.pdf cat output combined.pdf

Split Select Pages from Multiple PDFs into a New Document

$ pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf

Rotate the First Page of a PDF to 90 Degrees Clockwise

$ pdftk in.pdf cat 1E 2-end output out.pdf

Rotate an Entire PDF Document’s Pages to 180 Degrees

$ pdftk in.pdf cat 1-endS output out.pdf

Encrypt a PDF using 128-Bit Strength (the Default) and Withhold All Permissions (the Default)

$ pdftk mydoc.pdf output mydoc.128.pdf owner_pw foopass

Same as Above, Except a Password is Required to Open the PDF

$ pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz

Same as Above, Except Printing is Allowed (after the PDF is Open)

$ pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz allow printing

Decrypt a PDF

$ pdftk secured.pdf input_pw foopass output unsecured.pdf

Join Two Files, One of Which is Encrypted (the Output is Not Encrypted)

$ pdftk A=secured.pdf mydoc.pdf input_pw A=foopass cat output combined.pdf

Uncompress PDF Page Streams for Editing the PDF Code in a Text Editor

$ pdftk mydoc.pdf output mydoc.clear.pdf uncompress

Repair a PDF’s Corrupted XREF Table and Stream Lengths (If Possible)

$ pdftk broken.pdf output fixed.pdf

Burst a Single PDF Document into Single Pages and Report its Data to doc_data.txt

$ pdftk mydoc.pdf burst

Report on PDF Document Metadata, Bookmarks and Page Labels

$ pdftk mydoc.pdf dump_data output report.txt

Poppler is a PDF rendering library based on the xpdf-3.0 code base. The poppler-utils package contains pdftops (PDF to PostScript converter), pdfinfo (PDF document information extractor), pdfimages (PDF image extractor), pdftohtml (PDF to HTML converter), pdftotext (PDF to text converter), and pdffonts (PDF font analyzer).

Debian/Ubuntu users can install pdftk via apt:

$ sudo aptitude install poppler-utils


Convert PDF to TEXT

Pdftotext converts Portable Document Format (PDF) files to plain text.

$ pdftotext example.pdf example.txt

If textfile is not specified, pdftotext converts file.pdf to file.txt.  If text-file is ´-’, the text is sent to stdout.

To convert  page  from 3 to 7 (including 3 and 7) use:

$ pdftotext -f 3 -l 7 example.pdf example.txt

To extract only 3rd page

$ pdftotext -f 3 -l 3 example.pdf example.txt
$ pdftotext -layout example.pdf example.txt

Maintain the original physical layout of the text and output the text in reading order.

Set the  -nopgbrk option if you don’t want insert page breaks.

Uset -opw (owner password) or -upw (user password) options if the PDF file is password protected.


Extract Images From PDF

Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.

Pdfimages reads the PDF file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

Pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

$ pdfimages example.pdf exampleimage

The above command will extract all images from example.pdf. The images will be saved in PPM format.

Use -j option to save images as JPG format

$ pdfimages -j example.pdf exampleimage

Use the -f and -l options to specify the startpage and lastpage to scan. To scan pages 3 to 7 (including 3 and 7) use:

$ pdfimages -f 3 -l 7 example.pdf exampleimage

To scan only one specific page use:

$ pdfimages -f 3 -l 3 example.pdf exampleimage

If the PDF file is password protected use -opw or -upw option:

-opw Owner password
-upw User password


Convert PDF to HTML

pdftohtml is a program that converts pdf documents into html. It generates its output in the current working directory.

Usage:

$ pdftohtml file.pdf file.html

If you want to see graphics, you’ll need to use the -c (as in “complex”) option:

$ pdftohtml -c file.pdf file.html


Convert PDF to Image

First you need to have ImageMagick installed in your machine.
To install ImageMagick in Debian/Ubuntu run the following command:

$ sudo aptitude install imagemagick

To convert pdf file to image use the ‘convert‘ command:

$ convert doc.pdf doc.jpeg

convert to tiff

$ convert doc.pdf doc.tiff

for more information look here.

Categories: HOW-TOS Tags: ,

Switch to our mobile site