Archive for August, 2010

Data Compression and Archiving Using Python

August 30th, 2010 No comments

bzip2 compression

bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.

Compressing a file:

import bz2
import fileinput
output = bz2.BZ2File('a.txt.bz2', 'wb')
for line in fileinput.input('a.txt'):

This will compess a.txt to a.txt.bz2.

Decompress file.

import bz2
input_file = bz2.BZ2File('a.txt.bz2', 'rb')

gzip compression

gzip (GNU zip) is a compression utility designed to be a replacement for compress. Its main advantages over compress are much better compression and freedom from patented algorithms.

Compress file using gzip

import gzip
import fileinput
output ='a.txt.gz', 'wb')
for line in fileinput.input('a.txt'):

Decompress the file.

import gzip
input_file ='a.txt.gz', 'rb')

Tar archive access
List the contents of a tar file.

import tarfile
tar ="sample.tar", "r")
for tarinfo in tar:
    print, "is", tarinfo.size, "bytes in size and is",
    if tarinfo.isreg():
        print "a regular file."
    elif tarinfo.isdir():
        print "a directory."
        print "something else."

Untar an archive file.

import tarfile
tar ="sample.tar")

Create an archive file

import tarfile
import os, fnmatch
tar ="sample.tar", "w")
files = os.listdir('.')
for file in files:
    if os.path.isdir(file):
        print file,' is a dir.'
    if fnmatch.fnmatch ( file, '*.txt' ):
        print file

Using with gzip and bz2
You can use tar with gzip or bz2

tar ="sample.tar.gz", "r:gz")


tar ="sample.tar.bz2", "r:bz2")

to work with gzip or bz2 file.

Categories: PYTHON Tags: , , , ,

Playing With Python And Gmail – Part 2

August 19th, 2010 4 comments

This is the second part of the article series ‘Playing With Python And Gmail’. If you didn’t read the first part I would recomend you to read it.

This time we will see how to fetch mails from Gmail using Python.

Reading Mails

The IMAP4.fetch method fetch (parts of) messages. message_parts should be a string of message part names enclosed within parentheses, eg: “(UID BODY[TEXT])”. Returned data are tuples of message part envelope and data.

Here is a minimal example (without error checking) that opens a mailbox and retrieves and prints all messages:

import imaplib
M = imaplib.IMAP4('', 993)
M.login('', 'pa$$word')
typ, data =, 'ALL')
for num in data[0].split():
    typ, data = M.fetch(num, '(RFC822)')
    print 'Message %s\n%s\n' % (num, data[0][1])

The email package provides a standard parser that understands most email document structures, including MIME documents. You can pass the parser a string or a file object, and the parser will return to you the root Message instance of the object structure. For simple, non-MIME messages the payload of this root object will likely be a string containing the text of the message. For MIME messages, the root object will return True from its is_multipart() method, and the subparts can be accessed via the get_payload() and walk() methods.

Extract Mail Headers
Here is method to retrieve from, to and subject from from an email message:

from email.parser import HeaderParser
resp, data = M.FETCH(1, '(RFC822)')
msg = HeaderParser().parsestr(data[0][1])
print msg['From']
print msg['To']
print msg['Subject']

Output will be something like.

Gmail Team
My Name
Gmail is different. Here's what you need to know.

Identifying the content type
The Content-Type header indicates the Internet media type of the message content, consisting of a type and subtype, for example text/plain is the default value for “Content-Type:”
Gmail uses alternative content, such as a message sent in both plain text and another format such as HTML (multipart/alternative with the same content in text/plain and text/html forms).

import email
resp, data = M.FETCH(1, '(RFC822)')
mail = email.message_from_string(data[0][1])
for part in mail.walk():
  print 'Content-Type:',part.get_content_type()
  print 'Main Content:',part.get_content_maintype()
  print 'Sub Content:',part.get_content_subtype()

Out put will be

Content-Type: multipart/alternative
Main Content: multipart
Sub Content: alternative
Content-Type: text/plain
Main Content: text
Sub Content: plain
Content-Type: text/html
Main Content: text
Sub Content: html

Extract Message Body.
Using the walk() method we can iterate through Message parts. The get_payload() method will return the current payload, which will be a list of Message objects when is_multipart() is True, or a string when is_multipart() is False.

import email
resp, data = M.FETCH(1, '(RFC822)')
mail = email.message_from_string(data[0][1])
for part in mail.walk():
  # multipart are just containers, so we skip them
  if part.get_content_maintype() == 'multipart':
  # we are interested only in the simple text messages
  if part.get_content_subtype() != 'plain':
  payload = part.get_payload()
  print payload

Extracting Attachmets
The below code will extract and save attached images to disk.

import re
name_pat = re.compile('name=\".*\"')
for part in mail.walk():
  if part.get_content_maintype() != 'image':
  file_type = part.get_content_type().split('/')[1]
  if not file_type:
    file_type = 'jpg'
  filename = part.get_filename()
  if not filename:
    filename = name_pat.findall(part.get('Content-Type'))[0][6:-1]
  counter = 1
  if not filename:
    filename = 'img-%03d%s' % (counter, file_type)
    counter += 1
  payload = part.get_payload(decode=True)
  if not os.path.isfile(filename) :
      # finally write the stuff
      fp = open(filename, 'wb')

That’s it. In the next part I will explain searching and moving your mails using Python. Dont forget to subscribe¬†:-)

Categories: PYTHON Tags: , , ,