Developer at desk extracting emails from Microsoft PST

In my previous line of work as an archivist, the question of what to do about email archives was an ongoing and deeply-considered topic. Email is everywhere. Yes, even Gen Z and millennials use it, despite thousands of think pieces that would have you believe that the old ways are giving way to business meetings conducted on fixed-gear bicycles, over avocado toast and Instagram.

Email is still prevalent and accessing those emails when they’re stored in a Microsoft PST (Personal Storage Table) format can be … tricky. There are some very good commercial solutions for getting the information in PSTs, but they require Windows operating environments and are closed source. Follow along with me here, dear reader, and I'll show you a tool that makes extracting email from PSTs a lot less tricky.

The Project

First, a few words about why we’ve been thinking about emails here at Caktus. The University of North Carolina's School of Information and Library Science (UNC-SILS), in partnership with the State Archives of North Carolina, brought us onboard to help with their joint project called Review, Appraisal, and Triage of Mail, or RATOM.

Funded by the Andrew W. Mellon Foundation, the RATOM project aims to apply machine learning to the thorny problem of archival email processing. Archivists are tasked with cataloging the contents of email inboxes and digging out relevant records when citizens and journalists request them, but dealing with this data is never easy. To help with this, Caktus is building a web app for processing an archived email account. It will ingest messages from a PST, automatically classify them using natural language processing (and eventually machine learning), and give archivists an interface for discovering and processing the contents of the inbox. By the way, all of the code developed by both Caktus and the UNC-SILS team is available on github, released under the MIT open source license.

To develop the app, we’ve been leaning on the Python tool libratom, developed by UNC-SILS as a parallel effort in the RATOM project. This blog explores one of its most fundamental purposes, which is getting emails out of a PST file in the first place. Other members of our team will be blogging about some of the other issues we’ve run across and how we’ve solved them, so stay tuned!

Okay, now let's extract some emails.

Get Thee a PST

There are only a few good sources of publicly available PSTs for us to use, but the ones that are available are quite good. One of the best sources is the Enron Dataset. For this example, I'll use the account of Bill Rapp, which is one of the smaller accounts. You can also do this with your own PST files if you happen to have them.

Extract PST messages to .EML

The first order of business is to install the library using pip.

pip install libratom

Now let's assume that we have the PST file in the same directory as the script. Next, you’ll want a folder of plain text email messages that can be opened with email messaging programs like Thunderbird, Outlook, or Apple Mail.

I generally use the subject line to name the file. Additionally, to make sure that I don't clobber messages that happen to have the same subject, I prepend the PST identifier to the filename, like this:

from libratom.lib.pff import PffArchive
from email import generator
from pathlib import Path

archive = PffArchive("bill_rapp_000_1_1.pst")
eml_out = Path(Path.cwd() / "emls")

if not eml_out.exists():
  eml_out.mkdir()

print("Writing messages to .eml")
for folder in archive.folders():
    if folder.get_number_of_sub_messages() != 0:
        for message in folder.sub_messages:
            name = message.subject.replace(" ", "_")
            name = name.replace("/","-")
            filename = eml_out / f"{message.identifier}_{name}.eml"
            filename.write_text(archive.format_message(message))
print("Done!")

You should now have a directory full of .eml messages that can be opened by an email reader. If you are a python developer and you need to extract emails from a PST, libratom is an excellent tool to use. Easy, right? To really see how easy it is, view this quick video:

If you have any questions or feedback, please leave us a comment, below.

New Call-to-action
blog comments powered by Disqus
Times
Check

Success!

Times

You're already subscribed

Times