Scraping Data and Web Standards

We're currently involved in a project with the UNC School of Journalism that hopes to help rural newspapers in North Carolina leverage OpenBlock.  The project is called OpenRural, and if you're a software developer you can find the latest code on GitHub.

OpenBlock needs geographic data to display, and that data can come from a variety of sources.  We've found a number of web sites that offer geographically interesting data to NC residents, and in this post I'd like to discuss my experience attempting to scrape (that is, programmatically navigate and extract data from) the Chapel Hill Police Department's (CHPD's) online database of crime reports.

The CHPD site advertises itself as powered by "Sungard Public Sector OSSI's P2C engine," and a quick Google for "P2C engine" shows that Chapel Hill is not the only city or county in North Carolina that happens to use this product.  Unfortunately, scraping the data on this site proved to be a non-trivial endeavor.

I opted to host and run my scraper script on ScraperWiki, which is a great tool for writing, testing, and running scraper scripts in a variety of scripting languages.  The site even manifests the scraped data in API form, so it could potentially be used as an abstraction layer between the scraped sites and OpenBlock (or any other consumer of the data).  The current state of the script can be found here:

https://scraperwiki.com/scrapers/chapel_hill_police_reports/

The script uses the Python mechanize library to navigate the site being scraped, and BeautifulSoup to find and extract data on the pages retrieved.  After telling mechanize to click the "I Agree" button on the CHPD web site's landing page, it was easy enough to submit the search form for the current day and return a listing of results.

While getting the initial list of results was fairly trivial, one issue I ran into when writing the scraper is that the site uses an odd method of retrieving and paginating results.  Looking at the HTML source, you will see that the search form is submitted by a small piece of JavaScript, like so:

function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

It turns out this little method is used to do quite a lot.  There are calls to it to do everything from sorting, to pagination, to link to other pages on the site.  It effectively works by setting the form action (via two hidden form inputs on the page) and then calling submit() on the form.

You may have also noticed that the form has method="post", rather than method="get" set, which means the web browser will send an HTTP POST (rather than an HTTP GET) every time you modify the form and click the Search button.  Per the HTTP/1.1 specification, POST requests should be used for requests that modify data on the server, whereis GET requests should be used to retrieve information at a given URL.  You can also tell that the site uses POST instead of GET by inspecting the URL in your browser; sites pages that use GET will typically have a portion of their URL that starts with a question mark and is followed by key/value pairs.  The link to the Google search above is an example of the GET method. Searching a site is by definition a retrieval operation (and typically does not involve modifying data on the server), so well-written search forms should use the GET rather than the POST HTTP method.

Confusing POST and GET is a fairly elementary problem, but it's one that we see far too often on the web.  If you've ever been prompted by your browser "re-submit a form" after hitting the back button and are warned that it may modify data on the server, the site you're using is probably not using the GET and POST HTTP methods properly.

In the case of the CHPD site, while it was easy enough to set the values of the hidden form inputs and re-submit the form using POST (after finding this post on StackOverflow, at least), for some reason the site still returns the first page of results to mechanize (even though it properly paginates in a real web browser). I'm still working on it, but in the meantime, check out the code and let me know if you have any ideas. :-)

New Call-to-action
blog comments powered by Disqus
Times
Check

Success!

Times

You're already subscribed

Times