December 16, 2011
by Colin Copeland
0 comments
Categories:
Technical

OpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction

This is the second post in our OpenRural series reviewing OpenBlock and it's geocoder. OpenBlock Geocoder, Part 1: Data Model and Geocoding covers the internals of the OpenBlock geocoder and it's geocoding capabilities. As this posts builds upon topics covered there, you may wish to read Part 1 before proceeding. In this post we step back from the internals of the geocoder and explore how to use it along with other OpenBlock tools to parse unstructured text.

I'd also like to give a shout out here to Paul Winkler who was kind enough to answer questions and point me in the right direction on the topics below. Thanks Paul!

The Problem

OpenBlock's original design is centered around providing news at a hyper-local level. That is, down to your own city block. This allows interested citizens to see events ranging from police incidents, to restaurant inspections, to local news articles all aggregated on a map of your block. OpenBlock provides scraping tools to assist downloading this data from the web, but the obvious problem here is that most data isn't packaged or tagged with geographic information. Let's look at an example article teaser from The Daily Tar Heel in Chapel Hill, NC:

No. 4 North Carolina led Evansville 63-27 with just more than 14 minutes to go in the first half when senior forward Tyler Zeller scored his 999th career point at the Smith Center on Tuesday night.

The article mentions the game at the Smith Center, which is the location we want to extract and plot on a map. This is where OpenBlock utilities to ingest unstructured text helps.

Places

Places are simple models containing only a name and geographic point. OpenBlock implements a mechanism to find places defined in the database from a body of text. For example, say we have the following string we'd like to parse:

>>> message = 'A good movie is playing at the Varsity Theater in Chapel Hill tonight.'

OpenBlock can extract "Varsity Theater" if we define it as a Place. You can create and import places in the OpenBlock admin, but to keep things simple, we'll just create one here:

Here we created a new Point of Interest place (which is loaded by default on any OpenBlock install) geocoded to 123 East Franklin Street. Now we need a way to parse places from strings. Most of this functionality is found in ebdata. And ebdata contains a Natural Language Processing package, nlp. We can use it's place_grabber to extract matching places:

We can feed this right back into the Place model to retrieve the database objects and their geographic locations:

The parser is case sensitive however, so it'll fail if it's not an exact match:

>>> grabber("VARSITY THEATER")
[]

Obviously this is a brute-force method and requires you to pre-load all places of interest into the database beforehand. It's pretty rudimentary, but does provide this functionality out-of-the-box.

Locations

OpenBlock can also extract locations defined in the database. We already have cities loaded, so we'll use them in this example. Just like the place grabber, the location grabber is case sensitive, so we'll define a location synonym with the proper case:

>>> from ebpub.db.models import Location, LocationSynonym
>>> ch = Location.objects.get(name='CHAPEL HILL')
>>> LocationSynonym(pretty_name='Chapel Hill', location=ch).save()

By default, the location grabber igonores types of "city" and "borough". To keep things simple, we'll just create one that includes all location types:

>>> grabber = places.location_grabber(ignore_location_types=[])

Now we can use the grabber to extract locations:

>>> grabber(message)
[(50, 61, 'Chapel Hill')]

If you plan to parse a lot of text in succession, the OpenBlock grabbers cache the locations/places on instantiation. So you won't hit the database after the initial run. Cool!

Addresses

ebdata.nlp can also parse addresses. For example, let's use a simple string:

>>> from ebdata.nlp.addresses import parse_addresses
>>> parse_addresses('The Varsity Theater is located at 123 N Franklin St')
[('123 N Franklin St', '')]

Under the hood, OpenBlock uses a large regular expression to do this, so it's not actually hitting the database or attemping to do geocoding. You'll notice that it returns a 2-item tuple. The second item is for the city:

>>> parse_addresses('The individual was seen on 123 N Franklin St in Chapel Hill')
>>> [('123 N Franklin St', 'Chapel Hill')]

It can parse block locations too:

>>> parse_addresses('The construction is on the 100 block of Franklin St.')
[('100 block of Franklin St.', '')]

And intersections:

>>> parse_addresses('The incident occured at the intersection of Franklin and Hillsborough')
[('Franklin and Hillsborough', '')]

It all comes together with the geocoder:

Conclusion

As you can see, OpenBlock provides a few useful utilities to parse unstructured text. They're fairly limited and, especially with the address parser, will most likely return a lot of false positives. But I think OpenBlock has provided a great starting point. Stayed tuned for more posts on inner-workings of the OpenBlock project!

blog comments powered by Disqus