As Tobias mentioned in Scraping Data and Web Standards, Caktus is collaborating with the UNC School of Journalism to help develop Open Rural (the code is on GitHub). Open Rural hopes to help rural newspapers in North Carolina leverage OpenBlock. This blog post is the first of several covering the internals of OpenBlock and, specifically, the geocoder.
OpenBlock Data Model
The OpenBlock geocoder can only geocode from the data is has. It doesn't leverage a 3rd-party API or service. It only uses what's loaded in PostgreSQL (with PostGIS and GeoDjango) and, in this example, what comes from the US Census Bureau and local city and county GIS offices.
Further, the imported data is typically filtered by a bounding box setting in METRO_LIST. The setting, extent, is a list of leftmost longitude, lower latitude, rightmost longitude, upper latitude. This defines a bounding box - the range of latitudes and longitudes that are relevant to your area. A small or restrictive box will limit imported ZIP code and block data to areas that fall within the box.
Let's look at an example with these shapefiles:
- North Carolina's 5-Digit ZIP Code Tabulation Area
- North Carolina's Place (Current)
- Orange County's All Lines
- Orange County's Topological Faces (Polygons With All Geocodes)
- Orange County's Feature Names Relationship File
- Orange County's City Boundaries
We'll start with a restrictive extent that only consists of downtown Chapel Hill:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.066272, 35.91671, -79.040481, 35.910663),
# ...
},
)
This selection loaded 2 ZIP codes:
$ django-admin.py import_nc_zips Importing zip codes... # ... Skipping 27511, out of bounds Skipping 27513, out of bounds Created ZIP Code 27514 Created ZIP Code 27516 Skipping 27517, out of bounds Skipping 27519, out of bounds # ... Created 2 zipcodes.
And limited the block data as well:
$ django-admin.py import_county_streets 37135 Importing blocks, this may take several minutes ... Created 73 blocks Populating streets and fixing addresses, these can take several minutes... Populating the streets table streets: created: 28 block_intersections: created: 160 Done.
Restricting the area will limit the ability of the geocoder. In this case, for example, it can geocode the intersection of Franklin and Henderson, which is right downtown, but not Franklin and Estes (don't worry, we'll get into more geocoding details in the next section). A map helps illustrate this more clearly. Below you can see the bounding box with pins on the two intersections:
View OpenRural - Downtown Chapel Hill in a larger map
If we increase the bounding box, we'll get a lot more data:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# ...
},
)
With an extent that encompasses all of Chapel Hill, the importer loaded 9 ZIP codes, 4302 blocks, 1699 streets, and 7189 intersections. Here's a map illustrating the larger extent:
View OpenRural - Orange County, NC in a larger map
It's up to the maintainer of an OpenBlock install to determine which extent to use as it is based on the specifics of the application. A large extent will import more ZIP codes and blocks and, therefore, will slow down geospatial queries and may include unwanted geographic areas.
Street
Now that we have NC Orange County data loaded, let's investigate this data with the OpenBlock models.
The Street model contains a catalog of all loaded streets. It's a simple model with only a few fields:
- street
- pretty_name
- street_slug
- suffix
- city
- state
In NC Orange County, we can see that the street data spans 4 cities:
>>> from ebpub.streets.models import Street
>>> Street.objects.order_by('city').values_list('city', flat=True).distinct()
[u'', u'CARRBORO', u'CHAPEL HILL', u'DURHAM', u'HILLSBOROUGH']
Some streets cross city lines and therefore contain two entries:
>>> Street.objects.filter(street_slug='rosemary-st').values_list('city', flat=True)
[u'CARRBORO', u'CHAPEL HILL']
And, for example, if we're looking for Franklin St. in Chapel Hill, NC, we can filter for it here:
Blocks
Blocks are fundamental to OpenBlock and are used by the geocoder. OpenBlock defines a block as "a segment of a single street between one side street and another side street." The Block model is slightly more intricate than Street, but each entry basically represents the address range of a street for each block segment.
To start, we can see that Franklin St. is divided into roughly 32 blocks:
>>> from ebpub.streets.models import Block
>>> Block.objects.filter(street_slug='franklin-st').count()
32
It's sectioned into an east and west segment:
>>> Block.objects.filter(street_slug='franklin-st').order_by('street_pretty_name').values_list('street_pretty_name', 'predir').distinct()
[(u'Franklin St.', u'W'), (u'Franklin St.', u'E')]
And can have an address between 100 and 1899:
>>> Block.objects.filter(street_slug='franklin-st').aggregate(Min('from_num'), Max('to_num'))
{'from_num__min': 100, 'to_num__max': 1899}
So we can find the block that contains the 123 address:
Also, on a side note, it's possible for some blocks to span cities:
Geocoding
Now that we have a basic understanding of how the data is stored within OpenBlock, let's do some geocoding. Most of these examples will use the SmartGeocoder class. SmartGeocoder delegates to specific geocoders (AddressGeocoder, BlockGeocoder, and IntersectionGeocoder) based on how it interprets the string with regular expressions.
Addresses
To start, let's geocode "123 East Franklin Street":
This one was pretty easy for geocoder to parse and find. You can see that not only has it found the associated block, but it also knows the exact geographic point. However, this will fail if passed a non-existent address number (InvalidBlockButValidStreet):
In this case, the geocoder was able to extract the address, but it failed to find the associated block in the database. Non-existent streets also fail (DoesNotExist):
Intersections
The geocoder can locate intersections too:
Notice how the intersection field is populated, rather than block. This will raise a DoesNotExist exception when an intersection is not found:
Street Misspellings
OpenBlock provides a model, StreetMisspelling, to define street aliases. This allows you to map a bad street name to a good street name that exists in the database:
Now geocoding "Glen Haven" will find "Glenhaven".
Multiple Cities
By default, OpenBlock is configured to work with a single city, which is defined in METRO_LIST:
# Metros. You almost certainly only want one dictionary in this list.
# See the configuration docs for more info.
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# The major city in the region.
'city_name': 'Chapel Hill',
},
)
The geocoder will fail if it locates a street that's associated with a city unknown to OpenBlock. For example, 100 Pine Street is in Carrboro and not Chapel Hill:
This street exists in the database due to our extent covering most of Orange County. Since we've setup OpenBlock to encompass an entire county, rather than a single city, we need to define additional cities. This can be accomplished one of two ways:
- Add additional dictionaries to METRO_LIST for each city
We imported Orange County city boundary data above, so we'll use the latter:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# Set this to True if the region has multiple cities.
# You will also need to set 'city_location_type'.
'multiple_cities': True,
# The major city in the region.
'city_name': 'Chapel Hill',
# Slug of an ebpub.db.LocationType that represents cities.
# Only needed if multiple_cities = True.
'city_location_type': 'cities',
},
)
Here we enabled multiple_cities and informed OpenBlock that the location type slug is cities, respectively. Now 100 Pine Street will geocode properly:
What's Next
Now that we've had an overview of the geocoder, we'll jump into OpenBlock's place, location, and address parser. Stay tuned!
Update: Read more in OpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction.