January 10 2012 by Colin Copeland
We're pretty avid testers here at Caktus and when one of our Django projects required upgrading to Python 2.7, we also needed to upgrade our Jenkins build environment. Luckily, Jenkins supports distributed builds to allow a master install to delegate tasks to slaves instances. This way we can continue to run our primary build system on Ubuntu 10.04, which defaults to Python 2.6, and delegate tasks to an Ubuntu 11.04 environment running Python 2.7. The setup is fairly easy, but since I didn't find much out there already, I figured I write up a quick post outlining what we did.
To start, we'll need a new machine. I setup an Ubuntu 11.04 instance on Linode. Then SSH in, upgrade the packages, and install a Java Runtime Environment:
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install default-jre
That's the only package Jenkins needs by default. Next we'll setup a user for Jenkins to SSH as. To do this, we'll add a new user to the system and copy the master's SSH public key:
$ sudo useradd -m jenkins
$ sudo -u jenkins mkdir /home/jenkins/.ssh
$ sudo -u jenkins vim /home/jenkins/.ssh/authorized_keys2
Now the master Jenkins client can ssh to the slave without a password. Next we need to configure the Jenkins master to connect to the slave. Head over to the Master environment and navigate to "Manage Jenkins" and then "Manage Nodes". Click "New Node" in the sidebar and add a Dumb Slave. On the following page, fill in the following fields:
- # of executors: 2 (controls the number of concurrent builds)
- Remote FS root: /home/jenkins
- Labels: python27 natty
- Usage: Leave this machine for tied jobs only
- Launch method: Launch slave agents on Unix machines via SSH. Also fill in the Host field with the address of your slave machine.
Hit save and your Jenkins master should open a connection to your slave machine. To use the new slave machine, update an existing Jenkins job and set the "Restrict where this project can be run" Label Expression to "python27". You'll need to install any project dependencies on the slave for it to build properly, but that's basically it!
December 29 2011 by Dan Poirier
Django class-based views
Introduction
Django 1.3 added class-based views, but neglected to provide
documentation to explain what they were or how to use them. So here's
a basic introduction.
Example of a very basic class-based view
Let's start with an example of a very basic class-based view.
urls.py:
...
url(r'^/$', MyViewClass.as_view(), name='myview'),
...
views.py:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = "index.html"
def get(self, request, *args, **kwargs):
context = # compute what you want to pass to the template
return self.render_to_response(context)
This will render your template index.html with the context
you computed and return it as the content of an HttpResponse.
Introduction to class-based views
Now that we've seen the obligatory example, how about some instructions?
To create a class-based view, start by creating a class that inherits
from django.views.generic.View or one of its subclasses.
In your URLconf, specify the view method as the name of the new
class, plus .as_view():
url(r'urlpattern', MyViewClass.as_view(), ...)
In your class, write a get method that takes as arguments self
(as always), request (the HttpRequest), and any other arguments
from the request as specified in your URLconf.
In your get method, use the same logic you'd have used in an old
view, except that you can assume the request method is GET. Return an
HttpResponse as usual.
If you need to handle POST, write a post method, just like your get
method except that you can assume the request method is POST.
Any request method that you don't write a handler method for will
automatically get back a "method not allowed" response; you don't have
to do anything special.
Example:
from django.views.generic import View
from django.shortcuts import render
class MyViewClass(View):
def get(self, request, arg1, keyword=value):
return do_something()
def post(self, request, arg1, keyword=value):
return do_something_else()
Handy subclasses of View
Django comes with a number of useful subclasses of View that provide
some of the function that often ends up as boilerplate in views, just
by inheriting from them. You saw TemplateView being used already.
You'll probably want to base your views on TemplateView almost
anytime you're generating the content for a response.
Another useful one is RedirectView. This can be used to redirect
all requests. Example:
from django.core.urlresolvers import reverse
from django.views.generic import RedirectView
class MyRedirectView(RedirectView):
url = reverse(...)
That is a complete view, and will return a redirect to url on any
GET, POST, or HEAD request.
You can optionally set permanent = False to return a temporary
redirect instead of the default permanent redirect, and query_string
= True to include any query string from the incoming request on the
redirect URL:
from django.core.urlresolvers import reverse
from django.views.generic import RedirectView
class MyRedirectView(RedirectView):
url = reverse(...)
permanent = False
query_string = True
Decorators
Unfortunately, using decorators with class-based views isn't quite as
simple as using them with the old method-based views.
Maybe you're used to doing this:
from django.contrib.auth.decorators import login_required
@login_required
def myview(request):
context = ...
return render(request, 'index.html', context)
With class-based views, you have to decorate the .dispatch() method of
the class view, which means you have to override it just to decorate
it. And you need to decorate the decorator, because the decorators
provided by Django expect to be decorating method-based views, not
class-based ones:
from django.contrib.auth.decorators import login_required
from django.views.generic.base import View
from django.views.utils.decorators import method_decorator
class MyViewClass(View):
def get(self, request, **kwargs):
context = ...
return render(request, 'index.html', context)
@method_decorator(login_required)
def dispatch(self, *args, **kwargs):
return super(MyViewClass, self).dispatch(*args, **kwargs)
This is an area of class-based views that could use some improvement.
You could apply the decorator in urls.py without needing so much
extra code:
urls.py:
from django.contrib.auth.decorators import login_required
...
url(r'^/$', login_required(MyViewClass.as_view()), name='myview'),
...
but that moves the policy from the view code to the URLconf, which is
not where people will be expecting to have to look for it, so I
wouldn't recommend it.
Passing arguments to the view
The method signature for get(), post(), etc. in a view class is:
def get(self, request, *args, **kwargs)
Any unnamed values captured in the URLconf regular expression are
passed in args, and any named values are passed in kwargs, just
like before.
You can pass extra arguments to your view using the third element
of your URLconf, the same as before, or using a new technique -- passing
them to the .as_view() call in your url settings. E.g.
...
url(r'^/$', MyViewClass.as_view(extra_arg=3), name='myview'),
...
One warning - don't accidently write MyViewClass(extra_arg=3).as_view().
That'll still appear to work, but that extra_arg is just thrown away.
Where's the beef?
So far, all we've done is the same behavior, written using a different syntax. But
class-based views enable a whole new level of function.
Suppose you've got a view that displays some data on a web page, and you write it
as a class-based view. Maybe something like this:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = 'index.html'
def get(self, request, **kwargs):
# Lots of complex logic in here to compute 'context'
self.render_to_response(context)
Now you're asked to provide an HTTP API that returns the same data in json.
Start by refactoring your existing class slightly, moving your business
logic out of the get() method:
from django.views.generic.base import TemplateView
class MyViewClass(TemplateView):
template_name = 'index.html'
def compute_context(self, request, **kwargs):
# Lots of complex logic in here to compute 'context'
return context
def get(self, request, **kwargs):
self.render_to_response(self.compute_context(request, kwargs))
Now, write a new class that subclasses your original class, uses the
same method to compute the data, but overrides get() with different
rendering code:
class MyJsonViewClass(MyViewClass):
def get(self, request, **kwargs):
data = self.compute_context(request, **kwargs)
# Very naive way to put your data into json, but a good starting place
content = json.dumps(data)
return HttpResponse(content, content_type='application/json')
Add a new URL to urls.py pointing to your new class-based view, and you're done. All
the logic you worked out earlier is still in use, and the power of subclassing let you
provide the data in a new format almost effortlessly.
Class-based views for common policy
The previous example was still something you could have done almost as
easily with method-based views, by refactoring your code into separate
methods and calling them from all your views.
A more powerful use of the new class-based views is to provide common
function for many views. If you have a site with many views, and they
all inherit from a common view, then you have the potential to change
behavior across the site by changing that one view.
Previously, you would probably have used middleware for this kind of thing.
The problem with middleware is that it's completely hidden from the view
code. When working on your view, you won't even know middleware is affecting
things unless you go look at the settings and track down each piece of
middleware configured there.
Furthermore, middleware affects every request, not just the views you
really wanted it for.
With a common class-based view, every view affected is declared to
inherit from that view, making it obvious that we're inheriting
behavior from elsewhere. With a good IDE, you can even jump straight
to that superclass to inspect it. Any view that doesn't need the
common behavior doesn't have to inherit it.
References
The only documentation page that really discussed class-based views
in Django 1.3 is this one:
https://docs.djangoproject.com/en/1.3/topics/class-based-views/
Some of the rationale for the current design of class-based views,
and pros and cons of some alternatives that were considered, are
documented here:
https://code.djangoproject.com/wiki/ClassBasedViews
Beyond that, the best advice I can give is to go read the code. The
code for the base View is surprisingly small, and can be found at
django/views/generic/base.py.
December 28 2011 by Colin Copeland
The OpenBlock geocoder is powerful and robust. It uses PostGIS for spacial queries, can extract addresses from bodies of text, and can understand block and intersection notation. We've run into a few issues with it, however, including a low geocoding success rate. This is a tough problem to solve and depends on a lot of factors (the extent of street and block data in OpenBlock, format of the street addresses, etc.), so your mileage may vary. Below I constructed a simple test using Google's Geocoding API to have as an alternative.
Disclamer: This is the third post in our OpenRural series reviewing OpenBlock and it's geocoder. You may wish to read Part 1: Data Model and Geocoding and Part 2: Text Parsing and Entity Extraction before proceeding.
Adding news with OpenBlock's geocoder
The Schema and NewsItem models provide OpenBlock with a generic data model to associate news with geographic locations. You can find a fairly extensive introduction in the official documentation, so we won't go into too much detail here.
Since a NewsItem requires a geographic point, let's use the OpenBlock geocoder to fine 123 East Franklin Street:
>>> from ebpub.geocoder import SmartGeocoder
>>> geocoder = SmartGeocoder()
>>> location_name = '123 East Franklin Street'
>>> point = geocoder.geocode(location_name)['point']
>>> point.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
You'll notice that point has a
wkt attribute. wkt, or Well-known text, is a text markup language for representing geometry objects. Here we have a POINT, but the language can represent many geometries, including LineString and Polygons.
We'll use the "Local News" schema in this example as it is pre-loaded in OpenBlock:
>>> from ebpub.db import models as ebpub
>>> schema = ebpub.Schema.objects.get(name='Local News')
Using this schema, we'll add a new NewsItem with the point created above:
>>> import datetime
>>> news = schema.newsitem_set.create(
... title='Incident downtown',
... description='Something happend downtown today!',
... item_date=datetime.date.today(),
... location=point,
... location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
That was easy. Now we have a NewsItem that OpenBlock is aware of and can be plotted on a map. However, what do we do if we can't geocode the address?
Using an External Geocoder
If we already have a geographic point, then we can circumvent the geocoder entirely:
>>> from django.contrib.gis.geos import Point
>>> manual_point = Point(-79.0553588124999891, 35.9133110937499964)
>>> news = schema.newsitem_set.create(
... title='Incident downtown',
... description='Something happend downtown today!',
... item_date=datetime.date.today(),
... location=manual_point,
... location_name=location_name,
... )
>>> news.location.wkt
'POINT (-79.0553588124999891 35.9133110937499964)'
This means we can also use an external geocoder. For example, we can use Google's Geocoding API with geopy. First, you'll need a Google Maps API key, which we'll use with geopy:
>>> GOOGLE_MAPS_API_KEY = '' # your Google Maps API key
Then we can use geopy to construct a new geocoder:
>>> from geopy import geocoders
>>> g = geocoders.Google(GOOGLE_MAPS_API_KEY)
And we can geocode our address:
>>> address = '123 East Franklin Street, Chapel Hill, NC'
>>> place, (lat, lng) = g.geocode(address)
>>> point = Point(lng, lat)
>>> point.wkt
'POINT (-79.0549350000000004 35.9136495999999994)'
You can even tap into OpenBlock's internals and build a Geocoder that OpenBlock can use:
from django.conf import settings
from django.contrib.gis.geos import Point
from geopy import geocoders
from geopy.geocoders.google import GQueryError
from ebpub.geocoder import Geocoder, DoesNotExist
class GoogleGeocoder(Geocoder):
def __init__(self, *args, **kwargs):
kwargs['use_cache'] = False # haven't implemented cache yet
super(GoogleGeocoder, self).__init__(*args, **kwargs)
self.geocoder = geocoders.Google(settings.GOOGLE_MAPS_API_KEY)
def _do_geocode(self, location_string):
try:
place, (lat, lng) = self.geocoder.geocode(location_string)
except (GQueryError, ValueError), e:
raise DoesNotExist(unicode(e))
location = {'point': Point(lng, lat)}
return location
This is an proof-of-concept geocoder we're using with OpenRural. You can find it on GitHub. Using this geocoder with a sample dataset from the North Carolina Secretary of State Corporation Filings, I was able to increase the geocoding success rate from about 37% to 95%. Again, your mileage will vary, but it can be useful to test out. We can't use Google's API for everything though. Normal users are limited to 2,500 requests per day. Business accounts are allotted 100,000 requests. Additionally, Google requires you to display any points geocoded with their API on a Google Map. So you'll need to evaluate your needs before deciding on using Google's API.
December 16 2011 by Colin Copeland
This is the second post in our OpenRural series reviewing OpenBlock and it's geocoder. OpenBlock Geocoder, Part 1: Data Model and Geocoding covers the internals of the OpenBlock geocoder and it's geocoding capabilities. As this posts builds upon topics covered there, you may wish to read Part 1 before proceeding. In this post we step back from the internals of the geocoder and explore how to use it along with other OpenBlock tools to parse unstructured text.
I'd also like to give a shout out here to Paul Winkler who was kind enough to answer questions and point me in the right direction on the topics below. Thanks Paul!
The Problem
OpenBlock's original design is centered around providing news at a hyper-local level. That is, down to your own city block. This allows interested citizens to see events ranging from police incidents, to restaurant inspections, to local news articles all aggregated on a map of your block. OpenBlock provides scraping tools to assist downloading this data from the web, but the obvious problem here is that most data isn't packaged or tagged with geographic information. Let's look at an example article teaser from The Daily Tar Heel in Chapel Hill, NC:
No. 4 North Carolina led Evansville 63-27 with just more than 14 minutes to go in the first half when senior forward Tyler Zeller scored his 999th career point at the Smith Center on Tuesday night.
The article mentions the game at the Smith Center, which is the location we want to extract and plot on a map. This is where OpenBlock utilities to ingest unstructured text helps.
Places
Places are simple models containing only a name and geographic point. OpenBlock implements a mechanism to find places defined in the database from a body of text. For example, say we have the following string we'd like to parse:
>>> message = 'A good movie is playing at the Varsity Theater in Chapel Hill tonight.'
OpenBlock can extract "Varsity Theater" if we define it as a Place. You can create and import places in the OpenBlock admin, but to keep things simple, we'll just create one here:
Here we created a new Point of Interest place (which is loaded by default on any OpenBlock install) geocoded to 123 East Franklin Street. Now we need a way to parse places from strings. Most of this functionality is found in ebdata. And ebdata contains a Natural Language Processing package, nlp. We can use it's place_grabber to extract matching places:
We can feed this right back into the Place model to retrieve the database objects and their geographic locations:
The parser is case sensitive however, so it'll fail if it's not an exact match:
>>> grabber("VARSITY THEATER")
[]
Obviously this is a brute-force method and requires you to pre-load all places of interest into the database beforehand. It's pretty rudimentary, but does provide this functionality out-of-the-box.
Locations
OpenBlock can also extract locations defined in the database. We already have cities loaded, so we'll use them in this example. Just like the place grabber, the location grabber is case sensitive, so we'll define a location synonym with the proper case:
>>> from ebpub.db.models import Location, LocationSynonym
>>> ch = Location.objects.get(name='CHAPEL HILL')
>>> LocationSynonym(pretty_name='Chapel Hill', location=ch).save()
By default, the location grabber igonores types of "city" and "borough". To keep things simple, we'll just create one that includes all location types:
>>> grabber = places.location_grabber(ignore_location_types=[])
Now we can use the grabber to extract locations:
>>> grabber(message)
[(50, 61, 'Chapel Hill')]
If you plan to parse a lot of text in succession, the OpenBlock grabbers cache the locations/places on instantiation. So you won't hit the database after the initial run. Cool!
Addresses
ebdata.nlp can also parse addresses. For example, let's use a simple string:
>>> from ebdata.nlp.addresses import parse_addresses
>>> parse_addresses('The Varsity Theater is located at 123 N Franklin St')
[('123 N Franklin St', '')]
Under the hood, OpenBlock uses a large regular expression to do this, so it's not actually hitting the database or attemping to do geocoding. You'll notice that it returns a 2-item tuple. The second item is for the city:
>>> parse_addresses('The individual was seen on 123 N Franklin St in Chapel Hill')
>>> [('123 N Franklin St', 'Chapel Hill')]
It can parse block locations too:
>>> parse_addresses('The construction is on the 100 block of Franklin St.')
[('100 block of Franklin St.', '')]
And intersections:
>>> parse_addresses('The incident occured at the intersection of Franklin and Hillsborough')
[('Franklin and Hillsborough', '')]
It all comes together with the geocoder:
Conclusion
As you can see, OpenBlock provides a few useful utilities to parse unstructured text. They're fairly limited and, especially with the address parser, will most likely return a lot of false positives. But I think OpenBlock has provided a great starting point. Stayed tuned for more posts on inner-workings of the OpenBlock project!
December 12 2011 by Colin Copeland
As Tobias mentioned in Scraping Data and Web Standards, Caktus is collaborating with the UNC School of Journalism to help develop Open Rural (the code is on GitHub). Open Rural hopes to help rural newspapers in North Carolina leverage OpenBlock. This blog post is the first of several covering the internals of OpenBlock and, specifically, the geocoder.
OpenBlock Data Model
The OpenBlock geocoder can only geocode from the data is has. It doesn't leverage a 3rd-party API or service. It only uses what's loaded in PostgreSQL (with PostGIS and GeoDjango) and, in this example, what comes from the US Census Bureau and local city and county GIS offices.
Further, the imported data is typically filtered by a bounding box setting in METRO_LIST. The setting, extent, is a list of leftmost longitude, lower latitude, rightmost longitude, upper latitude. This defines a bounding box - the range of latitudes and longitudes that are relevant to your area. A small or restrictive box will limit imported ZIP code and block data to areas that fall within the box.
Let's look at an example with these shapefiles:
We'll start with a restrictive extent that only consists of downtown Chapel Hill:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.066272, 35.91671, -79.040481, 35.910663),
# ...
},
)
This selection loaded 2 ZIP codes:
$ django-admin.py import_nc_zips
Importing zip codes...
# ...
Skipping 27511, out of bounds
Skipping 27513, out of bounds
Created ZIP Code 27514
Created ZIP Code 27516
Skipping 27517, out of bounds
Skipping 27519, out of bounds
# ...
Created 2 zipcodes.
And limited the block data as well:
$ django-admin.py import_county_streets 37135
Importing blocks, this may take several minutes ...
Created 73 blocks
Populating streets and fixing addresses, these can take several minutes...
Populating the streets table
streets: created: 28
block_intersections: created: 160
Done.
Restricting the area will limit the ability of the geocoder. In this case, for example, it can geocode the intersection of Franklin and Henderson, which is right downtown, but not Franklin and Estes (don't worry, we'll get into more geocoding details in the next section). A map helps illustrate this more clearly. Below you can see the bounding box with pins on the two intersections:
View OpenRural - Downtown Chapel Hill in a larger map
If we increase the bounding box, we'll get a lot more data:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# ...
},
)
With an extent that encompasses all of Chapel Hill, the importer loaded 9 ZIP codes, 4302 blocks, 1699 streets, and 7189 intersections. Here's a map illustrating the larger extent:
View OpenRural - Orange County, NC in a larger map
It's up to the maintainer of an OpenBlock install to determine which extent to use as it is based on the specifics of the application. A large extent will import more ZIP codes and blocks and, therefore, will slow down geospatial queries and may include unwanted geographic areas.
Street
Now that we have NC Orange County data loaded, let's investigate this data with the OpenBlock models.
The Street model contains a catalog of all loaded streets. It's a simple model with only a few fields:
- street
- pretty_name
- street_slug
- suffix
- city
- state
In NC Orange County, we can see that the street data spans 4 cities:
>>> from ebpub.streets.models import Street
>>> Street.objects.order_by('city').values_list('city', flat=True).distinct()
[u'', u'CARRBORO', u'CHAPEL HILL', u'DURHAM', u'HILLSBOROUGH']
Some streets cross city lines and therefore contain two entries:
>>> Street.objects.filter(street_slug='rosemary-st').values_list('city', flat=True)
[u'CARRBORO', u'CHAPEL HILL']
And, for example, if we're looking for Franklin St. in Chapel Hill, NC, we can filter for it here:
Blocks
Blocks are fundamental to OpenBlock and are used by the geocoder. OpenBlock defines a block as "a segment of a single street between one side street and another side street." The Block model is slightly more intricate than Street, but each entry basically represents the address range of a street for each block segment.
To start, we can see that Franklin St. is divided into roughly 32 blocks:
>>> from ebpub.streets.models import Block
>>> Block.objects.filter(street_slug='franklin-st').count()
32
It's sectioned into an east and west segment:
>>> Block.objects.filter(street_slug='franklin-st').order_by('street_pretty_name').values_list('street_pretty_name', 'predir').distinct()
[(u'Franklin St.', u'W'), (u'Franklin St.', u'E')]
And can have an address between 100 and 1899:
>>> Block.objects.filter(street_slug='franklin-st').aggregate(Min('from_num'), Max('to_num'))
{'from_num__min': 100, 'to_num__max': 1899}
So we can find the block that contains the 123 address:
Also, on a side note, it's possible for some blocks to span cities:
Geocoding
Now that we have a basic understanding of how the data is stored within OpenBlock, let's do some geocoding. Most of these examples will use the SmartGeocoder class. SmartGeocoder delegates to specific geocoders (AddressGeocoder, BlockGeocoder, and IntersectionGeocoder) based on how it interprets the string with regular expressions.
Addresses
To start, let's geocode "123 East Franklin Street":
This one was pretty easy for geocoder to parse and find. You can see that not only has it found the associated block, but it also knows the exact geographic point. However, this will fail if passed a non-existent address number (InvalidBlockButValidStreet):
In this case, the geocoder was able to extract the address, but it failed to find the associated block in the database. Non-existent streets also fail (DoesNotExist):
Intersections
The geocoder can locate intersections too:
Notice how the intersection field is populated, rather than block. This will raise a DoesNotExist exception when an intersection is not found:
Street Misspellings
OpenBlock provides a model, StreetMisspelling, to define street aliases. This allows you to map a bad street name to a good street name that exists in the database:
Now geocoding "Glen Haven" will find "Glenhaven".
Multiple Cities
By default, OpenBlock is configured to work with a single city, which is defined in METRO_LIST:
# Metros. You almost certainly only want one dictionary in this list.
# See the configuration docs for more info.
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# The major city in the region.
'city_name': 'Chapel Hill',
},
)
The geocoder will fail if it locates a street that's associated with a city unknown to OpenBlock. For example, 100 Pine Street is in Carrboro and not Chapel Hill:
This street exists in the database due to our extent covering most of Orange County. Since we've setup OpenBlock to encompass an entire county, rather than a single city, we need to define additional cities. This can be accomplished one of two ways:
- Add additional dictionaries to METRO_LIST for each city
- Import city locations into the database and tell OpenBlock to refer to these
We imported Orange County city boundary data above, so we'll use the latter:
METRO_LIST = (
{
# Extent of the region, as a longitude/latitude bounding box.
'extent': (-79.165922, 35.829095, -78.978468, 36.02426),
# Set this to True if the region has multiple cities.
# You will also need to set 'city_location_type'.
'multiple_cities': True,
# The major city in the region.
'city_name': 'Chapel Hill',
# Slug of an ebpub.db.LocationType that represents cities.
# Only needed if multiple_cities = True.
'city_location_type': 'cities',
},
)
Here we enabled multiple_cities and informed OpenBlock that the location type slug is cities, respectively. Now 100 Pine Street will geocode properly:
What's Next
Now that we've had an overview of the geocoder, we'll jump into OpenBlock's place, location, and address parser. Stay tuned!
Update: Read more in OpenBlock Geocoder, Part 2: Text Parsing and Entity Extraction.
September 13 2011 by Tobias McNulty
Load testing a site with ApacheBench is fairly straight forward. Typically you'd just SSH to a machine on the same network as the one you want to test, and run a command like this:
ab -n 500 -c 50 http://my.web.server/path/to/page/
The -n argument determines the number of requests to execute, and the -c argument the determines the concurrency level--or how many requests will be running simultaneously at any given time.
For Python and Django web applications, Fabric is popular tool for deploying code to and running other commands on remote servers. It's built in Python, and its simple syntax makes it easy to use as well. For more information and a primer on Fabric, check out the post that Colin Copeland wrote back in 2010, titled Basic Django deployment with virtualenv, fabric, pip and rsync.
Running ApacheBench from Fabric is useful because you can easily do other things like customize and update your web server configuration in an automated way. For example, here's a sample template for an Apache server configuration that I upload to our web servers using Fabric:
ServerName %(www_server_name)s
WSGIDaemonProcess my_site-%(environment)s processes=%(process_count)s threads=%(thread_count)s display-name=%%{GROUP}
WSGIProcessGroup my_site-%(environment)s
WSGIScriptAlias / %(apache_root)s/%(environment)s.wsgi
ErrorLog %(log_root)s/wsgi.error.log
LogLevel info
CustomLog %(log_root)s/wsgi.access.log combined
You'll notice the %s-style Python string formatting syntax in the Apache config. These are populated by Fabric's files.upload_template method when the file is copied to the remote server, and are based on variables you pass in to the context. Here's a sample Fabric method to upload your Apache configuration to the remote server:
def _join(*items):
"""
We're deploying to Linux, so hard code that type of path join here. Using
os.path.join would not work when deploying from Windows.
"""
return '/'.join(items)
def apache_graceful():
sudo('/etc/init.d/apache2 graceful')
def update_apache_conf(process_count=15, thread_count=1):
env.process_count = process_count
env.thread_count = thread_count
for ext in ['conf', 'wsgi']:
source = os.path.join(env.deployment_dir, 'templates',
'apache.%s' % ext)
dest = _join(env.home, 'apache.conf.d',
'.'.join([env.environment, ext]))
files.upload_template(source, dest, context=context, mode=0755,
use_sudo=True)
apache_graceful()
Specifying process_count and thread_count in the arguments to update_apache_conf() means that I can pass those in from the command line, like so:
fab staging update_apache_conf:10,3
This would install an Apache configuration on the server that starts up 10 mod_wsgi processes with 3 threads each.
Running ApacheBench through Fabric is also easy to do, but here's a slightly more complex example I put together that saves the results in time-stamped folders, whose names also include the number of requests, concurrency level, process count, and thread count of the test:
def benchmark():
config = {
'number': 500,
'concurrency': 50,
'url': 'http://my.web.server/path/to/page/',
}
# prime the server with a few requests before logging any results
run('ab -n 10 -c 1 {url}'.format(**config))
context = dict(env)
context.update(config)
context['now'] = datetime.datetime.now().strftime('%Y-%m-%d_%H:%M:%S')
dir_name = '{now}_n={number},c={concurrency}'
if 'process_count' in context and 'thread_count' in context:
dir_name += '_p={process_count},t={thread_count}'
dir_name = dir_name.format(**context)
context['test_dir'] = os.path.join('test_runs', dir_name)
run('mkdir -p {0}'.format(context['test_dir']))
for x in range(4):
context['test_file'] = os.path.join(context['test_dir'],
'ab{0}.txt'.format(x))
run('ab -n {number} -c {concurrency} {url} > '
'{test_file}'.format(**context))
You can run these commands together to update the Apache configuration and run a benchmark with a single line from the shell, like so:
fab staging update_apache_conf:10,5 benchmark
This would update the Apache configuration on the remote server, run a few requests to prime the server, and then run the specified ApacheBench test 4 times and save the results in text files in a timestamped directory.
To test lots of different server configurations at once with minimal user interaction, you can further script this by wrapping the above command in a Bash for loop, like so:
for process_count in {1..76..5}; do fab staging update_apache_conf:$process_count,1 benchmark; done
This command iterates from 1 through 76, in steps of 5 (1, 6, 11, 16 ... 76), sets the Apache configuration to use that number of processes, and runs a separate benchmark for each configuration.
Anyway, that's just a little insight into how one might deploy and test a Python or Django application using Fabric and ApacheBench. Hope you find it helpful!
August 31 2011 by Dan Poirier
Eclipse with the PyDev module has a lot to offer the Python programmer these days. If you haven't looked at PyDev before, or not in a while, it's worth checking out.
Here are some of my favorite features:
- One-keystroke navigation to the definitions of variables, methods, classes
- Code completion, including automatically adding import statements
- Clean up imports
- Refactoring, including renaming across projects
- Clean up whitespace
There are many more. I recommend taking a look at the PyDev web site and blog to see what might appeal to you.
Getting Eclipse and PyDev
If you're already using Eclipse, you can add PyDev to it. If not, you also have the option to get a version of Eclipse with PyDev already included. You install PyDev into your existing Eclipse the same way you install any other Eclipse add-on: first tell Eclipse where to find the add-on, then install it.
- In Eclipse 3.6 and 3.7, select Help/Install New Software...
- On the panel that pops up, click "Add..." at the top right.
- Enter any name (e.g. "PyDev")
- Enter http://pydev.org/updates as the Location, then click OK.
- In the list of available software, select PyDev.
- Click Next, Next, accept the license, Finish.
- If Eclipse asks whether to trust the PyDev certificate, agree.
- When the install is complete, allow Eclipse to restart.
To get Eclipse with PyDev already installed, go to http://www.aptana.com/products/studio3/download and download Aptana Studio for your platform. Aptana Studio 3.0.4 is Eclipse 3.6 plus PyDev plus other add-ons.
Preferences
There are some preferences in Eclipse you probably want to change if you'll be working with Python. Open the preferences by selecting Window/Preferences, then use search to find and set these:
- Insert spaces for tabs: checked, but note that the PyDev editor ignores this and you need to make a similar setting in the PyDev settings for editing Python files.
- Show whitespace characters:
- In Eclipse 3.6, you probably want this off except when you're looking for trailing whitespace.
- In Eclipse 3.7, you can check the box and then click on "whitespace characters" and set just the trailing whitespace visible, which is unobtrusive enough to leave enabled all the time.
- Replace tabs with spaces when typing: checked. This is the one that PyDev obeys.
- Right trim lines: checked, otherwise you end up with a lot of lines with just indentation on them.
- Add newline at end of file: checked.
- Auto-Format editor contents before saving: If you check this, every time you save a file PyDev will fix it to comply with the other settings on this preferences page. That's great if you're working on your own project, but not so good if you're doing maintenance on somebody else's project and don't want to make random changes to white-space all over the place.
Explore the other PyDev settings. The "Code Analysis" section is particularly interesting, as it lets you control the kinds of things that Pydev marks as errors or warnings.
Finally, at least one Python interpreter needs to be configured. Still in Preferences, go to PyDev/Interpreter - Python. For now, just click "Auto Config" and click OK on the dialog that pops up. Then click OK to close Preferences. PyDev will take a while to analyze the python installation and libraries.
Perspective
Select Window/Open Perspective/Other and choose PyDev.
Starting to use Eclipse and PyDev with a project
I typically use Eclipse with Django projects, though I haven't tried PyDev's Django-specific features yet.
When I want to work with a project in Eclipse, first I check it out locally. Then here are the steps I follow:
- File/New/Project (not PyDev project, I don't like the PyDev new project wizard)
- Choose General/Project, click Next
- Enter a project name
- Uncheck "use default location" and set the location to the top directory of my project
- Click Finish
- Right-click on the project and select PyDev/Set as Pydev Project
- Right-click on the project and select Properties
- go to PyDev - PYTHONPATH
- In the Source Folders tab, use "Add source folder" to add folders that need to be on your python path for your project to work. Often this is either the top-level project folder or a folder immediately inside it.
Using PyDev with virtualenv
If you use virtualenv (and if not, why not?), there are a couple additional steps to take.
First, add the interpreter from your virtual environment as another Python interpreter:
- Open Preferences
- Go to PyDev/Interpreter - Python
- Click "New..."
- For the Executable, navigate to your virtual environment's bin directory and select the Python interpreter there.
- Choose another name for your interpreter if you want, probably something shorter than the default. I like to use the name of the virtual environment, with "-env" appended.
- Click OK
- Now here's the tricky part - a dialog will pop up asking which library folders to add. Keep the defaults but you also need to add your system python library directories - e.g. /usr/lib/python2.6, /usr/lib64/python2.6, and /usr/lib/python2.6/plat-linux. Otherwise PyDev won't be able to find all the libraries your python interpreter will be using.
- Click OK
Then, set the new interpreter as the interpreter for your project:
- Right-click the project and select Properties
- Go to Pydev - Interpreter/Grammar
- Under Interpreter, select your new interpreter
- Click OK
Now PyDev should be able to find any libraries you have installed in the virtual environment when needed.
If you install additional libraries, you might need to go back to the interpreter definitions, click "Apply", and tell Pydev which interpreters it should scan again. Until you do that, PyDev might not notice your new libraries.
For more information, see http://pydev.blogspot.com/2010/04/pydev-and-virtualenv.html
Links
July 18 2011 by Colin Copeland
We've been using RapidSMS, a Django-powered SMS framework, more and more frequently here at Caktus. It's evolved a lot over the past year-- from being reworked to feel more like a Django app, to merging the rapidsms-core-dev and rapidsms-contrib-apps-dev repositories into a single codebase (no more submodules!), to finally becoming installable via pypi. The "new core" is in a great state now and is much easier to work with. However, one particular aspect of RapidSMS, the route process, has always been complicated and confusing to deal with. Tobias began the conversation on this issue after returning from a 6-week long UNICEF project in Zambia. He summarized the route process like so:
- The route process as it currently stands is complicated; it includes a number of threads and the ways in which they interact is not always intuitive
- If the route process dies unexpectedly, all backends (and hence message processing) are brought offline
- Automated testing is difficult and inefficient, because the router (and all its threads) needs to be started/stopped for each test
The RapidSMS router is a globally instantiated object that routes incoming messages through each RapidSMS app and sends outgoing messages via installed backends. The run_router management command starts the router process and creates individual threads for each backend defined in the settings module. I'm not entirely certain as to why the route process was originally threaded, but I assume it was designed to more easily integrate blocking backends (like gsm) into RapidSMS. However, with the standardization of Kannel and SMS-based web services, like Twilio, both of which offload the low level communication work, I believe the threading aspect is now less important. So recently, in what started as a proof of concept, we began work on a decoupled router implementation called rapidsms-threadless-router. rapidsms-threadless-router provides a threadless_router app, which removes the threading functionality from the legacy Router class. Rather, all inbound requests are handled via the main HTTP thread. threadless_router attempts to:
- Make RapidSMS backends more Django-like. Use Django’s URL routing and views to handle inbound HTTP requests
- Remove clutter and complexity of route process and threaded backends
- Ease testing – no more threading or Queue modules slowing down tests
In comparison to the legacy route process, threadless_router handles all inbound and outbound backend communication from within the main HTTP thread. Each request creates a new router instance and no separate process or thread is created. This simplifies the Router class significantly. Additionally, threadless_router allows inbound messages to be easily passed off to an asynchronous task queue, such as Celery. Task queues allow message processing to be handled outside of the HTTP request/response cycle, which is perfect for SMS-based applications, as out of band responses are more than acceptable.
threadless_router is not, however, a drop-in replacement for the legacy router. Legacy backends will not work and as all routing is handled from within the HTTP thread, non-HTTP backends, such as pygsm, are not currently compatible with threadless_router. A simple wrapper around pygsm could be written to talk both to the modem and spin up a simple HTTP server to communicate with RapidSMS. This would decouple pygsm from RapidSMS and exist as it's own separate process. Integrating with supervisord would work great here too. Several contrib applications, such as httptester and scheduler, are also not compatible. We've bundled a new httptester as a replacement and celerybeat can be used to mimic the scheduler functionality. A full list of caveats can be found in the docs.
The full documentation for rapidsms-threadless-router, including installation instructions and examples, can be found on readthedocs.org. If you're already familiar with the internals of RapidSMS and would like to see examples of threadless backend implementations, I suggest reviewing the bundled http and httptester backends and our updated twilio backend.
I would like to mention that Nicolas Pottier and Eric Newcomer created rapidsms-httprouter, which also handles all messages within the main HTTP thread. The main difference between rapidsms-httprouter and rapidsms-threadless-router is that, while httprouter handles inbound messages in a Django view, it still starts up threads (for handling outgoing messages) like the current router (also from within a Django view). Make sure to check it out as well and let us know what you think!
March 25 2011 by Nicole Foster
I'm excited to announce that Caktus is launching its summer internship program. It is a 12 week paid position in our Carrboro, NC office. We're in driving distance from UNC Chapel Hill, NC State in Raleigh, and Duke in Durham, so students from all parts of the NC Research Triangle are welcome to apply.
We are looking for a web developer who enjoys working on a team and is excited to work on new and diverse projects. While working with us you will get to work on Django-powered web applications, learn about test driven development and other agile methodologies, perform front-end development in HTML, CSS and JavaScript (jQuery) and become familiar with Linux (Debian-flavor) desktop and server systems. Check out the full posting here.
If you'd like to spend your summer working with some great people on interesting projects please email us at jobs+website@caktusgroup.com with your resume and, if applicable, links to samples of code you have written. Kindly include a brief note describing why you would be a great fit for this opportunity.
March 15 2011 by Mark Lavin
We just got back from another fun and successful PyCon. While we didn't get to stay for much of the sprints we did get to spend some time in the Django sprint Sunday and Monday. Monday morning I was there early and I noticed a bit of confusion among the Django sprinters. While I'm not a frequent contributor I've participated a few sprints at previous conferences and local sprints with Caktus. I shared with them my experiences and it seemed generally helpful so I thought I would share them here as well.
If you've ever sprinted at a large conference like PyCon or DjangoCon you've probably heard the speech from the core developer's about contributing. It's always nice to hear but it still doesn't stop people from being unsure about what to do or where to start at the sprints. I'm here to tell you as another non-core developer it really isn't that hard or scary. Django is a big project but you don't have to know everything to be able to help.
Where to start?
To start you should always read the contributing guide. Once you've got the Django source code checked out you should run the test suite just to make sure you know how. Do you have to check out from SVN? No, you don't. There are mirrors on GitHub and BitBucket and you should used the VCS that you are most comfortable with. Just remember to generate your patches from the source root directory and in a format that's compatible with SVN.
How do I find a ticket?
There are a couple different strategies for finding a good ticket to work on. The list of Trac tickets can be intimidating especially if this is your first time working on Django. One way to find a ticket to work on is to find an area that really frustrates you when using Django. Last year at DjangoCon I focused on contrib.sites, contrib.sitemaps and formsets because there were more than a couple issues that I had come across recently working with them. Thankfully we were even able to fix #11418 and #11358.
Some tickets have patches and tests but haven't been reviewed. It's fairly easy to download the patch, apply it and see if it works. You should also spend some time looking to make sure everything works the way the patch submitter claims it works. Either way you should comment on the ticket with your results. Also, some patches are fairly old and won't apply cleanly, but updating the original patch is something else that's easy to do and can be helpful. One common example of this is the recent shift from doctests to unittests. A number of patches that had tests may not apply because they were doctests. Converting those patches to use unittests is another way to help.
Another strategy is to find tickets which have patches but need tests. You can easily filter the Trac tickets by 'Has Patch' and 'Needs Tests'. Writing the tests can help you better understand the underlying code. You might also want to write some patches and if so just filter by the tickets without patches. Here you might also want to filter on areas that you are most comfortable with such as the admin, forms, or ORM. Remember that you need tests with all of your patches and documentation for new features.
Can I change the tickets?
I say don't be afraid to jump in. There is nothing you can do in Trac that can't be undone. While you shouldn't re-open tickets closed by core devs, if there are bugs you can't reproduce don't be afraid to say so. Tobias and I spent over an hour working to reproduce a bug at the last PyCon. In the end we closed the ticket as 'could not reproduce'. Once we did that at least two other people commented in IRC that they had tried and couldn't reproduce it either, but hadn't commented or closed the ticket. So, if you spend time looking at a ticket, do everyone a favor and share what you did. When commenting or closing a ticket remember to always be respectful. Someone was kind enough to take the time to put in a ticket to fix or improve Django and we are all a part of the same community.
The last thing I'll add is that contributing to Django doesn't have to start or end at the sprints. Trac is always available if you have some time to look through tickets. I hope this helps some people get the confidence to write or check patches for Django. Happy sprinting everyone!
March 12 2011 by Tobias McNulty
I'm delighted to announce that we've just published another job posting for a Linux Systems Administrator at Caktus. The position will involve maintaining existing Linux servers, designing and building highly-scalable deployments, and assistance with Django deployment and development as time permits. This is a full-time position, with benefits, and is based out of our Carrboro, NC office (a short drive from Raleigh, Durham, and Chapel Hill).
For more information, follow the link on our careers page and let us know if you or someone you know might be interested in the position!
March 09 2011 by Tobias McNulty
PyCon 2011 Atlanta is just around the corner, and I'm proud to announce that Caktus is a gold sponsor at the conference this year! We sponsored DjangoCon in both 2009 and 2010, and this year agreed to extend that support to the Python community in general.
PyCon US is the annual gathering of software developers who use the open source, Python programming language. Django, our web framework of choice, is written in Python, so we use the language every day here at Caktus to create custom web applications and dynamic, content-rich web sites. Additionally, starting last year, we've put some of that knowledge to use extending and developing applications for the RapidSMS framework - a tool for creating mobile health and data collection applications that integrate web and mobile components (via SMS).
This year, the conference is being held March 9th through the 17th, 2011 in Atlanta, Georgia. We've grown a little since last year at this time; 7 Caktus team members—Colin, Karen, Mark, Mike, Calvin, Nicole, and myself—will be attending the conference. We're thrilled to be going again this year and hope to see you there!
We're currently looking for a Django developer to join the team, so stop by and introduce yourself if you or someone you know might be interested in the position!
February 09 2011 by Tobias McNulty
I'm pleased to announce that we just released a new Careers section of our web site here at Caktus. The section has been inaugurated with a new posting for a full-time Django developer position based out of our Carrboro, NC office (not far from Raleigh, Durham, or Chapel Hill), so kindly check it out and let us know if you or someone you know might be a good fit!
December 29 2010 by Tobias McNulty
I recently returned from a 6 week trip in Malawi, where I was heavily involved in the implementation and deployment of Project Mwana, an Information and Communication Technology (ICT) project focused on Maternal and Newborn Child Health (MNCH). The project is currently running as a pilot in both Zambia and Malawi. This post is a fairly technical overview of what the project does and the way in which it was developed.
The project aims to facilitate several things, including (a) secure delivery of HIV (Dry Blood Spot, or DBS) test results from the lab to health clinics by SMS, which we’ve named “Results160″ (b) appointment reminders for newborn children, or “RemindMi” (Mi = mothers & infants), and (c) free-text “chat” for health clinic workers and Community Health Workers, to strengthen communication and patient tracing.
Source Code
The source code for the project, which is based on Django and RapidSMS, can be found on GitHub:
I updated the developer setup instructions fairly recently, so, if you’re developer interested in this project or line of work, you should be able to get a local copy up and running without too much trouble. If you try and do have any issues, please let me know!
Team Composition
In Zambia, we have 2 on-going local developers, 1 temporary lead developer, 1 on-going local project manager, and 1 on-going project mentor. The team is similar in Malawi, except we have 1 on-going local developer, 1 temporary lead developer (myself), 1 on-going local project manager, and 1 on-going project mentor.
Development Workflow
For this phase of the project we adopted “git flow” to help guide our development workflow, and I think it was a big success (more below). See the following links for more information:
Code Organization
apps/ - we tried to separate functionality into separate Django/RapidSMS apps as much as possible.
backends/ - contains a RapidSMS backend for communicating with Kannel (an open source SMS gateway)
locale/ - translation files for Bemba and Chichewa
requirements/ - pip requirements files & corresponding tarballs (we found that checking in the tarballs was crucial for easily re-creating a development or production environment in low-bandwidth situations)
malawi/ - Malawi-specific configuration files & code
zambia/ - Zambia-specific configuration files & code
Settings Files
We took a strongly hierarchical approach to the settings files, to make it easy to share/override settings as necessary:
settings_project.py
\> malawi/settings_country.py
\-> malawi/settings_staging.py
\-> malawi/settings_production.py
\-> localsettings.py
\-> zambia/settings_country.py
\-> malawi/settings_staging.py
\-> malawi/settings_production.py
\-> localsettings.py
Each “sub” settings file simply imports from its “parent” at the top of the file, thereby allowing you to, for example, insert or append an app, add or remove a middleware, etc.
Apps
The project is divided up into the following main Django/RapidSMS apps: Results160 is implemented (mostly) in the “labresults” app, RemindMi in the “reminders” app, and the clinic chat in the “broadcast” app. The other apps support various pieces of functionality such as location management (”locations”), additions to the contact model (”contactsplus”), “stringcleaning,” and SMS printer integration (”tlcprinters”). I won’t pretend that they’re at all pluggable at the moment, but I’m sure that some of the more useful parts could be extracted & made independent, should the need arise.
Lessons learned
On the Malawi side, we learned a few things about building RapidSMS (or mobile health ICT projects in general) that I think are worth sharing:
(1) Regular meetings. Have quick meetings to review progress and plan next actions. Scrum type meetings really helped us review and quickly narrow down and fix things that we saw were not going too well before they became a problem.
(2) Pair Programming. Even though we were often pressed for time, pair programming proved very beneficial to the team and the project in general. Part of the mandate for the project is creating local capacity, and hands-on pair programming allowed for the newer or more junior local developers to both in gain a good overview of the codebase and learn from the more seasoned Python/Django developers on the team.
(3) Feature Branches. git-flow made them easy — which was key because we had two teams working simultaneously in two different countries (between which communication was difficult at times), and we wanted to share most of what we were doing while also shelter each other from potentially buggy commit of partially implemented features.
(4) Using a single code base for two deployments. It was difficult at times, when our two teams were changing different parts of the code and inadvertently breaking parts of it, but overall I think it was a big win & well worth the effort. We get the benefit of shared features & bug fixes, we don’t have to deal with maintaining separate forks, and it forces us to optimize our development workflow and make our code that much more configurable.
(5) Server environment. Get a public facing IP for your server and/or work location before beginning the implementation. We didn’t have one of these to start with in Zambia, while we did in Malawi. It makes life far, far easier for a number of reasons, including (a) it lets stakeholders get to the server (obviously), (b) it helps you be sure any connectivity issues are on the telco side, not yours, and (c) it lets you avoid corporate firewalls.
(6) SMS Gateway. If you have an SMS requirement, use Kannel. We started out with pygsm and our MultiTech modem would regularly get into a weird state where it was registered on the network & responded to AT commands, but no messages would come through. With other modems, the gsm backend didn’t delete messages from the SIM card, so it quickly filled up. After we switched to Kannel, we only had one case of downtime — and it was the RapidSMS route process, not Kannel, that was at fault. We also implemented the project on two different network carriers: Zain (now Airtel) via a GSM modem, and TNM (over the internet via SMPP). Kannel gave us a unified interface with which to interact with the two different backends. Kannel was also valuable as a “reference product” with which to test out the SMPP connection provided to us by TNM; as it turned out, when we couldn’t connect to it at first, the issue was on their end, not ours, and having a second opinion to back up our suspicions was key.
For more information, please see our page on Project Mwana or get in touch with a member of the Caktus team!
September 03 2010 by Tobias McNulty
I'm delighted to announce that Caktus is looking for two Python and/or Django web developers to join our team on a contract or part-time basis, with the potential for full-time work in the future.
Caktus builds custom web applications for local and remote clients using a variety of open-source technologies. We are a small team based in the Chapel Hill/Carrboro area of North Carolina (currently residing in Carrboro Creative Coworking). We believe in face-to-face contact, both with clients and amongst ourselves, and employ agile development techniques that emphasize teamwork and collaboration. We encourage you to meet the team and learn more about what we do.
We're looking for two experienced Python and/or Django web developers who enjoy working on a team and are excited to work on new projects. We have a preference for local candidates, but will consider all submissions. Your work will involve creating and integrating Django apps, working on existing Django projects, deployment, and database work.
You will be working in Linux (Debian-flavor) production environments with Apache and WSGI. Python and relational database experience is required. Django experience is a (big) plus. HTML/CSS and JavaScript experience are also a must, and jQuery is a plus.
If you're interested in one of these positions, please send us your resume, some sample Python code that you wrote, and links to any open-source projects you've contributed to. We're looking forward to meeting you!
August 26 2010 by Tobias McNulty
DjangoCon 2010 is just around the corner, and I'm proud to announce that Caktus is sponsoring the conference again this year!
DjangoCon is the annual gathering of software developers who use the open source, Python-based Django web framework. We use the framework every day here at Caktus to create custom web applications and dynamic, content-rich web sites. Additionally, starting this year, we've put some of that knowledge to use extending and developing applications for the RapidSMS framework. For more information about why we use Django and think it's so great, check out our blog post titled Why Caktus Uses Django.
This year, the conference is being held again the week of September 6th in the beautiful city of Portland, Oregon. We've grown a little since last year at this time; it looks like 6 Caktus team members—Colin, Alex, Karen, Mark, Mike, and myself—will be attending the conference. We're positively thrilled to be going again this year and we hope to see you there!
August 12 2010 by Tobias McNulty
I'm delighted to welcome Karey Tracey to our growing team of web developers here at Caktus. Karen is a core developer of the Django web framework and specializes in the development and testing of applications for the web. She is also the author of Django 1.1 Testing and Debugging, published by Packt Publishing in April, 2010.
Caktus is a seasoned team of web developers that creates interactive, content-rich sites and applications with the Django web framework. We put a strong emphasis on best practices, employ an agile method, and also actively participate in the Django development community.
For more information about Caktus and our team, check out our newly updated team page!
April 22 2010 by Colin Copeland
Deployment is usually a tedious process with lots of tinkering until everything is setup just right. We deploy quite a few Django sites on a regular basis here at Caktus and still do tinkering, but we've attempted to functionalize some of the core tasks to ease the process. I've put together a basic example that outlines local and remote environment setup. This is a simplified example and just one of many ways to deploy a Django project (I learned a lot from Jacob Kaplan-Moss' django-deployment-workshop), so I encourage you to browse around the Django community to learn more.
The entire source for this example project can be found in the caktus-deployment Bitbucket repository.
Local Development Environment
The project directory is organized like so:
caktus_website/
__init__.py
apache/
staging.conf -- staging Apache conf
staging.wsgi -- staging wsgi file
blog/
bootstrap.py -- bootstrap local environment
fabfile.py -- manage remote environments with fabric
local_settings.py
manage.py
media/
requirements/
apps.txt -- pip requirements file
settings.py
settings_staging.py -- staging settings file
urls.py
To setup a local development environment, we'll create a virtual environment and run bootstrap.py, which is just a simple script that automates installing Python dependencies using pip:
if "VIRTUAL_ENV" not in os.environ:
sys.stderr.write("$VIRTUAL_ENV not found.\n\n")
parser.print_usage()
sys.exit(-1)
virtualenv = os.environ["VIRTUAL_ENV"]
file_path = os.path.dirname(__file__)
subprocess.call(["pip", "install", "-E", virtualenv, "--requirement",
os.path.join(file_path, "requirements/apps.txt")])
bootstrap.py uses requirements/apps.txt (a pip requirements file), so you can source anything off of PyPI as well as mercurial, git, and SVN repositories that include setup.py files. In this example, django's SVN is the only dependency in apps.txt:
-e svn+http://code.djangoproject.com/svn/django/branches/releases/1.1.X#egg=django
bootstrap.py must be run within virtual environment, so let's create a new virtualenv (I recommend using virtualenvwrapper) and then run bootstrap.py to install the dependencies:
copelco@montgomery:~/caktus_website$ mkvirtualenv --distribute caktus
(caktus)copelco@montgomery:~/caktus_website$ ./bootstrap.py
Now that our environment is setup (and Django is on the python path), we can run normal Django management commands:
(caktus)copelco@montgomery:~/caktus_website$ ./manage.py syncdb --settings=caktus_website.local_settings
(caktus)copelco@montgomery:~/caktus_website$ ./manage.py runserver --settings=caktus_website.local_settings
Great! That's it for our local setup, let's look into deploying the project to a staging server.
Deployment and Remote Management
To help provision the remote server environment (in this case Ubuntu 9.10), we'll use fabric. fabric allows you to streamline deployment by functionalizing common tasks in Python. I've created an example fabfile.py to help bootstrap and deploy the project:
(caktus)copelco@montgomery:~/caktus_website$ fab --list
Available commands:
apache_reload reload Apache on remote host
apache_restart restart Apache on remote host
bootstrap initialize remote host environment (virtualenv, dep...
configtest test Apache configuration
create_virtualenv setup virtualenv on remote host
deploy rsync code to remote host
production use production environment on remote host
staging use staging environment on remote host
symlink_django create symbolic link so Apache can serve django adm...
touch touch wsgi file to trigger reload
update_apache_conf upload apache configuration to remote host
update_requirements update external dependencies on remote host
The fabfile splits the deployment process into discrete steps of 1) virtual environment creation, 2) code transfer, and 3) updating the Python dependencies. The bootstrap command wraps everything together, including initial directory creation, so you can setup the server quickly:
def bootstrap():
""" initialize remote host environment (virtualenv, deploy, update) """
require('root', provided_by=('staging', 'production'))
run('mkdir -p %(root)s' % env)
run('mkdir -p %s' % os.path.join(env.home, 'www', 'log'))
create_virtualenv()
deploy()
update_requirements()
def create_virtualenv():
""" setup virtualenv on remote host """
require('virtualenv_root', provided_by=('staging', 'production'))
args = '--clear --distribute'
run('virtualenv %s %s' % (args, env.virtualenv_root))
def deploy():
""" rsync code to remote host """
require('root', provided_by=('staging', 'production'))
if env.environment == 'production':
if not console.confirm('Are you sure you want to deploy production?',
default=False):
utils.abort('Production deployment aborted.')
extra_opts = '--omit-dir-times'
rsync_project(
env.root,
exclude=RSYNC_EXCLUDE,
delete=True,
extra_opts=extra_opts,
)
touch()
def update_requirements():
""" update external dependencies on remote host """
require('code_root', provided_by=('staging', 'production'))
requirements = os.path.join(env.code_root, 'requirements')
with cd(requirements):
cmd = ['pip install']
cmd += ['-E %(virtualenv_root)s' % env]
cmd += ['--requirement %s' % os.path.join(requirements, 'apps.txt')]
run(' '.join(cmd))
To bootstrap the staging environment, run:
(caktus)copelco@montgomery:~/caktus_website$ fab staging bootstrap
This will run a few commands over SSH and rsync the project directory to a specific location on the staging server. Using rsync is just one of many ways to transfer code to the server, such as pulling code from a remote repository. The "deploy" fabfile can be modified to perform almost any transfer task. Once the bootstrap process is complete, the directory structure will look like so:
home/
caktus/
www/
staging/
env/ -- virtual environment
bin/
include/
lib/ -- contains site-packages
source/ -- contains django src
caktus_website/
...
apache/
manage.py
requirements/
...
Now SSH to the server and run syncdb within the newly created virtual environment:
caktus@pike:~/www/staging/caktus_website$ source ../env/bin/activate
(env)caktus@pike:~/www/staging/caktus_website$ ./manage.py syncdb --settings=caktus_website.settings_staging
The staging setting's file is setup to use sqlite3 to simplify this deployment example. In practice we use PostgreSQL in our production environments, but database setup is for another blog post! To get Apache configured using mod_wsgi, we'll point the apache configuration to the staging.wsgi file using the WSGIScriptAlias directive. Here's an example Apache configuration to get a barebones Django environment up and running:
<VirtualHost:*80>
WSGIScriptReloading On
WSGIReloadMechanism Process
WSGIDaemonProcess caktus_website-staging
WSGIProcessGroup caktus_website-staging
WSGIApplicationGroup caktus_website-staging
WSGIPassAuthorization On
WSGIScriptAlias / /home/caktus/www/staging/caktus_website/apache/staging.wsgi/
<Location "/">
Order Allow,Deny
Allow from all
</Location>
<Location "/media">
SetHandler None
</Location>
Alias /media /home/caktus/www/staging/caktus_website/media
<Location "/admin-media">
SetHandler None
</Location>
Alias /admin-media /home/caktus/www/staging/caktus_website/media/admin
ErrorLog /home/caktus/www/log/error.log
LogLevel info
CustomLog /home/caktus/www/log/access.log combined
</VirtualHost:*80>
We'll use Apache to serve static media (both local and admin media) and direct everything else to the Django instance through mod_wsgi. In order for the wsgi instance to be aware of our environment and project directory, we need to add the virtual environment's site-packages directory, the project directory to the python path, and tell Django which settings file to use by setting the DJANGO_SETTINGS_MODULE environment variable:
import os
import sys
import site
PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(__file__)))
site_packages = os.path.join(PROJECT_ROOT, 'env/lib/python2.6/site-packages')
site.addsitedir(os.path.abspath(site_packages))
sys.path.insert(0, PROJECT_ROOT)
os.environ['DJANGO_SETTINGS_MODULE'] = 'caktus_website.settings_staging'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
Now just upload the staging apache configuration and reload apache:
(caktus)copelco@montgomery:~/caktus_website$ fab staging update_apache_conf
That's it! The site should be up and running on your server's public IP. If you run into any trouble (like a 500 Internal Server Error), just tail the Apache error.log, it'll usually point you in the right direction.
March 16 2010 by Tobias McNulty
Django is a tool we use every day to build fantastic web apps here at Caktus, and a development sprint is a concerted, focused period of time in which developers meet in the same space to get things done on a project.
We're proud to annouce that Caktus is hosting another local Django development sprint in the Triangle (Raleigh, Durham, and Chapel Hill/Carrboro) area of North Carolina. The sprint will be held the weekend of March 20th and 21st in Carrboro Creative Coworking, and the purpose of this sprint will be to help push out bug fixes in preparation for the upcoming Django 1.2 release.
If you're interested in attending, no previous experience contributing to Django is necessary and the sprint will be a great opportunity to start. Work on other open source Django-based projects is welcome too. For more information, check out the corresponding wiki page.
We'll be there to open the doors at 9am both days. Courtesy of our sponsors there will be free drinks, snacks, and lunch to go around. Hope to see you there!
March 11 2010 by Tobias McNulty
Like just about everyone else, we've written our own suite of tools to help with building complex content management systems in Django here at Caktus. We reviewed a number of the existing CMSes out there, but in almost every case the navigation and page structure were so tightly coupled the system broke down when it came time to add additional, non-CMS pages.
We wrote a few little apps, django-pagelets, django-treenav, and django-crumbs, each of which manages different pieces of content (little snippets of content, full CMS pages, navigation, and breadcrumbs). All of the apps are available for free under an open source license on Google Code.
Decoupling was a great move for us, and the ability to plug and play any single part of the system is a huge benefit. Sometimes, however, the completely decoupled architecture was a bit of a pain: If we didn't provide a link from the pagelets app to the treenav app, how would it be possible to edit a page's corresponding navigation item on its change form in the Django admin interface?
Enter Generic Relations. Using Django's content types framework, it's possible to create admin inlines for generic relations with just a few simple lines of code.
In this case, I'll show how we allowed users to edit a page's corresponding navigation item in django-pagelets without requiring everyone (i.e., those who don't need it) to install django-treenav. First, define the generic inline in the admin.py file of the app that contains the model you want to link to:
from django.contrib.contenttypes import generic
class GenericMenuItemInline(generic.GenericStackedInline):
"""
Add this inline to your admin class to support editing related menu items
from that model's admin page.
"""
max_num = 1
model = treenav.MenuItem
Then, inside the Admin class for the related model in question, dynamically import and add GenericMenuItemInline to the admin's list of inlines based on whether or not it's in the project's INSTALLED_APPS:
from django.conf import settings
class PageAdmin(admin.ModelAdmin):
# ...
inlines = [MyOtherInline]
if 'treenav' in settings.INSTALLED_APPS:
from treenav.admin import GenericMenuItemInline
inlines.insert(0, GenericMenuItemInline)
For more information, see the corresponding pagelets admin.py and treenav admin.py. Thanks for reading and don't hesitate to post comments if you have any questions!
February 17 2010 by Tobias McNulty
Python and Django are tools we use on a daily basis to build fantastic web apps here at Caktus. I'm pleased to announce that Caktus is sending five developers--Colin, Alex, Mike, Mark, and myself--to PyCon 2010! PyCon is an annual gathering for users and developers of the open source Python programming language. This year the US conference is being held in Atlanta, GA. We'll be driving down tomorrow (Thursday) from Chapel Hill, NC and staying for the conference weekend plus one day of the sprints.
Hope to see you there!
June 09 2009 by Tobias McNulty
As part of my work on EveryWatt, our fledgling energy monitoring web site, I needed a way to consolidate log messages from all the data loggers we have running in a single place. If you're not familiar with it, Python's logging module is good stuff and worth checking out. We already used it for logging to files locally, and the module defines an HTTPHandler that can deliver log messages to a remote server via HTTP.
To implement the Django side, I wrote a lightweight pluggable app to receive the log messages and store them in the database. To use the app, just create an HTTPHandler that points to your Django site, and add it to a logger:
import logging
import logging.handlers
logger = logging.getLogger('mylogger')
http_handler = logging.handlers.HTTPHandler(
'django.app.hostname:port',
'/remotelog/your_app_slug/log/',
method='POST',
)
logger.addHandler(http_handler)
logger.info('testing remote logging')
On the Django side, navigate to /admin/remotelog/logmessage/ and you should have a nice interface (courtesy of the Django admin) to filter, search, and sort log messages as they come in. The app is called django-remotelog, and it's up on Google code. Check it out, and feel free to comment.
May 26 2009 by Colin Copeland
By default, Django doesn't do explicit table locking. This is OK for most read-heavy scenarios, but sometimes you need guaranteed, exclusive access to the data. Caktus uses PostgreSQL in most of our production environments, so we can use the various lock modes it provides to control concurrent access to the data. Once we obtain a lock in PostgreSQL, it is held for the remainder of the current transaction. Django provides transaction management, so all we need to do is execute a SQL LOCK statement within a transaction, and Django and PostgreSQL will handle the rest.
Below is an example decorator we came up with to provide easy table-locking access in Django:
from django.db import transaction
LOCK_MODES = (
'ACCESS SHARE',
'ROW SHARE',
'ROW EXCLUSIVE',
'SHARE UPDATE EXCLUSIVE',
'SHARE',
'SHARE ROW EXCLUSIVE',
'EXCLUSIVE',
'ACCESS EXCLUSIVE',
)
def require_lock(model, lock):
"""
Decorator for PostgreSQL's table-level lock functionality
Example:
@transaction.commit_on_success
@require_lock(MyModel, 'ACCESS EXCLUSIVE')
def myview(request)
...
PostgreSQL's LOCK Documentation:
http://www.postgresql.org/docs/8.3/interactive/sql-lock.html
"""
def require_lock_decorator(view_func):
def wrapper(*args, **kwargs):
if lock not in LOCK_MODES:
raise ValueError('%s is not a PostgreSQL supported lock mode.')
from django.db import connection
cursor = connection.cursor()
cursor.execute(
'LOCK TABLE %s IN %s MODE' % (model._meta.db_table, lock)
)
return view_func(*args, **kwargs)
return wrapper
return require_lock_decorator
This is, by no means, a perfect solution. Feel free to comment below.
May 25 2009 by Tobias McNulty
In preparation for migrating the EveryWatt database from one machine to another, I wrote this little WSGI script to easily disable the site while I copy the data. Since it doesn't depend on Django or really anything else (other than a functioning WSGI server), you can use it for other upgrades, too.
This is useful for preventing updates to the database while you, for example, dump the database on one machine and load it on another. With everything else already in place on either side, the user should only see the "Upgrade in progress" message for a few minutes.
Since EveryWatt includes a number of data logger clients that upload utility meter readings to the site through its Open API, I wanted to make sure any POST attempts received a temporary failure message (the data logger will store the data and retry the POST every minute)--hence the 405 Method Not Allowed for all non-GET requests.
Here's the script:
import os
import sys
UPGRADING = False
#Calculate the project path based on the location of the WSGI script.
project_dir = os.path.dirname(__file__)
sys.path.append(project_dir)
def upgrade_in_progress(environ, start_response):
upgrade_file = os.path.join(project_dir, 'media', 'html', 'upgrade.html')
if os.path.exists(upgrade_file):
response_headers = [('Content-type','text/html')]
response = open(upgrade_file).read()
else:
response_headers = [('Content-type','text/plain')]
response = 'Application upgrade in progress...please check back soon.'
if environ['REQUEST_METHOD'] == 'GET':
status = '503 Service Unavailable'
else:
status = '405 Method Not Allowed'
start_response(status, response_headers)
return [response]
if UPGRADING:
application = upgrade_in_progress
else:
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
And in case you need it, here's one way to dump a PostgreSQL database on one machine while you load it on another, to be run on the new host, as the database superuser:
Good luck and please post your questions/comments.
May 21 2009 by Tobias McNulty
I finally got around to updating my Eclipse, PyDev, and Subclipse environment today, which I use for Django development.
Formerly I was using the SvnKit (pure-Java) libraries. SvnKit "felt" slow to me, compared to my command line SVN client, so this time I tried to get the JavaHL (JNI) libraries working.
For the record I'm using Ubuntu (jaunty) with Eclipse 3.4 (Ganymede). This version of Ubuntu comes with Subversion 1.5, so I need to install Subclipse 1.4. See:
http://subclipse.tigris.org/servlets/ProjectProcess?pageID=p4wYuA
I installed everything through the Eclipse update manager (minus SvnKit), but JavaHL didn't show up under Preferences -> Team -> SVN. The error message was. JavaHL (JNI) not available.
I had installed Eclipse manually (not through apt-get), so the solution was to install the JavaHL libraries:
apt-get install libsvn-java
and add the following line to my eclipse.ini (usually in the top level eclipse directory):
-Djava.library.path=/usr/lib/jni
Restart Eclipse, and you should be good to go!
January 13 2009 by Tobias McNulty
Here at Caktus, we use the popular Django web framework for a lot of our custom web application development. We don't use Django simply because it's popular, easy to learn, or happened to be the first thing we found. We've written web apps in PHP, Java, and Ruby on Rails--all before we discovered Django--but were never quite satisfied. Following are just a few of the reasons that we both enjoy working with Django and believe it gives you (the client) the best end-product.
Django is Business-Friendly
Django is open source, free, and published under a "do anything you like" license, so it can be used to create all kinds of products, including proprietary business web apps. In addition to a flexible license, Django has a truly thriving user community and is being constantly improved by web developers like ourselves across the globe.
Built-in Admin Interface
Web application development often starts with the "data model." A data model defines the ways in which all the different pieces of information--such as customer names and addresses or product descriptions--are organized and related to each other in the database. Finding the right data model takes time and it's important to get it right, because a lot of development decisions will be based on the way your information is organized and accessible. When you're building a web application from the ground up--something we do every day at Caktus--you want the flexibility to experiment with your data model and "see" what all the different options look like.
This is where Django's built-in admin interface comes in. From the beginning, Django has included an automatically generated interface that lets you see and edit what's in your database. It knows the structure of your data and puts together a set of search and listing pages and custom web forms for creating, modifying, and otherwise managing your data. It lets you evaluate your data model up front before making a big investment in other parts of your web app. For some sites, the admin interface even makes up a big part of the final product (e.g., for sites that primarily publish content, such as news organizations). And, we've found, the automatically generated admin interface is a powerful tool for showing potential clients what a web app can do.
I Trust Django With My Data
At Caktus we put a strong emphasis on "data integrity." What is data integrity? Kevin already wrote a great post about what it is and why you should care about data integrity. In a nutshell, the "integrity" of data refers to its "completeness" or validity as a whole. For example, you probably want to limit the products that people can order on your web site to those that you actually stock in the warehouse.
Modern "relational database management systems" provide integrity "checks" for your data that verify its appropriateness--based on the conditions you supply--for a given table in the database. When you build a data model in Django, you specify the nature or "type" of each column in your database and can even specify "constraints" on the data that--if your database server supports it--will be enforced at the database level in addition to the application. While this is always a good thing, it's even more important if other programs or users will be connecting to your database in addition to your web app. While Django does this out of the box, another popular web framework requires some under the hood "hacking" to achieve the same peace of mind about your data.
On a side note, in addition to preferring Django for web app development, Caktus also prefers PostgreSQL for data storage. Our friends over at Summersault have already written a good summary describing why PostgreSQL is often the best choice for web app development, so I won't repeat the reasons here. We trust the Django + PostgreSQL combination so much that we even wrote our own CRM and bookkeeping package to keep track of our clients, projects, and all the related financial transactions.
Django is Written in Python
Python is a great language with no shortage of facilities and a huge (and growing) user base. A lot of Google's infrastructure is written in Python, and it is the only language supported by the initial release of their App Engine service. According to python.org:
[Python] offers strong support for integration with other languages and tools, comes with extensive standard libraries, and can be learned in a few days. Many Python programmers report substantial productivity gains and feel the language encourages the development of higher quality, more maintainable code.
Based on Caktus' experience writing Django web apps over the past 1.5+ years, this couldn't be more true.
Separation of Application Components
Django uses a variation of the Model View Controller (MVC) architecture that ensures all the different pieces of your application end up in the right place and, for larger projects, let the people with different skills work on the things they do best, without getting in each other's way. Moreover, Django implements its own very simple "template language" for generating web pages. While some may view its simplicity as a curse, it is actually a blessing in disguise: by allowing only very simple constructs in the template, Django forces you to keep your business logic in the controller (what Django calls a "view") where it belongs.
At Caktus, we're not just web developers. We're web engineers with a passion for web apps that not only work, feel, and look great, but also have the capacity to grow, improve, and continue to perform long into the future without breaking the bank. That said, we're truly thrilled about the Python/Django + PostgreSQL combination.