Using OpenStreetMap Data for Routing

Introduction

[Note: I started writing this in March of 2014 but got side-tracked by other things. Since I’ve started working on byCycle again recently and made some updates to how it processes OSM data, I’m finally getting around to publishing this.]

I’m in the process of updating a side project to use OpenStreetMap (OSM) as its primary data source. When I started, I didn’t know anything about the OSM data schema or how to use it for routing. I found some info in the OSM wiki and various other sources, but it all came together after I downloaded some data and started playing with it.

Since I couldn’t find a concise tutorial on this topic, I figured I’d write up what I learned so that it might help others who are interested in creating their own routing engine based on OSM data (mostly for fun, probably not for profit). The focus here is less on the technologies I used (Python 3 & PostgreSQL w/ PostGIS) and more on the data and how to process it for routing.

The basic steps are:

  1. Get familiar with OSM terminology
  2. Get some data (in JSON format)
  3. Extract the relevant parts of the data, massage it a bit, and load it into a database
  4. Create a graph that can be used for routing from the data loaded into the database

In the end, it’s pretty straightforward, but there are few tricky bits, and that’s essentially what this post is about.

Terminology

  • Nodes – These are used to represent street intersections, points of interest, and also the line geometry of ways. Nodes are composed of an ID, a latitude & longitude, and maybe some tags.
  • Ways – These are used to represent roads, paths, buildings, and anything else that has a shape. Ways are composed of an ID, a sequence of node references, and some tags. Ways representing streets often span multiple real-world street segments. Creating a topologically correct graph that can be used for routing requires splitting such ways where they intersect with other ways. There are other tags/hints in the OSM data that can be used for this too, but that’s not covered here.
  • Tags – These are key/value pairs that are attached to nodes, ways, and relations. As an example, many ways have a ‘highway’ tag, which indicates that they are streets (or some other kind of path). Technically, tags can be anything and anyone can add new tags willy-nilly. In actuality, they seem to be fairly standardized. The fact that they aren’t totally standardized causes some additional complexity in using OSM data for routing.
  • Relations – These can be used to specify a relationship between a set of nodes and/or ways (e.g., a bus route). I’m going to ignore relations entirely in the rest of this article.

Note: some details elided.

Get Some Data

In the beginning, I was manually exporting data from the main OSM site. There are a few problems with this approach:

  • You can only download a small amount of data this way.
  • You can’t filter the data. You probably don’t want *all* the data in a given region, since it’s not all useful for routing.
  • It’s discouraged in the OSM documentation.
  • It’s manual.

Eventually,  I stumbled across XAPI, which is an extended version of the main OSM API that lets you download filtered XML data. Free XAPI Web services are were provided by MapQuest and others.

Fast forward a few years and XAPI is deprecated and the Overpass API/QL is now recommended. The Overpass API can output JSON instead of XML, which means smaller/faster downloads and simpler code for processing the OSM data.

For example, the following query will download data for a small area around downtown Portland, Oregon. It will include only ways that have a ‘highway’ tag along with the nodes associated with those ways:

http://overpass-api.de/api/interpreter?data=[out:json];(way[highway](45.52036,-122.67279,45.52542,-122.66281);>;);out;

This requests ways with a highway tag that are inside a bounding box specified as south, west, north, and east coordinates. The ‘highway’ tag is used to indicate streets, highways (AKA motorways), cycle tracks, sidewalks, “foot paths”, stairs, etc. My understanding is that any way that can be walked, biked, roller-bladed, or driven on will have a ‘highway’ tag.

The > after the coordinates tells Overpass to recurse downward, which is how you get it to return all the associated ways and nodes in the bounding box. This bit with recursion to get “completed ways” was the trickiest part of understanding the Overpass API/QL. I highly recommend using the Overpass Turbo tool to test your queries within a small bounding box to make sure you’re getting all the data you expect.

Note: It’s possible to do much more complex Overpass QL queries, but the relatively simple query above works for my purposes.

Code for fetching JSON data from the Overpass API

Load the Data

I found osm2pgsql early on and thought this part was going to be super easy, but osm2pgsql is oriented toward cartography and doesn’t insert any node data into the database [note: I originally wrote this in 2014, so that may or may not still be true]. That makes it somewhat more difficult to use for routing (you could use spatial analysis to figure out the topology, but that seems like a lot of extra work when the topology is already well defined in the OSM data).

Instead, I wrote my own loader. Reading an OSM JSON file downloaded from the Overpass API is straightforward since the structure of the data is pretty simple:

{
  "version": 0.6,
  "generator": "Overpass API 0.7.54.12 054bb0bb",
  "osm3s": {
    "timestamp_osm_base": "2017-12-13T05:00:00Z",
    "copyright": "The data included in this document is from www.openstreetmap.org. The data is made available under ODbL."
  },
  "elements": [
    {
      "type": "node",
      "id": 5079801671,
      "lat": 45.5350519,
      "lon": -122.6565833,
      "tags": {
        "highway": "traffic_signals"
      }
    },
    {
      "type": "node",
      "id": 40624981,
      "lat": 45.5459387,
      "lon": -122.6565138
    },

    ... more nodes ...

    {
      "type": "way",
      "id": 48268202,
      "nodes": [
        5079801671,
        1396604495,
        4673994633,
        4673994632,
        40601347,
        3850595129,
        40624967,
        40624969,
        40624973,
        40616052,
        4922848332,
        40624976,
        4922848331,
        40624979,
        40606102,
        40624981
      ],
      "tags": {
        "bicycle": "designated",
        "highway": "residential",
        "maxspeed": "25 mph",
        "name": "Northeast 9th Avenue",
        "sidewalk": "both"
      }
    },

    ... more ways ...
}


Here we can see that way 48268202 has several associated nodes and some tags. The first and last nodes are intersections. The other nodes define they way’s line geometry and might also be intersections (see below). The tags indicate that it’s a residential street that has sidewalks on both sides and is designated as part of the bicycle network.

Notably, this example way spans several blocks, so it has to be split into its component blocks before it can be used for routing (technically it doesn’t, but doing so enables better routing).

The loader first goes through all the nodes in the JSON data and determines if they are intersections. Nodes at the start and end of a way are always considered intersections (although thinking about it now, that might not actually be correct). In addition, nodes that are shared by two or more ways are considered intersections. All of the nodes are inserted into a temporary database table (so they can be used later for street geometry). The nodes that correspond to intersections are inserted into a separate, persistent table.

After the nodes are processed, the loader goes through all the ways in the JSON data, extracts specific tags, and inserts “street” records into the database. The geometry for each way/street is created by looking up the corresponding nodes in the temporary node table, putting them into the correct order, and creating a line geometry object that can be inserted into a PostGIS geometry column.

Code for loading OSM data into a Postgres/PostGIS database

Create a Graph and Do Some Routing

Technically, there’s an implicit graph structure in the database, but for now I’m not using it directly. The eventual goal is to migrate to a graph database such as Neo4j or maybe use pgRouting. For now, I’m using a simple library named Dijkstar to build a graph and perform shortest path queries.

The graph creation script reads the street table and adds an edge in each direction for two-way streets and a single edge for one-way streets. The edges are simply tuples of street attributes such as length that can be used by a cost function that’s passed into the path-finding function. The graph is essentially just a dictionary, and it’s saved using marshal from the Python standard library.

At this point, the graph can be loaded and queried for a shortest path by specifying a pair of node IDs, but that isn’t super user-friendly. Check out the route service in byCycle to see an example of how it can be used in “real” (i.e., side-project, just-for-fun) code.

Code for creating a graph from data loaded into the database

Code

All of the code used to do the processing described above is free/open source and can viewed on GitHub.

Finding the Oddity in a Sequence

After reading Ned Batchelder’s Iter-tools for puzzles: oddity post yesterday, my first thought was to use itertools.groupby(). That version is the fastest I could come up with (by quite a bit actually, especially for longer sequences), but it requires sorting the iterable first, which uses additional space and won’t work with infinite sequences.

My next thought was to use a set to keep track of seen elements, but that requires the keys to be hashable, so I scrapped that idea.

I figured using a list to keep track of seen elements wouldn’t be too bad if the seen list was never allowed to grow beyond two elements. After playing around with this version for a while, I finally came up with something on par with Ned’s version performance wise while meeting the following objectives:

  • Don’t require keys to be hashable
  • Don’t store the elements read from the iterable
  • Support infinite sequences

Some interesting things:

  • Rearranging the branches to use in instead of not in sped things up more than I would have thought
  • Initially, I was checking the lengths of the seen and common lists on every iteration, which slowed things down noticeably (which isn’t all that surprising really)
  • Setting the key function to the identity function lambda v: v is noticeably slower than checking to see if it’s set on every iteration (also not surprising)
  • Using sets/dicts for seen, common, and uncommon didn’t make a noticeable difference (I tried adding a hashable option and setting things up accordingly, but it wasn’t worth the additional complexity)

Here’s the code (the gist includes a docstring, tests, and a simplistic benchmark):

def oddity(iterable, key=None, first=False):
    seen = []
    common = []
    uncommon = []

    for item in iter(iterable):
        item_key = key(item) if key else item

        if item_key in seen:
            if item_key in common:
                if first and uncommon:
                    return item_key, uncommon[1]
            else:
                common.append(item_key)
                if len(common) == 2:
                    raise TooManyCommonValuesError

                if item_key in uncommon:
                    i = uncommon.index(item_key)
                    j = i + 2
                    uncommon[i:j] = []
        else:
            seen.append(item_key)
            if len(seen) == 3:
                raise TooManyDistinctValuesError
            uncommon.extend((item_key, item))

    if len(seen) == 0:
        raise EmptyError

    if len(common) == 0:
        raise NoCommonValueError

    if len(uncommon) == 0:
        uncommon_value = None
    else:
        uncommon_value = uncommon[1]

return common[0], uncommon_value

Release Early, Release Often

It’s an old saw, but I was wondering today why some projects don’t cut releases more often. The repo for a project may contain the bug fix you need, but it’s just sitting there on GitHub. I think it often just comes down to the fact that making releases is tedious.

You have to update the version number (perhaps in multiple places), update the change log (hopefully), merge your development branch into your release/master branch, create a tag, clean your dev environment, build a distributable package, upload that package, maybe upload some docs, push some commits, etc.

Doing all that manually isn’t much fun, so…

Write a script to do it for you.

Write it in Python or Bash or as a make target or whatever floats your boat. It’s a one-time cost that pays off big.

You can’t quite automate everything–like writing a (good) change log–but you can automate most of the process.

As an example, I wrote this release script for a project I started a month and half ago. I’ve already made 13 15 26 alpha releases because it’s so easy to do. Putting in an hour or two up front was well worth it.

If you’re feeling lazy, you can use something like zest.releaser (for Python projects). I’ve used it in the past and it’s been the inspiration for all the release scripts I’ve written since.

PEP 420 Namespace Packages – Somewhat Surprising

Update 2014/03/16: I submitted a better different patch, which was partially merged, but there’s still an ongoing discussion about the best way to make find_packages() support PEP 420.

Earlier today I submitted a patch to setuptools that adds support for PEP 420 namespace packages (“NS packages”) to find_packages(). In the process, I learned a few things about NS packages that I found somewhat surprising.

My initial “intuitive” understanding was that only directories with either one or more .py files or containing only directories would be considered NS packages. I also thought that NS packages couldn’t be nested under regular packages (i.e., those with an __init__.py). I’m not exactly sure how I arrived at this understanding, but it’s what seemed to make sense before I dug into this.

The first thing I found surprising is that NS packages can be nested under regular packages. I couldn’t figure out what the use cases for this would be (which, of course, doesn’t mean there aren’t any). One potential problem with this is that if you have a directory with package data that’s not a Python package, it can be imported as a package, whereas in the pre-420 days you’d get an ImportError.

The second surprising thing, which is a generalization of the first, is that any subdirectory of a directory that’s on sys.path can be imported as an NS package. This is probably useful in certain scenarios, but it can also cause issues in other scenarios.

For example, if find_packages() emulated the above behavior, by default *every* subdirectory in a source tree would be considered a package and included in the package list and therefore in a distribution, which is often (maybe usually) undesirable.

If you’re using vanilla distutils, this isn’t an issue since you have to explicitly list all of the packages in a distribution, but that can be super tedious and it’s easy to forget to add new packages to setup(), and a lot of packages on PyPI already use find_packages().

So I had one thought that an explicit __not_a_namespace__ marker file could be added to directories that shouldn’t be considered NS packages (or maybe __package_data__ is a better name). This is almost certainly a non-starter though, because it could lead to a lot of empty files cluttering up your source tree (plus, you might forget to do this too, so it doesn’t really help with that aspect).

My patch for find_packages() adds the ability to explicitly specify the packages you want to include using wild card patterns. In the following example, the mypkg directory and all of its subdirectories will be included as packages (when running Python 3.3+):

project/
    docs/
        index.rst
    mypkg/
        mysubpkg/
            __init__.py
        xxx/
            some-file
        mymod.py
    setup.py

# setup.py
from setuptools import setup, find_packages

setup(
    packages=find_packages(include=['mypkg*']),
    ...
)

This goes part of the way toward making sure only the appropriate directories are included as packages in a distribution. In simple cases, it will be sufficient by itself. In other cases, it might be necessary to exclude certain directories:

setup(
    packages=find_packages(
        include=['mypkg*'],
        exclude=['mypkg.xxx']),
    ...
)

This is a bit more complex than the way things used to be–where you could almost always simply say packages=find_packages() without thinking about it–but I guess that’s the price of new features and functionality.

Update: I thought of a little hack for explicitly marking non-package directories–name them with an extension (e.g., some.data). They will then become unimportable, and find_packages() already skips directories with dots in their names.

Travis CI, Python 3.3, Buildout

I had the hardest time configuring Travis CI for several Python 3.3 packages. I think it may have been because they’re all PEP 420 namespace packages. The strange thing is that the tests would pass for some of the packages but not others.

I ended up creating a package (which I lovingly named travisty) so my repos wouldn’t be cluttered with hundreds of superfluous commits recording my failed attempts to get things working. I got pretty frustrated and almost gave up on Travis entirely.

There was a post on Planet Python recently about problems with Python 3 and virtualenv, and I’m guessing the issue is something related to that and not Travis specifically, but I never did actually figure it out.

Instead, I tried installing my packages using Buildout (which I prefer over virtualenv for development anyway), and that magically worked. It kind of bugs me that I don’t know why it works and the bare virtualenv doesn’t, but at this point I’m just happy that it does.

Here’s the .travis.yml I ended up with:

language: python
python:
  - "3.3"
install:
  - pip install zc.buildout
  - git clone git://github.com/TangledWeb/tangled ../tangled
  - buildout
script:
  - ./bin/tangled test

The tangled test script is just a tiny wrapper around the built in unittest discovery. It’s equivalent to ./bin/python -m unittest discover my/package.

What I Hate About Python

During an interview a while back, I was asked to name some things I hate about Python. For some reason, I choked and couldn’t think of a good answer (I kind of wanted to blame the interview process, but that’s a rant for another time).

Maybe I’ve just been programming in Python for too long, and that’s why I couldn’t think of something (or maybe I’m just a massive Python fanboy). On the other hand, I’ve been programming in JavaScript (which I generally like) for about as long, and I can think of at least a few things right off the top of my head (mostly related to weak typing).

I did a search to see what other people don’t like about Python to get some inspiration, but I didn’t come across anything I truly hate.

Things That Don’t Bother Me

  • Significant whitespace. I love it.
  • Explicit self. I guess it would be “convenient” if I didn’t have to add self to every method signature, but I really don’t spend much time on that, and it takes about 1ns to type (in fact, my IDE fills it in for me). There are technical and stylistic considerations here, but the upshot for me is that it just doesn’t matter, and I actually like that all instance attribute access requires the self. prefix.
  • “Crippled” lambda. There are rare occasions where I want to define more complex anonymous functions, but there’s no loss of expressiveness from having to use a “regular” named function instead. Maybe multi-line anonymous functions that allow statements would lead to different/better ways of thinking about programs, but I’m not particularly convinced of that. (Aside: one thing I do hate relating to this is the conflation of lambdas and closures–normal functions are closures in the same way that lambdas are.)
  • Packaging. I don’t know why, but I’ve never had any problems with setuptools. There are some issues with the installation of eggs when using easy_install, but I think pip fixes them. I am glad that setuptools is now being actively developed again and the distribute fork is no longer necessary.
  • Performance, GIL, etc. I’ve used Python for some pretty serious data crunching (hello, multiprocessing) as well as for Web stuff. There are cases where something else might have been faster, but Python has almost never been too slow (caveat: for my use cases). Of course, Python isn’t suitable for some things, but for most of the things I need to do, it’s plenty fast enough.
  • len(), et al. I don’t have anything to say about this other than it’s a complete non-issue for me. Commentary about how this means Python isn’t purely object-oriented makes me a little cranky.

Things That Bug Me a Little Bit

  • The way super works in Python 2 is kind of annoying (being required to pass the class and self  in the super() call). This is fixed in Python 3, where you can just say super().method() in the common case.
  • Unicode vs bytes in Python 2. This is also fixed in Python 3 (some people have argued that it’s not, but I haven’t run into any issues with it yet (maybe it’s because I’m working on a Python 3 only project?)).
  • The implicit namespace package support added in Python 3.3 causes some trouble for my IDE (PyCharm), but I’m assuming this is a temporary problem. I also had some trouble using nose and py.test with namespace packages. Again, I assume (hope) this is only temporary.

Things That Bother Me a Little More

  • The Python 2/3 gap is a bit troublesome. Sometimes I think the perception that there’s a problem may be more of a problem, but I don’t maintain any major open source projects, so I’m not qualified to say much about this. Personally, I’ve really been enjoying Python 3, and I do think it offers some worthwhile advantages over Python 2.

Conclusion

There isn’t one. I left some things out intentionally (various quirks). I probably forgot some things too.

I’m Working on a Python 3 (only) Web Framework

I’ve been working on a Python 3 (3.3+) only Web Framework for the past few weeks. My initial motivation was to try to build something “RESTful” that doesn’t use the concepts (or terms) “view” or “controller” at all.

That’s because I don’t even know what a “view” is (or is supposed to be be). I’ve tried to think of views as “representations” (i.e., in the sense of representing some data as a particular content type like HTML or JSON), but in the real world view functions are often used to implement business logic, and the term “view” seems misleading (“action” seems more appropriate in this case). The term view is also used by some frameworks to refer to templates (e.g., Rails), and this usage actually makes a bit more sense to me.

If you’re just throwing some code together, maybe this type of thing doesn’t matter too much, but when things get complex (as they tend to) and you’re looking for ways to keep things simple (in terms of concepts, code organization, etc), I think it can make a big difference. Your core abstractions inform (and possibly limit) your thinking; they can also limit, to an extent, what’s possible or at least easily doable.

A different way to say this is that “views” and “controllers” don’t really fit my brain, so I’m experimenting with something different. I actually tried to implement something like this on top of Pyramid at one point (pyramid_restler), but I got hung up trying to build on top of views (although it’s still useful, IMO).

So, I set out to create a new framework that uses resources and representations as its main abstractions. Resources are things that are mounted at a particular path, respond to HTTP methods, and return data that gets represented according to the requested content type (i.e., as specified by the Accept header (note to self: should look at Accept-* headers as well)).

I’ve also put some effort into making the framework “correct” in terms of HTTP status codes and other RESTful concepts. But I’m finding that it’s not so easy to create a general purpose framework that handles every scenario right out of the box. For example, it’s okay (according to the RFC) to return various status codes for a POST or a PUT depending on the circumstances, so you can’t just say, “Oh, it’s a POST, so we’ll automatically return a 201 Created.”

I’m not sure how to completely generalize HATEOAS either. For certain scenarios you can say things like, “This container resource has these items, so generate a set of links to the items and include that in the representation.” But I’m not sure how you could cover every link scenario in a generic manner. If certain constraints are enforced, maybe… (maybe I just haven’t thought hard enough about this…)

That said, I think it’s possible to create a framework that at least nudges people in the “right” direction (or at least a particular direction).

So that’s the gist of it. As it matures, I plan to post more about it (including links to the source, which will be MIT licensed). For now, I’m mostly avoiding the term RESTful and just calling it resource-oriented.

Python 3

This is the first project that I’ve really used Python 3 for, and it’s been fun playing with all the shiny new stuff (hello, built in namespace packages). I was planning to say more about the choice of Python 3 (3.3+ in particular) in this post, but I think it’s gotten a bit long already. I’ll be writing another post about that soon, including the issues I’ve run into (or lack thereof).

Generating Random Tokens (in Python)

Update 1/13: After reading the comments and thinking about it some more, I think binascii.hexlify(os.urandom(n)) is the easiest way to generate random tokens, and random = random.SystemRandom(); ''.join(random.choice(alphabet) for _ in range(n)) is better when you need a string that contains only characters from a specific alphabet. Pyramid uses the former approach; Django uses the latter.

I’m working on a web site where I need to generate random CSRF tokens. After digging around a bit, I found os.urandom(n), which returns “a string of n random bytes suitable for cryptographic use.” Okay, that sounds good… except that it can include bytes that aren’t “web safe”.

So I needed a way to encode the output of urandom. I poked around some more and saw binascii.hexlify(data) being used for this purpose (in Pyramid). For some reason, though, I thought it would be “clever” to hash the output from urandom like so: hashlib.sha1(os.urandom(128)).hexdigest().

What I like about this is that no matter how many bytes you request from urandom (assuming more bytes means more entropy), you always end up with a 40 character string that’s safely encoded.

I’m not sure if this provides any real benefit though (in terms of increased security). Are there better ways to generate random tokens?

Another thought I had was to use bcrypt.gensalt() and use its output as is–it uses urandom to generate the initial salt, which is then hashed, and also returns a fixed number of bytes (29).

On a slightly related note, I recently needed to generate a new PIN. My first thought was to reuse a PIN I use elsewhere, but of course that’s a bad idea. My second thought was to use KeePassX to generate one. I happened to have a calculator sitting next to me (one with big buttons); I closed my eyes and banged on it a bit to generate the PIN.

Wildwood Trail from Germantown Road to Newberry Road

Date: July 17, 2012
Elevation: N/A
Distance: 11
Time to top: 1 hour 30 minutes
Time back: 1 hour 20 minutes
Weather: Partly cloudy, not too hot
Gear: Running shoes, small Camelbak pack with 3 liters of water

This was a nice, rolling hike through the northern part of Forest Park. This section is all forested except for two (?) clearings that take about thirty seconds to traverse. There’s no significant elevation change to speak of, though there are a couple of moderately steep sections.

Wildwood Trail, Newberry Road trailhead