Introduction

[Note: I started writing this in March of 2014 but got side-tracked by other things. Since I’ve started working on byCycle again recently and made some updates to how it processes OSM data, I’m finally getting around to publishing this.]

I’m in the process of updating a side project to use OpenStreetMap (OSM) as its primary data source. When I started, I didn’t know anything about the OSM data schema or how to use it for routing. I found some info in the OSM wiki and various other sources, but it all came together after I downloaded some data and started playing with it.

Since I couldn’t find a concise tutorial on this topic, I figured I’d write up what I learned so that it might help others who are interested in creating their own routing engine based on OSM data (mostly for fun, probably not for profit). The focus here is less on the technologies I used (Python 3 & PostgreSQL w/ PostGIS) and more on the data and how to process it for routing.

The basic steps are:

  1. Get familiar with OSM terminology
  2. Get some data (in JSON format)
  3. Extract the relevant parts of the data, massage it a bit, and load it into a database
  4. Create a graph that can be used for routing from the data loaded into the database

In the end, it’s pretty straightforward, but there are few tricky bits, and that’s essentially what this post is about.

Terminology {#terminology}

  • Nodes — Nodes are used to represent street intersections, points of interest, and also the line geometry of ways. Nodes have an ID, a latitude & longitude, and maybe some tags.
  • Ways — Ways are used to represent roads, paths, buildings, and anything else that has a shape. Ways have an ID, a sequence of node references, and tags. Ways representing streets often span multiple real-world street segments. Creating a topologically correct graph that can be used for routing requires splitting such ways where they intersect with other ways. There are other tags/hints in the OSM data that can be used for this too, but that’s not covered here.
  • Tags — Tags are key/value pairs that are attached to nodes, ways, and relations. As an example, many ways have a “highway” tag, which indicates that they are streets or some other kind of path. Tags can be anything and anyone can add new tags willy-nilly, but they seem to be fairly standardized.
  • Relations — These can be used to specify a relationship between a set of nodes and/or ways (e.g., a bus route). I’m going to ignore relations entirely in the rest of this post.

Get Some Data {#get-some-data}

In the beginning, I was manually exporting data from the main OSM site. There are a few problems with this approach:

  • You can only download a small amount of data this way.
  • You can’t filter the data. You probably don’t want all the data in a given region, since it’s not all useful for routing.
  • It’s discouraged in the OSM documentation.
  • It’s manual.

Eventually, I stumbled across XAPI, which is an extended version of the main OSM API that lets you download filtered XML data. Free XAPI Web services are were provided by MapQuest and others.

Fast forward a few years and the Overpass API/QL is now recommended while XAPI is deprecated. The Overpass API can output JSON instead of XML, which means smaller/faster downloads and simpler code for processing the OSM data.

For example, the following query will download data for a small area around downtown Portland, Oregon. It will include only ways that have a “highway” tag along with the nodes associated with those ways:

http://overpass-api.de/api/interpreter?
    data=[out:json];
    (way[highway](45.52036,-122.67279,45.52542,-122.66281);>;);
    out;

[Note: Query split over multiple lines for clarity.]

This requests ways with a highway tag that are inside a bounding box specified as south, west, north, and east coordinates. The highway tag is used to indicate streets, highways (AKA motorways), cycle tracks, sidewalks, foot paths, stairs, etc. My understanding is that any way that can be walked, biked, roller-bladed, or driven on will have a highway tag.

The > after the coordinates tells Overpass to recurse downward, which is how you get it to return all the associated ways and nodes in the bounding box. This bit with recursion to get “completed ways” was the trickiest part of understanding the Overpass API/QL. I highly recommend using the Overpass Turbo tool to test your queries within a small bounding box to make sure you’re getting all the data you expect.

Code for fetching JSON data from the Overpass API

Load the Data {#load-the-data}

I found osm2pgsql early on and thought this part was going to be super easy, but osm2pgsql is oriented toward cartography and doesn’t insert any node data into the database [note: I originally wrote this in 2014, so that may or may not still be true]. That makes it somewhat more difficult to use for routing (you could use spatial analysis to figure out the topology, but that seems like a lot of extra work when the topology is already well defined in the OSM data).

Instead, I wrote my own loader. Reading an OSM JSON file downloaded from the Overpass API is straightforward since the structure of the data is pretty simple:

{
  "version": 0.6,
  "generator": "Overpass API 0.7.54.12 054bb0bb",
  "osm3s": {
    "timestamp_osm_base": "2017-12-13T05:00:00Z",
    "copyright": "The data included in this document is from www.openstreetmap.org. The data is made available under ODbL."
  },
  "elements": [
    {
      "type": "node",
      "id": 5079801671,
      "lat": 45.5350519,
      "lon": -122.6565833,
      "tags": {
        "highway": "traffic_signals"
      }
    },
    {
      "type": "node",
      "id": 40624981,
      "lat": 45.5459387,
      "lon": -122.6565138
    },

    ... more nodes ...

    {
      "type": "way",
      "id": 48268202,
      "nodes": [
        5079801671,
        1396604495,
        4673994633,
        4673994632,
        40601347,
        3850595129,
        40624967,
        40624969,
        40624973,
        40616052,
        4922848332,
        40624976,
        4922848331,
        40624979,
        40606102,
        40624981
      ],
      "tags": {
        "bicycle": "designated",
        "highway": "residential",
        "maxspeed": "25 mph",
        "name": "Northeast 9th Avenue",
        "sidewalk": "both"
      }
    },

    ... more ways ...
}

Way 48268202 has several associated nodes and some tags. The first and last nodes are intersections. The other nodes define they way’s line geometry and might also be intersections (see below). The tags indicate that it’s a residential street that has sidewalks on both sides and is designated as part of the bicycle network.

This example way spans several blocks, so it has to be split into its component blocks before it can be used for routing.

The loader first goes through all the nodes in the JSON data and determines if they are intersections. Nodes at the start and end of a way are always considered intersections (although thinking about it now, that might not actually be correct 🤔). In addition, nodes that are shared by two or more ways are considered intersections. All of the nodes are inserted into a temporary database table (so they can be used later for street geometry). The nodes that correspond to intersections are inserted into a separate, persistent table.

After the nodes are processed, the loader goes through all the ways in the JSON data, extracts specific tags, and inserts “street” records into the database. The geometry for each way/street is created by looking up the corresponding nodes in the temporary node table, putting them into the correct order, and creating a line geometry object that can be inserted into a PostGIS geometry column.

Code for loading OSM data into a Postgres/PostGIS database

Create a Graph and Do Some Routing {#graph-and-routing}

Technically, there’s an implicit graph structure in the database, but for now I’m not using it directly. The eventual goal is to migrate to a graph database such as Neo4j or maybe use pgRouting. For now, I’m using a simple library named Dijkstar to build a graph and perform shortest path queries.

The graph creation script reads the street table and adds an edge in each direction for two-way streets and a single edge for one-way streets. The edges are simply tuples of street attributes such as length that can be used by a cost function that’s passed into the path-finding function. The graph is essentially just a dictionary, and it’s saved using marshal from the Python standard library.

At this point, the graph can be loaded and queried for a shortest path by specifying a pair of node IDs, but that isn’t super user-friendly. Check out the route service in byCycle to see an example.

Code for creating a graph from data loaded into the database

Code

All of the code used to do the processing described above is free/open source and can viewed on GitHub.