Setting up a high performance Geocoder

Midhun Krishna

Midhun Krishna

August 21, 2018

One of our applications uses geocoding extensively. When we started the project, we included the excellent Geocoder gem, and set Google as the geocoding backend. As the application scaled, its geocoding requirements grew and soon we were looking at geocoding bills worth thousands of dollars.

An alternative Geocoder

Our search for an alternative geocoder landed us on Nominatim. Written in C, with a PHP web interface, Nominatim was performant enough for our requirements. Once set up, Nominatim required 8GB of RAM to run and this included RAM for the PostgreSQL (+ PostGIS) as well.

The rest of the blog discusses how to setup Nominatim and the tips and tricks that we learned along the way and how it compares with the geocoding solution offered by Google.

Setting up Nominatim

We started off by looking for Amazon Machine Images with Nominatim setup and could only find one which was hosted by OpenStreetMap but the magnet link was dead.

Next, we went through the official installation document. We decided to give docker a shot and found that there are many Nominatim docker builds. We used https://github.com/merlinnot/nominatim-docker since it seemed to follow all the steps mentioned in the official installation guide.

Issues faced during Setup

Out of Memory Errors

The official documentation recommends using 32GB of RAM for initial import but we needed to double the memory to 64GB to make it work.

Also any time docker build failed, due to the large amount of data that is generated on each run, we also ran out of disk space on subsequent docker builds since docker caches layers across builds.

Merging Multiple Regions

We wanted to geocode locations from USA, Mexico, Canada and Sri Lanka. USA, Mexico and Canada are included by default in North America data extract but we had to merge data for Sri Lanka with North America to get it in a format required for initial import.

The following snippet pre-processes map data for North America and Sri Lanka into a single data.osm.pbf file that can be directly used by Nominatim installer.

RUN curl -L 'http://download.geofabrik.de/north-america-latest.osm.pbf' \
    --create-dirs -o /srv/nominatim/src/north-america-latest.osm.pbf
RUN curl -L 'http://download.geofabrik.de/asia/sri-lanka-latest.osm.pbf' \
    --create-dirs -o /srv/nominatim/src/sri-lanka-latest.osm.pbf

RUN osmconvert /srv/nominatim/src/north-america-latest.osm.pbf \
    -o=/srv/nominatim/src/north-america-latest.o5m
RUN osmconvert /srv/nominatim/src/sri-lanka-latest.osm.pbf \
    -o=/srv/nominatim/src/sri-lanka-latest.o5m

RUN osmconvert /srv/nominatim/src/north-america-latest.o5m \
    /srv/nominatim/src/sri-lanka-latest.o5m \
    -o=/srv/nominatim/src/data.o5m

RUN osmconvert /srv/nominatim/src/data.o5m \
    -o=/srv/nominatim/src/data.osm.pbf

Slow Search times

Once the installation was done, we tried running simple location searches like this one, but the search timed out. Usually Nominatim can provide a lot of information from its web-interface by appending &debug=true to the search query.

# from
https://nominatim.openstreetmap.org/search.php?q=New+York&polygon_geojson=1&viewbox=
# to
https://nominatim.openstreetmap.org/search.php?q=New+York&polygon_geojson=1&viewbox=&debug=true

We created an issue in Nominatim repository and got very prompt replies from Nominatim maintainers, especially from Sarah Hoffman .

# runs analyze on the entire nominatim database
psql -d nominatim -c 'ANALYZE VERBOSE'

PostgreSQL query planner depends on statistics collected by postgres statistics collector while executing a query. In our case, query planner took an enormous amount of time to plan queries as there were no stats collected since we had a fresh installation.

Comparing Nominatim and Google Geocoder

We compared 2500 addresses and we found that Google geocoded 99% of those addresses. In comparison Nominatim could only geocode 47% of the addresses.

It means we still need to geocode ~50% of addresses using Google geocoder. We found that we could increase geocoding efficiency by normalizing the addresses we had.

Address Normalization using libpostal

Libpostal is an address normalizer, which uses statistical natural-language processing to normalize addresses. Libpostal also has ruby bindings which made it quite easy to use it for our test purposes.

Once libpostal and its ruby-bindings were installed (installation is straightforward and steps are available in ruby-postal's github page), we gave libpostal + Nominatim a go.

require 'geocoder'
require 'ruby_postal/expand'
require 'ruby_postal/parser'

Geocoder.configure({lookup: :nominatim, nominatim: { host: "nominatim_host:port"}})

full_address = [... address for normalization ...]
expanded_addresses = Postal::Expand.expand_address(full_address)
parsed_addresses = expanded_addresses.map do |address|
  Postal::Parser.parse_address(address)
end

parsed_addresses.each do | address |
  parsed_address = [:house_number, :road, :city, :state, :postcode, :country].inject([]) do |acc, key|
    # address is of format
    # [{label: 'postcode', value: 12345}, {label: 'city', value: 'NY'} .. ]
    key_value = address.detect { |address| address[:label] == key }
    if key_value
        acc << "#{key_value_pair[:value]}".titleize
    end
    acc
  end

  coordinates = Geocoder.coordinates(parsed_address.join(", "))
  if (coordinates.is_a? Array) && coordinates.present?
    puts "By Libpostal #{coordinates} => #{parsed_address.join(", ")}"
    break
  end
end

With this, we were able to improve our geocoding efficiency by 10% as Nominatim + Libpostal combination could geocode ~ 59% of addresses.

If this blog was helpful, check out our full blog archive.

Stay up to date with our blogs.

Subscribe to receive email notifications for new blog posts.