One of our applications uses geocoding extensively. When we started the project, we included the excellent Geocoder gem, and set Google as the geocoding backend. As the application scaled, its geocoding requirements grew and soon we were looking at geocoding bills worth thousands of dollars.
An alternative Geocoder
Our search for an alternative geocoder landed us on Nominatim. Written in C, with a PHP web interface, Nominatim was performant enough for our requirements. Once set up, Nominatim required 8GB of RAM to run and this included RAM for the PostgreSQL (+ PostGIS) as well.
The rest of the blog discusses how to setup Nominatim and the tips and tricks that we learned along the way and how it compares with the geocoding solution offered by Google.
Setting up Nominatim
We started off by looking for Amazon Machine Images with Nominatim setup and could only find one which was hosted by OpenStreetMap but the magnet link was dead.
Next, we went through the official installation document. We decided to give docker a shot and found that there are many Nominatim docker builds. We used https://github.com/merlinnot/nominatim-docker since it seemed to follow all the steps mentioned in the official installation guide.
Issues faced during Setup
Out of Memory Errors
The official documentation recommends using 32GB of RAM for initial import but we needed to double the memory to 64GB to make it work.
Also any time docker build failed, due to the large amount of data that is generated on each run, we also ran out of disk space on subsequent docker builds since docker caches layers across builds.
Merging Multiple Regions
We wanted to geocode locations from USA, Mexico, Canada and Sri Lanka. USA, Mexico and Canada are included by default in North America data extract but we had to merge data for Sri Lanka with North America to get it in a format required for initial import.
The following snippet pre-processes map data for North America and Sri Lanka into a single data.osm.pbf file that can be directly used by Nominatim installer.
1RUN curl -L 'http://download.geofabrik.de/north-america-latest.osm.pbf' \ 2 --create-dirs -o /srv/nominatim/src/north-america-latest.osm.pbf 3RUN curl -L 'http://download.geofabrik.de/asia/sri-lanka-latest.osm.pbf' \ 4 --create-dirs -o /srv/nominatim/src/sri-lanka-latest.osm.pbf 5 6RUN osmconvert /srv/nominatim/src/north-america-latest.osm.pbf \ 7 -o=/srv/nominatim/src/north-america-latest.o5m 8RUN osmconvert /srv/nominatim/src/sri-lanka-latest.osm.pbf \ 9 -o=/srv/nominatim/src/sri-lanka-latest.o5m 10 11RUN osmconvert /srv/nominatim/src/north-america-latest.o5m \ 12 /srv/nominatim/src/sri-lanka-latest.o5m \ 13 -o=/srv/nominatim/src/data.o5m 14 15RUN osmconvert /srv/nominatim/src/data.o5m \ 16 -o=/srv/nominatim/src/data.osm.pbf
Slow Search times
Once the installation was done, we tried running simple location searches like this one, but the search timed out. Usually Nominatim can provide a lot of information from its web-interface by appending &debug=true to the search query.
1# from 2https://nominatim.openstreetmap.org/search.php?q=New+York&polygon_geojson=1&viewbox= 3# to 4https://nominatim.openstreetmap.org/search.php?q=New+York&polygon_geojson=1&viewbox=&debug=true
We created an issue in Nominatim repository and got very prompt replies from Nominatim maintainers, especially from Sarah Hoffman .
1# runs analyze on the entire nominatim database 2psql -d nominatim -c 'ANALYZE VERBOSE'
PostgreSQL query planner depends on statistics collected by postgres statistics collector while executing a query. In our case, query planner took an enormous amount of time to plan queries as there were no stats collected since we had a fresh installation.
Comparing Nominatim and Google Geocoder
We compared 2500 addresses and we found that Google geocoded 99% of those addresses. In comparison Nominatim could only geocode 47% of the addresses.
It means we still need to geocode ~50% of addresses using Google geocoder. We found that we could increase geocoding efficiency by normalizing the addresses we had.
Address Normalization using libpostal
Libpostal is an address normalizer, which uses statistical natural-language processing to normalize addresses. Libpostal also has ruby bindings which made it quite easy to use it for our test purposes.
Once libpostal and its ruby-bindings were installed (installation is straightforward and steps are available in ruby-postal's github page), we gave libpostal + Nominatim a go.
1require 'geocoder' 2require 'ruby_postal/expand' 3require 'ruby_postal/parser' 4 5Geocoder.configure({lookup: :nominatim, nominatim: { host: "nominatim_host:port"}}) 6 7full_address = [... address for normalization ...] 8expanded_addresses = Postal::Expand.expand_address(full_address) 9parsed_addresses = expanded_addresses.map do |address| 10 Postal::Parser.parse_address(address) 11end 12 13parsed_addresses.each do | address | 14 parsed_address = [:house_number, :road, :city, :state, :postcode, :country].inject([]) do |acc, key| 15 # address is of format 16 # [{label: 'postcode', value: 12345}, {label: 'city', value: 'NY'} .. ] 17 key_value = address.detect { |address| address[:label] == key } 18 if key_value 19 acc << "#{key_value_pair[:value]}".titleize 20 end 21 acc 22 end 23 24 coordinates = Geocoder.coordinates(parsed_address.join(", ")) 25 if (coordinates.is_a? Array) && coordinates.present? 26 puts "By Libpostal #{coordinates} => #{parsed_address.join(", ")}" 27 break 28 end 29end
With this, we were able to improve our geocoding efficiency by 10% as Nominatim + Libpostal combination could geocode ~ 59% of addresses.