[OSM-dev] Nominatim and/or a fault-tolerant geocoder
lonvia at denofr.de
Tue Nov 29 08:11:49 UTC 2016
On Tue, Nov 29, 2016 at 12:03:35AM +0100, Tom wrote:
> I’m in the quest for a geocoder for OSM that is fault-tolerant in regards of miss-spelled search terms.
> The company I’m working for does different projects for customers in the logistics field. From every customer we receive several hundred thousand address-records, which we have to geocode in order to do different calculations. I started to use Nominatim for that (on an own installation), but it seems that Nominatim has not much of tolerance regarding miss-spelled street and city names. Especially on our last project in Russia it turned out, that street- and city-names often include abbreviations in different ways (like „street“, „str.“, „s“, …). Since we receive the address information from our customers, we have not much influence on the quality of the data. So there are not just these valid abbreviations, but also real spelling errors. Nevertheless we have to geocode as much of these addresses as possible.
> But right now, Nominatim throws out around 40% of the addresses, not finding anything, although the address is in OSM and could be found (just slightly different spelled). What I would expect is, that a geocoder gives me back some kind of answer for every question I ask, being it an exact match on the city or on the street, or only a „similar“ match. It should tell me if there was no 100%-match, there were several records found, matching my street or my city from e.g. 80% to 50%. So then I can decide later on which records I consider a match and which not. In any case the first row returned should be the best match available.
> So I have a couple of questions here:
> Does anybody know of a geocoder for OSM-data that does this already?
> I found besides Nominatim there are several other geocoders. But I cannot test them all. Maybe some work already this way.
As a rule of thumb, the elastic-search-based geocoders do a bit better
for misspelled terms but they are still not ideal because elastic search
is optimised for free text, which has a different distribution of words
> There is a Postgresql-module that seems to do just what I want: pg_trgm. It does not seem like Nominatim uses that right now.
> Is there anybody already working on implementing this (or anything similar)?
Trigrams only work with misspellings of a letter or two, they fail
completely when trying to match up abbreviations.
> If not, I would be willing to invest further time and effort into this, but I need some help on the internals of Nominatim, which I’m not firm with.
> Where would be the right place to integrate this into Nominatim?
> Does it make sense to try to put this into Nominatim?
> Or would it be easier to use just osm2psql and build on top of that a new query-interface?
One of the most promising new approaches might be libpostal:
It's not a geocoder but a library for normalising addresses.
So you would use it to preprocess your address and then geocode
the results with a conventional geocoder. There is a php
library for it, so it would be easy to extend the Nominatim
query interface. Although I would probably rather try photon
as the geocoding backend as it will likely catch a few more
In any case, I'd be very interested in the results if you
experiment with libpostal and would be happy to take a
pull request for Nominatim.
More information about the dev