[Geocoding] [OSM-dev] Nominatim and/or a fault-tolerant geocoder

Tue Nov 29 22:39:35 UTC 2016

Hi,

[removed dev@ from CC to avoid further cross posting]

On Tue, Nov 29, 2016 at 12:51:11PM +0100, Tom wrote:
> But right now I’m doing some tests with pg_trgm. And Sarah, I cannot confirm so far your comment
> 
> "Trigrams only work with misspellings of a letter or two, they fail
> completely when trying to match up abbreviations.“
> 
> To me the opposite seems true, as you can see in the following examples. Let’s take this address, as I want to look for it and the way OSM has it stored and spelled.
> 
> 		(asked address)			(OSM address)
> —street: 	Верещагина ул			улица Верещагина
> —town:	Ханская ст-ца			Ханская 
> —city:	Майкоп г				городской округ Майкоп 
> —region:	Адыгея Респ				Адыгея 
> 
> The Nominatim standard query is basically this (for the street):
> 
> select word_id, word_token, word
> from word
> where word_token = make_standard_name('Ханская ст-ца')
> 
> …and does not return anything.

Nominatim's query matching is actually a bit more complex. For each
place name in the Database it saves the full name as well as the
partial terms (space separated words) that make up the term. For
example, for 'улица Верещагина' it will have the full term
'улица Верещагина' and the partials 'улица' and  'Верещагина'.
Further 'улица' is abbreviated to 'ул', so that it will match
against the full word and the abbreviation later.

When searching, Nominatim does a similar thing and matches first
against full words and then against partial terms. So, while
you won't find 'Верещагина ул' int the word table, Nominatim
will still match it correctly because it finds 'Верещагина' and
'ул' and a database entry (in search_name) that contains both
words as partials.

The real problems start with 'Ханская ст-ца'. Nominatim only
has 'Ханская' as a name but no partial for 'ст-ца' or 'станица'.
And as the search algorithm never drops terms from the search
query(*), it won't return any result. It's true that trigram
search can still return a result. The problem is that the
similarity is already lower than many false positives where
the spelling is similar. Similarity is simply not a good
indicator to distinguish between superflous words and spelling
differences.

(*) Not completely true. It may drop house numbers, but only those.

That's where libpostal comes in. It is supposed to normalize
your address to something that is compatible with the names used
in OpenStreetMap. That includes removing the odd prefix or suffix
(like ст-ца), normalizing numbers etc. The interesting question
is how well that works when the search terms in Nominatim have
not been normalized with the same algorithm.

Sarah

> 
> Now I enabled the extension (CREATE EXTENSION pg_trgm;) and created an index (CREATE INDEX word_token_trgm_idx ON word USING GIST (word_token gist_trgm_ops);) and modified the select slightly:
> 
> select word_id, word_token, word, gettokenstring(transliteration(‚Верещагина ул')) as asked, 
> 	similarity(word_token, gettokenstring(transliteration('Верещагина ул'))) as sml
> from word
> where word_token % make_standard_name('Верещагина ул')
> order by sml desc
> limit 20
> 
> …and this is the result (I hope the formatting gets through…):
> 
> word_id
> integerword_token
> textword
> textasked
> textsml
> real
> 19098ul virishchaghinaулица Верещагинаvirishchaghina ul119099ul virishchaghinavirishchaghina ul119100virishchaghinavirishchaghina ul0.8333331525904virishchaghinaВерещагинаvirishchaghina ul0.833333115343ul virishchaghinovirishchaghina ul0.8115342ul virishchaghinoулица Верещагиноvirishchaghina ul0.8568775n virishchaghinaНа Верещагинаvirishchaghina ul0.75568776n virishchaghinavirishchaghina ul0.751256480pl virishchaghinaплощадь Верещагинаvirishchaghina ul0.7142861256481pl virishchaghinavirishchaghina ul0.714286351652virishchaghinВерещагинvirishchaghina ul0.684211351653virishchaghinvirishchaghina ul0.684211217731virishchaghinskaia ulВерещагинская улицаvirishchaghina ul0.666667217732virishchaghinskaia ulvirishchaghina ul0.666667115344virishchaghinovirishchaghina ul0.65824366v v virishchaghinВ.В.Верещагинvirishchaghina ul0.65824367v v virishchaghinvirishchaghina ul0.65855756virishchaghinoВерещагиноvirishchaghina ul0.65721916ur virishchaghinovirishchaghina ul0.636364721915ur virishchaghinoур. Верещагиноvirishchaghina ul0.636364
> So the first two answers with a matching of 1 (=100%) are exactly the town I asked for!
> 
> The same happens with the town („Ханская ст-ца“ <-> „Ханская“) and with the region („Адыгея Респ“ <-> „Адыгея“). Of course the similarity is not alway 1, but this doesn’t matter, as long as the best match is still my address. And furthermore it tells me how certain the answer is, so I can deal with the information.
> 
> What Sarah mentions might apply to the city („Майкоп г“ <-> „городской округ Майкоп“), where the real answer only appears as 23. result with a matching of 40%, after the „best“ (but wrong) match of 70%.
> 
> Maybe libpostal could help here, or the OSM data are wrong or the name I asked for. Anyway this would be acceptable because of the huge difference in spelling. It could even be healed with a clever combination of region, city, town and street.
> 
> So, in conclusion, to me pg_trgm looks really promising! And the query doesn’t change a lot. Sure, Nominatim would have to deal with the similarity in the response, but this doesn’t seem a huge thing, is it?
> 
> Kind regards,
> 
> Tom
> 
> 
> 
> 
> Am 29.11.2016 um 09:11 schrieb Sarah Hoffmann <lonvia at denofr.de>:
> 
> Hi,
> 
> On Tue, Nov 29, 2016 at 12:03:35AM +0100, Tom wrote:
> > I’m in the quest for a geocoder for OSM that is fault-tolerant in regards of miss-spelled search terms.
> > 
> > The company I’m working for does different projects for customers in the logistics field. From every customer we receive several hundred thousand address-records, which we have to geocode in order to do different calculations. I started to use Nominatim for that (on an own installation), but it seems that Nominatim has not much of tolerance regarding miss-spelled street and city names. Especially on our last project in Russia it turned out, that street- and city-names often include abbreviations in different ways (like „street“, „str.“, „s“, …). Since we receive the address information from our customers, we have not much influence on the quality of the data. So there are not just these valid abbreviations, but also real spelling errors. Nevertheless we have to geocode as much of these addresses as possible. 
> > 
> > But right now, Nominatim throws out around 40% of the addresses, not finding anything, although the address is in OSM and could be found (just slightly different spelled). What I would expect is, that a geocoder gives me back some kind of answer for every question I ask, being it an exact match on the city or on the street, or only a „similar“ match. It should tell me if there was no 100%-match, there were several records found, matching my street or my city from e.g. 80% to 50%. So then I can decide later on which records I consider a match and which not. In any case the first row returned should be the best match available.
> > 
> > So I have a couple of questions here: 
> > 
> > Does anybody know of a geocoder for OSM-data that does this already? 
> > I found besides Nominatim there are several other geocoders. But I cannot test them all. Maybe some work already this way.
> 
> As a rule of thumb, the elastic-search-based geocoders do a bit better
> for misspelled terms but they are still not ideal because elastic search
> is optimised for free text, which has a different distribution of words
> than addresses.
> 
> > There is a Postgresql-module that seems to do just what I want: pg_trgm. It does not seem like Nominatim uses that right now.
> > Is there anybody already working on implementing this (or anything similar)?
> 
> Trigrams only work with misspellings of a letter or two, they fail
> completely when trying to match up abbreviations.
> 
> > If not, I would be willing to invest further time and effort into this, but I need some help on the internals of Nominatim, which I’m not firm with. 
> > Where would be the right place to integrate this into Nominatim? 
> > Does it make sense to try to put this into Nominatim?
> > Or would it be easier to use just osm2psql and build on top of that a new query-interface?
> 
> One of the most promising new approaches might be libpostal:
> https://github.com/openvenues/libpostal
> 
> It's not a geocoder but a library for normalising addresses.
> So you would use it to preprocess your address and then geocode
> the results with a conventional geocoder. There is a php
> library for it, so it would be easy to extend the Nominatim
> query interface. Although I would probably rather try photon
> as the geocoding backend as it will likely catch a few more
> spelling errors.
> 
> In any case, I'd be very interested in the results if you
> experiment with libpostal and would be happy to take a
> pull request for Nominatim.
> 
> Kind regards
> 
> Sarah
> 
> 
> 
> _______________________________________________
> Geocoding mailing list
> Geocoding at openstreetmap.org
> https://lists.openstreetmap.org/listinfo/geocoding
>