[Geocoding] Found a potential fix for Issue #967

Sarah Hoffmann lonvia at denofr.de
Mon Mar 23 20:29:18 UTC 2020


Hi,

On Mon, Mar 23, 2020 at 05:21:42AM +0530, K Rahul Reddy wrote:
> Hi!
> 
> I have been going through various files to find where the tab space is being
> dropped. I found that the normalization works as expected and converts tab
> space to single space. But the final query phrase contained tab spaces. The
> reason is:
> 
> Geocode.php:532
> 
>     $sQuery = $this->sQuery;
> 
> 
> When it is replaced with
> 
>    $sQuery = $sNormQuery;
> 
> all tab spaces and other white space character are replaced with single
> space.
> 
> 
> Is there any reason why the initial line was used? Or is it safe to replace?

The normalization done for $sNormQuery is a different one than the one
done later in line 630 when make_standard_name() is called on the phrase.
It serves the purpose for rechecking but it cannot be used for looking up
search terms.

Why? As I said before, normalisation is done twice, once with the input
data (the names of the places you actually want to search for) and again
with the query input (the one in line 630). It is very important that the
two normalizations produce exactly the same result. Imagine that the data
import normalizes away all capital letters. So 'London' becomes 'london'
before it is saved in the database. Now somebody want to search for
exactly the same term 'London' but the normalization for the search query
keeps the capital letters. Then our original data would never be found
because 'London' != 'london'.

Captialisaztion isn't usually the problem, but the two forms of normalization
have a different way of handling diacritics which leads exactly to the
problem described above: use $sNormQuery instead of $sQuery and half the
places won't be found anymore because the normalised names do no longer
match.

Unfortunately, this also means I have to reject your PR for the moment.
Not because it is wrong, but because we have no good way of changing the
normalization without breaking an existing database. There are  quite a
lot of Nominatim installations out there which would be expensive to
reinstall. So one of the policies with changes is that they do not break
existing databases or that we provide an upgrade path for them. In the
case of normalization this would mean that we either have a mechanism
where an existing database keeps its original normalization algorithm
even when upgrading to a newer Nominatim version or that there is a
script to 'convert' a database to the new normalization schema. It is
definitely something we want to have in the future but at the moment
we have neither. So that's why we can't accept normalisation changes.

Kind regards

Sarah



More information about the Geocoding mailing list