[Geocoding] Regarding issue #967

Thu Apr 2 19:51:08 UTC 2020

Hi,

On Wed, Apr 01, 2020 at 11:19:52PM +0530, K Rahul Reddy wrote:
> I have written test cases in test/bdd. But I found something else while
> doing so. setNearPointFromQuery function used to detect LatLon pairs is
> processed separately. This causes the last two examples in the following
> Scenario to fail.
> 
>     Scenario Outline: Search with white space characters
>         When sending json search query "<data>"
>         Then exactly 1 result is returned
> 
>     Examples:
>       | data |
>       | amerlugalpe, N 47.15739° E 9.61264° |
>       | amerlugalpe,    N 47.15739° E 9.61264° |
>       |     amerlugalpe    ,     N 47.15739° E 9.61264° |
>       | amerlugalpe, N 47.15739°         E 9.61264° |
>       | amerlugalpe
, N 47.15739° E    9.61264° |
> 
> 
> This could be fixed by using a preg_replace in setNearPointFromQuery
> function in SearchContext.php or by applying preg_replace on $sQuery. The
> former will fix LatLon, but the main query string will still have those
> characters.

Looks like the regexes in parseLatLong() are rather picky there and
only accept real spaces. That could be replaced with the more generic '\s'.

Cheers

Sarah

> 
> Which approach should I follow? Or should I ignore this, as this is a part
> of LanLon, and would not consist of other white space characters in general?
> 
> Regards,
> 
> Rahul
> 
> On 01/04/20 11:42 am, Sarah Hoffmann wrote:
> > Hi Rahul,
> > 
> > On Wed, Apr 01, 2020 at 05:36:00AM +0530, K Rahul Reddy wrote:
> > > For issue #967 <https://github.com/osm-search/Nominatim/issues/967>, These
> > > are some points I found so far:
> > > 
> > >      In Geocode.php lookup(),
> > > 
> > > 1) The sNormQuery is made by using PHP's Transliterator.
> > > 
> > > 2) The normalization method make_standard_name is used on phrases in line
> > > 630. This is an sql function which returns
> > > trim(public.gettokenstring(public.transliteration(name))).
> > > 
> > >      We need to replace %09-%0d characters in phrases. This can be done
> > > simply by adding
> > > 
> > >                  $sPhrase = preg_replace('/[\x09|\x0a|\x0b|\x0c|\x0d]/', ' ',
> > > $sPhrase);
> > > 
> > >      before normalization function is called.
> > > 
> > > 3) Other solution would be to change normalization(breaks the DB). The
> > > transliteration() uses the utfasciitable.h
> > > 
> > >      Changing UTFASCIILOOKUP by replacing 9-13 th position elements by '2'
> > > does the job.
> > > 
> > > 
> > > I have tested both the ways, and both seem to work as expected. What should
> > > I do now?
> > Go for solution 3). It is true that it breaks the DB but only for places
> > that have characters %09-%0d in their name. That's basically data that is
> > broken in the OSM database already and should be fixed. Therefore it is
> > okay to make an exception to the rule not to change the normalization.
> > 
> > Cheers
> > 
> > Sarah