[Photon] Next steps

Thu Jun 19 22:53:02 UTC 2014

Today, in two words: we are going forward, and fuzzy is starting to give 
good results :)
In local:
= 17 failed, 60 passed, 1 skipped, 29 deselected in 2.38 seconds =
Requesting photon.komoot.de:
= 50 failed, 27 passed, 1 skipped, 29 deselected in 59.63 seconds =

This is only for iledefrance tests in both case. Indeed, the data is not 
the same on both db, so numbers should be taken with distance ;)

We still are fighting with small words: we were wrong in our 
understanding of "prefix_length" in conjunction with fuzziness. It means 
that nothing is fuzzied under this threshold, even if the word is 
longer, so "prais" will never match "paris" with a "prefix_length" of 3.
Fuzzy is really the most exiting part of the challenge, but also imho 
the less controllable part of ES: no way to boost/unboost fuzzied 
matching, not way to decide which terms will be kept when using 
max_expansion, etc.

On my side, the next step is setup a France db, and then add much more 
tests (I have a 1.5Go csv to process :p).

Then I would like to test the "split at search time + match prefix" 
scenario suggested by Peter. I wonder if this can solve the noise we 
have when mixing fuzzi + edgengrams for small words.

Also, I'm thinking about spitting the search test module in a separate 
repository: as we are working now with many branches, and switching a 
lot, and adding tests here and there, merging becomes more and more tricky.
Also it's better to have all search test cases available in any branches.
Plus I would like to start working on adapter for Nominatim, Photon 
Solr, and also Pelias soon, to be able to compare results.
What are your thoughts on this?

On 06/17/2014 07:00 PM, Christoph Lingg wrote:
> I made a quick try, i did not get fuzzy to work.
>
>> But then cross_fields is not the silver bullet, because you weigth *per field* (which is nice), but you also want to weight differently if the query was fuzzy or not, which you can't do with a multimatch
> in my perception, cross_field is just a more elegant way (field specific boost) and more efficient way (no redundancy) of the collector. If fuzzy worked for instance I guess it would perfectly fit for the should branch here https://github.com/komoot/photon/blob/positivescoring/website/photon/app.py#L40
>
> now you only boost raw hits on name and the rest, this could be done more specifically.
>
>> If you want to work on search logic, you should work on the positive scoring branch.
> that’s why i asked you in the first mail if you had some time tomorrow (maybe 15 min) to get coordinated, also thinking about how to contribute on thursday/friday.
>
> Cheers,
>
>
>> On 06/17/2014 06:23 PM, Peter K wrote:
>>> There is no official answer so I'm not sure if this is correct. E.g. in
>>> the code it does not look like there is a limit
>>>
>>> https://github.com/elasticsearch/elasticsearch/blob/9ed34b5a9e9769b1264bf04d9b9a674794515bc6/src/main/java/org/elasticsearch/index/query/MultiMatchQueryBuilder.java
>>>
>>> Also in this issue the talk about fuzziness (and link to another issue
>>> which is now closed):
>>> https://github.com/elasticsearch/elasticsearch/pull/5005
>>>
>>> Peter.
>>>
>>>> On 06/17/2014 06:05 PM, Christoph Lingg wrote:
>>>>> downside is apparently that fuzzy is not working, haven’t tried it
>>>>> yet. Are you sure about this yohan?
>>>>
>>>> Yes, that's why we eliminated this option quickly :/
>>>> See for example
>>>> https://groups.google.com/forum/#!topic/elasticsearch/mmdnRDsvvVA
>>>>
>>>>
>>>
>>>