[Photon] Next steps

Fri Jun 20 10:50:04 UTC 2014

Hey Christoph, hey Yohan,

> another aspect is performance: things are working well as long as the entire index fits into the disk cache. 
> currently the entire world needs about 30 GB. but on an ordinary disk (no ssd) requests might take seconds. 
> @Peter: do you think using ssd would solve the problem even though the index does not fit in ram?

I have a rather old version installed and the index is about 65gb but I
only assign 12gb and the speed is okay with an ordinary disc or with
SSD. Or did you mean regarding fuzzy search? (I've not tested this)
I've also disabled swapping for my server.

> Fuzzy is really the most exiting part of the challenge, but also imho the less controllable part of ES: 
> no way to boost/unboost fuzzied matching, not way to decide which terms will be kept when using max_expansion, etc.

If fuzzy makes a lot of trouble maybe it would make sense to avoid it
and use (a lot less powerful) alternatives like phonetic analyzers and
synonyms. I can imagine that one can train the index quite good via the
data and give a kind of 'did you mean' feature. Of course this only
makes sense if you use it in an end user application and not e.g. in an API

> felix told me that fuzzy comes with the price of high cpu consumption …?

I think so too but I have no real world experience. I would at least
disable it for autosuggestions at the beginning. Maybe fuzzy does not
even makes sense for autosuggestion as one wants to have a more or less
strict filter and fuzzy includes sometimes too many and unexpected things.

> That’s why I was testing with a global data set primarily. at the price of slower development circles, 
> but it is more realistic in return and you don’t have surprises as the data is growing.

That is a valid argument, still I would prefer to simulate issues we get
in the global instance with some fast to execute unit tests. Thats what
I do for GraphHopper and is often very hard work to create such tests
(sometimes even impossible) but it pays back, a lot.

> My plans: I’m working on releasing a new version of photon on komoot at the beginning of next week.

Nice!

> And i am learning about how to deploy elasticsearch, things i can contribute to photon then.

If you have issues, feel free to ping me. BTW: I'm now subscribed to the
list, no need for CC ;)

You guys are doing a great job :) !

Regards,
Peter.

> hi guys!
>
>> Today, in two words: we are going forward, and fuzzy is starting to give good results :)
>> In local:
>> = 17 failed, 60 passed, 1 skipped, 29 deselected in 2.38 seconds =
>> Requesting photon.komoot.de:
>> = 50 failed, 27 passed, 1 skipped, 29 deselected in 59.63 seconds =
>>
>> This is only for iledefrance tests in both case. Indeed, the data is not the same on both db, so numbers should be taken with distance ;)
> is it only the data that is different?
> some experiences we had when we expanded our data from europe to world: things are getting more complicated as the pool possible items increase. for instance there is a place called pärsi in finnland that might conflict with a fuzzy search of paris.
> https://www.komoot.de/plan?p[0][name]=P%C3%A4rsi&p[0][loc]=58.150644%2C25.549609&p[1]=
> another example is kopenhagen (german for copenhagen) which reveals search results in south america rather than the european capital.
>
> another aspect is performance: things are working well as long as the entire index fits into the disk cache. currently the entire world needs about 30 GB. but on an ordinary disk (no ssd) requests might take seconds. @Peter: do you think using ssd would solve the problem even though the index does not fit in ram?
>
> That’s why I was testing with a global data set primarily. at the price of slower development circles, but it is more realistic in return and you don’t have surprises as the data is growing.
>
>> We still are fighting with small words: we were wrong in our understanding of "prefix_length" in conjunction with fuzziness. It means that nothing is fuzzied under this threshold, even if the word is longer, so "prais" will never match "paris" with a "prefix_length" of 3.
>> Fuzzy is really the most exiting part of the challenge, but also imho the less controllable part of ES: no way to boost/unboost fuzzied matching, not way to decide which terms will be kept when using max_expansion, etc.
> I am very interested in the outcome of your efforts as typo tolerance is a great feature.
>
>> On my side, the next step is setup a France db, and then add much more tests (I have a 1.5Go csv to process :p).
> btw, have you tested the json dump of photon or were you using the python script? I would ask felix for a nominatim export to json, so we will finally have an up to date and bug free global data set we can publish on photon.komoot.de and people can use for an photon setup without installing nominatim.
>
> My plans: I’m working on releasing a new version of photon on komoot at the beginning of next week. i am using the fork i created that does not use fuzzy search but is very solid in return. It is also uncertain to me how fuzzy is increasing the load of the system. has anybody made experience with that? felix told me that fuzzy comes with the price of high cpu consumption …? Nevertheless this release will resolve a lot of bugs in the current komoot setup and is already a big step as i have to change the infrastructure from solr to es. And i am learning about how to deploy elasticsearch, things i can contribute to photon then.
>
>> Then I would like to test the "split at search time + match prefix" scenario suggested by Peter. I wonder if this can solve the noise we have when mixing fuzzi + edgengrams for small words.
> cool!
>
>> Also, I'm thinking about spitting the search test module in a separate repository: as we are working now with many branches, and switching a lot, and adding tests here and there, merging becomes more and more tricky.
>> Also it's better to have all search test cases available in any branches.
> I understand your point, it wasn’t a gread pain for me though. For me both solutions are fine.
>
>> Plus I would like to start working on adapter for Nominatim, Photon Solr, and also Pelias soon, to be able to compare results.
>> What are your thoughts on this?
> Why are you interested in it?
>
> Cheers!

-- 
GraphHopper.com - Fast & Flexible Road Routing