[Photon] Next steps

Fri Jun 20 08:09:53 UTC 2014

hi guys!

> Today, in two words: we are going forward, and fuzzy is starting to give good results :)
> In local:
> = 17 failed, 60 passed, 1 skipped, 29 deselected in 2.38 seconds =
> Requesting photon.komoot.de:
> = 50 failed, 27 passed, 1 skipped, 29 deselected in 59.63 seconds =
> 
> This is only for iledefrance tests in both case. Indeed, the data is not the same on both db, so numbers should be taken with distance ;)
is it only the data that is different?
some experiences we had when we expanded our data from europe to world: things are getting more complicated as the pool possible items increase. for instance there is a place called pärsi in finnland that might conflict with a fuzzy search of paris.
https://www.komoot.de/plan?p[0][name]=P%C3%A4rsi&p[0][loc]=58.150644%2C25.549609&p[1]=
another example is kopenhagen (german for copenhagen) which reveals search results in south america rather than the european capital.

another aspect is performance: things are working well as long as the entire index fits into the disk cache. currently the entire world needs about 30 GB. but on an ordinary disk (no ssd) requests might take seconds. @Peter: do you think using ssd would solve the problem even though the index does not fit in ram?

That’s why I was testing with a global data set primarily. at the price of slower development circles, but it is more realistic in return and you don’t have surprises as the data is growing.

> We still are fighting with small words: we were wrong in our understanding of "prefix_length" in conjunction with fuzziness. It means that nothing is fuzzied under this threshold, even if the word is longer, so "prais" will never match "paris" with a "prefix_length" of 3.
> Fuzzy is really the most exiting part of the challenge, but also imho the less controllable part of ES: no way to boost/unboost fuzzied matching, not way to decide which terms will be kept when using max_expansion, etc.
I am very interested in the outcome of your efforts as typo tolerance is a great feature.

> On my side, the next step is setup a France db, and then add much more tests (I have a 1.5Go csv to process :p).
btw, have you tested the json dump of photon or were you using the python script? I would ask felix for a nominatim export to json, so we will finally have an up to date and bug free global data set we can publish on photon.komoot.de and people can use for an photon setup without installing nominatim.

My plans: I’m working on releasing a new version of photon on komoot at the beginning of next week. i am using the fork i created that does not use fuzzy search but is very solid in return. It is also uncertain to me how fuzzy is increasing the load of the system. has anybody made experience with that? felix told me that fuzzy comes with the price of high cpu consumption …? Nevertheless this release will resolve a lot of bugs in the current komoot setup and is already a big step as i have to change the infrastructure from solr to es. And i am learning about how to deploy elasticsearch, things i can contribute to photon then.

> Then I would like to test the "split at search time + match prefix" scenario suggested by Peter. I wonder if this can solve the noise we have when mixing fuzzi + edgengrams for small words.
cool!

> Also, I'm thinking about spitting the search test module in a separate repository: as we are working now with many branches, and switching a lot, and adding tests here and there, merging becomes more and more tricky.
> Also it's better to have all search test cases available in any branches.
I understand your point, it wasn’t a gread pain for me though. For me both solutions are fine.

> Plus I would like to start working on adapter for Nominatim, Photon Solr, and also Pelias soon, to be able to compare results.
> What are your thoughts on this?
Why are you interested in it?

Cheers!