[OSM-talk] Pictures of opening hours signs for machine learning purposes

Toggenburger Lukas Lukas.Toggenburger at fhgr.ch
Sun Apr 11 10:58:52 UTC 2021


> I didn’t do any additional work on deduplicating the images. I’m not sure why you think this is important if you’re going to use it for ML training.

@Bryce: It's important if one wants to splitt off a test dataset in order to have a good estimate how good the recognition works: If you have almost the same image in the training dataset and the test dataset, your estimated real-world performance will be better than it really is. (An alternative approach would be to make sure that pictures of the same opening hours sign either end up in the training dataset or the test dataset, but not both. But having such a set of pictures in the test dataset makes it hard to calculate a success rate.)

> Keep in mind I’m not doing any ML training, so having a larger sample size doesn’t benefit me.
> I wanted a large number of test images in order to measure the expected accuracy of the OCR
> and algorithm in a real-world settings. My plan now is to build a stand-alone app for testing
> during surveying, improve the recognition by building better spatial models of how the text is
> laid out, and then finally integrate it into Go Map!!

@Bryce: Ok, in that case you only need one set of images. But you would still profit from 
a) a large (test) dataset
b) annotations of the expected results (opening hours) to compare different implementations against each other
right?

> I’m working on this at https://github.com/bryceco/OpeningHoursPhoto

@ Bryce: Cool! Thanks for sharing!

> The data is not big but not tiny either (about 23MB) I've put them into a
> github repo:
> https://github.com/iboates/osm-opening-hours-signs-rekognition-results

@Isaac: Thanks for sharing!

> It will probably then have trouble detecting the "ß" character if it comes
> up (will probably often show up as "B", but from what I understand it
> doesn't appear in Swiss German.

Yes, "ß" is practically never used in Switzerland (except maybe by german shop owners in our case...), instead "ss" will be used.

Is it possible to set up the language that should be detected in Amazon Rekognition?

Best regards

Lukas



________________________________________
Von: Isaac Boates <iboates at gmail.com>
Gesendet: Samstag, 10. April 2021 21:40
An: Bryce Cogswell
Cc: Toggenburger Lukas; talk at openstreetmap.org
Betreff: Re: [OSM-talk] Pictures of opening hours signs for machine learning purposes

I took the images from the "deduplicated in Bryce's download link and ran them through Amazon Rekognition for text extraction. So no actual training or modelling done by me, but it's a pretty cheap service (1.20 USD per 1000 images on the Frankfurt server). It gives back a JSON for every image with precise details about what it found, where it is on the image, etc. So I saved the results for each image with a JSON with the same name as the image, just with ".json" as a file extension.

I didn't go through the results in detail just yet, I wanted to share them first so anyone can dig through them to see what's in there & potentially get ideas from them. One thing I did notice however is that it does not detect accented characters, so "Öffnungszeiten" becomes "Offnungszeiten". It will probably then have trouble detecting the "ß" character if it comes up (will probably often show up as "B", but from what I understand it doesn't appear in Swiss German.

The data is not big but not tiny either (about 23MB) I've put them into a github repo: https://github.com/iboates/osm-opening-hours-signs-rekognition-results

Also just as a disclaimer I am not affiliated with Amazon in any way, I just had some experience with this specific product from them and thought it would be good to just run the data through a "state of the art" pre-built ML solution.

Cheers
Isaac


On Sat, Apr 10, 2021 at 4:04 PM Isaac Boates <iboates at gmail.com<mailto:iboates at gmail.com>> wrote:
@Lukas: I was having a bit of trouble getting the guest account permissions set up on my AWS but then Bryce went ahead and posted a direct link, thanks for that!

Isaac

On Sat, Apr 10, 2021 at 5:52 AM Bryce Cogswell via talk <talk at openstreetmap.org<mailto:talk at openstreetmap.org>> wrote:

@Bryce: Did you already make significant efforts regarding deduplicating / sorting or otherwise processing the images? If yes, maybe you could share this altered dataset with Isaac and other interested parties?

I didn’t do any additional work on deduplicating the images. I’m not sure why you think this is important if you’re going to use it for ML training.

@Bryce: Congratulations! I already saw some correctly recognized specimens! That is certainly encouraging, isn't it? Do you already know if/how you would proceed further? If you would be okay with publishing with what you already have, maybe others could build upon that.

I remember one idea we had: If users of such a recognition feature would be willing to (automatically, with little/no effort) share the pictures to increase the pool of pictures you could create a virtuos cycle, especially if you can motivate them to either mark detections as correct or let them fix it as needed.

Keep in mind I’m not doing any ML training, so having a larger sample size doesn’t benefit me. I wanted a large number of test images in order to measure the expected accuracy of the OCR and algorithm in a real-world settings. My plan now is to build a stand-alone app for testing during surveying, improve the recognition by building better spatial models of how the text is laid out, and then finally integrate it into Go Map!!

I’m working on this at https://github.com/bryceco/OpeningHoursPhoto but the code is super rough at this point.
The image set it is at https://gomaposm.com/opening_hours/opening_hours.zip<http://gomaposm.com/opening_hours/opening_hours.zip> (12.5GB download)

Bryce

_______________________________________________
talk mailing list
talk at openstreetmap.org<mailto:talk at openstreetmap.org>
https://lists.openstreetmap.org/listinfo/talk



More information about the talk mailing list