[Gsoc-orga] GSoC updates for August/31th
Sarah Hoffmann
lonvia at denofr.de
Sun Sep 4 08:33:25 UTC 2022
Let me summarize this once more.
What is required to get a pass on GSoC from us:
1. PR with the latest version of the code changes you have so far.
2. A report describing the experiments, findings and road blocks
of the project.
What is _not_ required to get a pass on GSoC from us:
1. Merging of the PR.
2. Results as described in the initial project proposal,
if (and only if) the report describes why they were not reached.
There is plenty of time to bring this project to a conclusion.
Please sit down now and write up what you have done. This must be
your _only_ task to focus on in the next week. Do not get
tempted to do more testing or experiments. Just write the report.
Sarah
On Sun, Sep 04, 2022 at 01:39:42AM +0200, marc tobias wrote:
> Hi Tareq,
>
> The research and discussion of data issues you're doing is valuable. In
> short amount of time you became knowledgable on these files, the import
> and the importance algorithm.
>
> It's fine if the project ends without a successful import (or a merged
> PR) as long as the final project report remains usable. In the report
> you can explain the original plan, what was attempted, when and how
> issues were discovered and what the current status then. Finally you
> can write down your ideas how one could overcome the issues or what
> further research is needed. Imagine another student wanted to continue
> your work in 3 months and has again 3 months time to do so. Or another
> full-time engineer with unlimited time, patience and budget.
>
> (Realistically it'll be either Sarah or me attempting to continue in
> a couple of months or we try to make it a student project next GSoC.
> https://github.com/osm-search/wikipedia-wikidata was a project that
> generated a data file once, it took 18 months for me get it fully
> working again but I'm super grateful to the student's work. She did
> everything right, finished the project, documented everything but like
> you ran out of time in the final weeks).
>
> Like https://www.openstreetmap.org/user/tareqpi/diary/399655 we'd like
> to present the other OpenStreetMap mentors, in fact as many
> OpenStreetMap people as possible, what the project idea was, what
> the challenges were and give them hope that someday the tile view data
> can improve the importance score. It's something nobody has done yet.
> I work with geocoding systems for almost 20 year now and I'm not aware
> anybody using such data.
>
> Zoom 15 vs zoom 18: Add your pro and contra arguments into the final
> report. There's no perfect answer but we're interested in your expert
> opinion. I say expert because nobody has spent as much time thinking
> about the trade-offs as you have.
>
> We're limited by Google's rules. We can't move the deadline for
> submission, we can't give a pass if the report is bad (e.g. if you try
> to submit just 1 page or it's not readable) and we're sadly not even
> allowed to say if you passed or not until weeks later (the most stupid
> rule of them all in my opinion).
>
> It's certainly several days of work. First draft of the report probably
> a day, then feedback, then preparing the PR, then polishing the report.
> I don't know how much you've prepared on your computer already.
>
> We know it's a stressful last week. Every student, and we hear the same
> from other mentors, struggles with the final report. In the first week
> we already tell students to reserve a full week because it's so easy
> to underestimate the task.
>
> I can only finish by saying: don't give up yet. We accept the technical
> difficulties, it's part of software engineering and won't reflect too
> bad on the project. The more important part now is documenting the
> research and learnings and bring the work into a presentable state.
>
> We're looking forward to more updates from you and definitely let us
> know if we can help.
>
> Would it help you to move the next meeting to Tuesday or Thursday?
>
> All the best
>
> marc tobias
>
>
>
>
> On 03.09.22 19:17, Tareq Al-Ahdal wrote:
> > Hi,
> >
> > After I understood in the last meeting that the original
> > correlation graph that I have created (the one I shared previously in
> > this email thread after the meeting) was correct and that the problem is
> > with the data itself, I was trying to understand and fix the problem.
> > The first possible solution that I thought of was the suggestion that I
> > made in the reply to the original email message of this thread which is
> > to crop a city/country that has no unusual spikes to have a clean sample
> > so that we feed it into the mechanism to verify the results, but that
> > wouldn't work for the reason that will be later mentioned here. I then
> > did the reprojection hack with the centroids of placex to double check
> > if the issue is with how we reprojected the raster, but it isn't the
> > case after I have checked the correlation graph when doing so.
> > Furthermore, I have tried to implement different alternative
> > normalization techniques, min-max normalization and the z-score
> > normalization, to see if the issue is with how I normalized the data,
> > but nothing is wrong there as well. Because each zoom level is
> > essentially a different tile in the tile log server, I had this concern
> > the data stored in the GeoTIFF file that we are using is just the views
> > data of only zoom level 18 tiles from the tile server. That means that
> > the views number inside the GeoTIFF will hugely favor the smaller
> > places, hence why the museum that we viewed using QGIS during the last
> > meeting had a high view count. This means the data inside the GeoTIFF
> > file does not reflect the true importance of each place on the map since
> > the view data of each zoom level will favor places of certain size
> > corresponding to that zoom level. For example, users won't need to zoom
> > in that much to view a big place like a city, so zoom 18 tiles won't be
> > served by the server and the view count for zoom 18 tiles will not
> > increase. Yesterday, I used a GeoTIFF with zoom 15 that
> > Sarah has created using a PNG for further testing, however, each zoom
> > level will have a bias towards places of a certain size, and that is
> > true since the graph of zoom level 15 GeoTIFF does not have any
> > correlation. Moreover, the original GeoTIFF does not aggregate the views
> > data of all zoom levels from the server as the attached screenshot shows
> > that some places of Latvia using the zoom 15 GeoTIFF have more views
> > than the same places when the original GeoTIFF file. The zoom level
> > unidimensionality of the original GeoTIFF file (zoom 18) and the one
> > that was created with the PNG (zoom 15) means that the project is
> > infeasible from the start, and that sooner or later we would have found
> > out about this. I am very sad about this fact and will most likely
> > withdraw since the project is at a dead end.
> >
> > Tareq
> >
> >
> > On Sat, Sep 3, 2022 at 6:19 AM marc tobias <mtmail at gmx.net
> > <mailto:mtmail at gmx.net>> wrote:
> >
> > Hi Tareq,
> >
> > Just checking if you made progress on the goals for Friday. The
> > document is empty
> > https://docs.google.com/document/d/14l--OZyxxaEOLfSprAxmSHuSlvDLxPV50KndKLyY3CQ/edit
> > <https://docs.google.com/document/d/14l--OZyxxaEOLfSprAxmSHuSlvDLxPV50KndKLyY3CQ/edit>
> > and there hasn't been changes to
> > https://github.com/osm-search/Nominatim/pull/2779
> > <https://github.com/osm-search/Nominatim/pull/2779>
> > yet.
> >
> > All the best
> >
> > marc tobias
> >
> >
> > On 01.09.22 03:29, marc tobias wrote:
> > > Hi,
> > >
> > > That was our longest meeting so far. Please reply with any
> > > additions, corrections, questions you might have.
> > >
> > > Attached two screenshots from the meeting.
> > >
> > >
> > > Discussed today:
> > > ======================================================
> > > - Complaint that Tareq didn't send an update email before the meeting
> > > (again). Emails from Sarah about reprojection and Marc about creating
> > > a GeoTIFF extract were also not acknowledged. We didn't know if and
> > > what any work was done before the meeting.
> > >
> > > - Tareq showed the log() algorithm that converts view numbers into
> > > importance (0..1) to Sarah. He showed the same in the previous
> > meeting
> > > to Marc.
> > >
> > > - Tareq showed two graphs on importance scores. The first a
> > correlation
> > > of views to importance. (I'm no longer sure what the second graphed
> > > showed).
> > >
> > > - Sarah points out that we asked for a correlation between the
> > > existing importance of places (based on wikipedia data) and the
> > > new importance (based on tile views (35%) and wikipedia data (65%).
> > > It was listed as goal for next meetings twice already and we don't
> > > understand why the task was ignored.
> > >
> > > - Reprojecting the geotiff from 3857->4326 makes the import much
> > > faster. 20 minutes instead of 3 hours. The database table
> > ('osm_views')
> > > is also much smaller: 77 megabyte instead of several gigabyte.
> > >
> > > - The view counts in the 'osm_views' table are now floating point,
> > > it was integers in the past. The reason is the reprojecting needs to
> > > combine multiple cells. We think we could round the numbers without
> > > much impact on the accuracy.
> > >
> > > - Database named 'mini_nominatim' currently contains the country
> > > Latvia. Database named 'nominatim' contains the whole world. For
> > > the world the 'osm_views' is uptodate, but the 'place_views' isn't
> > > yet.
> > >
> > > - Tareq says the place_views table was created using this SQL:
> > >
> > > CREATE TABLE place_views AS (
> > > SELECT placex.place_id,
> > > ST_Value(osm_views.rast, placex.centroid) AS views
> > > FROM placex, osm_views
> > > WHERE ST_Intersects(osm_views.rast, placex.centroid));
> > >
> > > - Sarah points out that the ST_INTERSECT in the SQL should be called
> > > with convex_hull. Then the index will get used and the query will
> > > be much faster.
> > >
> > > - Sarah's query for listing top German cities by importance:
> > >
> > > SELECT v.place_id, name->'name', views FROM place_views v, placex p
> > > WHERE v.place_id = p.place_id and country_code = 'de' and
> > rank_address =
> > > 16 and type = 'city' order by views desc;
> > >
> > > This produces unexpected/surprising output. In the past it printed
> > > München / Munich
> > > Berlin
> > > Frankfurt am Main
> > > Hannover / Hanover
> > >
> > > Those are some of the biggest cities (by population). A lot of
> > > OSM tile views and thus high importance makes sense.
> > >
> > https://en.wikipedia.org/wiki/List_of_cities_in_Germany_by_population <https://en.wikipedia.org/wiki/List_of_cities_in_Germany_by_population>
> > >
> > > Today the output was
> > > Bielefeld
> > > Offenback am Main
> > > Berlin
> > > Kassel
> > > Leverkusen
> > >
> > > With the exception of Berlin those are much smaller cities.
> > >
> > > We looked at Frankfurt am Main. There is a local spike in number of
> > > tile views nearby at a museum
> > (https://www.feldbahn-ffm.de/anfahrt/
> > <https://www.feldbahn-ffm.de/anfahrt/>
> > > loads an OSM map on their website)
> > > https://www.openstreetmap.org/node/392801012
> > <https://www.openstreetmap.org/node/392801012>
> > > Such spikes cause the importance of the nearby Frankfurt am Main
> > > to be underreported.
> > >
> > > We determined that the new process of importing the projected geotiff
> > > file produces worse output. For the GSoC project overall that means
> > > while we have a somewhat working import and importance calculation
> > > the input file isn't usable. Or usable yet. Fixing the input file
> > > would require a lot of extra work (Sarah estimates 2 weeks full-time
> > > work) which would not fit into the GSoC timeline.
> > >
> > > It's a disappointing outcome. Tareq now needs to document well the
> > > steps he has taken during the GSoC project and list how another
> > > software developer can continue to the project.
> > >
> > > We think the raw data from the OSM tile servers need to be
> > > normalized better (smoothing) to filter out outliers/spikes.
> > >
> > > - Short discussion on the overall GSoC project: time is running
> > > out and the final report writing is expected to need at least
> > > one week. At this point in time we can only wrap up and well
> > > document what was done. The resulting PR won't be able to be
> > > merged until another software engineer continues work.
> > >
> > >
> > > Goals for Friday Sept/2nd
> > > ======================================================
> > >
> > > - Tareq to merge the work of importance scoring on the PR (2779?)
> > > for data import.
> > >
> > > - Work on any PR feedback Sarah is providing. For example the
> > > database name 'nominatim' is currently hard-coded.
> > >
> > > - Create first draft of the final report. We agreed on Google
> > > Doc.
> > >
> > (https://docs.google.com/document/d/14l--OZyxxaEOLfSprAxmSHuSlvDLxPV50KndKLyY3CQ/edit
> > <https://docs.google.com/document/d/14l--OZyxxaEOLfSprAxmSHuSlvDLxPV50KndKLyY3CQ/edit>)
> > >
> > >
> > > - Sarah and Marc will be able to provide feedback over the
> > > weekend.
> > >
> > >
> > > Goals for next meeting (Wednesday Sept/7nd)
> > > ======================================================
> > >
> > > - Tareq to send an update before the meeting.
> > >
> > > - Final report expected to be almost finished.
> > >
> > > - PR expected to be finished.
> > >
> > > - We will discuss final diary entry, where to put the
> > > final document (most likely a PDF file) and any open
> > > tasks.
> > >
> > >
> > > All the best
> > >
> > > marc tobias
> >
More information about the Gsoc-orga
mailing list