[Talk-GB] TfL cycle data published - proposed conflation process

Sun Oct 13 16:38:50 UTC 2019

(1)

I would suggest also generating 
big OSM file with this data (without conflation, 
just what would be imported into unmapped area)
and running JOSM validator on it.
It may find bugs in data, proposed
conversion and in JOSM itself.

--------
(2)
I would advise also consulting
OSM community after steps 4, 5, 6.

Just post on talk-gb and process feedback.

------
(3)
What is also missing is 
- posting in imports mailing list
- obtaining permission from OSM community for import
(Assuming that process continues to be as great
as so far it should be without problems).
- documenting new proposed tags on OSM Wiki
and getting feedback
(Full proposal process is not necessary,
but may be considered, but at least
post about new proposed tags on tagging mailing list. 
Things like that often benefit from additional review)

But mappers should be able to check
what exactly will be changed.

Agreeing on principle that data may be useful
does not mean that any import is ok.

---------
(4)

Have you considered importing 
some topics separately?

For example - in the first run import just
bicycle parkings.

-------++

 13 Oct 2019, 15:17 by list-osm-talk-gb at cyclestreets.net:

>
>
> I've been looking at the various tools available, e.g. JOSM Conflation, Hootenanny, OpenStreetMap Live Conflation, etc.
>
> Whatever tool is best, a process is needed. May I seek comments on this proposal which would be put to TfL in my report to them:
>
>
> ----
>
>
> A proposed process for conflation would be:
>
>
> 1. Seek OSM community agreement in principle that the CID data is useful for OSM (done).
>
>
> 2. Confirm licensing compatibility (done).
>
>
> 3. Consult on proposed technical translation of data (as per discussion in this talk-gb thread, ongoing).
> https://bikedata.cyclestreets.net/tflcid/conversion/
>
>
> 4. Write a programming script based on this technical translation, which converts the CID data (using the version containing OSM IDs) to .osm format. The fundamental aim of this is to get the CID data to be as compatible with OSM norms as possible, so that amount of effort needed in the eventual conflation tool will be as low as possible. This converted dataset is referred to below as the “External OSM-compatible format dataset”. This will require expert programming work undertaken by a programmer fully conversant with the OSM data model. Estimate: 10 days.
>
>
> 5. ALPHA STAGE: small-scale merge of data into OSM. This stage aims to prove that the data is capable of being converted, and to demonstrate to the OSM community that it can be undertaken sensitively and accurately. It does not seek to produce a tool selection recommendation. This work should be undertaken by someone with experience of JOSM and the JOSM Conflation plugin. Estimate: 3-5 days.
>
> i. Identify a suitable extract of the CID data covering only an area of 10-20 smaller streets. This should be an outer London area, and avoid main roads, so that in the event that problems materialise, the effect on real users of OSM data is low. It should include both point-based and line-based assets, giving a good overview. It should aim to have a good variety of CID assets rather than the same type of asset dominating.
>
> ii. Install the JOSM editor and the JOSM Conflation plugin, which provides a toolset for this alpha project. JOSM Conflation is the most sensible option, as this is most widely used conflation tool in the OSM community. Although it requires manual inspection, it is workable for an alpha project at this smaller scale.
>
> iii. Attempt a merge of the External OSM-compatible format dataset using this tool.
>
> iv. Carefully and thoroughly observe the correctness of the data, iterating the script output and repeating these alpha steps until correctness is achieved.
>
> v. Save the merged import data into the live OSM dataset and request community feedback.
>
> vi. Manually fix up any identified problems arising from this feedback so that there is correctness, and fix the underlying problem in the script.
>
> vii. At this point, feasibility of conversion has been established, and community confidence will be much stronger.
>
>
> 6. BETA STAGE: larger-scale merge of data for one area. This stage aims to identify the best merging tool for a fuller conversion with a view to creating a fully-optimised workflow. Estimate: 4-8 weeks.
>
> i. Identify a suitable extract of the CID data to undertake a pilot conversion project. One of the 25 CID data packages would be an ideal size for such an evaluation, and each package is likely to contain sufficient variety.
>
> ii. Identify 2-3 most likely merging tools, e.g. JOSM Conflation and Hootenanny (see below).
>
> iii. Install each such merging tool and learn and practice its use. The time required for such installation and evaluation should not be underestimated. These systems involve widely different technologies (even requiring different operating systems to be installed using a Virtual Machine), so this step could easily take 5 days. Test data will need to be prepared, trial runs created, questions are likely to need to be asked on mailing lists, etc.
>
> iv. Identify the pros and cons of each tool and move towards a recommended solution based on trialing with the data and the amount of manual fixing up required.
>
> v. Determine and iterate the workflow required for the tool.
>
> vi. Adapt the now near-final script to perform conversion of this larger dataset for the selected tool. It is likely that the bulk of the conversion script will be unchanged, but that the final output format (e.g. .osm/Shapefile/GeoJSON) would need to be different based on the tool’s expectations.
>
> vii. Substantial iteration of the conversion script and/or tool workflow is then likely to be required. For instance, merging will involve conflating data from a cycle lane in the CID data to the cycle lane present in the OSM nearby. This scenario is likely to throw up several potentially issues. For instance, the OSM ID may in fact now have changed; it might now be represented by multiple separate OSM IDs; there might be multiple cycle lanes nearby which need to be disambiguated, etc. Another example would be the inconsistent tagging of cycle lane/track -related data in OSM, which is acknowledged to be one of the most complex areas of OSM. The script will need to be adapted to deal with various edge-cases like these, so that the geometries and metadata are matched together correctly and that existing OSM data that should be retained is not overwritten.
>
> viii. Inspect the conflated data and determine where manual inspection will be unavoidable vs. where fixes can be automated.
>
> ix. Identify whether any upstream improvements to the conflation tool being used could be made, with a view to facilitating further automation of the workflow and reduce the need for repetitive manual inspection that is avoidable. Liaise with the tool authors to determine feasibility and likely time requirements for such development work.
>
> x. Iterate the script and workflow to minimise as far as possible the need for these manual changes during an inspection stage.
>
> xi. Document a key checklist of conversion types to check.
>
> xii. Carefully and thoroughly observe the correctness of the data, iterating the script output and repeating these beta steps until correctness is achieved. Undertake manual changes that cannot be automated. The time required for this should not be underestimated – there will be around 10,000 assets within the data package, and all the various combinations of data should be checked.
>
> xiii. Report to the OSM community at this stage, seeking their consent for merging in the data.
>
> xiv. Save the merged import data into the live OSM dataset and request community feedback.
>
> xv. Manually fix up any identified problems arising from this feedback so that there is correctness, and fix the underlying problem in the script.
>
> xvi. At this point, feasibility and timescale for conversion has been established, and community confidence will be much stronger. The script will be in a near-final state for a mass import, and a set of instructions for manual inspection will be established. One of the 25 areas will be in OSM and this data will be picked up by routing and cartography systems entering real-world use within days/weeks.
>
> xvii. Relay back to TfL the findings, in the form of a short document. This will:
>
> a. Confirm what data within the CID has and has not been imported.
>
> b. Include an estimate of the time requirement for the remaining 24 areas, based on an extrapolation of applying the finalised script and manual procedures.
>
> c. A recommendation for whether this activity should be undertaken on a paid professional basis or whether crowdsourcing is realistic given the time, complexity and data volume.
>
> d. Include any proposals for making improvements to tools and the likely cost, which TfL may wish to consider funding.
>
>
> 7. FINAL STAGE: full merger. This step involves re-running the finalised script/workflow and manual procedures for each of the 24 remaining data package areas. Estimate: as defined in beta report.
>
> i. Run the script to convert the data for the 24 data package areas.
>
> ii. Conduct the workflow for each of the 24 data package areas.
>
> iii. Seek community input as this work proceeds.
>
> iv. Import the data and fix up issues arising from feedback.
>
> v. Report back to the OSM community.
>
> vi. Produce a final report for TfL confirming completion of the activity.
>
>
>
>
> Martin,                     **  CycleStreets - For Cyclists, By Cyclists
> Developer, CycleStreets     **  https://www.cyclestreets.net/
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/talk-gb/attachments/20191013/58aa0965/attachment-0001.html>