[Talk-GB] TfL cycle data published - proposed conflation process
Martin - CycleStreets
list-osm-talk-gb at cyclestreets.net
Sun Oct 13 13:17:05 UTC 2019
I've been looking at the various tools available, e.g. JOSM Conflation,
Hootenanny, OpenStreetMap Live Conflation, etc.
Whatever tool is best, a process is needed. May I seek comments on this
proposal which would be put to TfL in my report to them:
----
A proposed process for conflation would be:
1. Seek OSM community agreement in principle that the CID data is useful
for OSM (done).
2. Confirm licensing compatibility (done).
3. Consult on proposed technical translation of data (as per discussion in
this talk-gb thread, ongoing).
https://bikedata.cyclestreets.net/tflcid/conversion/
4. Write a programming script based on this technical translation, which
converts the CID data (using the version containing OSM IDs) to .osm
format. The fundamental aim of this is to get the CID data to be as
compatible with OSM norms as possible, so that amount of effort needed in
the eventual conflation tool will be as low as possible. This converted
dataset is referred to below as the “External OSM-compatible format
dataset”. This will require expert programming work undertaken by a
programmer fully conversant with the OSM data model. Estimate: 10 days.
5. ALPHA STAGE: small-scale merge of data into OSM. This stage aims to
prove that the data is capable of being converted, and to demonstrate to
the OSM community that it can be undertaken sensitively and accurately. It
does not seek to produce a tool selection recommendation. This work should
be undertaken by someone with experience of JOSM and the JOSM Conflation
plugin. Estimate: 3-5 days.
i. Identify a suitable extract of the CID data covering only an area of
10-20 smaller streets. This should be an outer London area, and avoid main
roads, so that in the event that problems materialise, the effect on real
users of OSM data is low. It should include both point-based and line-based
assets, giving a good overview. It should aim to have a good variety of CID
assets rather than the same type of asset dominating.
ii. Install the JOSM editor and the JOSM Conflation plugin, which provides
a toolset for this alpha project. JOSM Conflation is the most sensible
option, as this is most widely used conflation tool in the OSM community.
Although it requires manual inspection, it is workable for an alpha project
at this smaller scale.
iii. Attempt a merge of the External OSM-compatible format dataset using
this tool.
iv. Carefully and thoroughly observe the correctness of the data, iterating
the script output and repeating these alpha steps until correctness is
achieved.
v. Save the merged import data into the live OSM dataset and request
community feedback.
vi. Manually fix up any identified problems arising from this feedback so
that there is correctness, and fix the underlying problem in the script.
vii. At this point, feasibility of conversion has been established, and
community confidence will be much stronger.
6. BETA STAGE: larger-scale merge of data for one area. This stage aims to
identify the best merging tool for a fuller conversion with a view to
creating a fully-optimised workflow. Estimate: 4-8 weeks.
i. Identify a suitable extract of the CID data to undertake a pilot
conversion project. One of the 25 CID data packages would be an ideal size
for such an evaluation, and each package is likely to contain sufficient
variety.
ii. Identify 2-3 most likely merging tools, e.g. JOSM Conflation and
Hootenanny (see below).
iii. Install each such merging tool and learn and practice its use. The
time required for such installation and evaluation should not be
underestimated. These systems involve widely different technologies (even
requiring different operating systems to be installed using a Virtual
Machine), so this step could easily take 5 days. Test data will need to be
prepared, trial runs created, questions are likely to need to be asked on
mailing lists, etc.
iv. Identify the pros and cons of each tool and move towards a recommended
solution based on trialing with the data and the amount of manual fixing up
required.
v. Determine and iterate the workflow required for the tool.
vi. Adapt the now near-final script to perform conversion of this larger
dataset for the selected tool. It is likely that the bulk of the conversion
script will be unchanged, but that the final output format (e.g.
.osm/Shapefile/GeoJSON) would need to be different based on the tool’s
expectations.
vii. Substantial iteration of the conversion script and/or tool workflow is
then likely to be required. For instance, merging will involve conflating
data from a cycle lane in the CID data to the cycle lane present in the OSM
nearby. This scenario is likely to throw up several potentially issues. For
instance, the OSM ID may in fact now have changed; it might now be
represented by multiple separate OSM IDs; there might be multiple cycle
lanes nearby which need to be disambiguated, etc. Another example would be
the inconsistent tagging of cycle lane/track -related data in OSM, which is
acknowledged to be one of the most complex areas of OSM. The script will
need to be adapted to deal with various edge-cases like these, so that the
geometries and metadata are matched together correctly and that existing
OSM data that should be retained is not overwritten.
viii. Inspect the conflated data and determine where manual inspection will
be unavoidable vs. where fixes can be automated.
ix. Identify whether any upstream improvements to the conflation tool being
used could be made, with a view to facilitating further automation of the
workflow and reduce the need for repetitive manual inspection that is
avoidable. Liaise with the tool authors to determine feasibility and likely
time requirements for such development work.
x. Iterate the script and workflow to minimise as far as possible the need
for these manual changes during an inspection stage.
xi. Document a key checklist of conversion types to check.
xii. Carefully and thoroughly observe the correctness of the data,
iterating the script output and repeating these beta steps until
correctness is achieved. Undertake manual changes that cannot be automated.
The time required for this should not be underestimated – there will be
around 10,000 assets within the data package, and all the various
combinations of data should be checked.
xiii. Report to the OSM community at this stage, seeking their consent for
merging in the data.
xiv. Save the merged import data into the live OSM dataset and request
community feedback.
xv. Manually fix up any identified problems arising from this feedback so
that there is correctness, and fix the underlying problem in the script.
xvi. At this point, feasibility and timescale for conversion has been
established, and community confidence will be much stronger. The script
will be in a near-final state for a mass import, and a set of instructions
for manual inspection will be established. One of the 25 areas will be in
OSM and this data will be picked up by routing and cartography systems
entering real-world use within days/weeks.
xvii. Relay back to TfL the findings, in the form of a short document. This
will:
a. Confirm what data within the CID has and has not been imported.
b. Include an estimate of the time requirement for the remaining 24 areas,
based on an extrapolation of applying the finalised script and manual
procedures.
c. A recommendation for whether this activity should be undertaken on a
paid professional basis or whether crowdsourcing is realistic given the
time, complexity and data volume.
d. Include any proposals for making improvements to tools and the likely
cost, which TfL may wish to consider funding.
7. FINAL STAGE: full merger. This step involves re-running the finalised
script/workflow and manual procedures for each of the 24 remaining data
package areas. Estimate: as defined in beta report.
i. Run the script to convert the data for the 24 data package areas.
ii. Conduct the workflow for each of the 24 data package areas.
iii. Seek community input as this work proceeds.
iv. Import the data and fix up issues arising from feedback.
v. Report back to the OSM community.
vi. Produce a final report for TfL confirming completion of the activity.
Martin, ** CycleStreets - For Cyclists, By Cyclists
Developer, CycleStreets ** https://www.cyclestreets.net/
More information about the Talk-GB
mailing list