[OSM-talk] SVG to PDF

Tue Dec 19 23:39:09 GMT 2006

> Can that possibly be true? I've generated PDFs without ever having to
> consider that the decoder might be reading from the end.

Yes, it is true. How have you managed to write a PDF without reading the
spec :-).

(I worked for many years on a PDF and PostScript interpreter for high
resolution printing - the high-end equivalent of Ghoststscript).

I quote from the Adobe PDF spec:

'The trailer of a PDF file enables an application reading the file to
quickly find the
cross-reference table and certain special objects. Applications should read
a PDF
file from its end. The last line[*] of the file contains only the
end-of-file marker,
%%EOF. ... The two preceding lines
contain the keyword startxref and the byte offset from the beginning of the
file to
the beginning of the xref keyword in the last cross-reference section. ...'

[*] 'line' is carefully defined to stil have meaning in the presence of
binary data

> How does a PDF viewer start displaying a partially downloaded file if
that's true?

A PDF structured as a 'linearised' PDF has a defined ordering of some
elements - everything in a page must be grouped together, and in particular
there is an offset stored in a dictionary near the beginning of each page
which lets you find the cross-reference table (the list of offsets, just for
that page). If the file doesn't conform to those conventions, the whole file
has to download before any of it can be rendered. (If a file is modified in
the way I described, this is the case.) However the file must still be
structured as a true PDF so that it ends up with the starting offset at the
end, if the source is a file on a direct access medium it will normally be
read this way.

So there are ways of taking short cuts if a PDF is carefully structured and
ordered in a way that isn't required by the spec. And there is some
redundancy which is what allows the repair process I mentioned.

Also Adobe has some (patented) methods of displaying text content before any
images, so you can start reading while the viewer renders the more
complicated images, so when there are some big images, it can look like it
is faster thn it actually would be if objects were rendered in the correct
stacking order, with the expense that text overlaying an image has to be
re-rendered at the end. This is probably a bit unnecessary now, as machines
are so much faster, and you'd barely notice - unless you are looking at a
PDF for pre-press with very large images in it in which case you still will.

Also from the spec, just to reiterate what I said in my own words:
'Note: If a PDF file contains binary data, as most do (see Section 3.1,
Lexical Conventions),
it is recommended that the header line be immediately followed by a
comment line containing at least four binary characters - that is,
characters whose
codes are 128 or greater. This will ensure proper behavior of file transfer
applications
that inspect data near the beginning of a file to determine whether to treat
the file's
contents as text or as binary.'

David.