[Dcmlib] [Fwd: pdf parser for generating XML like document]
Mathieu Malaterre
mathieu.malaterre at kitware.com
Tue Oct 25 21:04:58 CEST 2005
No, PDF has no concept of tables, as such. It's just commands to select
fonts and draw text, and some other commands to draw horizontal lines,
etc.
I don't know of any easy way to convert PDF to XML for the sort of
application you're working on, sorry.
- Derek
-------- Original Message --------
Subject: pdf parser for generating XML like document
Date: Sun, 23 Oct 2005 17:31:56 -0400
Hello,
I did search for a mailing list on the following web site:
http://www.foolabs.com/xpdf/
and since I could not find it, I am writting to you directly.
I have the following problem. DICOM is a file format that is specified
by NEMA at:
http://medical.nema.org/dicom/2004.html
In particular if you look at the document: (1)
http://medical.nema.org/dicom/2004/04_06PU.PDF
The spec is huge. Therefore I am using pdftotext + python script to
generate a custom output. You can find everything here:
The python script
(bascially takes as input the output of `pdftotext -raw -nopgbrk`
http://cvs.creatis.insa-lyon.fr/viewcvs/viewcvs.cgi/gdcm/Dicts/ParseDict.py
And here is the cleanup output (python script+hand writting):
http://cvs.creatis.insa-lyon.fr/viewcvs/viewcvs.cgi/gdcm/Dicts/dicomV3.dic
This is very difficult to maintain as every year a new spec is release.
Therefore I was wondering if you could give me some advice on how to
parse the PDF document(1). Is there some table start/end marker in the
pdf file that I can use. Is there any API, of the pdf lib that would
allow me to generate an 'XML' like description of the PDF in a neutral way ?
Thanks so much for your time,
Mathieu
Ps: If such ML exist, forgive me and please give the reference so that I
can ask this question.
More information about the Dcmlib
mailing list