Header extractions from PDFs (and others)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Header extractions from PDFs (and others)

Grant Ingersoll-2
Hi,

I'm working on a project extracting data out of user manuals (mainly PDFs) for indexing and searching.  I want to be able to mark in the search engine the header information.  The Tika extraction (coming from PDFBox) using an OOTB setup strips out all identifying header information, AFAICT.  I know the main reason is that PDF itself is just providing layout information primarily and likely doesn't give any indication that something is semantically a header, but I wanted to check here to see if anyone knows of a way to do this.

For example, the document might look something like this (we'll see how well this comes through as HTML):

<snip>
THIS IS A HEADER

This is normal text.
</snip>

Tika's XML representation then comes back with something like:
<p>THIS IS A HEADER</p>
<p/>
<p>This is normal text</p>

Ideally, it would come back with something like:
<h1>THIS IS A HEADER</h1>
<p/>
<p>This is normal text</p>

Thanks,
Grant
Reply | Threaded
Open this post in threaded view
|

Re: Header extractions from PDFs (and others)

Tim Allison
Grant,

  You might want to try the Grobid parser, which is built to process
academic papers. [0]

  Generally, though, as you can imagine, building a cross-language,
cross-genre header (and footer) extractor is going to require
heuristics and/or ML on a tagged set...with fingers crossed that
there's enough signal from which to learn the classification. There
are some research papers on this topic[1], and I suspect the
commercial extractors might do a reasonable job, but this is,
unfortunately, beyond the scope of what Tika currently offers.

  One thing we could do on the Tika end is a better job of including
font size/location/boldedness[2], etc in the xhtml output.  Then
consumers could write their own heuristics for their specific document
sets.

  As you know, as nlp continues to move into production, it will only
become more important for open source tools to be able to reconstruct
the logical components of PDFs and image-based files.

     Cheers,

                  Tim

[0] https://wiki.apache.org/tika/GrobidJournalParser or just straight
grobid: https://grobid.readthedocs.io/en/latest/Introduction/...see
also: https://www.crossref.org/labs/pdfextract/
[1] e.g. https://www.researchgate.net/publication/221253782_Header_and_Footer_Extraction_by_Page-Association
[2] hand-waving...need to look into size/boldedness...location is
fairly straightforward.

On Mon, Jan 7, 2019 at 8:50 AM Grant Ingersoll <[hidden email]> wrote:

>
> Hi,
>
> I'm working on a project extracting data out of user manuals (mainly PDFs) for indexing and searching.  I want to be able to mark in the search engine the header information.  The Tika extraction (coming from PDFBox) using an OOTB setup strips out all identifying header information, AFAICT.  I know the main reason is that PDF itself is just providing layout information primarily and likely doesn't give any indication that something is semantically a header, but I wanted to check here to see if anyone knows of a way to do this.
>
> For example, the document might look something like this (we'll see how well this comes through as HTML):
>
> <snip>
> THIS IS A HEADER
>
> This is normal text.
> </snip>
>
> Tika's XML representation then comes back with something like:
> <p>THIS IS A HEADER</p>
> <p/>
> <p>This is normal text</p>
>
> Ideally, it would come back with something like:
> <h1>THIS IS A HEADER</h1>
> <p/>
> <p>This is normal text</p>
>
> Thanks,
> Grant
Reply | Threaded
Open this post in threaded view
|

Re: Header extractions from PDFs (and others)

Grant Ingersoll-2
Thanks, Tim, will look into that parser.  Totally agree on the need for more formatting info as attributes.

On Mon, Jan 7, 2019 at 11:30 AM Tim Allison <[hidden email]> wrote:
Grant,

  You might want to try the Grobid parser, which is built to process
academic papers. [0]

  Generally, though, as you can imagine, building a cross-language,
cross-genre header (and footer) extractor is going to require
heuristics and/or ML on a tagged set...with fingers crossed that
there's enough signal from which to learn the classification. There
are some research papers on this topic[1], and I suspect the
commercial extractors might do a reasonable job, but this is,
unfortunately, beyond the scope of what Tika currently offers.

  One thing we could do on the Tika end is a better job of including
font size/location/boldedness[2], etc in the xhtml output.  Then
consumers could write their own heuristics for their specific document
sets.

  As you know, as nlp continues to move into production, it will only
become more important for open source tools to be able to reconstruct
the logical components of PDFs and image-based files.

     Cheers,

                  Tim

[0] https://wiki.apache.org/tika/GrobidJournalParser or just straight
grobid: https://grobid.readthedocs.io/en/latest/Introduction/...see
also: https://www.crossref.org/labs/pdfextract/
[1] e.g. https://www.researchgate.net/publication/221253782_Header_and_Footer_Extraction_by_Page-Association
[2] hand-waving...need to look into size/boldedness...location is
fairly straightforward.

On Mon, Jan 7, 2019 at 8:50 AM Grant Ingersoll <[hidden email]> wrote:
>
> Hi,
>
> I'm working on a project extracting data out of user manuals (mainly PDFs) for indexing and searching.  I want to be able to mark in the search engine the header information.  The Tika extraction (coming from PDFBox) using an OOTB setup strips out all identifying header information, AFAICT.  I know the main reason is that PDF itself is just providing layout information primarily and likely doesn't give any indication that something is semantically a header, but I wanted to check here to see if anyone knows of a way to do this.
>
> For example, the document might look something like this (we'll see how well this comes through as HTML):
>
> <snip>
> THIS IS A HEADER
>
> This is normal text.
> </snip>
>
> Tika's XML representation then comes back with something like:
> <p>THIS IS A HEADER</p>
> <p/>
> <p>This is normal text</p>
>
> Ideally, it would come back with something like:
> <h1>THIS IS A HEADER</h1>
> <p/>
> <p>This is normal text</p>
>
> Thanks,
> Grant