OCR and Raw text

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

OCR and Raw text

David Pilato
Heya


When OCR is available, what should happen when I have a document containing both text and images with text. 

For example I have a  PDF with a text "hello world" and an image containing "foo bar".
When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is. 

If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.

Is that expected? 
If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?

Thanks for your insights.


David

--
David Pilato, elastic.co
Developer | Evangelist,
Reply | Threaded
Open this post in threaded view
|

Re: OCR and Raw text

David Pilato
Anyone knows? 
I guess if no one I need to look at the code or use log debug. :) 



David

--
David Pilato, elastic.co
Developer | Evangelist,
Le 18 déc. 2018 à 21:43 +0100, David Pilato <[hidden email]>, a écrit :
Heya


When OCR is available, what should happen when I have a document containing both text and images with text. 

For example I have a  PDF with a text "hello world" and an image containing "foo bar".
When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is. 

If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.

Is that expected? 
If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?

Thanks for your insights.


David

--
David Pilato, elastic.co
Developer | Evangelist,
Reply | Threaded
Open this post in threaded view
|

Re: OCR and Raw text

Tim Allison
Hi David,
  I'm sorry for my slow response!

  That behavior isn't expected.  How have you configured Tika to run
OCR on pdfs?
1) extractInlineImages
2) render the page and then run OCR
    a) no_ocr
    b) ocr_only
    c) ocr_and_text

Is there any chance that "foo bar" is in the title of the PDF for the
image-only pdf?  We do write title info into the body.




1

On Fri, Dec 21, 2018 at 8:04 AM David Pilato <[hidden email]> wrote:

>
> Anyone knows?
> I guess if no one I need to look at the code or use log debug. :)
>
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,
> Le 18 déc. 2018 à 21:43 +0100, David Pilato <[hidden email]>, a écrit :
>
> Heya
>
>
> When OCR is available, what should happen when I have a document containing both text and images with text.
>
> For example I have a  PDF with a text "hello world" and an image containing "foo bar".
> When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is.
>
> If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted.
>
> Is that expected?
> If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all?
>
> Thanks for your insights.
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,