Tika Server - don't extract embedded images?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika Server - don't extract embedded images?

Hanjan, Harinder

Hello!

 

We are using Tika Server to extract text from rich files, HTML, PDF, DOCX, XLS, etc. By default Tika is extracting the alt text of images present in HTML files and returns it as [image: this is the alt text of the image] which becomes part of the document’s extracted text. This ends up in Solr and shows up in the results when we generate document summaries at query time (via Solr’s highlight functionality). You can see this at https://imgur.com/a/zTc9X6m


Based on the docs, I have tried the following tika config but I continue to see [image: ] tags in extract text.

<?xml version="1.0" encoding="UTF-8"?>

<properties>

  <parsers>

    <parser class="org.apache.tika.parser.DefaultParser">

      <mime-exclude>image/jpeg</mime-exclude>

      <parser-exclude class="org.apache.tika.parser.image.ImageParser"/>

     <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>

    </parser>

  </parsers>

</properties>

 

> java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar

 

What am I doing wrong, how can I tell Tika Server to ignore embedded images?

 

Thanks!

Harinder



NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
Reply | Threaded
Open this post in threaded view
|

Re: Tika Server - don't extract embedded images?

Tim Allison
Please open an issue on our jira with a short example file.  We can
look into parameterizing this behavior, maybe?
On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder
<[hidden email]> wrote:

>
> Hello!
>
>
>
> We are using Tika Server to extract text from rich files, HTML, PDF, DOCX, XLS, etc. By default Tika is extracting the alt text of images present in HTML files and returns it as [image: this is the alt text of the image] which becomes part of the document’s extracted text. This ends up in Solr and shows up in the results when we generate document summaries at query time (via Solr’s highlight functionality). You can see this at https://imgur.com/a/zTc9X6m
>
>
> Based on the docs, I have tried the following tika config but I continue to see [image: ] tags in extract text.
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>   <parsers>
>
>     <parser class="org.apache.tika.parser.DefaultParser">
>
>       <mime-exclude>image/jpeg</mime-exclude>
>
>       <parser-exclude class="org.apache.tika.parser.image.ImageParser"/>
>
>      <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
>
>     </parser>
>
>   </parsers>
>
> </properties>
>
>
>
> > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar
>
>
>
> What am I doing wrong, how can I tell Tika Server to ignore embedded images?
>
>
>
> Thanks!
>
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Tika Server - don't extract embedded images?

Hanjan, Harinder
Thanks Tim. I have created TIKA-2755.

Cheers!
Harinder

-----Original Message-----
From: Tim Allison <[hidden email]>
Sent: Friday, October 12, 2018 10:59 AM
To: [hidden email]
Subject: [EXT] Re: Tika Server - don't extract embedded images?

Please open an issue on our jira with a short example file.  We can look into parameterizing this behavior, maybe?
On Thu, Oct 11, 2018 at 6:00 PM Hanjan, Harinder <[hidden email]> wrote:

>
> Hello!
>
>
>
> We are using Tika Server to extract text from rich files, HTML, PDF,
> DOCX, XLS, etc. By default Tika is extracting the alt text of images
> present in HTML files and returns it as [image: this is the alt text
> of the image] which becomes part of the document’s extracted text.
> This ends up in Solr and shows up in the results when we generate
> document summaries at query time (via Solr’s highlight functionality).
> You can see this at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__imgur.com_a_zTc9X
> 6m&d=DwIFaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=dwMbP_-LdUIFVK-ul8K2AqWKJRFWTkM1Kf
> eDYQyxOec&s=l3fyPFfLoWdKkRdnB1y1h4dd8vnoHFBZ8Ii8dNNaZy0&e=
>
>
> Based on the docs, I have tried the following tika config but I continue to see [image: ] tags in extract text.
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>   <parsers>
>
>     <parser class="org.apache.tika.parser.DefaultParser">
>
>       <mime-exclude>image/jpeg</mime-exclude>
>
>       <parser-exclude
> class="org.apache.tika.parser.image.ImageParser"/>
>
>      <parser-exclude class="org.apache.tika.parser.jpeg.JpegParser"/>
>
>     </parser>
>
>   </parsers>
>
> </properties>
>
>
>
> > java -Dtika.config=tikaconfig.xml -jar tika-server-1.17.jar
>
>
>
> What am I doing wrong, how can I tell Tika Server to ignore embedded images?
>
>
>
> Thanks!
>
> Harinder
>
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.