TikaServer - extract only a specific part of HTML page

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

TikaServer - extract only a specific part of HTML page

Hanjan, Harinder

Hello!

 

I was wondering if there is a way to instruct Tika Server to extract content only with in a div tag.

 

I am extracting a Sharepoint site and do not want to see text from header, footer etc. The important text is always inside a particular content div. I only want text from inside that div.

 

Previously, I had switched to using the /tika/main endpoint. While this has definitely given us some improvement, there are still many cases where text from the header is also extracted.

 

Thanks!

Harinder



NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
Reply | Threaded
Open this post in threaded view
|

RE: TikaServer - extract only a specific part of HTML page

Markus Jelsma
Hello Harinder,

You could try Boilerpipe which is integrated in Tika, it tries to solve the problem automatically. If this doesn't work for you, you can create a custom ContentHandler and collect text only in the div that has the ID you want.

We do a similar thing as Boilerpipe and both are extending ContentHandler. In the overloaded methods you can check for the div element and ID attribute value in the startElement() method, and if the conditions are right, collect text in the characters() method.

Regards,
Markus

 
 
-----Original message-----
> From:Hanjan, Harinder <[hidden email]>
> Sent: Wednesday 9th January 2019 22:06
> To: '[hidden email]' <[hidden email]>
> Subject: TikaServer - extract only a specific part of HTML page
>
> Hello!
> I was wondering if there is a way to instruct Tika Server to extract content only with in a div tag.
 
> I am extracting a Sharepoint site and do not want to see text from header, footer etc. The important text is always inside a particular content div. I only want text from inside that div.
> Previously, I had switched to using the /tika/main endpoint. While this has definitely given us some improvement, there are still many cases where text from the header is also extracted.
> Thanks!
> Harinder
>
 
> -----------
> NOTICE -
 
> This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or
 
>  communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify
 
>  us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
 
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] RE: TikaServer - extract only a specific part of HTML page

Hanjan, Harinder
Thanks Markus.

I have actually decided to use Jsoup instead for extraction of HTML pages.
Jsoup is a lightweight library and I only needs 3 lines of code to get content from within a div.

Cheers!
Harinder

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Wednesday, January 09, 2019 4:08 PM
To: [hidden email]; '[hidden email]' <[hidden email]>
Subject: [EXT] RE: TikaServer - extract only a specific part of HTML page

Hello Harinder,

You could try Boilerpipe which is integrated in Tika, it tries to solve the problem automatically. If this doesn't work for you, you can create a custom ContentHandler and collect text only in the div that has the ID you want.

We do a similar thing as Boilerpipe and both are extending ContentHandler. In the overloaded methods you can check for the div element and ID attribute value in the startElement() method, and if the conditions are right, collect text in the characters() method.

Regards,
Markus

 
 
-----Original message-----
> From:Hanjan, Harinder <[hidden email]>
> Sent: Wednesday 9th January 2019 22:06
> To: '[hidden email]' <[hidden email]>
> Subject: TikaServer - extract only a specific part of HTML page
>
> Hello!
> I was wondering if there is a way to instruct Tika Server to extract content only with in a div tag.
 
> I am extracting a Sharepoint site and do not want to see text from header, footer etc. The important text is always inside a particular content div. I only want text from inside that div.
> Previously, I had switched to using the /tika/main endpoint. While this has definitely given us some improvement, there are still many cases where text from the header is also extracted.
> Thanks!
> Harinder
>
 
> -----------
> NOTICE -
 
> This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or
 
>  communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify
 
>  us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.