Encoding issues when upgrading Tika 1.17 to 1.19.1

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding issues when upgrading Tika 1.17 to 1.19.1

Markus Jelsma
Hello,

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.

The other test fails because we suddenly extracted a bunch of Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.

Any idea on how to fix this encoding issue and the weird inline base64 Javascript? Are there any Tika options that i am unaware of? Are these bugs?

Of course, i can share the HTML files if needed.

Many thanks,
Markus
Reply | Threaded
Open this post in threaded view
|

Re: Encoding issues when upgrading Tika 1.17 to 1.19.1

Tim Allison
Hi Markus,

  On the scripts...we added an "extractScripts" option, but the
default is false, and the idea is that the scripts should be extracted
as embedded documents, which with xhtml, would be inlined.  But, with
the default as false, you shouldn't be seeing anything from scripts.

  On charset detection, that was likely caused by our "upgrade" to a
more recent copy of icu4j's charset detector.

  Thank you for letting us know about these.  Please do open issues
and share files.

   Cheers,

              Tim
On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
<[hidden email]> wrote:

>
> Hello,
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.
>
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
>
> The other test fails because we suddenly extracted a bunch of Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.
>
> Any idea on how to fix this encoding issue and the weird inline base64 Javascript? Are there any Tika options that i am unaware of? Are these bugs?
>
> Of course, i can share the HTML files if needed.
>
> Many thanks,
> Markus
Reply | Threaded
Open this post in threaded view
|

RE: Encoding issues when upgrading Tika 1.17 to 1.19.1

Markus Jelsma
In reply to this post by Markus Jelsma
Hello Tim,

Opened two issues to track the problems:
https://issues.apache.org/jira/browse/TIKA-2758
https://issues.apache.org/jira/browse/TIKA-2759

Many thanks,
Markus
 
-----Original message-----

> From:Tim Allison <[hidden email]>
> Sent: Wednesday 17th October 2018 16:53
> To: [hidden email]
> Subject: Re: Encoding issues when upgrading Tika 1.17 to 1.19.1
>
> Hi Markus,
>
>   On the scripts...we added an "extractScripts" option, but the
> default is false, and the idea is that the scripts should be extracted
> as embedded documents, which with xhtml, would be inlined.  But, with
> the default as false, you shouldn't be seeing anything from scripts.
>
>   On charset detection, that was likely caused by our "upgrade" to a
> more recent copy of icu4j's charset detector.
>
>   Thank you for letting us know about these.  Please do open issues
> and share files.
>
>    Cheers,
>
>               Tim
> On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
> <[hidden email]> wrote:
> >
> > Hello,
> >
> > I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.
> >
> > Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
> >
> > The other test fails because we suddenly extracted a bunch of Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.
> >
> > Any idea on how to fix this encoding issue and the weird inline base64 Javascript? Are there any Tika options that i am unaware of? Are these bugs?
> >
> > Of course, i can share the HTML files if needed.
> >
> > Many thanks,
> > Markus
>