Attributes of HTML element not reported in ContentHandler

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Attributes of HTML element not reported in ContentHandler

Markus Jelsma
Hello,

We parse HTML using a ContentHandler. Tika uses TagSoup, which does not support modern HTML but we work-around the problem by fiddling with its HMTLSchema. Now we have access to HTML5 elements, and other curiosities such as allowing META anywhere in the body.

What we never managed to get to work, is reading attributes of the HTML element. So, any ideas on how to get attributes reported always?

Many thanks,
Markus