Sample Rate / Audio Sample Rate not included in XML output

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Sample Rate / Audio Sample Rate not included in XML output

Nick Sincaglia
I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and would like to use XML as a content handler. But when the metadata is returned as ‘structured text’ (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?

 

Tika 1.19 ‘Metadata’ view (sample rate is available):

 

Author: Glee Cast
Content-Length: 8251946
Content-Type: audio/mpeg
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser
X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612
X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0
channels: 2
creator: Glee Cast
dc:creator: Glee Cast
dc:title: Rehab (Glee Cast Version)
meta:author: Glee Cast
resourceName: USQX90900223_A4_T7.mp3
samplerate: 44100
title: Rehab (Glee Cast Version)
version: MPEG 3 Layer III Version 1
xmpDM:album: Glee: The Music, The Complete Season One
xmpDM:artist: Glee Cast
xmpDM:audioChannelType: Stereo
xmpDM:audioCompressor: MP3
xmpDM:audioSampleRate: 44100
xmpDM:duration: 206301.296875
xmpDM:genre: 
xmpDM:logComment: XXX - 
(P) 2009 Twentieth Century Fox Television - USQX90900223
xmpDM:releaseDate: 
xmpDM:trackNumber: 4

 

 

Tika 1.19 ‘Structured Text’ view (no sample rate):

 

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpDM:genre" content=""/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>
<meta name="creator" content="Glee Cast"/>
<meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>
<meta name="xmpDM:releaseDate" content=""/>
<meta name="meta:author" content="Glee Cast"/>
<meta name="xmpDM:artist" content="Glee Cast"/>
<meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>
<meta name="dc:creator" content="Glee Cast"/>
<meta name="xmpDM:audioCompressor" content="MP3"/>
<meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>
<meta name="xmpDM:logComment" content="XXX - &#10;(P) 2009 Twentieth Century Fox Television - USQX90900223"/>
<meta name="dc:title" content="Rehab (Glee Cast Version)"/>
<meta name="Author" content="Glee Cast"/>
<meta name="Content-Length" content="8251946"/>
<meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>
<meta name="Content-Type" content="audio/mpeg"/>
<title>Rehab (Glee Cast Version)</title>
</head>
<body><h1>Rehab (Glee Cast Version)</h1>
<p>Glee Cast</p>
<p>Glee: The Music, The Complete Season One, track 4</p>
<p>206301.3</p>
<p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>
</body></html>

 

Tika 1.19 Recursive JSON view (the sample rate is there):

 

[
  {
    "Author": "Glee Cast",
    "Content-Type": "audio/mpeg",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.mp3.Mp3Parser"
    ],
    "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",
    "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",
    "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",
    "X-TIKA:parse_time_millis": "86",
    "channels": "2",
    "creator": "Glee Cast",
    "dc:creator": "Glee Cast",
    "dc:title": "Rehab (Glee Cast Version)",
    "meta:author": "Glee Cast",
    "samplerate": "44100",
    "title": "Rehab (Glee Cast Version)",
    "version": "MPEG 3 Layer III Version 1",
    "xmpDM:album": "Glee: The Music, The Complete Season One",
    "xmpDM:artist": "Glee Cast",
    "xmpDM:audioChannelType": "Stereo",
    "xmpDM:audioCompressor": "MP3",
    "xmpDM:audioSampleRate": "44100",
    "xmpDM:duration": "206301.296875",
    "xmpDM:genre": "",
    "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",
    "xmpDM:releaseDate": "",
    "xmpDM:trackNumber": "4"
  }
]

Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Nick Sincaglia
I was wondering if anyone might have some insights on why the XML output does not contain some of the technical file information that the JSON and text version does. Is this something that can be fixed? Could someone suggest a way to go about identifying the root cause and fixing it?

Thanks,

Nick

On Oct 8, 2018, at 9:31 PM, Nick Sincaglia <[hidden email]> wrote:

I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and would like to use XML as a content handler. But when the metadata is returned as ‘structured text’ (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?

 

Tika 1.19 ‘Metadata’ view (sample rate is available):

 

Author: Glee Cast
Content-Length: 8251946
Content-Type: audio/mpeg
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser
X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612
X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0
channels: 2
creator: Glee Cast
dc:creator: Glee Cast
dc:title: Rehab (Glee Cast Version)
meta:author: Glee Cast
resourceName: USQX90900223_A4_T7.mp3
samplerate: 44100
title: Rehab (Glee Cast Version)
version: MPEG 3 Layer III Version 1
xmpDM:album: Glee: The Music, The Complete Season One
xmpDM:artist: Glee Cast
xmpDM:audioChannelType: Stereo
xmpDM:audioCompressor: MP3
xmpDM:audioSampleRate: 44100
xmpDM:duration: 206301.296875
xmpDM:genre: 
xmpDM:logComment: XXX - 
(P) 2009 Twentieth Century Fox Television - USQX90900223
xmpDM:releaseDate: 
xmpDM:trackNumber: 4

 

 

Tika 1.19 ‘Structured Text’ view (no sample rate):

 

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpDM:genre" content=""/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>
<meta name="creator" content="Glee Cast"/>
<meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>
<meta name="xmpDM:releaseDate" content=""/>
<meta name="meta:author" content="Glee Cast"/>
<meta name="xmpDM:artist" content="Glee Cast"/>
<meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>
<meta name="dc:creator" content="Glee Cast"/>
<meta name="xmpDM:audioCompressor" content="MP3"/>
<meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>
<meta name="xmpDM:logComment" content="XXX - &#10;(P) 2009 Twentieth Century Fox Television - USQX90900223"/>
<meta name="dc:title" content="Rehab (Glee Cast Version)"/>
<meta name="Author" content="Glee Cast"/>
<meta name="Content-Length" content="8251946"/>
<meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>
<meta name="Content-Type" content="audio/mpeg"/>
<title>Rehab (Glee Cast Version)</title>
</head>
<body><h1>Rehab (Glee Cast Version)</h1>
<p>Glee Cast</p>
<p>Glee: The Music, The Complete Season One, track 4</p>
<p>206301.3</p>
<p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>
</body></html>

 

Tika 1.19 Recursive JSON view (the sample rate is there):

 

[
  {
    "Author": "Glee Cast",
    "Content-Type": "audio/mpeg",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.mp3.Mp3Parser"
    ],
    "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",
    "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",
    "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",
    "X-TIKA:parse_time_millis": "86",
    "channels": "2",
    "creator": "Glee Cast",
    "dc:creator": "Glee Cast",
    "dc:title": "Rehab (Glee Cast Version)",
    "meta:author": "Glee Cast",
    "samplerate": "44100",
    "title": "Rehab (Glee Cast Version)",
    "version": "MPEG 3 Layer III Version 1",
    "xmpDM:album": "Glee: The Music, The Complete Season One",
    "xmpDM:artist": "Glee Cast",
    "xmpDM:audioChannelType": "Stereo",
    "xmpDM:audioCompressor": "MP3",
    "xmpDM:audioSampleRate": "44100",
    "xmpDM:duration": "206301.296875",
    "xmpDM:genre": "",
    "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",
    "xmpDM:releaseDate": "",
    "xmpDM:trackNumber": "4"
  }
]



Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Tim Allison
Will take a look over next few days... I’m sorry for not having a ready answer.

On Sun, Oct 14, 2018 at 11:08 PM Nick Sincaglia <[hidden email]> wrote:
I was wondering if anyone might have some insights on why the XML output does not contain some of the technical file information that the JSON and text version does. Is this something that can be fixed? Could someone suggest a way to go about identifying the root cause and fixing it?

Thanks,

Nick

On Oct 8, 2018, at 9:31 PM, Nick Sincaglia <[hidden email]> wrote:

I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and would like to use XML as a content handler. But when the metadata is returned as ‘structured text’ (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?

 

Tika 1.19 ‘Metadata’ view (sample rate is available):

 

Author: Glee Cast
Content-Length: 8251946
Content-Type: audio/mpeg
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser
X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612
X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0
channels: 2
creator: Glee Cast
dc:creator: Glee Cast
dc:title: Rehab (Glee Cast Version)
meta:author: Glee Cast
resourceName: USQX90900223_A4_T7.mp3
samplerate: 44100
title: Rehab (Glee Cast Version)
version: MPEG 3 Layer III Version 1
xmpDM:album: Glee: The Music, The Complete Season One
xmpDM:artist: Glee Cast
xmpDM:audioChannelType: Stereo
xmpDM:audioCompressor: MP3
xmpDM:audioSampleRate: 44100
xmpDM:duration: 206301.296875
xmpDM:genre: 
xmpDM:logComment: XXX - 
(P) 2009 Twentieth Century Fox Television - USQX90900223
xmpDM:releaseDate: 
xmpDM:trackNumber: 4

 

 

Tika 1.19 ‘Structured Text’ view (no sample rate):

 

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpDM:genre" content=""/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>
<meta name="creator" content="Glee Cast"/>
<meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>
<meta name="xmpDM:releaseDate" content=""/>
<meta name="meta:author" content="Glee Cast"/>
<meta name="xmpDM:artist" content="Glee Cast"/>
<meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>
<meta name="dc:creator" content="Glee Cast"/>
<meta name="xmpDM:audioCompressor" content="MP3"/>
<meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>
<meta name="xmpDM:logComment" content="XXX - &#10;(P) 2009 Twentieth Century Fox Television - USQX90900223"/>
<meta name="dc:title" content="Rehab (Glee Cast Version)"/>
<meta name="Author" content="Glee Cast"/>
<meta name="Content-Length" content="8251946"/>
<meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>
<meta name="Content-Type" content="audio/mpeg"/>
<title>Rehab (Glee Cast Version)</title>
</head>
<body><h1>Rehab (Glee Cast Version)</h1>
<p>Glee Cast</p>
<p>Glee: The Music, The Complete Season One, track 4</p>
<p>206301.3</p>
<p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>
</body></html>

 

Tika 1.19 Recursive JSON view (the sample rate is there):

 

[
  {
    "Author": "Glee Cast",
    "Content-Type": "audio/mpeg",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.mp3.Mp3Parser"
    ],
    "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",
    "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",
    "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",
    "X-TIKA:parse_time_millis": "86",
    "channels": "2",
    "creator": "Glee Cast",
    "dc:creator": "Glee Cast",
    "dc:title": "Rehab (Glee Cast Version)",
    "meta:author": "Glee Cast",
    "samplerate": "44100",
    "title": "Rehab (Glee Cast Version)",
    "version": "MPEG 3 Layer III Version 1",
    "xmpDM:album": "Glee: The Music, The Complete Season One",
    "xmpDM:artist": "Glee Cast",
    "xmpDM:audioChannelType": "Stereo",
    "xmpDM:audioCompressor": "MP3",
    "xmpDM:audioSampleRate": "44100",
    "xmpDM:duration": "206301.296875",
    "xmpDM:genre": "",
    "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",
    "xmpDM:releaseDate": "",
    "xmpDM:trackNumber": "4"
  }
]



Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Tim Allison
In reply to this post by Nick Sincaglia
Nick,
  I'm sorry for my delay.  The XHTMLContentHandler writes everything
that is in the Metadata object when the parser writes the first
"content" element, and in the MP3Parser, this is the <h1> element,
which is written before the sample rate is added to the Metadata
object.  Any metadata that is added afterwards does not show up in the
xhtml, but is retrievable from the Metadata object.
  This is one of the limitations of a streaming write.  As I look at
the code of the MP3Parser, I _think_ it would be trivial to write the
metadata before writing any content, and it wouldn't get in the way of
a streaming parse because the parser reads the whole file and caches
the content as it goes -- only writing once it has finished reading
the file.
  Please open a ticket on our JIRA, and I'll take care of it.

          Best,

                 Tim
On Sun, Oct 14, 2018 at 11:08 PM Nick Sincaglia <[hidden email]> wrote:

>
> I was wondering if anyone might have some insights on why the XML output does not contain some of the technical file information that the JSON and text version does. Is this something that can be fixed? Could someone suggest a way to go about identifying the root cause and fixing it?
>
> Thanks,
>
> Nick
>
> On Oct 8, 2018, at 9:31 PM, Nick Sincaglia <[hidden email]> wrote:
>
> I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and would like to use XML as a content handler. But when the metadata is returned as ‘structured text’ (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?
>
>
>
> Tika 1.19 ‘Metadata’ view (sample rate is available):
>
>
>
> Author: Glee Cast
> Content-Length: 8251946
> Content-Type: audio/mpeg
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser
> X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612
> X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0
> channels: 2
> creator: Glee Cast
> dc:creator: Glee Cast
> dc:title: Rehab (Glee Cast Version)
> meta:author: Glee Cast
> resourceName: USQX90900223_A4_T7.mp3
> samplerate: 44100
> title: Rehab (Glee Cast Version)
> version: MPEG 3 Layer III Version 1
> xmpDM:album: Glee: The Music, The Complete Season One
> xmpDM:artist: Glee Cast
> xmpDM:audioChannelType: Stereo
> xmpDM:audioCompressor: MP3
> xmpDM:audioSampleRate: 44100
> xmpDM:duration: 206301.296875
> xmpDM:genre:
> xmpDM:logComment: XXX -
> (P) 2009 Twentieth Century Fox Television - USQX90900223
> xmpDM:releaseDate:
> xmpDM:trackNumber: 4
>
>
>
>
>
> Tika 1.19 ‘Structured Text’ view (no sample rate):
>
>
>
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="xmpDM:genre" content=""/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>
> <meta name="creator" content="Glee Cast"/>
> <meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>
> <meta name="xmpDM:releaseDate" content=""/>
> <meta name="meta:author" content="Glee Cast"/>
> <meta name="xmpDM:artist" content="Glee Cast"/>
> <meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>
> <meta name="dc:creator" content="Glee Cast"/>
> <meta name="xmpDM:audioCompressor" content="MP3"/>
> <meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>
> <meta name="xmpDM:logComment" content="XXX - &#10;(P) 2009 Twentieth Century Fox Television - USQX90900223"/>
> <meta name="dc:title" content="Rehab (Glee Cast Version)"/>
> <meta name="Author" content="Glee Cast"/>
> <meta name="Content-Length" content="8251946"/>
> <meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>
> <meta name="Content-Type" content="audio/mpeg"/>
> <title>Rehab (Glee Cast Version)</title>
> </head>
> <body><h1>Rehab (Glee Cast Version)</h1>
> <p>Glee Cast</p>
> <p>Glee: The Music, The Complete Season One, track 4</p>
> <p>206301.3</p>
> <p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>
> </body></html>
>
>
>
> Tika 1.19 Recursive JSON view (the sample rate is there):
>
>
>
> [
>   {
>     "Author": "Glee Cast",
>     "Content-Type": "audio/mpeg",
>     "X-Parsed-By": [
>       "org.apache.tika.parser.DefaultParser",
>       "org.apache.tika.parser.mp3.Mp3Parser"
>     ],
>     "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",
>     "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",
>     "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",
>     "X-TIKA:parse_time_millis": "86",
>     "channels": "2",
>     "creator": "Glee Cast",
>     "dc:creator": "Glee Cast",
>     "dc:title": "Rehab (Glee Cast Version)",
>     "meta:author": "Glee Cast",
>     "samplerate": "44100",
>     "title": "Rehab (Glee Cast Version)",
>     "version": "MPEG 3 Layer III Version 1",
>     "xmpDM:album": "Glee: The Music, The Complete Season One",
>     "xmpDM:artist": "Glee Cast",
>     "xmpDM:audioChannelType": "Stereo",
>     "xmpDM:audioCompressor": "MP3",
>     "xmpDM:audioSampleRate": "44100",
>     "xmpDM:duration": "206301.296875",
>     "xmpDM:genre": "",
>     "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",
>     "xmpDM:releaseDate": "",
>     "xmpDM:trackNumber": "4"
>   }
> ]
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Nick Burch
On Wed, 17 Oct 2018, Tim Allison wrote:
> This is one of the limitations of a streaming write.  As I look at
> the code of the MP3Parser, I _think_ it would be trivial to write the
> metadata before writing any content, and it wouldn't get in the way of
> a streaming parse because the parser reads the whole file and caches
> the content as it goes -- only writing once it has finished reading
> the file.

IIRC some of the metadata is only known once all parsing is finished, eg
the audio duration, which may be why it's currently done as it is

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Tim Allison
>IIRC some of the metadata is only known once all parsing is finished, eg
the audio duration, which may be why it's currently done as it is

Y, I completely agree, but I don't think anything is written during
the parsing.  I _think_ all info is stored in memory during the parse,
and then we write to the Metadata object and the ContentHandler after
reading through the file.  If I understand the code correctly, then it
would be trivial to write to the Metadata object before writing to the
handler.
On Wed, Oct 17, 2018 at 9:48 AM Nick Burch <[hidden email]> wrote:

>
> On Wed, 17 Oct 2018, Tim Allison wrote:
> > This is one of the limitations of a streaming write.  As I look at
> > the code of the MP3Parser, I _think_ it would be trivial to write the
> > metadata before writing any content, and it wouldn't get in the way of
> > a streaming parse because the parser reads the whole file and caches
> > the content as it goes -- only writing once it has finished reading
> > the file.
>
> IIRC some of the metadata is only known once all parsing is finished, eg
> the audio duration, which may be why it's currently done as it is
>
> Nick
Reply | Threaded
Open this post in threaded view
|

Re: Sample Rate / Audio Sample Rate not included in XML output

Nick Sincaglia
In reply to this post by Tim Allison
Thanks for looking into this Tim!

I just created a ticket in JIRA.
https://issues.apache.org/jira/browse/TIKA-2761

Nick

> On Oct 17, 2018, at 8:21 AM, Tim Allison <[hidden email]> wrote:
>
> Nick,
>  I'm sorry for my delay.  The XHTMLContentHandler writes everything
> that is in the Metadata object when the parser writes the first
> "content" element, and in the MP3Parser, this is the <h1> element,
> which is written before the sample rate is added to the Metadata
> object.  Any metadata that is added afterwards does not show up in the
> xhtml, but is retrievable from the Metadata object.
>  This is one of the limitations of a streaming write.  As I look at
> the code of the MP3Parser, I _think_ it would be trivial to write the
> metadata before writing any content, and it wouldn't get in the way of
> a streaming parse because the parser reads the whole file and caches
> the content as it goes -- only writing once it has finished reading
> the file.
>  Please open a ticket on our JIRA, and I'll take care of it.
>
>          Best,
>
>                 Tim
> On Sun, Oct 14, 2018 at 11:08 PM Nick Sincaglia <[hidden email]> wrote:
>>
>> I was wondering if anyone might have some insights on why the XML output does not contain some of the technical file information that the JSON and text version does. Is this something that can be fixed? Could someone suggest a way to go about identifying the root cause and fixing it?
>>
>> Thanks,
>>
>> Nick
>>
>> On Oct 8, 2018, at 9:31 PM, Nick Sincaglia <[hidden email]> wrote:
>>
>> I am using the Tika 1.19 as a GUI to extract metadata from an .mp3 file. The sample rate is available and I am able access it, but only as a string or as part of a JSON document. I am working in XML and would like to use XML as a content handler. But when the metadata is returned as ‘structured text’ (XML) the sample rate is not returned. I have tried using Tika 1.19 in a Maven project and experimented with different contentHandlers  and the same issue occurs. I cannot seem to get the sample rate returned in an XML doc, but I am able to access the data from the metadata object itself. If the metadata is returned as a string, the sample rate is there, if it is returned as XML, the sample rate is not returned. I am wondering what I am doing wrong or misunderstanding. Perhaps an issue with the parser or contentHandler that is used?
>>
>>
>>
>> Tika 1.19 ‘Metadata’ view (sample rate is available):
>>
>>
>>
>> Author: Glee Cast
>> Content-Length: 8251946
>> Content-Type: audio/mpeg
>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>> X-Parsed-By: org.apache.tika.parser.mp3.Mp3Parser
>> X-TIKA:digest:MD5: e0bdf3a0e171fca838604f9baad46612
>> X-TIKA:digest:SHA256: ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0
>> channels: 2
>> creator: Glee Cast
>> dc:creator: Glee Cast
>> dc:title: Rehab (Glee Cast Version)
>> meta:author: Glee Cast
>> resourceName: USQX90900223_A4_T7.mp3
>> samplerate: 44100
>> title: Rehab (Glee Cast Version)
>> version: MPEG 3 Layer III Version 1
>> xmpDM:album: Glee: The Music, The Complete Season One
>> xmpDM:artist: Glee Cast
>> xmpDM:audioChannelType: Stereo
>> xmpDM:audioCompressor: MP3
>> xmpDM:audioSampleRate: 44100
>> xmpDM:duration: 206301.296875
>> xmpDM:genre:
>> xmpDM:logComment: XXX -
>> (P) 2009 Twentieth Century Fox Television - USQX90900223
>> xmpDM:releaseDate:
>> xmpDM:trackNumber: 4
>>
>>
>>
>>
>>
>> Tika 1.19 ‘Structured Text’ view (no sample rate):
>>
>>
>>
>> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
>> <head>
>> <meta name="xmpDM:genre" content=""/>
>> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
>> <meta name="X-Parsed-By" content="org.apache.tika.parser.mp3.Mp3Parser"/>
>> <meta name="creator" content="Glee Cast"/>
>> <meta name="xmpDM:album" content="Glee: The Music, The Complete Season One"/>
>> <meta name="xmpDM:releaseDate" content=""/>
>> <meta name="meta:author" content="Glee Cast"/>
>> <meta name="xmpDM:artist" content="Glee Cast"/>
>> <meta name="X-TIKA:digest:SHA256" content="ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0"/>
>> <meta name="dc:creator" content="Glee Cast"/>
>> <meta name="xmpDM:audioCompressor" content="MP3"/>
>> <meta name="resourceName" content="USQX90900223_A4_T7.mp3"/>
>> <meta name="xmpDM:logComment" content="XXX - &#10;(P) 2009 Twentieth Century Fox Television - USQX90900223"/>
>> <meta name="dc:title" content="Rehab (Glee Cast Version)"/>
>> <meta name="Author" content="Glee Cast"/>
>> <meta name="Content-Length" content="8251946"/>
>> <meta name="X-TIKA:digest:MD5" content="e0bdf3a0e171fca838604f9baad46612"/>
>> <meta name="Content-Type" content="audio/mpeg"/>
>> <title>Rehab (Glee Cast Version)</title>
>> </head>
>> <body><h1>Rehab (Glee Cast Version)</h1>
>> <p>Glee Cast</p>
>> <p>Glee: The Music, The Complete Season One, track 4</p>
>> <p>206301.3</p>
>> <p>XXX -  (P) 2009 Twentieth Century Fox Television - USQX90900223</p>
>> </body></html>
>>
>>
>>
>> Tika 1.19 Recursive JSON view (the sample rate is there):
>>
>>
>>
>> [
>>  {
>>    "Author": "Glee Cast",
>>    "Content-Type": "audio/mpeg",
>>    "X-Parsed-By": [
>>      "org.apache.tika.parser.DefaultParser",
>>      "org.apache.tika.parser.mp3.Mp3Parser"
>>    ],
>>    "X-TIKA:content": "Rehab (Glee Cast Version)\nGlee Cast\nGlee: The Music, The Complete Season One, track 4\n206301.3\nXXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223\n",
>>    "X-TIKA:digest:MD5": "e0bdf3a0e171fca838604f9baad46612",
>>    "X-TIKA:digest:SHA256": "ea1e4aa998f2c6e80139fa100c62fc1ee17652cf702cd484532b90183e7c5cc0",
>>    "X-TIKA:parse_time_millis": "86",
>>    "channels": "2",
>>    "creator": "Glee Cast",
>>    "dc:creator": "Glee Cast",
>>    "dc:title": "Rehab (Glee Cast Version)",
>>    "meta:author": "Glee Cast",
>>    "samplerate": "44100",
>>    "title": "Rehab (Glee Cast Version)",
>>    "version": "MPEG 3 Layer III Version 1",
>>    "xmpDM:album": "Glee: The Music, The Complete Season One",
>>    "xmpDM:artist": "Glee Cast",
>>    "xmpDM:audioChannelType": "Stereo",
>>    "xmpDM:audioCompressor": "MP3",
>>    "xmpDM:audioSampleRate": "44100",
>>    "xmpDM:duration": "206301.296875",
>>    "xmpDM:genre": "",
>>    "xmpDM:logComment": "XXX - \n(P) 2009 Twentieth Century Fox Television - USQX90900223",
>>    "xmpDM:releaseDate": "",
>>    "xmpDM:trackNumber": "4"
>>  }
>> ]
>>
>>
>>