Google Takeout GChat messages

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Google Takeout GChat messages

Tucker Barbour
I've exported a GMail archive in MBOX format using takeout.google.com. The MBOX archive also includes GChat messages. However, the GChat messages do not include a Date header. Instead the date sent is included in what appears to be a non-conforming RFC822 header which the tika mbox parser does not recognize. I'm wondering if anyone has any experience extracting metadata from Gmail exports, specifically gchat messages. Any help or guidance would be appreciated.

gchat.eml (207 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Google Takeout GChat messages

Nick Burch
yOn Tue, 4 Sep 2018, Tucker Barbour wrote:
> I've exported a GMail archive in MBOX format using takeout.google.com. The
> MBOX archive also includes GChat messages. However, the GChat messages do not
> include a Date header. Instead the date sent is included in what appears to
> be a non-conforming RFC822 header which the tika mbox parser does not
> recognize.

As a user of Tika, were you expecting these to show up as additional
emails in the mbox, or something else?

(The underlying library may not give us a choice, I haven't dug in enough
recently to remember, but in case it does, user expectations are of
interst!)

> I'm wondering if anyone has any experience extracting metadata from
> Gmail exports, specifically gchat messages. Any help or guidance would
> be appreciated.

Any chance you could share / produce a small mbox file, with a handful of
both real emails and these gchat messages in, so we can take a look? If
you could open a bug in jira, and attach the small mbox file, that'd be
great

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Google Takeout GChat messages

Tucker Barbour
* Nick Burch <[hidden email]> [2018-09-05 07:36:46 +0100]:

>yOn Tue, 4 Sep 2018, Tucker Barbour wrote:
>>I've exported a GMail archive in MBOX format using
>>takeout.google.com. The MBOX archive also includes GChat messages.
>>However, the GChat messages do not include a Date header. Instead
>>the date sent is included in what appears to be a non-conforming
>>RFC822 header which the tika mbox parser does not recognize.
>
>As a user of Tika, were you expecting these to show up as additional
>emails in the mbox, or something else?

For my use-case I care about the metadata and body content. Ultimately, the metadata and body content end up in a search engine. So whether they are actually treated as emails or not doesn't really matter than much to me. Ideally, I should be able to determine the difference between a gchat message and an email. Maybe the presence of the X-GM-THRID header? In the case of the exported gchat messages, the metadata that's relevant to my use-case is the thread id, From, and Date headers. Tika gets most of the metadata I care about except for the sent time. The additional From header seems to be at issue. "From 1558692903658457318@xxx Tue Feb 07 16:36:29 +0000 2017". Body content is properly sent to the ContentHandler.

>
>(The underlying library may not give us a choice, I haven't dug in
>enough recently to remember, but in case it does, user expectations
>are of interst!)
>
>>I'm wondering if anyone has any experience extracting metadata from
>>Gmail exports, specifically gchat messages. Any help or guidance
>>would be appreciated.
>
>Any chance you could share / produce a small mbox file, with a handful
>of both real emails and these gchat messages in, so we can take a
>look? If you could open a bug in jira, and attach the small mbox file,
>that'd be great
>

I can spend some time cleaning up a data set for testing and will submit a JIRA ticket. In the mean time I might explore an additional parser and a custom-mimetypes.xml.

>Nick