Fwd: Memory Leak in 7.3 to 7.4

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Memory Leak in 7.3 to 7.4

Tim Allison
Thomas,
   Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
    Thank you!

    Best,
       Tim

---------- Forwarded message ---------
From: Thomas Scheffler <[hidden email]>
Date: Thu, Aug 2, 2018 at 6:06 AM
Subject: Memory Leak in 7.3 to 7.4
To: [hidden email] <[hidden email]>


Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same problems?

kind regards,

Thomas

signature.asc (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Memory Leak in 7.3 to 7.4

David Pilato
That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566


Le 7 août 2018 à 14:36 +0200, Tim Allison <[hidden email]>, a écrit :
Thomas,
   Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
    Thank you!

    Best,
       Tim

---------- Forwarded message ---------
From: Thomas Scheffler <[hidden email]>
Date: Thu, Aug 2, 2018 at 6:06 AM
Subject: Memory Leak in 7.3 to 7.4
To: [hidden email] <[hidden email]>


Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same problems?

kind regards,

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Memory Leak in 7.3 to 7.4

Tim Allison
Thank you, David!  It would be helpful to know if downgrading to 1.16
solves the problems with .txt files, as it does (apparently) with
pdfs.
On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[hidden email]> wrote:

>
> That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
> I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566
>
>
> Le 7 août 2018 à 14:36 +0200, Tim Allison <[hidden email]>, a écrit :
>
> Thomas,
>    Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
>     Thank you!
>
>     Best,
>        Tim
>
> ---------- Forwarded message ---------
> From: Thomas Scheffler <[hidden email]>
> Date: Thu, Aug 2, 2018 at 6:06 AM
> Subject: Memory Leak in 7.3 to 7.4
> To: [hidden email] <[hidden email]>
>
>
> Hi,
>
> we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
>
> In the mean time I would like to know if anybody else experienced the same problems?
>
> kind regards,
>
> Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Memory Leak in 7.3 to 7.4

Robert Neal Clayton
I have a remarkably similar setup to David, I’m running through about 50,000 PDF files with OCR tools at the moment. I have Tika 1.18 running standalone and a shell script sending PDFs to it via curl for each file to extract metadata before OCR functions, by POSTing the file to the /meta URL.  

After 10 hours of uptime, Tika is using about 5.6 gigs of memory.  After restarting the Tika server, that appears to be about the same amount of memory that it gets when it starts fresh.

So whatever the issue is, it’s not in anything that falls under /meta, that stuff is working great for me.

> On Aug 7, 2018, at 9:57 AM, Tim Allison <[hidden email]> wrote:
>
> Thank you, David!  It would be helpful to know if downgrading to 1.16
> solves the problems with .txt files, as it does (apparently) with
> pdfs.
> On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[hidden email]> wrote:
>>
>> That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
>> I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566
>>
>>
>> Le 7 août 2018 à 14:36 +0200, Tim Allison <[hidden email]>, a écrit :
>>
>> Thomas,
>>   Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
>>    Thank you!
>>
>>    Best,
>>       Tim
>>
>> ---------- Forwarded message ---------
>> From: Thomas Scheffler <[hidden email]>
>> Date: Thu, Aug 2, 2018 at 6:06 AM
>> Subject: Memory Leak in 7.3 to 7.4
>> To: [hidden email] <[hidden email]>
>>
>>
>> Hi,
>>
>> we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.
>>
>> In the mean time I would like to know if anybody else experienced the same problems?
>>
>> kind regards,
>>
>> Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Memory Leak in 7.3 to 7.4

David Pilato
My bad. The issue I mentioned was reported with a Tika 1.16 version.
So not related to the current thread. Most likely a problem in my own code :) 


Le 8 août 2018 à 05:25 +0200, Robert Neal Clayton <[hidden email]>, a écrit :
I have a remarkably similar setup to David, I’m running through about 50,000 PDF files with OCR tools at the moment. I have Tika 1.18 running standalone and a shell script sending PDFs to it via curl for each file to extract metadata before OCR functions, by POSTing the file to the /meta URL.

After 10 hours of uptime, Tika is using about 5.6 gigs of memory. After restarting the Tika server, that appears to be about the same amount of memory that it gets when it starts fresh.

So whatever the issue is, it’s not in anything that falls under /meta, that stuff is working great for me.

On Aug 7, 2018, at 9:57 AM, Tim Allison <[hidden email]> wrote:

Thank you, David! It would be helpful to know if downgrading to 1.16
solves the problems with .txt files, as it does (apparently) with
pdfs.
On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[hidden email]> wrote:

That's interesting. Someone did some tests on a project I'm working on and reported as well a lot of memory usage (even for only txt files).
I did not dig yet into the issue so I don't know if this is related or not, but I thought I'd share this here: https://github.com/dadoonet/fscrawler/issues/566


Le 7 août 2018 à 14:36 +0200, Tim Allison <[hidden email]>, a écrit :

Thomas,
Thank you for raising this on the Solr list. Please let us know if we can help you help us figure out what’s going on...or if you’ve already figured it out!
Thank you!

Best,
Tim

---------- Forwarded message ---------
From: Thomas Scheffler <[hidden email]>
Date: Thu, Aug 2, 2018 at 6:06 AM
Subject: Memory Leak in 7.3 to 7.4
To: [hidden email] <[hidden email]>


Hi,

we noticed a memory leak in a rather small setup. 40.000 metadata documents with nearly as much files that have „literal.*“ fields with it. While 7.2.1 has brought some tika issues (due to a beta version) the real problems started to appear with version 7.3.0 which are currently unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was enough, now 6G aren’t enough to index all files.
I am now to a point where I can track this down to the libraries in solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries shipped with 7.2.1 the problem disappears. As most files are PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the problem. I will next try to downgrade these single libraries back to 2.0.6 and 1.16 to see if these are the source of the memory leak.

In the mean time I would like to know if anybody else experienced the same problems?

kind regards,

Thomas