Logging and filename

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Logging and filename

Olivier Tavard
Hi,

I have a question about the log into Tika and for Tika server specifically.
We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19. 
The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.

To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us. 

Thanks,
Best regards,
Olivier 
Reply | Threaded
Open this post in threaded view
|

Re: Logging and filename

Tim Allison
Doh. Sorry.  I just added that in bf75e39.  Please let us know what
else you find!

Aside from the unit tests, I haven't had a chance to try to break the
-spawnChild option with our regression corpus.
On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
<[hidden email]> wrote:

>
> Hi,
>
> I have a question about the log into Tika and for Tika server specifically.
> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>
> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>
> Thanks,
> Best regards,
> Olivier
Reply | Threaded
Open this post in threaded view
|

Re: Logging and filename

Olivier Tavard
Hi,

Thanks for the quick fix !
The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.

Thanks,
Best regards,

Olivier 


Le 11 oct. 2018 à 19:46, Tim Allison <[hidden email]> a écrit :

Doh. Sorry.  I just added that in bf75e39.  Please let us know what
else you find!

Aside from the unit tests, I haven't had a chance to try to break the
-spawnChild option with our regression corpus.
On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
<[hidden email]> wrote:

Hi,

I have a question about the log into Tika and for Tika server specifically.
We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.

To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.

Thanks,
Best regards,
Olivier

Reply | Threaded
Open this post in threaded view
|

Re: Logging and filename

Tim Allison
Except that it didn't fix anything!  I _think_ I got it right this
time: https://issues.apache.org/jira/browse/TIKA-2754  Let me know
what you find.

Thank you, again.

Cheers,

         Tim
On Fri, Oct 12, 2018 at 5:44 AM Olivier Tavard
<[hidden email]> wrote:

>
> Hi,
>
> Thanks for the quick fix !
> The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.
>
> Thanks,
> Best regards,
>
> Olivier
>
>
> Le 11 oct. 2018 à 19:46, Tim Allison <[hidden email]> a écrit :
>
> Doh. Sorry.  I just added that in bf75e39.  Please let us know what
> else you find!
>
> Aside from the unit tests, I haven't had a chance to try to break the
> -spawnChild option with our regression corpus.
> On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
> <[hidden email]> wrote:
>
>
> Hi,
>
> I have a question about the log into Tika and for Tika server specifically.
> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>
> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>
> Thanks,
> Best regards,
> Olivier
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Logging and filename

Olivier Tavard
Hello,

Thanks for the fix, it works well !
 
Best regards,

Olivier 


Le 12 oct. 2018 à 18:41, Tim Allison <[hidden email]> a écrit :

Except that it didn't fix anything!  I _think_ I got it right this
time: https://issues.apache.org/jira/browse/TIKA-2754  Let me know
what you find.

Thank you, again.

Cheers,

        Tim
On Fri, Oct 12, 2018 at 5:44 AM Olivier Tavard
<[hidden email]> wrote:

Hi,

Thanks for the quick fix !
The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.

Thanks,
Best regards,

Olivier


Le 11 oct. 2018 à 19:46, Tim Allison <[hidden email]> a écrit :

Doh. Sorry.  I just added that in bf75e39.  Please let us know what
else you find!

Aside from the unit tests, I haven't had a chance to try to break the
-spawnChild option with our regression corpus.
On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
<[hidden email]> wrote:


Hi,

I have a question about the log into Tika and for Tika server specifically.
We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.

To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.

Thanks,
Best regards,
Olivier