PDF Open Errors Causing Log File to Fill and arssockd to Stop

Previous topic - Next topic

JeanineJ

I'm running CMOD 10.5.0.6 on a RHEL7 server with multiple instances of ODWEK interrogating the server for PDF documents stored using Page Piece Dictionary.
Last night we started experiencing an abnormally large volume of DocGets for those PDF's. The system log showed multiple error types, below

09/20/2024 09:41:29   ODUSER1     383561602   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEB51E8700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:25   ODUSER1     383559016   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEB19CC700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:25   ODUSER1     383560897   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEB59EC700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383562667   Error   No      119   Unable to write to file >/app/arstmp/ARS.28565.00007FCEA958A700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383563774   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCE57AF1700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383563509   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCE576EF700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383559662   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCED59EC700.CONVERT.TMP<.  The error number is 2  Srvr-> full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383563570   Error   No      119   Unable to write to file >/app/arstmp/ARS.28565.00007FCEA3359700.RESCONVERT<.  The error number is 28  Srvr-> full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383563291   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCE578F0700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383563323   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCE57CF2700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383560450   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEB45E2700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:24   ODUSER1     383560203   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEDE7CB700.CONVERT.TMP<.  The error number is 2  Srvr->non-SSL<-
09/20/2024 09:41:24   ODUSER1     383560411   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEB4FE7700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:41:23   ODUSER1     383563374   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCE921F0700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:40:28   ODUSER1     383563231   Error   No      119   Unable to write to file >/app/arstmp/ARS.28565.00007FCEAF9BC700.RESCONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:40:25   ODUSER1     383561583   Error   No      116   Unable to stat file >/app/arstmp/ARS.28565.00007FCEA6974700.CONVERT.TMP<.  The error number is 2  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:40:24   ODUSER1     383563267   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCEAD1A8700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:40:24   ODUSER1     383561931   Error   No      114   Unable to open file >/app/arstmp/ARS.28565.00007FCEA6371700.CONVERT<.  The error number is 28  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:35:26   ODUSER1     383546145   Error   No       20   SM Error: Unexpected end of data encountered6, Return Code=6, Reason=arscssm.c, File=584, Line=  Srvr->full name removed to protect the innocent non-SSL<-
09/20/2024 09:35:15   ODUSER1     383546145   Error   Yes      257   PDF document/resource conversion error.  See stored document for more information.

So many that the file system where we write the files to filled up and arssockd and arsload (running as a daemon) stopped. I restarted it and the rest of the night everything was fine. Then it happened again this morning and we had to restart CMOD 3 times because I couldn't keep up with deleting the files out of the filesystem.
Interrogation of arssockd also showed abnormally high volume of activity, upwards of 150+ current activities displayed with ps -ef | grep arssockd

One of the 257 errors showed ARS4923E PJoin error code 1073938445 The file may be read-only, or another user may have it open. Please save the document with a different name or in a different folder.

Had anybody run into this? Do you have a way of identying the document or documents that could be involved? How can we prevent users from trying to retrieve documents that they might have already tried to retrieve (if the above error isn't a red herring)?


Justin Derrick

#1
So, basic questions first...

Is the temp directory your fastest-tier of storage (SSD?) it sounds like you're having problems processing the sheer quantity of requests.  Faster disks means faster throughput, and less opportunity for conflicts.  Local SSD/M2/PCI-E storage is best, remote over fast HBA (8Gbit FibreChannel?) is good, NFS storage is bad.

Second, are you filling up this disk?  You may need to expand the filesystem so that you have room to process all of the retrieval requests.

Third, I've seen poor PDF indexing result in terrible retrieval performance -- especially when higher volumes are experienced.  If this is just one App Group/Application pair, please share your indexing parameters so I can check for obvious problems...

-JD.
Call:  +1-866-533-7742  or  eMail:  jd@justinderrick.com
IBM CMOD Wiki:  https://CMOD.wiki/
FREE IBM CMOD Webinars:  https://CMOD.Training/
IBM CMOD Professional Services: https://CMOD.cloud

Interests: #AIX #Linux #Multiplatforms #DB2 #TSM #SP #Performance #Security #Audits #Customizing #Availability #HA #DR

JeanineJ

This is CMOD 10.5.0.5 running on a RHEL7 server with the temporary files going to a filesystem (not /tmp) in the root directory (forgive me if I use the wrong terms because I'm a mainframer with enough server knowledge to be dangerous).
The documents in question use Page Piece Dictionary attached is the output of the App Group.
The documents load once a month and we can load between 400k and 500k depending. I scanned the system log for the previous month and there was no overwhelming issue logged as errors or excessive activities for arssockd.
It might have been another team that changed their code that interrogates CM8 and CMOD. I have no visibility into what they do.
There are 4 applications in this group using the same indexing.
It's been MANY years since this has happened to where it stopped arssockd.

Justin Derrick

Hey Jeanine...

I've seen a lot of issues related to PDF indexing over the past few years...  Trouble with the indexing parameters can make for successful loads, but cause a lot of problems at retrieve time.

In one customer site, their retrievals were taking several seconds to process -- because their RESTYPE parameter was set incorrectly, and resources were gigantic.  At another customer site, the issue was that PDFs from the upstream system were produced by taking thousands of individual PDFs, and simply adding all the pages into a gigantic PDF which loaded with duplicate resources.

If you can have one of the Linux SysAdmins answer the questions about the filesystems, and respond with the indexing parameters, I'll try and give you a better answer.
Call:  +1-866-533-7742  or  eMail:  jd@justinderrick.com
IBM CMOD Wiki:  https://CMOD.wiki/
FREE IBM CMOD Webinars:  https://CMOD.Training/
IBM CMOD Professional Services: https://CMOD.cloud

Interests: #AIX #Linux #Multiplatforms #DB2 #TSM #SP #Performance #Security #Audits #Customizing #Availability #HA #DR

JeanineJ

We're still having issues with opening PDF's.
The documents use Page Piece Dictionary Indexing.

INDEXSTARTBY=1
RESTYPE=ALL
INDEXMODE=INTERNAL
Data Compression is OD77

Compressed Object Size is 100k
AppGroup has 17 DB2 fields with 4 Index type and the remaining are filter. 6 of the filter fields are 250 bytes long to allow for a comma delimited array of claim numbers to be populated.
Documents are indexed monthly and stored in CACHE for 60 days. We run maintenance once a week.

The indexed pdf documents can be anywhere from 14 to 200 logical pages. A normal load totals between 400k and 500k documents with over 100k where the consumers are 'paperless'.
We run a notification process to notify our consumers that their documents are ready to view.
The consumers access the documents via ICI & ODWEK (I have no details as that is owned by our CM8 team).
After the notifications go out the web portal is inundated with document get requests, more than 100 and less than 150 in minutes.
The PDF's are uncompressed using our temporary file system but the sheer volume of the requests overwhelms CMOD. I had at one point on Wednesday over 20GB of 200 RECONVERT, CONVERT and TMP.err files being retrieved from CACHE.

CMOD is being blamed for the latency on the ODWEK front end. I say it's them or a robot process causing the high traffic. I'm going to request that our temp file space be moved to a mount point after the first of the year (year end/holiday changes are basically frozen until January) to see if that helps.

Justin Derrick

Odd question.... How much RAM does this system have? 

And just as a recommendation - ask if the notifications can be "smoothed out" by only sending a few emails per second, or schedule them to be sent in the middle of the night, so people don't all click on them as soon as they're sent.

-JD.
Call:  +1-866-533-7742  or  eMail:  jd@justinderrick.com
IBM CMOD Wiki:  https://CMOD.wiki/
FREE IBM CMOD Webinars:  https://CMOD.Training/
IBM CMOD Professional Services: https://CMOD.cloud

Interests: #AIX #Linux #Multiplatforms #DB2 #TSM #SP #Performance #Security #Audits #Customizing #Availability #HA #DR