Replies: 3 comments 7 replies
-
@bertsky, thanks for your input and opinion so far. I had to remove some parts (that are in bold in the pad) from the initial post here because they were more fitting to be separate comments here. |
Beta Was this translation helpful? Give feedback.
-
Thus, we have a redundancy here: both the Processing Server (via delivery confirmations) and the process block of the Nextflow run (by listening to the result queue) get to see the resulting job status. If PS should act on failure, then NF would need to know that (and vice versa)... |
Beta Was this translation helpful? Give feedback.
-
It is still not clear how to handle the correct workspace state. How can a workspace state be verified to be stable or broken? |
Beta Was this translation helpful? Give feedback.
-
After having some insightful discussions with @bertsky regarding the error handling in #974, I have decided to transfer the information to this pad. However, it turned out to be a bad idea to document there since more discussions continued on the pad and it was hard to track. Hence, I am creating this post and transferring the information here. The most up-to-date information will be at the top and edited as the discussions continue.
Introduction
Steps of a workflow job execution inside the OCR-D Network architecture:
– receives an ID for the workflow.
– receives an ID for the workspace.
– receives an ID for the workflow job.
which it associates with the workflow (ID) in the DB.
Further, see Step 14
and each process block inside that script sequentially (one after another)
creates a processing request (
PYJobInput
) to the Processing Server.Further, see Step 13
DBProcessorJob
model) in the DB,OcrdProcessingMessage
,not supported yet
), unless already running,PYJobOutput
which has a job ID field among others.PYJobOutput
and parses the job ID from it,PYJobInput
, consuming from the result queue and checking whether the consumed message has the expected job ID[Note: we may need to know when the processing message was consumed for the timer to activate!]
[Note2: this is not a DB, but a Queue! It's not possible just to peek at the message and decide whether to consume it or not. This means the consumed message that has a different job ID has to be requeued back to the specific result queue so that message is not lost. Another listener to the same result queue that expects that specific job ID, should still potentially receive it. This may produce inefficiency when scaling... For simplicity reasons, currently, it's assumed that there is only a single listener for each result queue so each consumed message has the expected job ID.]
OcrdProcessingMessage
,running
in the DB (DBProcessorJob
),parameter
set in the message, unless already cached,workspace
,pages
, andinput/output_file_grp
in the message.pages
of theworkspace
(possibly in parallel,
not supported yet
on the processor level)…not supported yet
),skip/fallback/raise
(depending on configuration)[probably also
repeat
if just OOM or I/O timeout etc.] (not supported yet
on processor level),raise
or too manyskip
s /fallback
s,depending on configuration).
then updates the job state in the DB.
OcrdResultMessage
to some result queue (string) that was provided in thePYJobInput
request. The Processing Worker creates the result queue if not existing before pushing the result there.OcrdResultMessage
in json format to some callback URL that was provided in thePYJobInput
request.success/failed
) in the message.not supported yet
),errorStrategy=retry
andmaxTries
are not reached, repeat from step 5,[How about notification of any workflow listeners for that ID?]
Note: Check the WebAPI spec for more details.
Different levels of error handling
5 different levels (6 with the user/admin level) of error handling were identified across ocrd_network, ocrd.
The levels are listed below from 1 to 5 (from low to high, i.e., in the reverse invocation order).
1 Processor Instance level
The Processor Instance gets called with
process()
orprocess_workspace()
.Error handling options:
Check here for ideas regarding processor-level error handling options.
In short, processors should handle errors on the page level.
Fail reasons:
Resolutions:
2 Processing Worker level
Error handling options:
Failed
OcrdProcessingMessage
due to (raised) Processor Instance failure should be requeued back to the message queue (unless reproducible).[How does the Processing Worker decide if an error is reproducible?
Does it try to start the processor instance a few times?
Does it also do some error message parsing - for smarter decisions?]
If the same message fails several times (3-4, configurable), or is non-repeatable (already reproducible), then the failed message will either be
not supported yet
).The
redelivered
flag/property (set by the message broker) of the processing message hints that the message was requeued so either of these happened:However, simply by checking that property, we could not say whether:
Note: It's still possible to NACK a message and not requeue it back.
Note2: When processing a redelivered message the workspace should not be in a broken state and potentially the ocr-d processor argument
--overwrite
must be set.The
redelivered
or other property does not hint at how many times a message was requeued. To track that:OcrdProcessingMessage
must be introduced to track the number of times a message was redelivered – a hop counter. Unfortunately, there isn't a convenient way to do that with a hop limit (as in IP stack) for a specific message. So this should be preferable over the other option below.x-delivery-count
variable in the header of a redelivered message. However, quorum queues come with some limitations in comparison to regular queues.Currently:
The Processing Worker sends ACKs/NACKs back to the message broker (RabbitMQ server). Based on whether the publisher (Processing Server) enabled delivery confirmations, the status may also be sent back to the publisher by the message broker. A separate handler can be implemented on the publisher side for deciding what to do on message ACKs/NACKs. For more details regarding acknowledgments and confirmations check here. Currently, the Processing Server does not do anything based on ACKs or NACKs.
It is worth noting that the Processing Worker is a publisher itself when writing the
OcrdResultMessage
back to the specific result queue. Hence, separate handler for ACKs/NACKs can be implemented on the Processing Worker side to make sure the result message was successfully consumed and processed by the Nextflow run.Fail reasons:
Resolutions:
3 Processing Server level
Error handling options:
Failed Processing Workers should be identified and restarted again by the Deployer agent of the Processing Server (
not supported yet
).Ideally, this error-handling strategy should be flexible and controllable through a CLI option,
e.g.
--worker-failure [raise|ignore|restart]
, whererestart
will restart the processing worker,ignore
will do nothing, andraise
will stop the Processing Server.For job failures:
Probably it's better to have a mechanism to track whether the queues have active consumers or not. Setting a timeout for each processing job may be hard to follow and lose track of.
Option: To set TTL for messages and queues. However, it's still hard to decide on the optimal value without having some data from actual runs to find the optimal TTL value. Moreover, how are the dropped messages by the message broker (RabbitMQ) tracked efficiently to be requeued back?
Another option may be to set the
auto-delete
flag on queues.The Processing Server then:Fail reasons:
Resolutions:
4 Nextflow runner level
Error handling options:
Fail reasons:
Resolutions:
5 Workflow server level
Error handling options:
[What are potential errors? Parsing error messages to find out what went wrong?]
Fail reasons:
Resolutions:
Open Questions:
TO DOs:
Beta Was this translation helpful? Give feedback.
All reactions