Error handling in ocr-d network and ocrd processor (page-wise) #1015

MehmedGIT · 2023-03-20T11:23:18Z

MehmedGIT
Mar 20, 2023
Maintainer

After having some insightful discussions with @bertsky regarding the error handling in #974, I have decided to transfer the information to this pad. However, it turned out to be a bad idea to document there since more discussions continued on the pad and it was hard to track. Hence, I am creating this post and transferring the information here. The most up-to-date information will be at the top and edited as the discussions continue.

Introduction

Steps of a workflow job execution inside the OCR-D Network architecture:

The user/admin uploads a workflow (Nextflow) script through the Workflow Server
– receives an ID for the workflow.
The user/admin uploads a workspace (METS URL or OCRD-ZIP) through the Workspace Server
– receives an ID for the workspace.
The user/admin triggers a workflow (ID) on some workspace (ID) through the Workflow Server
– receives an ID for the workflow job.
The Workflow Server triggers a Nextflow (manager) run,
which it associates with the workflow (ID) in the DB.
Further, see Step 14
The Nextflow run executes the workflow's Nextflow script,
and each process block inside that script sequentially (one after another)
creates a processing request (PYJobInput) to the Processing Server.
Further, see Step 13
For each request, the Processing Server…
- creates a corresponding job entry (DBProcessorJob model) in the DB,
- forwards it to the appropriate message queue as an OcrdProcessingMessage,
- instantiates a temporary METS server for the given workspace (not supported yet), unless already running,
- responds with PYJobOutput which has a job ID field among others.
The active process block on the Nextflow run…
- receives the PYJobOutput and parses the job ID from it,
- starts listening to the result queue corresponding to that processor that was provided inside the PYJobInput, consuming from the result queue and checking whether the consumed message has the expected job ID
- blocks until exactly one such message was received, or a timeout has been reached. Further, see Step 13
  [Note: we may need to know when the processing message was consumed for the timer to activate!]
  [Note2: this is not a DB, but a Queue! It's not possible just to peek at the message and decide whether to consume it or not. This means the consumed message that has a different job ID has to be requeued back to the specific result queue so that message is not lost. Another listener to the same result queue that expects that specific job ID, should still potentially receive it. This may produce inefficiency when scaling... For simplicity reasons, currently, it's assumed that there is only a single listener for each result queue so each consumed message has the expected job ID.]
A Processing Worker responsible for that specific message queue…
- consumes and parses the OcrdProcessingMessage,
- updates its job status to running in the DB (DBProcessorJob),
- creates a Processor Instance for the required parameter set in the message, unless already cached,
- triggers execution for workspace, pages, and input/output_file_grp in the message.
The Processor Instance loops over the given pages of the workspace
(possibly in parallel, not supported yet on the processor level)…
- reading input file(s), processing, writing output file(s),
- scheduling corresponding updates on the METS server of that workspace (not supported yet),
- handling errors by skip/fallback/raise (depending on configuration)
  [probably also repeat if just OOM or I/O timeout etc.] (not supported yet on processor level),
- returns with success or failure (from raise or too many skips / fallbacks,
  depending on configuration).
The Processing Worker waits for the Processor Instance to finish execution,
then updates the job state in the DB.
(Optional steps)

The Processing Worker pushes an OcrdResultMessage to some result queue (string) that was provided in the PYJobInput request. The Processing Worker creates the result queue if not existing before pushing the result there.
The Processing Worker posts an OcrdResultMessage in json format to some callback URL that was provided in the PYJobInput request.

The active process block of the Nextflow run…
- consumes the result message,
- exits the block with job status (success/failed) in the message.
The Nextflow run…
- if not successful (not supported yet),
  - if errorStrategy=retry and maxTries are not reached, repeat from step 5,
  - otherwise aborts the overall workflow with failure,
- else if not finished, proceeds to the next process block,
- otherwise completes the workflow.
The Workflow Server writes the workflow status to the workflow entry in the DB.
[How about notification of any workflow listeners for that ID?]

Note: Check the WebAPI spec for more details.

Different levels of error handling

5 different levels (6 with the user/admin level) of error handling were identified across ocrd_network, ocrd.

The levels are listed below from 1 to 5 (from low to high, i.e., in the reverse invocation order).

1 Processor Instance level

The Processor Instance gets called with process() or process_workspace().

Error handling options:

Check here for ideas regarding processor-level error handling options.

In short, processors should handle errors on the page level.

Fail reasons:

crash due to error in the code (100% reproducible for one concrete input page)
crash due to error in the code (fuzzy/non-localized, perhaps bad memory management, etc)
crash due to race conditions?
crash due to OOM (CPU or GPU), stack/heap limit, #FH limit
crash due to I/O or network failure/timeout
per-page processing timeout
...

Resolutions:

(configurable) strategy:
- repeat
- skip page
- fallback (input-as-output, or empty-as-output)
- raise (let all pages fail collectively)
- skip/fallback may need to become raised if too many
reproducible crashes: skip/fallback/raise
transient crashes: repeat (perhaps with more resources or lower multiscalarity) up to maxTries, then skip/fallback/raise
timeout: kill, then skip/fallback/raise
ideally when raising: discern reproducible/transient, too

2 Processing Worker level

Error handling options:

Failed OcrdProcessingMessage due to (raised) Processor Instance failure should be requeued back to the message queue (unless reproducible).

[How does the Processing Worker decide if an error is reproducible?
Does it try to start the processor instance a few times?
Does it also do some error message parsing - for smarter decisions?]

If the same message fails several times (3-4, configurable), or is non-repeatable (already reproducible), then the failed message will either be

dropped or
requeued to dead-letter queue for logging purposes (not supported yet).

The redelivered flag/property (set by the message broker) of the processing message hints that the message was requeued so either of these happened:

The message was NACKed due to consumer timeout (worker has failed or still processing)
The message was NACKed manually (the processor has failed or the worker was terminated and caught the termination signal to NACK the message before termination).

However, simply by checking that property, we could not say whether:

The message was already processed - don't process it again/twice
The message was not (fully) processed - (re)process it.

Note: It's still possible to NACK a message and not requeue it back.
Note2: When processing a redelivered message the workspace should not be in a broken state and potentially the ocr-d processor argument --overwrite must be set.

The redelivered or other property does not hint at how many times a message was requeued. To track that:

A separate field inside OcrdProcessingMessage must be introduced to track the number of times a message was redelivered – a hop counter. Unfortunately, there isn't a convenient way to do that with a hop limit (as in IP stack) for a specific message. So this should be preferable over the other option below.
Another option is to consider using quorum queues that support poison message handling. Then there is an x-delivery-count variable in the header of a redelivered message. However, quorum queues come with some limitations in comparison to regular queues.

Currently:

each processing message triggers the specific OCR-D processor once
a failed message (due to bad message format or processor fail) is not requeued back to anywhere when a NACK (negative acknowledgment) is sent.

The Processing Worker sends ACKs/NACKs back to the message broker (RabbitMQ server). Based on whether the publisher (Processing Server) enabled delivery confirmations, the status may also be sent back to the publisher by the message broker. A separate handler can be implemented on the publisher side for deciding what to do on message ACKs/NACKs. For more details regarding acknowledgments and confirmations check here. Currently, the Processing Server does not do anything based on ACKs or NACKs.

It is worth noting that the Processing Worker is a publisher itself when writing the OcrdResultMessage back to the specific result queue. Hence, separate handler for ACKs/NACKs can be implemented on the Processing Worker side to make sure the result message was successfully consumed and processed by the Nextflow run.

Fail reasons:

Processor Instance for this parameter set cannot even be started (unknown model, invalid params, etc)
Network error or corrupted transport
Timeout??

Resolutions:

2.1 ...
2.2 ...

3 Processing Server level

Error handling options:

Failed Processing Workers should be identified and restarted again by the Deployer agent of the Processing Server (not supported yet).

Ideally, this error-handling strategy should be flexible and controllable through a CLI option,
e.g. --worker-failure [raise|ignore|restart], where

restart will restart the processing worker,
ignore will do nothing, and
raise will stop the Processing Server.

For job failures:

A proper timeout should be configured for the RabbitMQ Server for failed Processing Workers or workers that take too long to process a message (due to a race condition or other unknown error)
Also, on the Processing Server itself, a timeout must back each processing job: There might not be any consumer available (improper configuration or all workers of that type died).

Probably it's better to have a mechanism to track whether the queues have active consumers or not. Setting a timeout for each processing job may be hard to follow and lose track of.

Option: To set TTL for messages and queues. However, it's still hard to decide on the optimal value without having some data from actual runs to find the optimal TTL value. Moreover, how are the dropped messages by the message broker (RabbitMQ) tracked efficiently to be requeued back?
Another option may be to set the auto-delete flag on queues.The Processing Server then:

knows a queue was deleted by a raised exception when pushing the next message
knows to re-instantiate the specific worker/s for that queue.
However, this solution is problematic:
- not viable to track the cases when there is more than a single consumer in a queue
- the previous messages in the queue will be lost

Fail reasons:

3.1 ...
3.2 ...

Resolutions:

3.1 ...
3.2 ...

4 Nextflow runner level

Error handling options:

Error handling on each process level through useful directives such as errorStrategy, maxErrors, and maxRetries.
Error handling on workflow level (i.e. handlers)
Incorporate flops (not failures) into the scheme (check 1. in the TO DOs section)

Fail reasons:

4.1 ...
4.2 ...

Resolutions:

4.1 ...
4.2 ...

5 Workflow server level

Error handling options:

Failed Nextflow (manager) processes should be restarted if the fail reason is something that could potentially be fixed on the Workflow Server level.

[What are potential errors? Parsing error messages to find out what went wrong?]

Fail reasons:

5.1 ...
5.2 ...

Resolutions:

5.1 ...
5.2 ...

Open Questions:

TO DOs:

incorporate flops (not failures) into the scheme, i.e. when evaluators (not processors) deem the interim result unworthy of continuation → branching on the workflow script, or cancellation with the user in the loop.

MehmedGIT · 2023-03-20T11:25:50Z

MehmedGIT
Mar 20, 2023
Maintainer Author

@bertsky, thanks for your input and opinion so far. I had to remove some parts (that are in bold in the pad) from the initial post here because they were more fitting to be separate comments here.

2 replies

bertsky Mar 20, 2023
Collaborator

I added it as a new thread below. (Perhaps we can do the same with the Q&A.)

MehmedGIT Mar 20, 2023
Maintainer Author

I was not sure but yes makes sense

bertsky · 2023-03-20T11:37:15Z

bertsky
Mar 20, 2023
Collaborator

The Processing Worker sends ACKs/NACKs back to the message broker (RabbitMQ server). Based on whether the publisher (Processing Server) enabled delivery confirmations, the status may also be sent back to the publisher by the message broker. A separate handler can be implemented on the publisher side for deciding what to do on message ACKs/NACKs. For more details regarding acknowledgments and confirmations check here. Currently, the Processing Server does not do anything based on ACKs or NACKs.

Thus, we have a redundancy here: both the Processing Server (via delivery confirmations) and the process block of the Nextflow run (by listening to the result queue) get to see the resulting job status. If PS should act on failure, then NF would need to know that (and vice versa)...

4 replies

bertsky Mar 20, 2023
Collaborator

[@MehmedGIT wrote:]
This is not the same thing. In the case of the Processing Server (publisher) - it gets informed directly by the message broker (RabbitMQ server), in the case of Nextflow (consumer) it has to consume the result from somewhere, i.e., the result queue - to know how to proceed with the Workflow execution.
Thus, it may or may not be redundant depending on what we want to achieve. The implementation is based on the architecture in the spec. In case the architecture can be simplified - of course this is better in general.

bertsky Mar 20, 2023
Collaborator

of course it's not the same thing, but we must be careful not to react to failures inconsistently – either the PS request thread is fully responsible or the NF run block is; the other may only be "passive"

bertsky Mar 20, 2023
Collaborator

[@MehmedGIT wrote:]
The Processing Server should be doing the error-handling for levels (in call order) below itself. Nextflow is a level above the Processing Server, so in that case the Nextflow consumer is passive when it comes to error handling on the lower levels.

bertsky Mar 20, 2023
Collaborator

But for a proper separation of concerns/levels, it must not be allowed to skip/shortcut levels. The NF runner consuming the Worker's result queue message – and acting on failure – would "bypass" the Processing Server level, is what I mean. So this goes back to the question how there can be callbacks for long-running API requests. The current Web API Spec's model of having NF listen to the lower-level queue directly instead of letting the Processing Server "answer" makes it difficult to fathom what the PS's role actually is. See discussion on spec here → here → here.

@tdoan2010 ?

MehmedGIT · 2023-03-20T11:55:59Z

MehmedGIT
Mar 20, 2023
Maintainer Author

It is still not clear how to handle the correct workspace state. How can a workspace state be verified to be stable or broken?

1 reply

MehmedGIT Mar 20, 2023
Maintainer Author

[@bertsky wrote:]
There will be no need for --overwrite with proper error handling on the workflow level. It used to be necessary because the de-facto workflow engines – ocrd process and CLI-based NF without --resume – were not re-entrant. But that should have been overcome with the Web API architecture (and even for NF on pure CLI level, one can use --resume). So as long as there are no crashes during METS synchronization, a situation like a “broken workspace” should not occur.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling in ocr-d network and ocrd processor (page-wise) #1015

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error handling in ocr-d network and ocrd processor (page-wise) #1015

MehmedGIT Mar 20, 2023 Maintainer

Introduction

Different levels of error handling

1 Processor Instance level

Error handling options:

Fail reasons:

Resolutions:

2 Processing Worker level

Error handling options:

Fail reasons:

Resolutions:

3 Processing Server level

Error handling options:

Fail reasons:

Resolutions:

4 Nextflow runner level

Error handling options:

Fail reasons:

Resolutions:

5 Workflow server level

Error handling options:

Fail reasons:

Resolutions:

Open Questions:

TO DOs:

Replies: 3 comments · 7 replies

MehmedGIT Mar 20, 2023 Maintainer Author

bertsky Mar 20, 2023 Collaborator

MehmedGIT Mar 20, 2023 Maintainer Author

bertsky Mar 20, 2023 Collaborator

bertsky Mar 20, 2023 Collaborator

bertsky Mar 20, 2023 Collaborator

bertsky Mar 20, 2023 Collaborator

bertsky Mar 20, 2023 Collaborator

MehmedGIT Mar 20, 2023 Maintainer Author

MehmedGIT Mar 20, 2023 Maintainer Author

MehmedGIT
Mar 20, 2023
Maintainer

Replies: 3 comments 7 replies

MehmedGIT
Mar 20, 2023
Maintainer Author

bertsky Mar 20, 2023
Collaborator

MehmedGIT Mar 20, 2023
Maintainer Author

bertsky
Mar 20, 2023
Collaborator

bertsky Mar 20, 2023
Collaborator

bertsky Mar 20, 2023
Collaborator

bertsky Mar 20, 2023
Collaborator

bertsky Mar 20, 2023
Collaborator

MehmedGIT
Mar 20, 2023
Maintainer Author

MehmedGIT Mar 20, 2023
Maintainer Author