-
Notifications
You must be signed in to change notification settings - Fork 40
Ordering Annotators
Since Baleen 2.4, by default Baleen will try to automatically optimise the order of your pipeline. However, you may want to manually specify the order to achieve certain effects, or in the case where Baleen is unable to automatically order your pipeline.
Baleen will use a Pipeline Orderer to attempt to optimise the order of annotators and consumers in a pipeline. By default, it uses the DependencyGraph
orderer, which will order the pipeline based on the declared inputs and outputs of each components.
You can change the orderer by adding the following to your pipeline configuration. Below, we specify the NoOpOrderer
which will run the pipeline in the order specified in the YAML.
orderer: uk.gov.dstl.baleen.core.pipelines.orderers.NoOpOrderer
Note that Baleen will separately order the annotators and consumers, with annotators being run before the consumers. The collection reader will always run first.
The majority of annotators can run anywhere in the pipeline without any issues - this includes the majority of Regular Expression (RegEx) annotators. However, some annotators are dependent on the outputs of other annotators, and therefore must follow these annotators in the processing chain. Below is some general guidance to help you decide on an appropriate order for your purposes.
- The Language OpenNLP annotator (
language.OpenNLP
) should generally come first in the pipeline, as a number of other annotators are dependent on it. - Cleaners should generally come at the end of the pipeline, as there's no point cleaning up until we've done everything else!
- In general, the more generic a cleaner or annotator, the later it should come in the pipeline so that specific cases have already been dealt with. For example,
cleaners.RemoveNestedEntities
should come aftercleaners.RemoveDateTimes
andcleaners.RemoveLocations
. - Grammatical annotators are usually best placed after most of the other annotators, as they often rely on existing annotations as well as the grammatical information.
- Coreference cleaners are usually best placed after most of the other cleaners, so that they have the 'best' annotations to work with when trying to find coreferences.
The advice above is just general guidance, and there will be exceptions to most rules. To properly 'optimise' your pipeline, you should read the Javadoc for the annotators you are using to understand how they work and therefore the potential dependencies between annotators.