Skip to content
James Baker edited this page Feb 9, 2016 · 1 revision

Baleen includes a number of Gazetteer annotators, which will extract entities from text based on a list of entities you are interested in. There are two formats used by Baleen for gazetteers, both of which are described below to allow you to create your own.

File Based Gazetteers

File based gazetteers are the easiest format for creating a gazetteer, and are essentially text files. Each line is treated as a separate entry in the gazetteer, with aliases for the same terms (e.g. USA and United States of America) comma separated on the same line. It is possible to change this separator to a different character in the annotator configuration.

For example:

usa, united states, united states of america
uk, united kingdom, united kingdom of great britain and northern ireland
france
germany
spain

Mongo Based Gazetteers

Mongo based gazetteers are a lot more powerful, but not as easy to configure. Each document within Mongo is treated as a separate entry in the gazetteer, and if you wish to specify aliases then you can use an array instead of a single value. By default, the field containing the value is called value, although this can be configured in the annotator.

If there are additional fields in the Mongo document that have the same name as a property on the type you are creating (e.g. geoJson on the Location type), then this information will be added to the annotation in the relevant property.

For example:

{
	"value": [
		"london",
		"londres"
	],
	"geoJson": {
		"type": "Point",
		"coordinates": [
			-0.117,
			51.500
		]
	}
}

There are a number of different gazetteer annotators that use a Mongo database. For more information, refer to the Javadoc included within Baleen.