-
Notifications
You must be signed in to change notification settings - Fork 40
Knowledge Representation
A number of new consumers have been developed for Baleen 2.6.
analysis.elsticsearch
and analysis.mongo
have been created to allow simplified analytical queries to Baleen output in the retrospective databases. Additionally these consumers are designed for use with the Jonah visualisation tool.
These databases can be run from docker with the following commands:
docker run -d -p 27017:27017 mongo:3
docker run -d -p 9200:9200 -p 9300:9300 -e "http.host=0.0.0.0" -e "transport.host=0.0.0.0" -e "xpack.security.enabled=false" -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms750m -Xmx750m" docker.elastic.co/elasticsearch/elasticsearch:5.6.8
In addition to the existing Elasticearch consumers new consumers have been created for temporal and geographic searching, namely LocationElasticsearch
and TemporalElasticsearch
.
LocationElasticsearch
consumer that creates an index of documentId to the Elasticsearch types geo_point and geo_shape . This allows for quick and scalable geo queries and aggregations such as geohash heatmaps.
TemporalElasticsearch
creates an index of documentId to all Temporal mentions. This will facilitate the quick responses to queries about time and date ranges, as well as aggregations for timelines and date histograms. This uses the date type for ‘single’ precision and the date_range for ‘range’ precision mentions. Relative time mentions are not included as they do not have a fixed point in time to reference.
Note that the meta-time (the time a document is authored or published) is included in the main Elasticsearch document as part of the metadata.
A significant new development for Baleen 2.6 is the ability to output two different graph structures for entities that are linked by coreference, relationships or events. The first, Document Graph, represents the annotations in the document and faithfully stores all the relevant information from the Content Information level. From this graph, any reasonable graph representation can be derived through filtering, aggregation or path short-cutting. The second, Entity Graph, is a derived graph representing the higher level entity and relation information. In this case the reference target nodes from the document are mapped to entity nodes using a configurable mapping of attributes from the associated mentions.
The following graph based consumers have been implemented:
-
print.documentGraph
andprint.entityGraph
- to log the graph output as GraphML or JSON -
file.documentGraph
andfile.entityGraph
- to write to a file in GraphML, JSON or the Kyro binary format -
Neo4JDocumentGraphConsumer
andNeo4JEntityGraphConsumer
- to write to Neo4j using the Bolt protocol (https://boltprotocol.org/. -
DocumentGraphConsumer
andEntityGraphConsumer
- to write to Tinkerpop supported graph databases. See https://github.com/mohataher/awesome-tinkerpop for a list of supported graph databases. The required graph driver may need to be added to the classpath.
See Baleen graph and RDF examples for a more detailed description and examples.
Baleen's output data can be output using a simple OWL schema based on the Document and Entity graph structures defined above using the file.Rdf
or file.RdfEntityGraph
consumers as follows.
consumers: - class: file.Rdf outputDirectory: ./output_rdf format:RDF_XML - class: file.RdfEntityGraph outputDirectory: ./output_entity_rdf format: RDF_XML
Where the supported formats are:
- RDF_XML - Standard RDF XML serialisation5
- TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
- RDF_XML_ABBREV - Abbreviated RDF XML serialisation
- N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
- RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
- JSONLD - JSON for Linked Data, see https://json-ld.org/
- N3 - Notation3, a Human readable triple format.
Alternatively Baleen can output to external triple stores that support SPARQL Graph Store. For example Baleen can output to Fuseki (running locally with collection named 'baleen') using:
consumers: - class: rdf.RdfDocumentGraphConsumer query: http://localhost:3030/baleen/query update: http://localhost:3030/baleen/update store: http://localhost:3030/baleen/data
See Baleen graph and RDF examples for a more detailed example of output to Fuseki.