streamcorpus-elasticsearch

Use Cases

Ingest and index streamcorpus Chunk files from S3. The chunk files might be part of a public data set, e.g. the TREC KBA StreamCorpora or loaded by a private application of streamcorpus-pipeline
Enable easy discovery of which web domains contain particular keywords and language content. For example, show me the web domains that contain the most documents written in the Chinese language and containing the word "Hétóng," which means "deal." This will enable us to manually select which domains to prioritize in crawling more data.
Enable easy browsing of documents matching keyword and language constraints. For example, let me read each document that mentions "Hétóng" and let me see which web domains contain the most documents matching the query.

v1.0 Requirements

Is operated and configured with salt, so that all of your private config info gets stored in your own private git repo holding all the salt states.
Documented best practices for forking the repo and customizing, and pulling updates from the public upstream repo.
Creates an ElasticSearch cluster using EC2 APIs
Configure it for ingesting StreamItems and indexing appropriate fields (see Use Cases below).
Configure ElasticSearch User Interface to enable filtering on StreamItem metadata (details in Use Case below)
Display a list of results with excerpted snippets showing the query terms.
Display facets of metadata indexed with the documents to support the two use cases below.
View a document's clean_visible text inside of HTML PRE tags, so that the whitespace between words is visible.
public pypi package (pushed by a buildbot that diffeo operates)
py.test unit tests with >80% coverage as measured by coverage
View a document's clean_html in an iframe

Strawman Design

This is a rough cut at a possible design. Nothing in this is sacred; it should all be challenged and re-evaluated during implementation.

salt config pieces for configuring elasticsearch to consume StreamItems starting from a bare ubuntu cloud image
python module that provides a streamcorpus-pipeline writer stage for pushing into elasticsearch with all the metadata fields constructed.
tests can spin up a new instance and run a small selection of specific chunk files from the TREC KBA 2014 Serif-only corpus corpus through it, and then run a battery of tests against it. This corpus contains all of the metadata described below, so tests can cover the full list of indexed field types.

StreamCorpus Metadata to Index

Fields needed for base requirements:

full-text search on clean_visible
facetted search on si.body.language.name
nested facetted queries on DNS domains, e.g. [boggle.doggy.com, doggy.com, com]
fielded exact match queries on stream_id, doc_id, abs_url

Other fields to index for future phases:

range queries on epoch_ticks
nested facetted queries on datetime buckets from zulu_timestamp prefixes: YEAR, YEAR-MONTH, YEAR-MONTH-DAY, YEAR-MONTH-DAY-HOUR
nested facetted search on tagger_id-->entity_type-->mention tokens (from boNAME and boNOM)
range queries on len(clean_visible)

Future Phases:

Can this effectively utilize spot instances for "elastic" scaling when query load bursts?

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
streamcorpus_elasticsearch		streamcorpus_elasticsearch
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py
version.py		version.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

streamcorpus-elasticsearch

Use Cases

v1.0 Requirements

Strawman Design

StreamCorpus Metadata to Index

Future Phases:

About

Releases

Packages

Languages

License

trec-kba/streamcorpus-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

streamcorpus-elasticsearch

Use Cases

v1.0 Requirements

Strawman Design

StreamCorpus Metadata to Index

Future Phases:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages