Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic full-text-search #1359

Merged
merged 6 commits into from
Oct 15, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -290,13 +290,12 @@ export class FirestoreRepository implements Repository {
}

async writeCustomElements(
packageVersionMetadata: Version,
{name: packageName, version, description}: Version,
customElements: CustomElementInfo[],
distTags: string[],
author: string
): Promise<void> {
// Store custom elements data in subcollection
const {name: packageName, version, description} = packageVersionMetadata;
const versionRef = this.getPackageVersionRef(packageName, version);
const customElementsRef = versionRef.collection('customElements');
const isLatest = distTags.includes('latest');
Expand Down Expand Up @@ -326,6 +325,10 @@ export class FirestoreRepository implements Repository {
...descriptionStems,
...summaryStems,
...tagNameParts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an intuition, no action needed. Feels like we might want to weight tag name parts a little more highly than description terms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think weighting / relevance might be one reason to push us to use a search service rather quickly. I don't think we have a natural way of getting Firestore to consider some kind of weighting in our query, we just get a result or not for the array-contains-any operator. Sorting by array fields isn't very useful. We can get Firestore to sort by simple fields like a "ranking" field.

To do a relevance ordering we'd have to talk all the results, iterate and apply something like tf-id, and sort on that. That means our Firestore results size will have to be bigger than our requested query size because we might have high-relevance results later in the Firestore result set.

It's plausible we could continue down this path and keep it simple, but it also seems like it might get complicated real quick.

Copy link
Contributor

@aomarks aomarks Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it didn't occur to me that we'd need the ranking to happen on the database side, and presumably you couldn't write a query like that (pretty sure you could with something like postgres, but firebase doubtful). I'm pretty convinced we'll need a standalone search service, then, based on lucene or similar (a service we run, or a third-party hosted one).

tagName,
// TODO (justinfagnani): tokenizing the package name is temporary
// until we don't tokenize the *entire* query
...natural.PorterStemmer.tokenizeAndStem(packageName),
]),
];

Expand Down Expand Up @@ -508,6 +511,7 @@ export class FirestoreRepository implements Repository {
if (queryTerms.length > 10) {
queryTerms.length = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit at 10?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the Firestore limit: https://cloud.google.com/firestore/docs/query-data/queries#array-contains-any

I should make a TODO/issue - If we keep this approach, we want to sort the query by inverse frequency so we keep the most important terms. Then 10 seems like enough. For longer queries I think you get better results with n-gram searches, but that's stretching my knowledge and probably another reason to use a service :)

}
console.log('queryTerms', queryTerms);
justinfagnani marked this conversation as resolved.
Show resolved Hide resolved
dbQuery = dbQuery.where('searchTerms', 'array-contains-any', queryTerms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, but just for the record here -- it sounds like the searchTerms field is automatically indexed, and firestore supports indexed array-contains-any lookups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and there are no errors in the emulator.

}

justinfagnani marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
1 change: 0 additions & 1 deletion packages/catalog-server/src/lib/graphql.ts
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,6 @@ export const makeExecutableCatalogSchema = async (catalog: Catalog) => {
}
},
async elements(_parent, {query, limit}): Promise<Array<CustomElement>> {
console.log('query elements', {query, limit});
return catalog.queryElements({
query: query ?? undefined,
limit: limit ?? 25,
Expand Down
40 changes: 27 additions & 13 deletions packages/catalog-server/src/test/lib/catalog_test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ const TEST_SEQUENCE_ONE = 'test-data-1';
// Other tests than can run independently
const TEST_SEQUENCE_TWO = 'test-data-2';


test('Imports a package with no problems', async () => {
const packageName = 'test-1';
const version = '0.0.0';
Expand Down Expand Up @@ -106,20 +105,35 @@ test('Full text search', async () => {
const catalog = new Catalog({files, repository});

// Use a term in the package description - it should match all elements
const resultOne = await catalog.queryElements({
query: 'cool',
limit: 10
});
assert.ok(resultOne);
assert.equal(resultOne.length, 2);
let result = await catalog.queryElements({query: 'cool'});
assert.equal(result.length, 2);

// Use a term in an element description
const resultTwo = await catalog.queryElements({
query: 'incredible',
limit: 10
});
assert.ok(resultTwo);
assert.equal(resultTwo.length, 1);
result = await catalog.queryElements({query: 'incredible'});
assert.equal(result.length, 1);

// Use a term not found
result = await catalog.queryElements({query: 'jandgslwijd'});
assert.equal(result.length, 0);

// Use an element name
result = await catalog.queryElements({query: '"foo-element"'});
// TODO (justinfagnani): this isn't what we want. We really just want
// The element <foo-element> to be returned, but the tokenizer we're
// using is splitting "foo-element" into ["foo", "element"] and "element"
// is matching against bar-element's search terms.
// If we keep our own search index, we'll want to use or write a tokenizer
// that preserves quoted sections for exact matches:
// http://naturalnode.github.io/natural/Tokenizers.html
assert.equal(result.length, 2);

// Use part of an element name
result = await catalog.queryElements({query: 'element'});
assert.equal(result.length, 2);

// Use a package name
result = await catalog.queryElements({query: 'test-1'});
assert.equal(result.length, 2);
});

test('Gets package version data from imported package', async () => {
Expand Down