-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Fuzzy Search Using PostgreSQL #137
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThe recent changes introduce several new SQL scripts, configuration files, and TypeScript updates to enhance the database operations for a project. These updates include creating views and indexes for efficient searches, adding dependencies, configuring data loading from SQLite to PostgreSQL, and defining database-related constants. Additionally, new SQL queries for data retrieval, deletion, insertion, and updates are added, along with TypeScript functions and types for handling Postgres database operations. Changes
Sequence Diagram(s) (Beta)sequenceDiagram
participant User
participant API
participant Database
User->>API: Request to insert data
API->>Database: Execute upsertNewDataItem SQL query
Database-->>API: Data inserted/updated
API-->>User: Response with success/failure
sequenceDiagram
participant User
participant API
participant Database
User->>API: Request to fetch data attributes
API->>Database: Execute selectDataAttributes SQL query
Database-->>API: Return data attributes
API-->>User: Response with data attributes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Outside diff range and nitpick comments (4)
src/database/pgql/bundles/cleanup.sql (1)
1-13
: The DELETE queries are secure using parameterized inputs. Consider reviewing the performance ifnew_data_items
andnew_data_item_tags
tables are large, as these operations might be intensive.src/database/sql/core/accessors.sql (1)
Line range hint
1-10
: The SELECT queries use subqueries and UNIONs efficiently. Ensure that the tables involved are properly indexed to optimize these queries.src/workers/bundle-repair-worker.ts (1)
Line range hint
23-92
: Consider replacingforEach
withfor...of
for better performance when handling large arrays, as suggested by the static analysis tool.- this.intervalIds.forEach((intervalId) => clearInterval(intervalId)); + for (const intervalId of this.intervalIds) { + clearInterval(intervalId); + }src/workers/block-importer.ts (1)
Line range hint
160-162
: ReplaceforEach
withfor...of
for better performance.- txs.forEach((tx) => { + for (const tx of txs) { this.eventEmitter.emit(events.BLOCK_TX_FETCHED, tx); + }Biome static analysis suggests avoiding
forEach
for performance reasons, especially with large arrays.Also applies to: 167-169, 173-175
parentId: tx.parent_id ? toB64Url(tx.parent_id) : null, | ||
}))); | ||
|
||
return []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unreachable code.
- return [];
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
return []; |
Tools
Biome
[error] 1325-1325: This code is unreachable (lint/correctness/noUnreachable)
... because this statement will return from the function beforehand
break; | ||
case tagsMatch.FUZZY_OR: | ||
queryWheres.push(`tn.name = '${tag.name}'`); | ||
let queryOr: string[] = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encapsulate variable declaration within a block to restrict scope within the switch case.
case tagsMatch.FUZZY_OR:
+ {
let queryOr: string[] = [];
tag.values.forEach(value => {
queryOr.push(`similarity(convert_from(tv.value,'UTF8'),'${value.replace('*', '')}') > 0.4 `);
});
queryWheres.push(('(' + queryOr.join(' OR ') + ')'));
+ }
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
let queryOr: string[] = []; | |
case tagsMatch.FUZZY_OR: | |
{ | |
let queryOr: string[] = []; | |
tag.values.forEach(value => { | |
queryOr.push(`similarity(convert_from(tv.value,'UTF8'),'${value.replace('*', '')}') > 0.4 `); | |
}); | |
queryWheres.push(('(' + queryOr.join(' OR ') + ')')); | |
} |
Tools
Biome
[error] 1029-1029: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause. (lint/correctness/noSwitchDeclarations)The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
switch (tag.match) { | ||
case tagsMatch.WILDCARD: | ||
queryWheres.push(`tn.name = \'${tag.name}\'`); | ||
const wildcardValue: string = tag.values[0].replace('*', '%'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encapsulate variable declaration within a block to restrict scope within the switch case.
case tagsMatch.WILDCARD:
+ {
const wildcardValue: string = tag.values[0].replace('*', '%');
queryWheres.push(`tn.name = \'${tag.name}\'`);
queryWheres.push(`tv.value LIKE \'${wildcardValue}\'`);
+ }
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
const wildcardValue: string = tag.values[0].replace('*', '%'); | |
case tagsMatch.WILDCARD: | |
{ | |
const wildcardValue: string = tag.values[0].replace('*', '%'); | |
queryWheres.push(`tn.name = \'${tag.name}\'`); | |
queryWheres.push(`tv.value LIKE \'${wildcardValue}\'`); | |
} |
Tools
Biome
[error] 1018-1018: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause. (lint/correctness/noSwitchDeclarations)The declaration is defined in this switch clause:
Unsafe fix: Wrap the declaration in a block.
-- selectNewDataItemTags | ||
SELECT name, value | ||
FROM new_data_item_tags | ||
JOIN tag_names ON tag_name_hash = tag_names.hash | ||
JOIN tag_values ON tag_value_hash = tag_values.hash | ||
WHERE data_item_id = @id | ||
ORDER BY data_item_tag_index ASC | ||
|
||
-- selectStableDataItemTags | ||
SELECT name, value | ||
FROM stable_data_item_tags | ||
JOIN tag_names ON tag_name_hash = tag_names.hash | ||
JOIN tag_values ON tag_value_hash = tag_values.hash | ||
WHERE data_item_id = @id | ||
ORDER BY data_item_tag_index ASC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
Indexes for tag_names.hash
and tag_values.hash
are missing, which could lead to performance issues in the queries.
- Ensure that indexes are created for
tag_names.hash
andtag_values.hash
.
Analysis chain
The SELECT queries are well-structured, using joins and parameterized inputs efficiently. Ensure that the necessary indexes are in place for tag_names.hash
and tag_values.hash
to optimize performance.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that indexes on `tag_names.hash` and `tag_values.hash` are properly set up.
# Test: Search for index creation statements in the migration scripts.
rg --type sql $'CREATE INDEX'
Length of output: 13127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@foxoffire33, understood! If you need any assistance when you get to it, feel free to reach out. Good luck with the rest of your updates! 🐇
@@ -119,7 +119,7 @@ export class BlockImporter { | |||
previousDbBlockHash, | |||
}, | |||
); | |||
this.chainIndex.resetToHeight(previousHeight - 1); | |||
await this.chainIndex.resetToHeight(previousHeight - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor repeated code for resetting the chain index.
- await this.chainIndex.resetToHeight(previousHeight - 1);
+ await this.resetChainIndex(previousHeight);
Consider extracting the repeated resetToHeight
calls into a separate method to reduce duplication and improve maintainability.
Also applies to: 141-141
Committable suggestion was skipped due to low confidence.
Hey @foxoffire33, cool work! I'm happy to talk through issues with this. I'm not sure whether we'll want to integrate it into I'll follow up later today or tomorrow with more thoughts about the approach and other potential fuzzy search options. Thanks again for all the work you put into this! |
@foxoffire33 I've been thinking about this more today. What I'm initially curious about is what kind of fuzzy search you want to provide and which specific Postgres features you intend to use for it. (apologies if this is visible in the diffs. I probably won't be able to review those in detail till next week) I'm also curious if you considered using the webhook support to populate some external index (e.g., OpenSearch or similar), and if not, why. Regarding potential integration - how much effort are you willing to put into maintaining this? We would like to add Postgres support in |
I tried to do it with Meilisearch first, but it doesn't fit the case perfectly. Implementing the functionality I need would be similar in cost to changing the gateway itself. Since this is for a project I'm working on, I'm happy to maintain a Postgres solution for fuzzy search as long as we need it. I'm also interested in the database design you already have in place, and I might be able to contribute some code for that. |
Gotcha, just wanted to make sure you were aware of the webhook option.
Great! What's the current status of the PR? Is it working for your use case now or are you still sorting through issues?
Also great! I don't think I'll get to it till next week at this point, but I'll write up some details soon. |
I'm still working on it. Do not have a lot of time. There is an issue with sorting in queries. I will provide an update when I know more. |
Long overdue update from our side - we're working on a version of the schema I'd recommend (on an internal legacy database in a different code base). It's similar to the SQLite schema but uses a single height partitioned table in Postgres as opposed to separate 'new' and 'stable' tables. We represent |
is it possible to share the sql schema? |
Sure. This should give you the general idea, but keep in mind this is from a different DB with some legacy decisions (strings instead of binary fields). It's also under active development, so some details might still change. CREATE TABLE data_items (
height integer,
height_partition_key integer,
id character(43) NOT NULL,
raw_offset bigint,
raw_size bigint,
signature character varying, -- only here to support ANS102
signature_type bigint,
signature_offset bigint,
signature_size bigint,
owner_address character(43),
owner_offset bigint,
owner_size bigint,
data_offset bigint,
data_size bigint,
target character(43),
quantity character varying,
reward character varying,
anchor character varying,
content_type character varying,
format smallint,
created_at timestamp without time zone DEFAULT now() NOT NULL,
updated_at timestamp without time zone DEFAULT now() NOT NULL,
parent character(43),
parent_root_tx_offset bigint, -- offset of the parent relative to the root TX
root_tx_id character(43), -- same as parent when it's not a nested data-item
PRIMARY KEY (height_partition_key, id)
) PARTITION BY RANGE (height_partition_key);
CREATE INDEX data_item_id_idx ON data_items (id);
CREATE INDEX data_item_height_part_key_desc_id_idx ON data_items (height_partition_key DESC, id);
CREATE INDEX data_item_owner_address_height_part_key_id_idx ON data_items (owner_address, height_partition_key, id);
CREATE INDEX data_item_owner_address_height_part_key_desc_id_idx ON data_items (owner_address, height_partition_key DESC, id);
CREATE INDEX data_item_target_height_part_key_id_idx ON data_items (target, height_partition_key, id);
CREATE INDEX data_item_target_height_part_key_desc_id_idx ON data_items (target, height_partition_key DESC, id);
CREATE TABLE data_item_tags (
height integer,
height_partition_key integer,
tx_id character(43) NOT NULL,
index integer NOT NULL,
name_hash bytea NOT NULL,
value_hash bytea NOT NULL,
created_at timestamp without time zone DEFAULT now() NOT NULL,
updated_at timestamp without time zone DEFAULT now() NOT NULL,
PRIMARY KEY (height_partition_key, tx_id, index)
) PARTITION BY RANGE (height_partition_key);
CREATE INDEX data_item_tags_name_value_height_part_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key, tx_id);
CREATE INDEX data_item_tags_name_value_height_part_key_desc_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key DESC, tx_id);
CREATE TABLE data_item_tag_names (
name_hash bytea NOT NULL PRIMARY KEY,
name character varying NOT NULL
);
CREATE INDEX data_item_tag_names_name_idx ON data_item_tag_names (name);
CREATE TABLE data_item_tag_values (
value_hash bytea NOT NULL PRIMARY KEY,
value character varying NOT NULL
);
CREATE INDEX data_item_tag_values_value_idx ON data_item_tag_values (value); |
As part of the project I’m working on, I need to implement fuzzy search functionality. For example, if a user searches for an artist and makes a typo, I still want to return the correct result. To achieve this, I plan to use a feature of the PostgreSQL database. Currently, I am testing this on PostgreSQL.
Additionally, I have noticed that sorting is currently broken, and I might need some help resolving this issue. The database structure will follow in a future update.
I would appreciate any feedback you might have on this approach.