Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote/httpfs URLs in the from field #567

Open
alexkreidler opened this issue Oct 27, 2024 · 4 comments
Open

Support remote/httpfs URLs in the from field #567

alexkreidler opened this issue Oct 27, 2024 · 4 comments

Comments

@alexkreidler
Copy link

DuckDB's HTTPFS feature, which can read parquet, csv, json, and other files on HTTP servers or cloud object storage, is an incredibly powerful tool that allows the query engine to use range reads to push down queries on parquet (and use its builtin statistics) to limit the amount of data transferred over the network. This helps DuckDB run queries really quickly even over files that might be too large to load into DuckDB WASM's memory.

When I tried this spec in Mosaic Playground:

{
    "plot": [
        {
            "mark": "lineY",
            "data": {
                "from": "read_parquet('https://f005.backblazeb2.com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30.parquet')"
            },
            "x": "file",
            "y": "Close"
        }
    ],
    "width": 680,
    "height": 200
}

Mosaic created this query:

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "read_parquet('https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet')" AS "source"

And when I changed it to remove the read_parquet function I got

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet" AS "source"

It would be great to add some logic to detect https:// and http:// strings (and maybe s3:// and hf:// which are also supported by the httpfs extension) in the from field, and output them directly into the output SQL.

from(...expr) {
const { query } = this;
if (expr.length === 0) {
// @ts-ignore
return query.from;
} else {
const list = [];
expr.flat().forEach(e => {
if (e == null) {
// do nothing
} else if (typeof e === 'string') {
list.push({ as: e, from: asRelation(e) });
} else if (e instanceof Ref) {
list.push({ as: e.table, from: e });
} else if (isQuery(e) || isSQLExpression(e)) {
list.push({ from: e });
} else if (Array.isArray(e)) {
list.push({ as: unquote(e[0]), from: asRelation(e[1]) });
} else {
for (const as in e) {
list.push({ as: unquote(as), from: asRelation(e[as]) });
}
}
});
query.from = query.from.concat(list);
return this;
}
}

And to add docs/examples for mosaic-sql, vgplot, and mosaic-spec.

@jheer
Copy link
Member

jheer commented Oct 27, 2024

Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once. HTTPFS is of course supported for this.

@domoritz
Copy link
Member

domoritz commented Oct 28, 2024

It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.

@jheer
Copy link
Member

jheer commented Oct 29, 2024

It still seems like we are unnecessarily and incorrectly splitting the string at the . here because we assume the user meant to access a table in a schema. The workaround is to create a view but we can probably be a bit more robust here so it's not required. IIRC duckdb does cache the file reads.

It's not "incorrect", in the sense that the method's proper argument is a relation reference (possibly including database and schema names). It's not intended to take an arbitrary expression. However, this is certainly something we can revisit. When my schedule allows I'm hoping to make a major revision of the mosaic-sql helper library, at which point I should be able to reconsider table references more generally.

@alexkreidler
Copy link
Author

@jheer @domoritz I'll test out the view workaround, thank you!

@jheer I'm excited for those mosaic-sql updates, but I'm wondering if there's a way to make a smaller change for HTTPFS.

Right now Mosaic assumes you first create a named table or view (as an exec call), which helps with reuse and ensures it is only loaded once.

Would it break any other parts of Mosaic if mosaic-sql started generating queries that referenced HTTPFS URLs?

If not, would you be open to a PR to update Query.from to detect https:// and http:// strings? It could put them into the list and then query.from arrays so those strings are outputted directly into SQL. This would introduce a slight dependence on DuckDB making #399 more difficult.

Another option could be to change mosaic-spec PlotData type so it supports SQL expressions similar to those in encoding channels using the sql field (Expression.ts), so my read_parquet example would get handled in the isQuery(e) || isSQLExpression(e) branch of Query.from.

export interface PlotFrom {
/** The name of the backing data table. */
from: string;
/** A selection that filters the mark data. */
filterBy?: ParamRef;
/**
* A flag (default `true`) to enable any mark-specific query optimizations.
* If `false`, optimizations are disabled to aid testing and debugging.
*/
optimize?: boolean;
}

But the additional complexity might not be worth adding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants