-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement schema_modes config to infer schema fields from document #6
Comments
It's a little difficult to get up to speed on this issue without reading through and parsing a fairly length conversation on #5. Can we get a |
Good idea! I can summarize the key points from the other thread to keep the context going. SummaryThe IssueMongoDB doesn't have a strict data model, but extracting data should conform to a schema of some kind to help the targets and other downstream processes load the data with appropriate data types. The Proposed SolutionTwo modes: These modes are to be specified in the tap's config to associate with collections present in the mongo database, in this form:
schemalessThis mode will emit a strictThis mode will sample the most recent document in the schema, and use the fields it finds on that document to create the This isn't 100% to cover all cases, so any methods of making the data typing more accurate (without having to sample every single document) can be applied in the final implementation. As such, I think the design of |
@MadLittleMods I noticed in your response that the definitions of the two modes are switched from what I wrote. My understanding was that |
@dmosorast Same page, I just assumed from the naming that |
Thanks for this, everyone. Hopefully someone can act on this now. :) |
MongoDB doesn't have a strict data model, but extracting data should conform to a schema of some kind to help the targets and other downstream processes load the data with appropriate data types.
The Proposed Solution
Two modes:
schemaless
andstrict
These modes are to be specified in the tap's config to associate with collections present in the mongo database, in this form:
schemaless
This mode will emit a
SCHEMA
message with an emptyobject
JSON schema associated,{"type": "object"}
, as it currently does. This is more loose on which fields should be loaded by the target, and also offloads the data typing logic to the target.strict
This mode will sample the most recent document in the schema, and use the fields it finds on that document to create the
SCHEMA
to emit for that table's records. In this case, the target can rely on theSCHEMA
message to type the data based on the JSON schema types found in this message.This isn't 100% to cover all cases, so any methods of making the data typing more accurate (without having to sample every single document) can be applied in the final implementation. As such, I think the design of
strict
is open to further refinement.As proposed by @dmosorast, #5 (comment)
Implement
schema_modes
config,strict
: Infer schema fields from latest document in the collectionschemaless
: No change (backwards compatible)The text was updated successfully, but these errors were encountered: