Implement schema_modes config to infer schema fields from document #6

MadLittleMods · 2018-11-09T18:25:08Z

MongoDB doesn't have a strict data model, but extracting data should conform to a schema of some kind to help the targets and other downstream processes load the data with appropriate data types.

The Proposed Solution

Two modes: schemaless and strict

These modes are to be specified in the tap's config to associate with collections present in the mongo database, in this form:

config.json

{
   "schema_modes": {
       "users": "strict",
       "events": "schemaless",
       ...
   },
   ...
}

schemaless

This mode will emit a SCHEMA message with an empty object JSON schema associated, {"type": "object"}, as it currently does. This is more loose on which fields should be loaded by the target, and also offloads the data typing logic to the target.

strict

This mode will sample the most recent document in the schema, and use the fields it finds on that document to create the SCHEMA to emit for that table's records. In this case, the target can rely on the SCHEMA message to type the data based on the JSON schema types found in this message.

This isn't 100% to cover all cases, so any methods of making the data typing more accurate (without having to sample every single document) can be applied in the final implementation. As such, I think the design of strict is open to further refinement.

As proposed by @dmosorast, #5 (comment)

Implement schema_modes config,

{
    "schema_modes": {
        "users": "strict",
        "events": "schemaless",
        ...
    },
    ...
}

strict: Infer schema fields from latest document in the collection
schemaless: No change (backwards compatible)

The text was updated successfully, but these errors were encountered:

timvisher · 2018-11-12T19:11:27Z

It's a little difficult to get up to speed on this issue without reading through and parsing a fairly length conversation on #5.

Can we get a TL;DR here summarizing the issue being addressed, the proposed solution, and finally the details (as you already have). Thanks!

dmosorast · 2018-11-14T00:44:39Z

Good idea! I can summarize the key points from the other thread to keep the context going.

Summary

The Issue

MongoDB doesn't have a strict data model, but extracting data should conform to a schema of some kind to help the targets and other downstream processes load the data with appropriate data types.

The Proposed Solution

Two modes: schemaless and strict

These modes are to be specified in the tap's config to associate with collections present in the mongo database, in this form:

config.json

{
   "schema_modes": {
       "users": "strict",
       "events": "schemaless",
       ...
   },
   ...
}

schemaless

This mode will emit a SCHEMA message with an empty object JSON schema associated, {"type": "object"}, as it currently does. This is more loose on which fields should be loaded by the target, and also offloads the data typing logic to the target.

strict

This mode will sample the most recent document in the schema, and use the fields it finds on that document to create the SCHEMA to emit for that table's records. In this case, the target can rely on the SCHEMA message to type the data based on the JSON schema types found in this message.

This isn't 100% to cover all cases, so any methods of making the data typing more accurate (without having to sample every single document) can be applied in the final implementation. As such, I think the design of strict is open to further refinement.

dmosorast · 2018-11-14T00:46:23Z

@MadLittleMods I noticed in your response that the definitions of the two modes are switched from what I wrote. My understanding was that schemaless is the backwards compatible mode, and strict would sample the latest document. Is that not accurate? I'd like to confirm to make sure we're on the same page.

MadLittleMods · 2018-11-14T01:15:35Z

@dmosorast Same page, I just assumed from the naming that schemaless meant it was fluid and could adapt to the documents (ambiguous). But your way of defining them is good with me (feel free to update description).

timvisher · 2018-11-14T14:49:31Z

Thanks for this, everyone. Hopefully someone can act on this now. :)

timvisher added enhancement New feature or request question Further information is requested labels Nov 12, 2018

MadLittleMods mentioned this issue Nov 14, 2018

What targets/databases has this been used with? - Assume schema properties from documents #5

Closed

timvisher removed the question Further information is requested label Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement schema_modes config to infer schema fields from document #6

Implement schema_modes config to infer schema fields from document #6

MadLittleMods commented Nov 9, 2018 •

edited by timvisher

Loading

timvisher commented Nov 12, 2018

dmosorast commented Nov 14, 2018

dmosorast commented Nov 14, 2018 •

edited

Loading

MadLittleMods commented Nov 14, 2018 •

edited

Loading

timvisher commented Nov 14, 2018

Implement schema_modes config to infer schema fields from document #6

Implement schema_modes config to infer schema fields from document #6

Comments

MadLittleMods commented Nov 9, 2018 • edited by timvisher Loading

The Proposed Solution

schemaless

strict

timvisher commented Nov 12, 2018

dmosorast commented Nov 14, 2018

Summary

The Issue

The Proposed Solution

schemaless

strict

dmosorast commented Nov 14, 2018 • edited Loading

MadLittleMods commented Nov 14, 2018 • edited Loading

timvisher commented Nov 14, 2018

MadLittleMods commented Nov 9, 2018 •

edited by timvisher

Loading

dmosorast commented Nov 14, 2018 •

edited

Loading

MadLittleMods commented Nov 14, 2018 •

edited

Loading