Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer fields set does not match our data #25

Closed
rolanddb opened this issue Mar 8, 2017 · 11 comments
Closed

Transformer fields set does not match our data #25

rolanddb opened this issue Mar 8, 2017 · 11 comments
Assignees
Labels

Comments

@rolanddb
Copy link

rolanddb commented Mar 8, 2017

Hi,
I'm trying to load the atomic.events data from Snowplow into Spark.
I'd like to do this using the EventTransformer.transform() method.

We observe a difference in the fields in our data, compared to what the transform method requires. Therefore, all events are marked as failure (unable to parse).

This is the mismatch:
Fields in our data: 128
Fields in SDK transformer: 131
In transformer but not in data: Set(derived_contexts, unstruct_event, contexts, refr_device_tstamp)
In data but not in transformer: Set(refr_dvce_tstamp)

  • Was refr_dvce_tstamp renamed?
  • How come we are missing the contexts and unstruct_event?

I can make it work by forking the SDK and modifying the transform method, but ideally I'd continue to use the main branch..

Thanks!

@alexanderdean
Copy link
Member

alexanderdean commented Mar 8, 2017

Hi @rolanddb - hopefully all the answers you need are in #5. It would be great if you could open a PR adding support for your version of the enriched events!

@alexanderdean
Copy link
Member

Ah @rolanddb - I read this slightly too fast. Are you trying to parse an extract from Redshift using this SDK? That's not supported - you should be pointing this at your enriched:good:archive instead.

@rolanddb
Copy link
Author

rolanddb commented Mar 8, 2017

Ah! So it wasn't me :)
I can make a PR tomorrow, it's a trivial change.
Thanks.

@alexanderdean
Copy link
Member

See follow-up comment! I misunderstood your situation I think...

@rolanddb
Copy link
Author

rolanddb commented Mar 8, 2017

@alexanderdean I'm loading the events from S3, main/shredded/good.
Should I be using 'enriched' instead?

@alexanderdean
Copy link
Member

Exactly, yes.

@rolanddb
Copy link
Author

rolanddb commented Mar 8, 2017

Ok, I'll give that a try tomorrow.

@alexanderdean
Copy link
Member

Re-opening as @rolanddb's seems to be having an ongoing issue here.

@chuwy
Copy link
Contributor

chuwy commented May 15, 2017

Hello @rolanddb,

We're about to publish Scala SDK 0.2.0, but still didn't manage to identify any case where transformer could not match enriched TSV data.

I see only one possible cause here - possibly you tried to load events enriched with Snowplow pre-R73 (released in December 2015), which produced more columns than it produces now. If I'm wrong here, would it be possible to provide some details (error message, TSV example) that breaks transformer.

@rolanddb
Copy link
Author

Hi @chuwy,
There was some confusion about the difference between enriched / shredded data. So most of the difference between the dataset and the fields defined in the transformer was explained by that.

@alexanderdean
Copy link
Member

Okay great, closing and descheduling...

@alexanderdean alexanderdean removed this from the Version 0.2.0 milestone May 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants