-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLP support #56
base: master
Are you sure you want to change the base?
NLP support #56
Conversation
2. Introduce Kafka to avoid broadcast huge tranning data
Thank you for contributing to the project @allwefantasy! |
The CI fails because there is no kafka lib in the env. Is there something i can do to fix this ? |
@allwefantasy thank you very much for the contribution. I will have more comments for the estimator, so would you mind splitting your PR into the transformer part and into the estimator? Also, I see that the transformer is embedding Word2Vec. Have you considered chaining them in a pipeline instead? Regarding kafka, you should be able to add it in this file: |
…o support integrating TFoS infuture
2. Introduce Kafka to avoid broadcast huge tranning data
…o support integrating TFoS infuture
@thunterdb TFTextTransformer is a tool like StringIndexer in MLlib, which we can use to transform the dataframe and feed the new dataframe to TFTextFileEstimator. It seems no need to split into two PRs. Using word2vec is in order to compute a Map which contains the mapping of word to vector. We do not need the word2vec model's transform function. |
Codecov Report
@@ Coverage Diff @@
## master #56 +/- ##
==========================================
- Coverage 82.82% 79.29% -3.54%
==========================================
Files 23 25 +2
Lines 1217 1473 +256
Branches 5 5
==========================================
+ Hits 1008 1168 +160
- Misses 209 305 +96
Continue to review full report at Codecov.
|
What changes are proposed in this pull request?
Creating a Spark MLlib Estimator API which can integrated with tensorflow code, with a reference implementation.
Creating a Spark MLlib Transformer convert text column to 2-D vector which can be feeded to CNN/LSTM directly.
It provides a taste of how to process text in a DataFrame and use them to train a NLP model developed by tensorflow.
Also fix issue: #53
The changes consist of these components.
TFTextFileEstimator/TFTextTransformer
New shard params
How is this patch tested?