Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP support #56

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

allwefantasy
Copy link

What changes are proposed in this pull request?

Creating a Spark MLlib Estimator API which can integrated with tensorflow code, with a reference implementation.
Creating a Spark MLlib Transformer convert text column to 2-D vector which can be feeded to CNN/LSTM directly.

It provides a taste of how to process text in a DataFrame and use them to train a NLP model developed by tensorflow.
Also fix issue: #53

The changes consist of these components.

TFTextFileEstimator/TFTextTransformer
New shard params

How is this patch tested?

  • Unit tests
  • Manual tests

@phi-dbq
Copy link
Contributor

phi-dbq commented Oct 14, 2017

Thank you for contributing to the project @allwefantasy!

@allwefantasy
Copy link
Author

The CI fails because there is no kafka lib in the env. Is there something i can do to fix this ?

@thunterdb
Copy link
Contributor

@allwefantasy thank you very much for the contribution. I will have more comments for the estimator, so would you mind splitting your PR into the transformer part and into the estimator?

Also, I see that the transformer is embedding Word2Vec. Have you considered chaining them in a pipeline instead?
https://spark.apache.org/docs/2.1.1/ml-pipeline.html

Regarding kafka, you should be able to add it in this file:
https://github.com/databricks/spark-deep-learning/blob/master/python/requirements.txt

@allwefantasy allwefantasy mentioned this pull request Oct 18, 2017
@allwefantasy
Copy link
Author

@thunterdb TFTextTransformer is a tool like StringIndexer in MLlib, which we can use to transform the dataframe and feed the new dataframe to TFTextFileEstimator. It seems no need to split into two PRs.

Using word2vec is in order to compute a Map which contains the mapping of word to vector. We do not need the word2vec model's transform function.

@codecov-io
Copy link

codecov-io commented Oct 18, 2017

Codecov Report

Merging #56 into master will decrease coverage by 3.53%.
The diff coverage is 62.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
- Coverage   82.82%   79.29%   -3.54%     
==========================================
  Files          23       25       +2     
  Lines        1217     1473     +256     
  Branches        5        5              
==========================================
+ Hits         1008     1168     +160     
- Misses        209      305      +96
Impacted Files Coverage Δ
python/sparkdl/transformers/keras_applications.py 93.93% <100%> (+2.1%) ⬆️
python/sparkdl/transformers/utils.py 100% <100%> (ø) ⬆️
python/sparkdl/transformers/named_image.py 93.51% <100%> (ø) ⬆️
...ython/sparkdl/estimators/tf_text_file_estimator.py 48.02% <48.02%> (ø)
python/sparkdl/transformers/tf_text.py 78.26% <78.26%> (ø)
python/sparkdl/param/shared_params.py 80.88% <82.5%> (+0.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f668d9...99d2b30. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants