NLP support #56

allwefantasy · 2017-10-13T10:38:23Z

What changes are proposed in this pull request?

Creating a Spark MLlib Estimator API which can integrated with tensorflow code, with a reference implementation.
Creating a Spark MLlib Transformer convert text column to 2-D vector which can be feeded to CNN/LSTM directly.

It provides a taste of how to process text in a DataFrame and use them to train a NLP model developed by tensorflow.
Also fix issue: #53

The changes consist of these components.

TFTextFileEstimator/TFTextTransformer
New shard params

How is this patch tested?

Unit tests
Manual tests

2. Introduce Kafka to avoid broadcast huge tranning data

phi-dbq · 2017-10-14T01:13:25Z

Thank you for contributing to the project @allwefantasy!

allwefantasy · 2017-10-14T02:38:12Z

The CI fails because there is no kafka lib in the env. Is there something i can do to fix this ?

thunterdb · 2017-10-16T22:42:15Z

@allwefantasy thank you very much for the contribution. I will have more comments for the estimator, so would you mind splitting your PR into the transformer part and into the estimator?

Also, I see that the transformer is embedding Word2Vec. Have you considered chaining them in a pipeline instead?
https://spark.apache.org/docs/2.1.1/ml-pipeline.html

Regarding kafka, you should be able to add it in this file:
https://github.com/databricks/spark-deep-learning/blob/master/python/requirements.txt

…o support integrating TFoS infuture

2. Introduce Kafka to avoid broadcast huge tranning data

…o support integrating TFoS infuture

allwefantasy · 2017-10-18T07:17:08Z

@thunterdb TFTextTransformer is a tool like StringIndexer in MLlib, which we can use to transform the dataframe and feed the new dataframe to TFTextFileEstimator. It seems no need to split into two PRs.

Using word2vec is in order to compute a Map which contains the mapping of word to vector. We do not need the word2vec model's transform function.

codecov-io · 2017-10-18T08:07:40Z

Codecov Report

Merging #56 into master will decrease coverage by 3.53%.
The diff coverage is 62.79%.

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
- Coverage   82.82%   79.29%   -3.54%     
==========================================
  Files          23       25       +2     
  Lines        1217     1473     +256     
  Branches        5        5              
==========================================
+ Hits         1008     1168     +160     
- Misses        209      305      +96

Impacted Files	Coverage Δ
python/sparkdl/transformers/keras_applications.py	`93.93% <100%> (+2.1%)`	⬆️
python/sparkdl/transformers/utils.py	`100% <100%> (ø)`	⬆️
python/sparkdl/transformers/named_image.py	`93.51% <100%> (ø)`	⬆️
...ython/sparkdl/estimators/tf_text_file_estimator.py	`48.02% <48.02%> (ø)`
python/sparkdl/transformers/tf_text.py	`78.26% <78.26%> (ø)`
python/sparkdl/param/shared_params.py	`80.88% <82.5%> (+0.67%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f668d9...99d2b30. Read the comment docs.

allwefantasy added 3 commits October 13, 2017 17:22

1. Support NLP non-distribued training

bf6f994

2. Introduce Kafka to avoid broadcast huge tranning data

set test_mode to True which can avoid to kafka dependency

3c3fd2d

clean some file

e0cdad2

allwefantasy mentioned this pull request Oct 18, 2017

General DL #61

Open

allwefantasy and others added 10 commits October 18, 2017 10:25

move tensorflow map_fun to tf_text_test.py and modify the signature t…

4e8b11e

…o support integrating TFoS infuture

1. Support NLP non-distribued training

15a0c40

2. Introduce Kafka to avoid broadcast huge tranning data

set test_mode to True which can avoid to kafka dependency

08e61f3

clean some file

e51c508

[databricks#55] fix TFImageTransformer example in docs (databricks#58)

65a4694

move tensorflow map_fun to tf_text_test.py and modify the signature t…

b812764

…o support integrating TFoS infuture

fix code style in TFTextTransformer

e277b24

make sure TFTextTransformer will pass the ./python/run-tests.sh

edd359c

fix conflict

6dc76e2

fix conflict

b2550c3

fix pickle in python 3

ddc1b7b

allwefantasy added 7 commits October 18, 2017 17:20

import sys

eeb462b

rm /tmp/mock_kafka before run test

d32381d

kafka temp directory using tempfile.mkdtemp

99ab371

fix kafka tmp file

67f5f30

remove cpickle when using python 3

d854379

pickle write file with wb mode

b0aac36

read pickle file with rb

99d2b30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP support #56

NLP support #56

allwefantasy commented Oct 13, 2017

phi-dbq commented Oct 14, 2017

allwefantasy commented Oct 14, 2017

thunterdb commented Oct 16, 2017

allwefantasy commented Oct 18, 2017

codecov-io commented Oct 18, 2017 •

edited

Loading

NLP support #56

Are you sure you want to change the base?

NLP support #56

Conversation

allwefantasy commented Oct 13, 2017

phi-dbq commented Oct 14, 2017

allwefantasy commented Oct 14, 2017

thunterdb commented Oct 16, 2017

allwefantasy commented Oct 18, 2017

codecov-io commented Oct 18, 2017 • edited Loading

Codecov Report

codecov-io commented Oct 18, 2017 •

edited

Loading