Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP - do not merge!] Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema #90

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tomasatdatabricks
Copy link
Contributor

[WIP] Preparation for moving stuff to Spark.

Moved utilities for image schema <=> numpy array conversion to (copy pasted from spark 2.3) Image schema code.

  1. Extended ImageSchema scala code with support/information for all OpenCv modes
  2. python toNDArray and toImage utilities extended to work with all supported data types.
  3. [minor] sparkdl toImage function included batch size stripping - had to make a separate call for that

@tomasatdatabricks tomasatdatabricks force-pushed the tomas/ImageSchemaUpdate2 branch 2 times, most recently from a959b18 to b533e69 Compare December 29, 2017 18:02
Copy link
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just a few minor comments.

buffer=image.data,
strides=(width * nChannels, nChannels, 1))
strides=(width * nChannels * itemSz, nChannels * itemSz, itemSz))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will numpy figure out the right strides if we don't pass it explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I would think so. The original code from ms folks was like this and I did not want to do more changes than necessary.

@@ -152,29 +192,29 @@ def toImage(self, array, origin=""):
"array argument should be numpy.ndarray; however, it got [%s]." % type(array))

if array.ndim != 3:
raise ValueError("Invalid array shape")
raise ValueError("Invalid array shape %s" % str(array.shape))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to reshape 2d arrays to be shape + (1,)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with their approach. I think it's better to make the caller pass the arguments in expected format rather than trying to auto-convert unless that is completely unambiguous.

So in this case, we say images are always 3 dimensional arrays and it's up to the user to make sure they conform to that. Otherwise they might be passing something else than they think they are passing and we would mask their bug until later.

"Unexpected/unsupported array data type '%s', currently only supported formats are %s" %
(str(
array.dtype), str(
self._numpyToOcvMap.keys())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get this on fewer lines or sue some variables, it looks odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does, I think it's the autopep8 being weird here. I'll reformat that.

@@ -128,4 +128,6 @@ class DeepImageFeaturizerSuite extends FunSuite with TestSparkContext with Defau
.setOutputCol("myOutput")
testDefaultReadWrite(featurizer)
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra white space.

dataType=x.dataType(),
nptype=self._ocvToNumpyMap[x.dataType()])
for x in ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()]
return [x for x in self._ocvTypes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Isn't self._ocvTypes already a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose was to return copy of the list so that the private member can not be modified.

Copy link
Contributor

@MrBago MrBago Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oic, I usually see myList[:] or list(myList) to make a (shallow) copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's nicer :)
The members of the list are tuples so shallow copy suffices here.

@codecov-io
Copy link

codecov-io commented Dec 29, 2017

Codecov Report

Merging #90 into master will increase coverage by 1.43%.
The diff coverage is 77.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage   82.49%   83.92%   +1.43%     
==========================================
  Files          33       33              
  Lines        1879     1866      -13     
  Branches       35       39       +4     
==========================================
+ Hits         1550     1566      +16     
+ Misses        329      300      -29
Impacted Files Coverage Δ
python/sparkdl/udf/keras_image_model.py 75.6% <0%> (+1.8%) ⬆️
...main/scala/com/databricks/sparkdl/ImageUtils.scala 90.9% <100%> (ø) ⬆️
...n/sparkdl/estimators/keras_image_file_estimator.py 74.35% <100%> (ø) ⬆️
python/sparkdl/transformers/tf_image.py 94.06% <33.33%> (-0.05%) ⬇️
python/sparkdl/param/image_params.py 81.81% <50%> (+6.14%) ⬆️
.../scala/org/apache/spark/ml/image/ImageSchema.scala 77.94% <75%> (-1.1%) ⬇️
python/sparkdl/image/imageIO.py 73.33% <81.25%> (-4.77%) ⬇️
python/sparkdl/image/image.py 78.82% <82.35%> (+40.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff9c9...ce17c43. Read the comment docs.

Copy link
Collaborator

@sueann sueann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to merge this? It'll be difficult to maintain and resolve the differences in the copied files between DL Pipelines and Spark. It'd be much easier cognitively to merge these changes into Spark then remove the corresponding files in sparkdl. If we need to keep these files in sparkdl until Spark 2.4 is out, it'd be safer to first get the changes merged into Spark then copy the exact changes here; if we merge this first, it could easily get out of sync with whatever revisions get made in Spark.

@tomasatdatabricks
Copy link
Contributor Author

@sueann Yes I agree, I would merge spark version first and merge this one only after spark 2.4 is released. I made the PR here mostly because that's what we need the changes for, so it can be reviewed in context, also to run tests.

I'll mark it WIP.

@tomasatdatabricks tomasatdatabricks changed the title Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema [WIP - do not merge!] Move sparkdl utilities for conversion between numpy arrays and image schema to ImageSchema Jan 9, 2018
@sueann
Copy link
Collaborator

sueann commented Jan 9, 2018

ah ok got it. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants