-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected null equality #269
Comments
Also, frameless's current behavior is incompatible with Datasets import spark.implicits._
val a: Option[Int] = None
val b: Option[Int] = None
val data = X2(a, b) :: X2(a, a) :: Nil
val dataset = TypedDataset.create(data)
val untyped = spark.createDataset(data)
val A = dataset.col('a)
val B = dataset.col('b)
data.filter(x => x.a == x.b).toSeq // List(X2(None,None), X2(None,None))
// the two below should be equal
dataset.filter(A === B).collect().run().toSeq // WrappedArray(X2(None,None), X2(None,None))
untyped.filter($"a" === $"b").collect().toSeq // WrappedArray() Strangely, @kmate 's PR #267 seems to resolve this problem within case classes. |
Sorry, not within case classes, but when the option is parameterized over a case class. |
Hi @dszakallas, thank you for bringing this up. I remember working on this maybe a year ago. We made some changes so you can compare a column with an optional literal. For example, these would work in Frameless land: dataset.filter(A === None).collect().run
res1: Seq[X2] = WrappedArray(X2(None,None), X2(None,None))
scala> dataset.filter(A =!= None).collect().run
res2: Seq[X2] = WrappedArray()
scala> dataset.filter(A =!= Some(2)).collect().run
res3: Seq[X2] = WrappedArray(X2(None,None), X2(None,None)) But they would not work for Dataframes: scala> untyped.filter($"a" =!= None).show()
java.lang.RuntimeException: Unsupported literal type class scala.None$ None
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at org.apache.spark.sql.Column.$eq$bang$eq(Column.scala:302)
... 42 elided
scala> untyped.filter($"a" === null).collect()
res31: Array[X2] = Array() |
I see now how this might contradict the SQL standard, but I think most Frameless users would expect |
From the user perspective I think the most natural mapping from |
If we want to retain the current, non-SQL compatible behavior, I think we should at least explicitly state in the documentation that |
Again thank you from bringing this up. It's great that you looked into this in such detail. I personally believe that in the case where A and B are Option[Int] I am all in favor for adding this to the documentation: "If you want to achieve what vanilla Spark gives you for |
What's the behavior in joins? It would seem like this would be the most problematic case. If I am going to do an inner join on two columns with null values, then my result will contain the cross product of the rows with null. |
It breaks as well of course. This is no exception. |
|
ok, so this starts looking more like a problem. I didn't think about the implication of this for joins. We can start treating this as bug. |
@dszakallas Were you interested in looking in to fixing this? I'm poking around for an issue to work on, this one seems pretty tractable for someone new to the project. So if you are not going to take it up, I might make a try. @imarios I'm impressed with your open-mindedness on this one. |
@sullivan- feel free to pick up this issue. |
Thank you both for looking at this in such depth. @sullivan- feel free to send a PR for this, that would be awesome. |
thanks @imarios. I've started looking into it, but got pulled off by other commitments. I'll get back to it soon. |
In SQL
results in
NULL
, which is a falsy value when used inWHERE
, so eg.returns an empty table.
However in Scala, both
null == null
andNone == None
aretrue
.Consequently these two lines should not pass.
The text was updated successfully, but these errors were encountered: