Skip to content

Commit

Permalink
Some light copy editing of the docs (#244)
Browse files Browse the repository at this point in the history
* Some light copy editing of the docs

* Consistently capitalize "Frameless" in the docs
  • Loading branch information
rossabaker authored and imarios committed Feb 7, 2018
1 parent fc3a832 commit 576eb67
Show file tree
Hide file tree
Showing 8 changed files with 38 additions and 38 deletions.
24 changes: 12 additions & 12 deletions dataset/src/main/scala/frameless/TypedDataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
* plan may grow exponentially. It will be saved to files inside the checkpoint
* directory set with `SparkContext#setCheckpointDir`.
*
* Differs from `Dataset#checkpoint` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#checkpoint` by wrapping its result into an effect-suspending `F[_]`.
*
* apache/spark
*/
Expand Down Expand Up @@ -196,12 +196,12 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val

/** Returns the number of elements in the [[TypedDataset]].
*
* Differs from `Dataset#count` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#count` by wrapping its result into an effect-suspending `F[_]`.
*/
def count[F[_]]()(implicit F: SparkDelay[F]): F[Long] =
F.delay(dataset.count)

/** Returns `TypedColumn` of type `A` given it's name.
/** Returns `TypedColumn` of type `A` given its name.
*
* {{{
* tf('id)
Expand All @@ -215,7 +215,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
i1: TypedEncoder[A]
): TypedColumn[T, A] = col(column)

/** Returns `TypedColumn` of type `A` given it's name.
/** Returns `TypedColumn` of type `A` given its name.
*
* {{{
* tf.col('id)
Expand Down Expand Up @@ -262,14 +262,14 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
* Running this operation requires moving all the data into the application's driver process, and
* doing so on a very large [[TypedDataset]] can crash the driver process with OutOfMemoryError.
*
* Differs from `Dataset#collect` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#collect` by wrapping its result into an effect-suspending `F[_]`.
*/
def collect[F[_]]()(implicit F: SparkDelay[F]): F[Seq[T]] =
F.delay(dataset.collect())

/** Optionally returns the first element in this [[TypedDataset]].
*
* Differs from `Dataset#first` by wrapping it's result into an `Option` and an effect-suspending `F[_]`.
* Differs from `Dataset#first` by wrapping its result into an `Option` and an effect-suspending `F[_]`.
*/
def firstOption[F[_]]()(implicit F: SparkDelay[F]): F[Option[T]] =
F.delay {
Expand All @@ -285,7 +285,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
* Running take requires moving data into the application's driver process, and doing so with
* a very large `num` can crash the driver process with OutOfMemoryError.
*
* Differs from `Dataset#take` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#take` by wrapping its result into an effect-suspending `F[_]`.
*
* apache/spark
*/
Expand All @@ -301,7 +301,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input [[TypedDataset]] should be cached first.
*
* Differs from `Dataset#toLocalIterator()` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#toLocalIterator()` by wrapping its result into an effect-suspending `F[_]`.
*
* apache/spark
*/
Expand Down Expand Up @@ -342,7 +342,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
* @param truncate Whether truncate long strings. If true, strings more than 20 characters will
* be truncated and all cells will be aligned right
*
* Differs from `Dataset#show` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#show` by wrapping its result into an effect-suspending `F[_]`.
*
* apache/spark
*/
Expand All @@ -365,14 +365,14 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val

/** Runs `func` on each element of this [[TypedDataset]].
*
* Differs from `Dataset#foreach` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#foreach` by wrapping its result into an effect-suspending `F[_]`.
*/
def foreach[F[_]](func: T => Unit)(implicit F: SparkDelay[F]): F[Unit] =
F.delay(dataset.foreach(func))

/** Runs `func` on each partition of this [[TypedDataset]].
*
* Differs from `Dataset#foreachPartition` by wrapping it's result into an effect-suspending `F[_]`.
* Differs from `Dataset#foreachPartition` by wrapping its result into an effect-suspending `F[_]`.
*/
def foreachPartition[F[_]](func: Iterator[T] => Unit)(implicit F: SparkDelay[F]): F[Unit] =
F.delay(dataset.foreachPartition(func))
Expand Down Expand Up @@ -517,7 +517,7 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val

TypedEncoder[A].catalystRepr match {
case StructType(_) =>
// if column is struct, we use all it's fields
// if column is struct, we use all its fields
val df = tuple1
.dataset
.selectExpr("_1.*")
Expand Down
4 changes: 2 additions & 2 deletions dataset/src/main/scala/frameless/TypedDatasetForwarded.scala
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@ trait TypedDatasetForwarded[T] { self: TypedDataset[T] =>
/** Optionally reduces the elements of this [[TypedDataset]] using the specified binary function. The given
* `func` must be commutative and associative or the result may be non-deterministic.
*
* Differs from `Dataset#reduce` by wrapping it's result into an `Option` and an effect-suspending `F`.
* Differs from `Dataset#reduce` by wrapping its result into an `Option` and an effect-suspending `F`.
*/
def reduceOption[F[_]](func: (T, T) => T)(implicit F: SparkDelay[F]): F[Option[T]] =
F.delay {
Expand All @@ -365,4 +365,4 @@ trait TypedDatasetForwarded[T] { self: TypedDataset[T] =>
}
}(self.dataset.sparkSession)
}
}
}
2 changes: 1 addition & 1 deletion docs/src/main/tut/Cats.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ implicit val sync: Sync[ReaderT[IO, SparkSession, ?]] = new Sync[ReaderT[IO, Spa
}
```

There are two main parts to the `cats` integration offered by frameless:
There are two main parts to the `cats` integration offered by Frameless:
- effect suspension in `TypedDataset` using `cats-effect` and `cats-mtl`
- `RDD` enhancements using algebraic typeclasses in `cats-kernel`

Expand Down
10 changes: 5 additions & 5 deletions docs/src/main/tut/FeatureOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import org.apache.spark.sql.SparkSession
import frameless.functions.aggregate._
import frameless.TypedDataset
val conf = new SparkConf().setMaster("local[*]").setAppName("frameless repl").set("spark.ui.enabled", "false")
val conf = new SparkConf().setMaster("local[*]").setAppName("Frameless repl").set("spark.ui.enabled", "false")
implicit val spark = SparkSession.builder().config(conf).appName("REPL").getOrCreate()
spark.sparkContext.setLogLevel("WARN")
Expand Down Expand Up @@ -71,7 +71,7 @@ This is completely type-safe, for instance suppose we misspell `city` as `citi`:
aptTypedDs.select(aptTypedDs('citi))
```

This gets raised at compile-time, whereas with the standard `Dataset` API the error appears at run-time (enjoy the stack trace):
This gets raised at compile time, whereas with the standard `Dataset` API the error appears at runtime (enjoy the stack trace):

```tut:book:fail
aptDs.select('citi)
Expand All @@ -83,7 +83,7 @@ aptDs.select('citi)
aptTypedDs.select(aptTypedDs('surface) * 10, aptTypedDs('surface) + 2).show().run()
```

Note that unlike the standard Spark API where some operations are lazy and some are not, **TypedDatasets have all operations to be lazy.**
Note that unlike the standard Spark API, where some operations are lazy and some are not, **all TypedDatasets operations are lazy.**
In the above example, `show()` is lazy. It requires to apply `run()` for the `show` job to materialize.
A more detailed explanation of `Job` is given [here](Job.md).

Expand Down Expand Up @@ -286,7 +286,7 @@ bedroomStats.collect().run().foreach(println)
### Entire TypedDataset Aggregation

We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.
In `Frameless` you can do this using the `agg()` operator directly on the `TypedDataset`.
In Frameless you can do this using the `agg()` operator directly on the `TypedDataset`.
In the following example, we compute the average price, the average surface,
the minimum surface, and the set of cities for the entire dataset.

Expand Down Expand Up @@ -319,7 +319,7 @@ val cityInfo = Seq(
val citiInfoTypedDS = TypedDataset.create(cityInfo)
```

Here is how to join the population information to the apartment's dataset.
Here is how to join the population information to the apartment's dataset:

```tut:book
val withCityInfo = aptTypedDs.join(citiInfoTypedDS, aptTypedDs('city), citiInfoTypedDS('name))
Expand Down
20 changes: 10 additions & 10 deletions docs/src/main/tut/TypedDataFrame.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Proof of Concept: TypedDataFrame

`TypedDataFrame` is the API developed in the early stages of frameless to manipulate Spark `DataFrame`s in a type-safe manner. With the introduction of `Dataset` in Spark 1.6, `DataFrame` seems deprecated and won't be the focus of future developments of frameless. However, the design is interesting enough for being documented.
`TypedDataFrame` is the API developed in the early stages of Frameless to manipulate Spark `DataFrame`s in a type-safe manner. With the introduction of `Dataset` in Spark 1.6, `DataFrame` seems deprecated and won't be the focus of future development of Frameless. However, the design is interesting enough to document.

To safely manipulate `DataFrame`s we use a technique called *shadow type*, which consists in storing additional information about a value in a "dummy" type. Mirroring value-level computation at the type-level lets us leverage the type system to catch common mistakes at compile time.
To safely manipulate `DataFrame`s we use a technique called a *shadow type*, which consists in storing additional information about a value in a "dummy" type. Mirroring value-level computation at the type-level lets us leverage the type system to catch common mistakes at compile time.

### Diving in

Expand All @@ -28,7 +28,7 @@ As you can see, instead of the `def filter(conditionExpr: String): DataFrame` de

### Type-level column referencing

For Spark's `DataFrame`s, column referencing is done directly by `String`s or using the `Column` type which provides no additional type safety. `TypedDataFrame` improves on that by catching column referencing at compile type. When everything goes well, frameless select is very similar to vanilla select, except that it keeps track of the selected column types:
For Spark's `DataFrame`s, column referencing is done directly by `String`s or using the `Column` type which provides no additional type safety. `TypedDataFrame` improves on that by catching invalid column references compile type. When everything goes well, Frameless select is very similar to vanilla select, except that it keeps track of the selected column types:

```scala
import frameless.TypedDataFrame
Expand All @@ -39,7 +39,7 @@ def selectIntString(tf: TypedDataFrame[Foo]): TypedDataFrame[(Int, String)] =
tf.select('i, 's)
```

However, in case of typo, it gets coughs right away:
However, in case of typo, it gets caught right away:

```scala
def selectIntStringTypo(tf: TypedDataFrame[Foo]): TypedDataFrame[(Int, String)] =
Expand All @@ -48,7 +48,7 @@ def selectIntStringTypo(tf: TypedDataFrame[Foo]): TypedDataFrame[(Int, String)]

### Type-level joins

Joins can available with two different syntaxes, the first lets you reference different columns on each `TypedDataFrame`, and ensures that their all exists and have compatible types:
Joins can available with two different syntaxes. The first lets you reference different columns on each `TypedDataFrame`, and ensures that they all exist and have compatible types:

```scala
case class Bar(i: Int, j: String, b: Boolean)
Expand All @@ -58,7 +58,7 @@ def join1(tf1: TypedDataFrame[Foo], tf2: TypedDataFrame[Bar])
tf1.innerJoin(tf2).on('s).and('j)
```

The second syntax bring some convenience when the joining columns have identical names in both tables:
The second syntax brings some convenience when the joining columns have identical names in both tables:

```scala
def join2(tf1: TypedDataFrame[Foo], tf2: TypedDataFrame[Bar])
Expand All @@ -70,7 +70,7 @@ Further example are available in the [TypedDataFrame join tests.](../dataframe/s

### Complete example

We now consider a complete example to see how the type system can frameless can improve not only correctness but also the readability of Spark jobs. Consider the following domain of phonebooks, city map and neighborhood:
We now consider a complete example to see how the Frameless types can improve not only correctness but also the readability of Spark jobs. Consider the following domain of phonebooks, city maps and neighborhoods:

```tut:silent
type Neighborhood = String
Expand Down Expand Up @@ -98,7 +98,7 @@ object NLPLib {
}
```

Suppose we manage to obtain a `TypedDataFrame[PhoneBookEntry]` and a `TypedDataFrame[CityMapEntry]` public data, here is what our Spark job could look like with frameless:
Suppose we manage to obtain public data for a `TypedDataFrame[PhoneBookEntry]` and `TypedDataFrame[CityMapEntry]`. Here is what our Spark job could look like with Frameless:

```scala
import org.apache.spark.sql.SQLContext
Expand Down Expand Up @@ -130,10 +130,10 @@ def bestNeighborhood
}
```

If you compare this version from Spark vanilla where every line is a `DataFrame`, you see how much types can improve readability. An executable version of this example is available in the [BestNeighborhood test](../dataframe/src/test/scala/BestNeighborhood.scala).
If you compare this version to vanilla Spark where every line is a `DataFrame`, you see how much types can improve readability. An executable version of this example is available in the [BestNeighborhood test](../dataframe/src/test/scala/BestNeighborhood.scala).

### Limitations

The main limitation of this approach comes from Scala 2.10, which limits the arity of class classes to 22. Because of the way `DataFrame` models joins, joining two table with more that 11 fields results in a `DataFrame` which not representable with `Schema` of type `Product`.

In the `Dataset` API introduced in Spark 1.6, the way join are handled was rethought to return a pair of both schemas instead of a flat table, which moderates the trouble caused by case class limitations. Alternatively, since Scala 2.11, it is possible to define Tuple23 and onward. Sadly, due to the way Spark is commonly packaged in various systems, the amount Spark users having to Scala 2.11 and *not* to Spark 1.6 is essentially zero. For this reasons, further development in frameless will target Spark 1.6+, deprecating the early work on`TypedDataFrame`.
In the `Dataset` API introduced in Spark 1.6, the way join are handled was rethought to return a pair of both schemas instead of a flat table, which moderates the trouble caused by case class limitations. Alternatively, since Scala 2.11, it is possible to define Tuple23 and onward. Sadly, due to the way Spark is commonly packaged in various systems, the amount Spark users having to Scala 2.11 and *not* to Spark 1.6 is essentially zero. For this reasons, further development in Frameless will target Spark 1.6+, deprecating the early work on`TypedDataFrame`.
6 changes: 3 additions & 3 deletions docs/src/main/tut/TypedDatasetVsSparkDataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ org.apache.commons.io.FileUtils.deleteDirectory(new java.io.File("/tmp/foo/"))
```

**Goal:**
This tutorial compares the standard Spark Datasets api with the one provided by
frameless' TypedDataset. It shows how TypedDatsets allows for an expressive and
This tutorial compares the standard Spark Datasets API with the one provided by
Frameless' `TypedDataset`. It shows how `TypedDataset`s allow for an expressive and
type-safe api with no compromises on performance.

For this tutorial we first create a simple dataset and save it on disk as a parquet file.
Expand Down Expand Up @@ -109,7 +109,7 @@ Intuitively, Spark currently doesn't have a way to look inside the code we pass
closures. It only knows that they both take one argument of type `Foo`, but it has no way of knowing if
we use just one or all of `Foo`'s fields.

The TypedDataset in frameless solves this problem. It allows for a simple and type-safe syntax
The `TypedDataset` in Frameless solves this problem. It allows for a simple and type-safe syntax
with a fully optimized query plan.

```tut:book
Expand Down
8 changes: 4 additions & 4 deletions docs/src/main/tut/TypedEncoder.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ implicit val spark = SparkSession.builder().config(conf).appName("REPL").getOrCr
System.setProperty("spark.cleaner.ttl", "300")
```

Spark uses Reflection to derive it's `Encoder`s, which is why they can fail at run time. For example, because Spark does not supports `java.util.Date`, the following leads to an error:
Spark uses Reflection to derive its `Encoder`s, which is why they can fail at run time. For example, because Spark does not support `java.util.Date`, the following leads to an error:

```tut:silent
import org.apache.spark.sql.Dataset
Expand All @@ -22,9 +22,9 @@ case class DateRange(s: java.util.Date, e: java.util.Date)
val ds: Dataset[DateRange] = sqlContext.createDataset(Seq(DateRange(new java.util.Date, new java.util.Date)))
```

As shown by the stack trace, this runtime error goes thought [ScalaReflection](https://github.com/apache/spark/blob/19cf208063f035d793d2306295a251a9af7e32f6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala) to try to derive an `Encoder` for `Dataset` schema. Beside the annoyance of not detecting this error at compile time, a more important limitation of the reflection based approach is it's inability to be extended for custom types. See this Stack Overflow question for a summary of the current situation (as of 2.0) in vanilla Spark: [How to store custom objects in a Dataset?](http://stackoverflow.com/a/39442829/2311362).
As shown by the stack trace, this runtime error goes through [ScalaReflection](https://github.com/apache/spark/blob/19cf208063f035d793d2306295a251a9af7e32f6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala) to try to derive an `Encoder` for `Dataset` schema. Beside the annoyance of not detecting this error at compile time, a more important limitation of the reflection-based approach is its inability to be extended for custom types. See this Stack Overflow question for a summary of the current situation (as of 2.0) in vanilla Spark: [How to store custom objects in a Dataset?](http://stackoverflow.com/a/39442829/2311362).

Frameless introduces a new type class called `TypeEncoder` to solve these issues. `TypeEncoder`s are passed around as implicit parameters to every frameless method to ensure that the data being manipulated is `Encoder`. It uses a standard implicit resolution coupled with shapeless type class derivation mechanism to ensure every that compiling code manipulates encodable data. For example, the code `java.util.Date` example won't compile with frameless:
Frameless introduces a new type class called `TypeEncoder` to solve these issues. `TypeEncoder`s are passed around as implicit parameters to every Frameless method to ensure that the data being manipulated is `Encoder`. It uses a standard implicit resolution coupled with shapeless' type class derivation mechanism to ensure every that compiling code manipulates encodable data. For example, the `java.util.Date` example won't compile with Frameless:

```tut:silent
import frameless.TypedDataset
Expand Down Expand Up @@ -55,7 +55,7 @@ case class FooDate(i: Int, b: BarDate)
val ds: TypedDataset[FooDate] = TypedDataset.create(Seq(FooDate(1, BarDate(1.1, "s", new java.util.Date))))
```

It should be noted that once derived, reflection based `Encoder`s and implicitly derived `TypeEncoder`s have identical performances. The derivation mechanism is different, but the objects generated to encode and decode JVM object in the Spark internal representation behave the same at run-time.
It should be noted that once derived, reflection-based `Encoder`s and implicitly derived `TypeEncoder`s have identical performance. The derivation mechanism is different, but the objects generated to encode and decode JVM objects in Spark's internal representation behave the same at runtime.

```tut:invisible
spark.stop()
Expand Down
Loading

0 comments on commit 576eb67

Please sign in to comment.