Scalding 0.13.1, the most convenient scalding we’ve ever released!
Scala 2.11 Support is here!
We’re now publishing scalding for scala 2.11! Get it while it’s hot!
Easier aggregation via the latest Algebird
Algebird now comes with some very powerful aggregators that make it easy to compose aggregations and apply them in a single pass.
For example, to find each customer's order with the max quantity, as well as the order with the min price, in a single pass:
val maxOp = maxBy[Order, Long](_.orderQuantity).andThenPresent(_.orderQuantity)
val minOp = minBy[Order, Long](_.orderPrice).andThenPresent(_.orderPrice)
TypedPipe.from(orders)
.groupBy(_.customerName)
.aggregate(maxOp.join(minOp))
For more examples and documentation see: Aggregation using Algebird Aggregators
And for a hands on walkthrough in the REPL, see Alice In Aggregator Land
Read-Eval-Print-Love
We’ve made some improvements that make day to day use of the REPL more convenient:
Easily switch between local and hdfs mode
#1113 Makes it easy to switch between local and hdfs mode in the REPL, without losing your session.
So you can iterate locally on some small data, and once that’s working, run a hadoop job on your real data, all from within the same REPL session. You can also sample some data down to fit into memory, then switch to local mode where you can really quickly get the answers you’re looking for.
For example:
$ ./sbt assembly
$ ./scripts/scald.rb --repl --hdfs --host <host to ssh to and launch jobs from>
scalding> useLocalMode()
scalding> def helper(x: Int) = (x * x) / 2
helper: (x: Int)Int
scalding> val dummyData = TypedPipe.from(Seq(10, 11, 12))
scalding> dummyData.map(helper).dump
50
60
72
scalding> useHdfsMode()
scalding> val realData = TypedPipe.from(MySource(“/logs/some/real/data”)
scalding> realData.map(helper).dump
Easily save TypedPipes of case classes to disk
#1129 Lets you save any TypedPipe to disk from the REPL, regardless of format, so you can load it back up again later from another session. This is useful for saving an intermediate TypedPipe[MyCaseClass] without figuring out how to map it to a TSV or some other format. This works by serializing the objects to json behind the scenes.
For example:
$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson
scalding> case class Bio(text: String, language: String)
defined class Bio
scalding> case class User(id: Long, bio: Bio)
defined class User
// in a real use case, getUsers might load a few sources, do some projections + joins, and then return
// a TypedPipe[User]
scalding> def getUsers() = TypedPipe.from(Seq( User(7, Bio("hello", "en")), User(8, Bio("hola", "es")) ))
getUsers: ()com.twitter.scalding.typed.TypedPipe[User]
scalding> getUsers().filter(_.bio.language == "en").save(TypedJson("/tmp/en-users"))
res0: com.twitter.scalding.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@7cccf31c
scalding> exit
$ cat /tmp/en-users
{"id":7,"bio":{"text":"hello","language":"en"}}
$ ./scripts/scald.rb --json --repl --local
scalding> import com.twitter.scalding.TypedJson
import com.twitter.scalding.TypedJson
scalding> case class Bio(text: String, language: String)
defined class Bio
scalding> case class User(id: Long, bio: Bio)
defined class User
scalding> val filteredUsers = TypedPipe.from(TypedJson[User]("/tmp/en-users"))
filteredUsers: com.twitter.scalding.typed.TypedPipe[User] = com.twitter.scalding.typed.TypedPipeFactory@44bb1922
scalding> filteredUsers.dump
User(7,Bio(hello,en))
ValuePipe.dump
#1157 Adds dump to ValuePipe, so now you can not only print the contents of TypedPipes but on ValuePipes as well (see above for examples of using dump in the REPL).
Execution Improvements
The scaladoc for Execution is complete, but some additional exposition was added to the wiki: Calling Scalding from inside your application. We added two helper methods to object Execution: Execution.failed
creates an Execution
from a Throwable
(like Future.failed
), and Execution.unit
which creates a successful Execution[Unit]
, which is handy in some branching loops.
Bugfixes
The final bugs were finally removed from scalding*. Including #1190, a bug that effected the hashCode for Args instances and issue #1184 that made Stats unreliable for some users.
*some humor is used in scalding notes.
See CHANGES.md for a full change log.
Thanks to @avibryant, @danielhfrank, @DanielleSucher, @miguno, and the rest of the algebird contributors for the new aggregations, as well as all the scalding contributors