Alice in Wonderland Tutorial

First, let’s import some stuff.

import scala.io.Source
import com.twitter.scalding._
import com.twitter.scalding.ReplImplicits._
import com.twitter.scalding.ReplImplicitContext._
scala> val alice = Source.fromURL("https://raw.githubusercontent.com/mihi-tr/reading-alice/master/pg28885.txt").getLines
alice: Iterator[String] = non-empty iterator

Add the line numbers, which we might want later

scala> val aliceLineNum = alice.zipWithIndex.toList
aliceLineNum: List[(String, Int)] = List((Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll,0), ("",1), (This eBook is for the use of anyone anywhere at no cost and with,2), (almost no restrictions whatsoever.  You may copy it, give it away or,3), (re-use it under the terms of the Project Gutenberg License included,4), (with this eBook or online at www.gutenberg.net,5), ("",6), ("",7), (Title: Alice's Adventures in Wonderland,8), ("       Illustrated by Arthur Rackham. With a Proem by Austin Dobson",9), ("",10), (Author: Lewis Carroll,11), ("",12), (Illustrator: Arthur Rackham,13), ("",14), (Release Date: May 19, 2009 [EBook #28885],15), ("",16), (Language: English,17), ("",18), ("",19), (*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND **...

Now for scalding, TypedPipe is the main scalding object representing your data.

scala> val alicePipe = TypedPipe.from(aliceLineNum)
alicePipe: com.twitter.scalding.typed.TypedPipe[(String, Int)] = IterablePipe(List((Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll,0), (,1), (This eBook is for the use of anyone anywhere at no cost and with,2), (almost no restrictions whatsoever.  You may copy it, give it away or,3), (re-use it under the terms of the Project Gutenberg License included,4), (with this eBook or online at www.gutenberg.net,5), (,6), (,7), (Title: Alice's Adventures in Wonderland,8), (       Illustrated by Arthur Rackham. With a Proem by Austin Dobson,9), (,10), (Author: Lewis Carroll,11), (,12), (Illustrator: Arthur Rackham,13), (,14), (Release Date: May 19, 2009 [EBook #28885],15), (,16), (Language: English,17), (,18), (,19), (*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVEN...

scala> val aliceWordList = alicePipe.map { line => line._1.split("\\s+") }
aliceWordList: com.twitter.scalding.typed.TypedPipe[Array[String]] = com.twitter.scalding.typed.TypedPipeFactory@40a6104f

Three things: map, function, tuples but that’s ugly, so we can use tuple matching the be clearer:

scala> val aliceWordList = alicePipe.map { case (text, lineNum) =>
     |   text.split("\\s+").toList
     | }
aliceWordList: com.twitter.scalding.typed.TypedPipe[List[String]] = com.twitter.scalding.typed.TypedPipeFactory@7fa0e6fc

But we want words, not lists of words. We need to flatten!

scala> val aliceWords = aliceWordList.flatten
aliceWords: com.twitter.scalding.typed.TypedPipe[String] = com.twitter.scalding.typed.TypedPipeFactory@47c17842

Scala has a common function for this map + flatten == flatMap

scala> val aliceWords = alicePipe.flatMap { case (text, _) => text.split("\\s+").toList }
aliceWords: com.twitter.scalding.typed.TypedPipe[String] = com.twitter.scalding.typed.TypedPipeFactory@6d6f3ec6

Now lets add a count for each word:

scala> val aliceWithCount = aliceWords.map { word => (word, 1L) }
aliceWithCount: com.twitter.scalding.typed.TypedPipe[(String, Long)] = com.twitter.scalding.typed.TypedPipeFactory@8267443

let’s sum them for each word:

scala> val wordCount = aliceWithCount.group.sum
wordCount: com.twitter.scalding.typed.UnsortedGrouped[String,Long] = IteratorMappedReduce(scala.math.Ordering$String$@2c20bf29,com.twitter.scalding.typed.TypedPipeFactory@2880d49b,<function2>,None,List(.<init>(<console>:23)))

(We could have also used .sumByKey, which is equivalent to .group.sum.)

Let’s print them to the screen (REPL only):

scala> wordCount.toIterator.take(100)
res0: Iterator[(String, Long)] = non-empty iterator

Let’s print just the ones with more that 100 appearances:

scala> wordCount.filter { case (word, count) => count > 100 }.dump
(,1399)
("I,120)
(Alice,224)
(I,248)
(a,678)
(all,171)
(and,793)
(as,246)
(at,209)
(be,158)
(but,105)
(for,143)
(had,177)
(her,207)
(in,405)
(it,365)
(little,122)
(not,123)
(of,604)
(on,147)
(or,140)
(out,103)
(said,422)
(she,489)
(so,107)
(that,230)
(the,1694)
(they,111)
(this,127)
(to,794)
(very,128)
(was,332)
(with,225)
(you,306)

But which is the biggest word?

Hint: In the Scala REPL, you can turn on :paste mode to make it easier to paste multi-line expressions.
scala> val top10 = { wordCount
     |       .groupAll
     |       .sortBy { case (word, count) => -count }
     |       .take(10) }
top10: com.twitter.scalding.typed.SortedGrouped[Unit,(String, Long)] = ValueSortedReduce(com.twitter.scalding.typed.TypedPipe$$anon$2@956f27e,com.twitter.scalding.typed.TypedPipeFactory@2b4a32fc,scala.math.Ordering$$anon$9@14595498,<function2>,Some(1),List(.<init>(<console>:25)))

scala> top10.dump
((),(the,1694))
((),(,1399))
((),(to,794))
((),(and,793))
((),(a,678))
((),(of,604))
((),(she,489))
((),(said,422))
((),(in,405))
((),(it,365))

Where is Alice? What is with the ()?

scala> val top20 = { wordCount
     |       .groupAll
     |       .sortBy { case (word, count) => -count }
     |       .take(20)
     |       .values } // ignore the ()-all key
top20: com.twitter.scalding.typed.TypedPipe[(String, Long)] = com.twitter.scalding.typed.TypedPipeFactory@3a2c2b1a

scala> top20.dump
(the,1694)
(,1399)
(to,794)
(and,793)
(a,678)
(of,604)
(she,489)
(said,422)
(in,405)
(it,365)
(was,332)
(you,306)
(I,248)
(as,246)
(that,230)
(with,225)
(Alice,224)
(at,209)
(her,207)
(had,177)

There she is!

Now, suppose we want to know the last line on which each word appears.

How do we solve this? First, we generate (word, lineNum) pairs by flatmapping each line of words to a list of (word, lineNum) pairs.

scala> val wordLine = alicePipe.flatMap { case (text, line) =>
     |    text.split("\\s+").toList.map { word => (word, line) }
     |  }
wordLine: com.twitter.scalding.typed.TypedPipe[(String, Int)] = com.twitter.scalding.typed.TypedPipeFactory@6d08b42d

Next, we group the pairs on the word, and take the max line number for each group.

See all the functions on grouped things here: http://twitter.github.io/scalding/#com.twitter.scalding.typed.Grouped
scala> val lastLine = wordLine.group.max
lastLine: com.twitter.scalding.typed.UnsortedGrouped[String,Int] = IteratorMappedReduce(scala.math.Ordering$String$@2c20bf29,com.twitter.scalding.typed.TypedPipeFactory@641aa1ae,<function2>,None,List(.<init>(<console>:22)))

Finally, we lookup the words from the initial line:

By the way: lastLine.swap is equivalent to lastLine.map { case (word, lastLine) => (lastLine, word) }
scala> val words = {
     |   lastLine.map { case (word, lastLine) => (lastLine, word) }
     |           .group
     |           .join(alicePipe.swap.group)
     | }
words: com.twitter.scalding.typed.CoGrouped[Int,(String, String)] = com.twitter.scalding.typed.CoGroupable$$anon$3@5d67c779

scala> println(words.toIterator.take(30).mkString("\n"))
(0,(Project,Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll))
(8,(Title:,Title: Alice's Adventures in Wonderland))
(9,(Austin,       Illustrated by Arthur Rackham. With a Proem by Austin Dobson))
(9,(Dobson,       Illustrated by Arthur Rackham. With a Proem by Austin Dobson))
(9,(Illustrated,       Illustrated by Arthur Rackham. With a Proem by Austin Dobson))
(9,(Proem,       Illustrated by Arthur Rackham. With a Proem by Austin Dobson))
(9,(Rackham.,       Illustrated by Arthur Rackham. With a Proem by Austin Dobson))
(11,(Author:,Author: Lewis Carroll))
(13,(Arthur,Illustrator: Arthur Rackham))
(13,(Illustrator:,Illustrator: Arthur Rackham))
(13,(Rackham,Illustrator: Arthur Rackham))
(15,(#28885],Release Date: May 19, 2009 [EBook #28885]))
(15,(19,,Release Date: May 19, 2009 [EBook #28885]))
(15,(2009,Release Date: May 19, 2009 [EBook #28885]))
(15,(Date:,Release Date: May 19, 2009 [EBook #28885]))
(15,(May,Release Date: May 19, 2009 [EBook #28885]))
(15,(Release,Release Date: May 19, 2009 [EBook #28885]))
(15,([EBook,Release Date: May 19, 2009 [EBook #28885]))
(17,(Language:,Language: English))
(20,(START,*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***))
(42,("Alice"],[Illustration: "Alice"]))
(46,(ALICE'S·ADVENTURES,          ALICE'S·ADVENTURES))
(47,(IN·WONDERLAND,          IN·WONDERLAND))
(48,(BY·LEWIS·CARROLL,          BY·LEWIS·CARROLL))
(49,(ILLUSTRATED·BY,          ILLUSTRATED·BY))
(50,(ARTHUR·RACKHAM,          ARTHUR·RACKHAM))
(52,(AUSTIN,          WITH A PROEM BY AUSTIN DOBSON))
(52,(DOBSON,          WITH A PROEM BY AUSTIN DOBSON))
(52,(PROEM,          WITH A PROEM BY AUSTIN DOBSON))
(54,(LONDON·WILLIAM·HEINEMANN,          LONDON·WILLIAM·HEINEMANN))

That’s it. You have learned the basics: TypedPipe, map/flatMap/filter groups do reduce/join: max, sum, join, take, sortBy