# Alice in Wonderland Tutorial First, let's import some stuff. ```scala import scala.io.Source import com.twitter.scalding._ import com.twitter.scalding.ReplImplicits._ import com.twitter.scalding.ReplImplicitContext._ ``` ```scala scala> val alice = Source.fromURL("https://raw.githubusercontent.com/mihi-tr/reading-alice/master/pg28885.txt").getLines alice: Iterator[String] = non-empty iterator ``` Add the line numbers, which we might want later ```scala scala> val aliceLineNum = alice.zipWithIndex.toList aliceLineNum: List[(String, Int)] = List((Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll,0), ("",1), (This eBook is for the use of anyone anywhere at no cost and with,2), (almost no restrictions whatsoever. You may copy it, give it away or,3), (re-use it under the terms of the Project Gutenberg License included,4), (with this eBook or online at www.gutenberg.net,5), ("",6), ("",7), (Title: Alice's Adventures in Wonderland,8), (" Illustrated by Arthur Rackham. With a Proem by Austin Dobson",9), ("",10), (Author: Lewis Carroll,11), ("",12), (Illustrator: Arthur Rackham,13), ("",14), (Release Date: May 19, 2009 [EBook #28885],15), ("",16), (Language: English,17), ("",18), ("",19), (*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND **... ``` Now for scalding, TypedPipe is the main scalding object representing your data. ```scala scala> val alicePipe = TypedPipe.from(aliceLineNum) alicePipe: com.twitter.scalding.typed.TypedPipe[(String, Int)] = IterablePipe(List((Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll,0), (,1), (This eBook is for the use of anyone anywhere at no cost and with,2), (almost no restrictions whatsoever. You may copy it, give it away or,3), (re-use it under the terms of the Project Gutenberg License included,4), (with this eBook or online at www.gutenberg.net,5), (,6), (,7), (Title: Alice's Adventures in Wonderland,8), ( Illustrated by Arthur Rackham. With a Proem by Austin Dobson,9), (,10), (Author: Lewis Carroll,11), (,12), (Illustrator: Arthur Rackham,13), (,14), (Release Date: May 19, 2009 [EBook #28885],15), (,16), (Language: English,17), (,18), (,19), (*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVEN... scala> val aliceWordList = alicePipe.map { line => line._1.split("\\s+") } aliceWordList: com.twitter.scalding.typed.TypedPipe[Array[String]] = com.twitter.scalding.typed.TypedPipeFactory@40a6104f ``` Three things: map, function, tuples but that's ugly, so we can use tuple matching the be clearer: ```scala scala> val aliceWordList = alicePipe.map { case (text, lineNum) => | text.split("\\s+").toList | } aliceWordList: com.twitter.scalding.typed.TypedPipe[List[String]] = com.twitter.scalding.typed.TypedPipeFactory@7fa0e6fc ``` But we want words, not lists of words. We need to flatten! ```scala scala> val aliceWords = aliceWordList.flatten aliceWords: com.twitter.scalding.typed.TypedPipe[String] = com.twitter.scalding.typed.TypedPipeFactory@47c17842 ``` Scala has a common function for this map + flatten == flatMap ```scala scala> val aliceWords = alicePipe.flatMap { case (text, _) => text.split("\\s+").toList } aliceWords: com.twitter.scalding.typed.TypedPipe[String] = com.twitter.scalding.typed.TypedPipeFactory@6d6f3ec6 ``` Now lets add a count for each word: ```scala scala> val aliceWithCount = aliceWords.map { word => (word, 1L) } aliceWithCount: com.twitter.scalding.typed.TypedPipe[(String, Long)] = com.twitter.scalding.typed.TypedPipeFactory@8267443 ``` let's sum them for each word: ```scala scala> val wordCount = aliceWithCount.group.sum wordCount: com.twitter.scalding.typed.UnsortedGrouped[String,Long] = IteratorMappedReduce(scala.math.Ordering$String$@2c20bf29,com.twitter.scalding.typed.TypedPipeFactory@2880d49b,,None,List(.(:23))) ``` (We could have also used `.sumByKey`, which is equivalent to `.group.sum`.) Let's print them to the screen (REPL only): ```scala scala> wordCount.toIterator.take(100) res0: Iterator[(String, Long)] = non-empty iterator ``` Let's print just the ones with more that 100 appearances: ```scala scala> wordCount.filter { case (word, count) => count > 100 }.dump (,1399) ("I,120) (Alice,224) (I,248) (a,678) (all,171) (and,793) (as,246) (at,209) (be,158) (but,105) (for,143) (had,177) (her,207) (in,405) (it,365) (little,122) (not,123) (of,604) (on,147) (or,140) (out,103) (said,422) (she,489) (so,107) (that,230) (the,1694) (they,111) (this,127) (to,794) (very,128) (was,332) (with,225) (you,306) ``` But which is the biggest word? > Hint: In the Scala REPL, you can turn on `:paste` mode to make it easier to paste multi-line expressions. ```scala scala> val top10 = { wordCount | .groupAll | .sortBy { case (word, count) => -count } | .take(10) } top10: com.twitter.scalding.typed.SortedGrouped[Unit,(String, Long)] = ValueSortedReduce(com.twitter.scalding.typed.TypedPipe$$anon$2@956f27e,com.twitter.scalding.typed.TypedPipeFactory@2b4a32fc,scala.math.Ordering$$anon$9@14595498,,Some(1),List(.(:25))) scala> top10.dump ((),(the,1694)) ((),(,1399)) ((),(to,794)) ((),(and,793)) ((),(a,678)) ((),(of,604)) ((),(she,489)) ((),(said,422)) ((),(in,405)) ((),(it,365)) ``` Where is Alice? What is with the ()? ```scala scala> val top20 = { wordCount | .groupAll | .sortBy { case (word, count) => -count } | .take(20) | .values } // ignore the ()-all key top20: com.twitter.scalding.typed.TypedPipe[(String, Long)] = com.twitter.scalding.typed.TypedPipeFactory@3a2c2b1a scala> top20.dump (the,1694) (,1399) (to,794) (and,793) (a,678) (of,604) (she,489) (said,422) (in,405) (it,365) (was,332) (you,306) (I,248) (as,246) (that,230) (with,225) (Alice,224) (at,209) (her,207) (had,177) ``` There she is! Now, suppose we want to know the last line on which each word appears. How do we solve this? First, we generate `(word, lineNum)` pairs by flatmapping each line of words to a list of `(word, lineNum)` pairs. ```scala scala> val wordLine = alicePipe.flatMap { case (text, line) => | text.split("\\s+").toList.map { word => (word, line) } | } wordLine: com.twitter.scalding.typed.TypedPipe[(String, Int)] = com.twitter.scalding.typed.TypedPipeFactory@6d08b42d ``` Next, we group the pairs on the word, and take the max line number for each group. > See all the functions on grouped things here: > [http://twitter.github.io/scalding/#com.twitter.scalding.typed.Grouped](http://twitter.github.io/scalding/#com.twitter.scalding.typed.Grouped) ```scala scala> val lastLine = wordLine.group.max lastLine: com.twitter.scalding.typed.UnsortedGrouped[String,Int] = IteratorMappedReduce(scala.math.Ordering$String$@2c20bf29,com.twitter.scalding.typed.TypedPipeFactory@641aa1ae,,None,List(.(:22))) ``` Finally, we lookup the words from the initial line: > By the way: `lastLine.swap` is equivalent to `lastLine.map { case (word, lastLine) => (lastLine, word) }` ```scala scala> val words = { | lastLine.map { case (word, lastLine) => (lastLine, word) } | .group | .join(alicePipe.swap.group) | } words: com.twitter.scalding.typed.CoGrouped[Int,(String, String)] = com.twitter.scalding.typed.CoGroupable$$anon$3@5d67c779 scala> println(words.toIterator.take(30).mkString("\n")) (0,(Project,Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll)) (8,(Title:,Title: Alice's Adventures in Wonderland)) (9,(Austin, Illustrated by Arthur Rackham. With a Proem by Austin Dobson)) (9,(Dobson, Illustrated by Arthur Rackham. With a Proem by Austin Dobson)) (9,(Illustrated, Illustrated by Arthur Rackham. With a Proem by Austin Dobson)) (9,(Proem, Illustrated by Arthur Rackham. With a Proem by Austin Dobson)) (9,(Rackham., Illustrated by Arthur Rackham. With a Proem by Austin Dobson)) (11,(Author:,Author: Lewis Carroll)) (13,(Arthur,Illustrator: Arthur Rackham)) (13,(Illustrator:,Illustrator: Arthur Rackham)) (13,(Rackham,Illustrator: Arthur Rackham)) (15,(#28885],Release Date: May 19, 2009 [EBook #28885])) (15,(19,,Release Date: May 19, 2009 [EBook #28885])) (15,(2009,Release Date: May 19, 2009 [EBook #28885])) (15,(Date:,Release Date: May 19, 2009 [EBook #28885])) (15,(May,Release Date: May 19, 2009 [EBook #28885])) (15,(Release,Release Date: May 19, 2009 [EBook #28885])) (15,([EBook,Release Date: May 19, 2009 [EBook #28885])) (17,(Language:,Language: English)) (20,(START,*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***)) (42,("Alice"],[Illustration: "Alice"])) (46,(ALICE'S·ADVENTURES, ALICE'S·ADVENTURES)) (47,(IN·WONDERLAND, IN·WONDERLAND)) (48,(BY·LEWIS·CARROLL, BY·LEWIS·CARROLL)) (49,(ILLUSTRATED·BY, ILLUSTRATED·BY)) (50,(ARTHUR·RACKHAM, ARTHUR·RACKHAM)) (52,(AUSTIN, WITH A PROEM BY AUSTIN DOBSON)) (52,(DOBSON, WITH A PROEM BY AUSTIN DOBSON)) (52,(PROEM, WITH A PROEM BY AUSTIN DOBSON)) (54,(LONDON·WILLIAM·HEINEMANN, LONDON·WILLIAM·HEINEMANN)) ``` That's it. You have learned the basics: TypedPipe, map/flatMap/filter groups do reduce/join: max, sum, join, take, sortBy