Scalding SourcesΒΆ
Scalding sources are how you get data into and out of your scalding jobs. There are several useful sources baked into the project and a few more in the scalding-commons repository. Here are a few basic ones to get you started:
- To read a text file line-by-line, use
TextLine(filename). For every line infilename, this source creates a tuple with two fields:linecontains the text in the given lineoffsetcontains the byte offset of the given line withinfilename
- To read or write a tab- or comma-separated values file, use
TsvorCsv.- When reading a
TsvorCsv, Scalding will choose field names based on the input file’s headers. - When writing a
TsvorCsv, Scalding will write out headers with the field names.
- When reading a
- To create a pipe from data in a Scala
Iterable, use theIterableSource. For example,IterableSource(List(4,8,15,16,23,42), 'foo)will create a pipe with a field'foo.IterableSourceis especially useful for unit testing. - A
NullSourceis useful if you wish to create a pipe for only its side effects (e.g., printing out some debugging information). For example, although defining a pipe asCsv("foo.csv").debugwithout a sink will create ajava.util.NoSuchElementException, adding a write to aNullSourcewill work fine:Csv("foo.csv").debug.write(NullSource).