Scalding SourcesΒΆ
Scalding sources are how you get data into and out of your scalding jobs. There are several useful sources baked into the project and a few more in the scalding-commons repository. Here are a few basic ones to get you started:
- To read a text file line-by-line, use
TextLine(filename)
. For every line infilename
, this source creates a tuple with two fields:line
contains the text in the given lineoffset
contains the byte offset of the given line withinfilename
- To read or write a tab- or comma-separated values file, use
Tsv
orCsv
.- When reading a
Tsv
orCsv
, Scalding will choose field names based on the input file’s headers. - When writing a
Tsv
orCsv
, Scalding will write out headers with the field names.
- When reading a
- To create a pipe from data in a Scala
Iterable
, use theIterableSource
. For example,IterableSource(List(4,8,15,16,23,42), 'foo)
will create a pipe with a field'foo
.IterableSource
is especially useful for unit testing. - A
NullSource
is useful if you wish to create a pipe for only its side effects (e.g., printing out some debugging information). For example, although defining a pipe asCsv("foo.csv").debug
without a sink will create ajava.util.NoSuchElementException
, adding a write to aNullSource
will work fine:Csv("foo.csv").debug.write(NullSource)
.