cascalog.cascading.operations documentation

->Existence

(->Existence gen available-fields out-field)
Positional factory function for class cascalog.cascading.operations.Existence.

->Inner

(->Inner gen available-fields)
Positional factory function for class cascalog.cascading.operations.Inner.

->Outer

(->Outer gen available-fields)
Positional factory function for class cascalog.cascading.operations.Outer.

IAggregateBy

IAggregator

IBuffer

REDUCER-KEY

add-aggregator

(add-aggregator _ pipe)

add-buffer

(add-buffer _ pipe)

add-op

(add-op flow fn)
Accepts a generator and a function from pipe to pipe and applies
the operation to the active head pipe.

agg

(agg agg-fn in-fields out-fields)
Returns in instance of IAggregator that adds a reduce-side-only
aggregation to its supplied pipe.

aggregate-by

(aggregate-by _)

aggregate-mode

(aggregate-mode aggregators force-reduce?)
Accepts a sequence of aggregators and a boolean force-reduce? flag
and returns a keyword representing the aggregation type.

aggregator?

(aggregator? x)

assembly

macro

(assembly args & ops)

buffer

(buffer buffer-fn in-fields out-fields)

buffer?

(buffer? x)

bufferiter

(bufferiter buffer-fn in-fields out-fields)

build-join-group

macro

(build-join-group group-op pipes group-fields decl-fields join)

build-triplet

(build-triplet gen join-fields)

cascalog-join

(cascalog-join gen-seq join-fields)

co-group*

(co-group* flows group-fields & {:keys [decl-fields aggs reducers join], :or {join :inner}})

constant-substitutions

(constant-substitutions vars)
Returns a 2-vector of the form

[new variables, {map of newvars to values to substitute}]

debug*

(debug* flow)
Prints all tuples that pass through the StdOut.

declared-fields

(declared-fields join-fields renames infields)
Accepts a sequence of join fields and a sequence of
field-seqs (each containing the join-fields, presumably) and returns
a full vector of unique field names, suitable for the return value
of a co-group.

defop

macro

(defop f-name & tail)
Defines a flow operation.

discard*

(discard* flow drop-fields)
Discard the supplied fields.

each

(each flow f from-fields to-fields)
Accepts a flow, a function from result fields => cascading
Function, input fields and output fields and returns a new flow.

ensure-project

(ensure-project gen-seq)
Makes sure that the declared fields are in the proper order.

fields-to-keep

(fields-to-keep gen-seq)
We want to keep the out-field of Existence nodes and all available
fields of the Inner and Outer nodes.

filter*

(filter* flow op-var in-fields)

filter-nullable-vars

(filter-nullable-vars flow fields)
If there are any nullable variables present in the output, filter
nulls out now.

generate-join-fields

(generate-join-fields numfields numpipes)

group-by*

(group-by* flow group-fields aggs & {:keys [reducers spill-threshold sort-fields reverse? reduce-only], :or {spill-threshold 10000}})
Applies a grouping operation to the supplied generator.

hash-join*

(hash-join* flows join-fields & {:keys [join decl-fields], :or {join :inner}})
Performs a map-side join of flows on join-fields. By default
does an inner join, but callers can specify a join type using
:join keyword argument, which can be :inner, :outer, or a
Cascading Joiner implementation.

Note: full or right outer joins have odd behavior in hash joins.
      See Cascading documentation for details.

hash-join-many

(hash-join-many flow-joins decl-fields)
Takes a sequence of [pipe, join-fields, join-type] triplets along
with other hash-join arguments and performs a mixed join. Allowed
join types are :inner, :outer, and :exists. The first entry must
be of join type :inner.

hash-join-with-tiny

(hash-join-with-tiny larger-flow fields1 tiny-flow fields2)

identity*

(identity* flow input output)
Mirrors the supplied set of input fields into the output fields.

in-branch

(in-branch flow f)(in-branch flow name f)
Accepts a temporary name and a function from flow => flow and
performs the operation within a renamed branch.

insert*

(insert* flow & field-v-pairs)
Accepts a flow and alternating field/value pairs and inserts these
items into the flow.

insert-subs

(insert-subs flow sub-m)

join->joiner

(join->joiner join)
Converts the supplier joiner instance or keyword to a Cascading
Joiner.

join-fields-selector

(join-fields-selector num-fields)
Returns a selector that's used to go pull out groups from the join
that aren't all nil.

join-many

(join-many flow-joins decl-fields & opts)
Takes a sequence of [pipe, join-fields, join-type] triplets along
with other co-group arguments and performs a mixed join. Allowed
join types are :inner, :outer, and :exists.

join-with-larger

(join-with-larger smaller-flow fields1 larger-flow fields2 group-fields aggs & opts)

join-with-smaller

(join-with-smaller larger-flow fields1 smaller-flow fields2 & opts)

lazy-generator

(lazy-generator tmp-path [tuple :as l-seq])
Returns a cascalog generator on the supplied sequence of
tuples. `lazy-generator` serializes each item in the lazy sequence
into a sequencefile located at the supplied temporary directory and returns
a tap for the data in that directory.

It's recommended to wrap queries that use this tap with
`cascalog.cascading.io/with-fs-tmp`; for example,

  (with-fs-tmp [_ tmp-dir]
    (let [lazy-tap (lazy-generator tmp-dir lazy-seq)]
      (?<- (stdout)
           [?field1 ?field2 ... etc]
           (lazy-tap ?field1 ?field2)
           ...)))

left-hash-join-with-tiny

(left-hash-join-with-tiny larger-flow fields1 tiny-flow fields2)

left-join-with-larger

(left-join-with-larger smaller-flow fields1 larger-flow fields2 aggs & {:as opts})

left-join-with-smaller

(left-join-with-smaller larger-flow fields1 smaller-flow fields2 aggs & opts)

lift-pipes

(lift-pipes flows)

logically

(logically gen in-fields out-fields f)
Accepts a flow, input fields, output fields and a function that
accepts the same things and allows for the following features:

Any variables not prefixed with !, !! or ? are treated as constants
in the flow. This allows for (map* flow + 10 ["?a"] ["?b"]) to
work properly and clean up its fields without hassle.

Any non-nullable output variables (prefixed with ?) are removed from
the flow.

Duplicate input fields are allowed. It is currently NOT allowed to
output one of the input variables. In Cascalog, this triggers an
implicit filter; this needs to be supplied at another layer.

map*

(map* flow op-var in-fields out-fields)

map->Existence

(map->Existence m__5818__auto__)
Factory function for class cascalog.cascading.operations.Existence, taking a map of keywords to field values.

map->Inner

(map->Inner m__5818__auto__)
Factory function for class cascalog.cascading.operations.Inner, taking a map of keywords to field values.

map->Outer

(map->Outer m__5818__auto__)
Factory function for class cascalog.cascading.operations.Outer, taking a map of keywords to field values.

mapcat*

(mapcat* flow op-var in-fields out-fields)

multigroup

(multigroup pairs declared-group-vars op out-fields)
Take a sequence of pairs of [pipe, join-fields]

name-flow

(name-flow gen name)
Assigns a new name to the clojure flow.

new-pipe-name

(new-pipe-name joined-seq)

no-overlap?

(no-overlap? large small)

not-nil?

(not-nil? & xs)

parallel-agg

(parallel-agg agg-fn in-fields out-fields & {:keys [init-var present-var]})
Creates a parallel aggregation operation.

parallel-agg?

(parallel-agg? x)

rename*

(rename* flow new-fields)(rename* flow old-fields new-fields)
rename old-fields to new-fields.

rename-pipe

(rename-pipe gen)(rename-pipe gen name)

replace-join-fields

(replace-join-fields join-fields join-renames fields)

sample*

(sample* flow percent)(sample* flow percent seed)
Sample some percentage of elements within this pipe. percent should
be between 0.00 (0%) and 1.00 (100%) you can provide a seed to get
reproducible results.

select*

(select* flow keep-fields)
Remove all but the supplied fields from the given flow.

set-reducers

(set-reducers pipe reducers)
Set the number of reducers for this step in the pipe.

substitute-if

(substitute-if pred subfn aseq)
Returns [newseq {map of newvals to oldvals}]

trap*

(trap* flow trap)
Applies a trap to the current branch of the supplied flow.

union*

(union* & flows)
Merges the supplied flows and ensures uniqueness of the resulting
tuples.

unique

(unique flow)(unique flow unique-fields & options)
Performs a unique on the input pipe by the supplied fields.

unique-aggregator

(unique-aggregator)

with-constants

(with-constants gen in-fields f)
Allows constant substitution on inputs.

with-duplicate-inputs

(with-duplicate-inputs flow from-fields f)
Accepts a flow, some fields, and a function from (flow,
unique-fields, new-fields) => flow and appropriately handles
duplicate entries inside of the fields.

The fields passed to the supplied function will be guaranteed
unique. New fields are passed as a third option to the supplying
function, which may decide to call (discard* delta) if the fields
are still around.

with-trap*

(with-trap* flow trap f)
Applies a trap to everything that occurs within the supplied
function of flow => flow.

write*

(write* flow sink)