Spark Cheat Sheet, Cheat Sheet of Distributed Programming and Computing

Spark RDD, Transformations on one RDD with a shuffle, Combining several RDDs and more in cheat sheet

Typology: Cheat Sheet

2020/2021

Uploaded on 04/26/2021

snehaaaa
snehaaaa 🇺🇸

4.7

(19)

239 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Spark Cheat Sheet
Spark RDD
Spark operators are either lazy transformation transforming RDDs or actions triggering the
computation.
Import/Export
myRDD = textFile(f) Read finto RDD
myRDD.saveAsTextFile(f) Store RDD into file f.
myRDD = sc.parallelize(l) Transform list linto RDD.
Transformations on one RDD without shuffle
These function transforms RDDs into other RDDs. All these functions are lazy and do not
imply a shuffle.
myRDD.filter(f) Keep rows rwhere f(r) is True
myRDD.map(f) Transform each row rinto the row f(r).
myRDD.flatMap(f) Transform each row into the set of rows f(r).
myRDD.mapValues(f) Transform each row (k, v) into the row k, f (v).
Expect rows to be pairs!
myRDD.flatMapValues(f) Transform each row (k, v) into the set of rows
(k, v0),...(k, vk) where v0,...,vk=f(v).
myRDD.keyBy(f) Transform each row rinto the row (f(r), r).
myRDD.sample(replacement,
fraction, seed)
Return an RDD which is a sample of the RDD. The parame-
ter replacement controls whether an element can be sampled
more than once, fraction controls the expected number of
times an element appears (you should have 0 fraction 1)
and seed is optional and controls the random generator.
Transformations on one RDD with a shuffle
These function transforms RDDs into other RDDs with a shuffle. All these functions expect
RDDs of pairs except distinct.
myRDD.distinct() Return an RDD with the set of values but without duplicates.
myRDD.groupByKey() Group all the values associated with a key. The result is an
RDD containing pairs of a key and a list of all the values
associated with this key. All the data is shuffled.
myRDD.reduceByKey(f) Group all values associated with a key. As long as there
is (k, v1) and (k, v2) in the RDD, they are replaced with
k, f (v1, v2) until each key is unique in the RDD. Only one
value per partition is shuffled.
myRDD.foldByKey(zero,f) Group all values associated with a key. Let v0,...vkbe the
values associated with kin a partition, the function com-
putes f(vk,...f(v0,z ero)...) then let p1,...,p`be the val-
ues associated with kin the various partition it computes
f(p`,...f(p0,z ero)...). Only one value per partition is shuf-
fled.
myRDD
.aggregateByKey(zero,f,r)
Group all values associated with a key. Let v0,...vhbe the
values associated with kin a partition, the function com-
putes f(vh,...f(v0,z ero)...) then let p1,...,p`be the val-
ues associated with kin the various partition it computes
r(p`,...r(p0, p1)...). When `= 0 it returns p0. Only one
value per partition is shuffled.
Action on one RDD (without a shuffle)
Action transforms RDDs into values on the driver. They trigger computation.
myRDD.collect() Return a list with all the values.
myRDD.count() Count the number of elements in the RDD.
myRDD.reduce(f) Combine the RDD into a single value. As long as there are two
values v1and v2in the RDD, they are replaced with f(v1,v2).
myRDD.fold(zero,f) Combine the RDD into a single value. Let v0,...vk
be the values within a partition, the function com-
putes f(vk,...f(v0,z ero)...) then let p1,...,p`be the
values computed in the various partition it returns
f(p`,...f(p0,z ero)...)
myRDD.aggregate(zero,f,r) Combine the RDD into a single value. Let v0,...vhbe the val-
ues in the RDD the function computes f(vh,...f(v0,z ero)...)
then let p1,...,p`be the values associated with kin the var-
ious partition it computes r(p`,...r(p0, p1)...). When `= 0
it returns p0.
Combining several RDDs
These function transforms several RDDs. Among these operations, only union does not
trigger a shuffle.
myRDD1.union(myRDD2) Create the RDD containing all values present in one RDD or
the other. A value will appear as many times as it appears in
both RDD. In particular if a value is present in both, it will
be present twice.
myRDD1.join(myRDD2) Expect two RDD of pairs and return the join on the first col-
umn of each RDD. All the data is shuffled.
myRDD1.intersection(myRDD2) Return an RDD which is a subset of myRDD1 without all the
data also appearing in myRDD2. All the data is shuffled.
myRDD1.subtract(myRDD2) Return an RDD which is a subset of myRDD1 without all the
data also appearing in myRDD2. All the data is shuffled.
myRDD1.subtractByKey(myRDD2) Expect two RDD of pairs. Return an RDD which is a subset
of myRDD1 keeping only the row (k, v) such that kdoes not
appear in the first column of myRDD2. All the data is shuffled.
Miscellaneous
myRDD.sort() Sort the elements in the RDD.
myRDD.sortWith(f) Sort the RDD according to the value returned by f.
myRDD.persist() Ensure that the RDD is cached in RAM.
myRDD.persist(p) Ensure that the RDD is cached according to the policy p. The
available policies declare preference regarding whether to store
into RAM or disk, or both and whether to serialize data or
not when keeping data in RAM.
myRDD.unpersist() Ask Spark to free the memory of the given RDD.
Quick SQL recall
SELECT col1, ..., colK, sum(colD), min(colE)
FROM table1 t1, ..., tableK tK
WHERE condition GROUP BY colA, colB
Condition can be:
conditionA AND conditionB
conditionA OR conditionB
NOT condition
EXISTS (SELECT *FROM ...)

Partial preview of the text

Download Spark Cheat Sheet and more Cheat Sheet Distributed Programming and Computing in PDF only on Docsity!

Spark Cheat Sheet

Spark RDD

Spark operators are either lazy transformation transforming RDDs or actions triggering the computation.

Import/Export

myRDD = textFile(f) Read f into RDD myRDD.saveAsTextFile(f) Store RDD into file f. myRDD = sc.parallelize(l) Transform list l into RDD.

Transformations on one RDD without shuffle

These function transforms RDDs into other RDDs. All these functions are lazy and do not imply a shuffle.

myRDD.filter(f) Keep rows r where f (r) is True myRDD.map(f) Transform each row r into the row f (r). myRDD.flatMap(f) Transform each row into the set of rows f (r). myRDD.mapValues(f) Transform each row (k, v) into the row k, f (v). Expect rows to be pairs! myRDD.flatMapValues(f) Transform each row (k, v) into the set of rows (k, v 0 ),... (k, vk ) where v 0 ,... , vk = f (v). myRDD.keyBy(f) Transform each row r into the row (f (r), r). myRDD.sample(replacement, fraction, seed)

Return an RDD which is a sample of the RDD. The parame- ter replacement controls whether an element can be sampled more than once, fraction controls the expected number of times an element appears (you should have 0 ≤ fraction ≤ 1) and seed is optional and controls the random generator.

Transformations on one RDD with a shuffle

These function transforms RDDs into other RDDs with a shuffle. All these functions expect RDDs of pairs except distinct.

myRDD.distinct() Return an RDD with the set of values but without duplicates. myRDD.groupByKey() Group all the values associated with a key. The result is an RDD containing pairs of a key and a list of all the values associated with this key. All the data is shuffled. myRDD.reduceByKey(f) Group all values associated with a key. As long as there is (k, v 1 ) and (k, v 2 ) in the RDD, they are replaced with k, f (v 1 , v 2 ) until each key is unique in the RDD. Only one value per partition is shuffled. myRDD.foldByKey(zero,f) Group all values associated with a key. Let v 0 ,... vk be the values associated with k in a partition, the function com- putes f (vk ,... f (v 0 , zero).. .) then let p 1 ,... , pbe the val- ues associated with k in the various partition it computes f (p,... f (p 0 , zero).. .). Only one value per partition is shuf- fled. myRDD .aggregateByKey(zero,f,r)

Group all values associated with a key. Let v 0 ,... vh be the values associated with k in a partition, the function com- putes f (vh,... f (v 0 , zero).. .) then let p 1 ,... , pbe the val- ues associated with k in the various partition it computes r(p,... r(p 0 , p 1 ).. .). When ` = 0 it returns p 0. Only one value per partition is shuffled.

Action on one RDD (without a shuffle)

Action transforms RDDs into values on the driver. They trigger computation. myRDD.collect() Return a list with all the values. myRDD.count() Count the number of elements in the RDD. myRDD.reduce(f) Combine the RDD into a single value. As long as there are two values v 1 and v 2 in the RDD, they are replaced with f (v 1 , v 2 ). myRDD.fold(zero,f) Combine the RDD into a single value. Let v 0 ,... vk be the values within a partition, the function com- putes f (vk ,... f (v 0 , zero).. .) then let p 1 ,... , pbe the values computed in the various partition it returns f (p,... f (p 0 , zero).. .) myRDD.aggregate(zero,f,r) Combine the RDD into a single value. Let v 0 ,... vh be the val- ues in the RDD the function computes f (vh,... f (v 0 , zero).. .) then let p 1 ,... , pbe the values associated with k in the var- ious partition it computes r(p,... r(p 0 , p 1 ).. .). When ` = 0 it returns p 0.

Combining several RDDs

These function transforms several RDDs. Among these operations, only union does not trigger a shuffle. myRDD1.union(myRDD2) Create the RDD containing all values present in one RDD or the other. A value will appear as many times as it appears in both RDD. In particular if a value is present in both, it will be present twice. myRDD1.join(myRDD2) Expect two RDD of pairs and return the join on the first col- umn of each RDD. All the data is shuffled. myRDD1.intersection(myRDD2) Return an RDD which is a subset of myRDD1 without all the data also appearing in myRDD2. All the data is shuffled. myRDD1.subtract(myRDD2) Return an RDD which is a subset of myRDD1 without all the data also appearing in myRDD2. All the data is shuffled. myRDD1.subtractByKey(myRDD2) Expect two RDD of pairs. Return an RDD which is a subset of myRDD1 keeping only the row (k, v) such that k does not appear in the first column of myRDD2. All the data is shuffled.

Miscellaneous

myRDD.sort() Sort the elements in the RDD. myRDD.sortWith(f) Sort the RDD according to the value returned by f. myRDD.persist() Ensure that the RDD is cached in RAM. myRDD.persist(p) Ensure that the RDD is cached according to the policy p. The available policies declare preference regarding whether to store into RAM or disk, or both and whether to serialize data or not when keeping data in RAM. myRDD.unpersist() Ask Spark to free the memory of the given RDD.

Quick SQL recall

SELECT col1, ..., colK, sum(colD), min(colE) FROM table1 t1, ..., tableK tK WHERE condition GROUP BY colA, colB

Condition can be: conditionA AND conditionB conditionA OR conditionB NOT condition EXISTS (SELECT * FROM ...)