Apache Spark- Basic RDD Transformation

March 28, 2017

RDD transformation is to create a new dataset from an existing one. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. Let’s look at more examples.

Reference Book

Map Function

We will first create a standalone program. More info can be found here to create a standalone program.

Purpose: Apply a function to each element in the RDD and return an RDD of the result.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD

object SimpleApp {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g");
    val sc = new SparkContext(conf)

    val input: RDD[Int] = sc.parallelize(List(1, 2, 3, 3))
    val rdd: RDD[Int] = input.map(x => x + 1)
    print("the result is : " + rdd.collect().mkString(","))

    sc.stop()
  }
}

the result is : 2,3,4,4

FlapMap Function

This time, the fragment of code will be shown from now on.

Purpose: Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words.

    val input: RDD[List[Int]] = sc.parallelize(List(List(1, 2), List(6, 7)))
    val rdd: RDD[Int] = input.flatMap(sublist => sublist.map(x => x + 1))
    print("the result is : " + rdd.collect().mkString(","))

the result is : 2,3,7,8

Filter Function

Purpose: Return an RDD consisting of only elements that pass the condition passed to filter()

    val input: RDD[Int] = sc.parallelize(List(1, 2, 3, 3))
    val rdd: RDD[Int] = input.filter(x => x != 3)
    print("the result is : " + rdd.collect().mkString(","))

the result is : 1,2

Distinct Function

Purpose: Remove duplicates.

    val input: RDD[Int] = sc.parallelize(List(1, 2, 3, 3))
    val rdd: RDD[Int] = input.distinct()
    print("the result is : " + rdd.collect().mkString(","))

the result is : 2,1,3

Sample Function

Purpose: Sample a fraction of the data, with or without replacement, using a given random number generator seed.

    val input: RDD[Int] = sc.parallelize(1 to 10)
    val rdd: RDD[Int] = input.sample(withReplacement = false, 0.5)
    print("the result is : " + rdd.collect().mkString(","))