Skip to content

Apache Spark Scala Interview Questions- Shyam Mallesh -

import org.apache.spark.sql.types._ val schema = StructType(Seq( StructField("name", StringType), StructField("age", IntegerType), StructField("address", StructType(Seq( StructField("city", StringType), StructField("zip", LongType) ))) ))

⚠️ coalesce(1) avoids shuffle but may cause data skew. Only safe if current partitions are small. With schema inference (slow but automatic): Apache Spark Scala Interview Questions- Shyam Mallesh

val rdd = sc.parallelize(Seq(("a",2),("a",4),("b",1),("b",3))) val avg = rdd.mapValues((_,1)) .reduceByKey((x,y) => (x._1 + y._1, x._2 + y._2)) .mapValuescase (sum, count) => sum.toDouble / count import org