Spark 2 | Workbook Answers
1. **Ingestion** – `spark.read.json` or `textFile`. 2. **Parsing** – `withColumn` + `from_unixtime`, `regexp_extract`. 3. **Cleaning** – filter out malformed rows, `na.drop`. 4. **Enrichment** – join with a static lookup table (broadcast). 5. **Aggregation** – `groupBy(date, status).agg(count("*").as("cnt"))`. 6. **Output** – write to Parquet partitioned by `date` **or** stream to console for debugging.
### 🎯 Your Next Step
If the workbook includes a **mini‑project** (e.g., “process a log dataset and produce a daily report”), you can outline the full pipeline: spark 2 workbook answers
---
val df = spark.read .option("header","true") .option("inferSchema","true") .csv("hdfs:///data/employees.csv") avg($"salary").as("avg_salary")) .filter($"emp_cnt" >
## 7. Putting It All Together – A Mini‑Project Blueprint
- [ ] All code compiles/run on Spark 2.x (no 3.x‑only APIs). - [ ] Comments are present for every non‑obvious line. - [ ] You’ve referenced at least **one** Spark concept (lazy eval, shuffle, broadcast, etc.). - [ ] Edge cases are discussed. - [ ] The answer is written **in your own words** (no copy‑pasting from the internet). **Aggregation** – `groupBy(date
words = lines.flatMap(lambda line: line.split()) # optional cleaning cleaned = words.map(lambda w: w.lower().strip('.,!?"\'')) distinct_words = cleaned.distinct() count = distinct_words.count()
val result = df .groupBy($"department") .agg(count("*").as("emp_cnt"), avg($"salary").as("avg_salary")) .filter($"emp_cnt" > 5)