Databricks Spark 3.0 ceritification

Prep for the databricks spark 3 certification using Learning Spark v2 book resource
Sample datasets have been taken from the databricks community edition storage.

The repo covers the practical aspects of dataframe api including the following:

  • Subsetting DataFrames (select, filter, etc.)
  • Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
  • String manipulation (Splitting strings, regex)
  • Reading/writing DataFrames (schemas, formats- parquet, avro, json etc)
  • Rows, Columns and Expressions
  • Common Dataframe operations (filter, select, where, distinct, sort, limit)
  • Wide and Narrow Transformations
  • Working with dates (extraction, formatting, etc)
  • Aggregations (groupBy, orderBy, count)
  • Statistical methods (avg, sum, max, min, describe, correlation, sampleBy)
  • UDFs
  • Combining datasets (joins, unions, broadcasting)
  • Optimising and tuning (caching and persistence, repartitioning, shuffle, catalyst optimiser – logical, optimised plans)

In addition, there is also an example of mlflow pipeline although not part of the certification

GitHub

View Github