Databricks Spark 3.0 ceritification
Prep for the databricks spark 3 certification using Learning Spark v2 book resource
Sample datasets have been taken from the databricks community edition storage.
The repo covers the practical aspects of dataframe api including the following:
- Subsetting DataFrames (select, filter, etc.)
- Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
- String manipulation (Splitting strings, regex)
- Reading/writing DataFrames (schemas, formats- parquet, avro, json etc)
- Rows, Columns and Expressions
- Common Dataframe operations (filter, select, where, distinct, sort, limit)
- Wide and Narrow Transformations
- Working with dates (extraction, formatting, etc)
- Aggregations (groupBy, orderBy, count)
- Statistical methods (avg, sum, max, min, describe, correlation, sampleBy)
- UDFs
- Combining datasets (joins, unions, broadcasting)
- Optimising and tuning (caching and persistence, repartitioning, shuffle, catalyst optimiser – logical, optimised plans)
In addition, there is also an example of mlflow pipeline although not part of the certification