Exercises

  1. Go through the examples in https://sparkbyexamples.com/ with notebooks using your Databricks community edition account.

  2. Go through this article to learn more about the spark architecture.

  3. Learn the difference between a data lake and a data warehouse here.

  4. Try to write the non-sql version of the code shown to obtain the top 10 movies by average rating.

  5. (hard) Try to use the idea of Pandas UDFs to generate a new dataframe of users and a derived attribute that captures whether their average rating over time has increased or decreased. While this can be done on a single machine in various ways, pandas UDFs let us do a groupby on users and do the same computation in a distributed way. In each group, we could fit a linear model using the scikit-learn library and get the trend.

  6. Run the models available at http://surpriselib.com/ in a spark environment and see if you can take advantage of distributed computation with any of them.

  7. Use user defined functions instead of the inbuilt aggregate functions (such as mean, sum) to create new columns in a pyspark dataframe.