An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Pipeline Consists of various modules:
EMR - I used a 3 node cluster with below Instance Types:
Finally, pyspark uses python2 as default setup on EMR. To change to python3, setup environment variable…