data-engineering-course.md

Movie datasets
- IMDB plots & spoilers

Create Google Service Account credential (JSON)
Logstash
- Logstash download & installation
- Jdbc input plugin
- BigQuery output plugin. You need to install the output plugin first. See the link for installation guide.
- Command to execute logstash : Go to logstash installation folder using terminal, then use bin\logstash -f [path-to-pipeline-configuration-file]
- List of Logstash input plugins
Data Warehouse SQL (BigQuery) Create vendor_invoices cube

	SELECT FORMAT_DATE('%Y-%m', invoice_received_date) AS invoice_received_month, 
		vendor_name,item_payment_status, COUNT(DISTINCT po_number) AS po_count
	  FROM `bigquery-project-id.finance_dataset.purchase_invoices_2021`
      GROUP BY invoice_received_month, vendor_name, item_payment_status
      ORDER BY 1, 2, 3;

Download and install Java (for Spark) : Amazon Coretto. Spark currently support java 8 or 11. Please see the compatibility on Spark documentation
Some places to loook for training dataset:
Setting google authentication : detail here
Dataproc job submit script

gcloud dataproc jobs submit pyspark spark-etl-movies-movie-details.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-movies.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-names.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-ratings.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-reviews.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-title-principals.py --cluster=your-cluster-name --region=your-cluster-gcp-region2 --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-weathers.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

timpamungkasudemy/data-engineering-course.md