- Movie datasets
- Create Google Service Account credential (JSON)
- Logstash
- Logstash download & installation
- Jdbc input plugin
- BigQuery output plugin. You need to install the output plugin first. See the link for installation guide.
- Command to execute logstash : Go to logstash installation folder using terminal, then use
bin\logstash -f [path-to-pipeline-configuration-file]
- List of Logstash input plugins
- Data Warehouse SQL (BigQuery)
Create
vendor_invoices
cube
SELECT FORMAT_DATE('%Y-%m', invoice_received_date) AS invoice_received_month,
vendor_name,item_payment_status, COUNT(DISTINCT po_number) AS po_count
FROM `bigquery-project-id.finance_dataset.purchase_invoices_2021`
GROUP BY invoice_received_month, vendor_name, item_payment_status
ORDER BY 1, 2, 3;
-
Download and install Java (for Spark) : Amazon Coretto. Spark currently support java 8 or 11. Please see the compatibility on Spark documentation
-
Some places to loook for training dataset:
-
Setting google authentication : detail here
-
Dataproc job submit script
gcloud dataproc jobs submit pyspark spark-etl-movies-movie-details.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-movies.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-names.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-ratings.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-reviews.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-movies-title-principals.py --cluster=your-cluster-name --region=your-cluster-gcp-region2 --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
gcloud dataproc jobs submit pyspark spark-etl-weathers.py --cluster=your-cluster-name --region=your-cluster-gcp-region --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar