Bigquery

Introduction

This is a usage and design summary of the pulsar-io-bigquery sink.

Parameters

This is the current list of parameters.

param name	description
credentials_file_path	BigQuery Json Key file Path
project_id	BigQuery Project Id
topic_data_set	BigQuery target topic/dataset map
	eg :"topic1:dataset1,topic2:dataset2"
topic_table_set	BigQuery target topic/table map
	eg :"topic1:table_tag1,topic2:table_tag2"
add_insert_timestamp	Adds a timestamp column
time_stamp_column_name	default is "sink_timestamp"
useMessageTimeDatePartitioning	Use Time Date Partitioning

Design

The current sink expects a gcp json credentials file to initialize, it also has message routing capabiltiy to different tables based on topic map.

Sample local run command

sink localrun \
--archive ./pulsar-google-nar-0.0.1.nar \
--tenant public \
--namespace default \
--name bigquery-sink \
--inputs bigquery-data \
--sinkConfigFile ~/bigquery-sink.yaml

Sample config yaml

configs:
  credentials_file_path: "/tmp/kubernetes-34c5c20a8e3e.json"
  project_id: "sample-project-170720"
  topic_data_set: "bigquery-data:test1"
  topic_table_set: "bigquery-data:test_table1"
  add_insert_timestamp: "true"
  time_stamp_column_name: "inserted_timestamp"

Schema

There is no schema validation performed currently and there no integration with the pulsar ot bigquery schema registry at this time.

Option is provided to add a time_stamp column if the option is enabled to add an additional column per row with the utc timestamp generated from java, before the insertion request is made.

Error Management

TODO

aahmed-se/Pulsar-IO-Bigquery.md