Skip to content

Instantly share code, notes, and snippets.

@iklobato
Last active August 30, 2024 12:54
Show Gist options
  • Save iklobato/09d71d420ea3e53e4127c9ea75fa1ffa to your computer and use it in GitHub Desktop.
Save iklobato/09d71d420ea3e53e4127c9ea75fa1ffa to your computer and use it in GitHub Desktop.
This script is responsible to move that from bigquery between different Google Cloud accounts
#!/bin/bash
if [[ "$*" == *-v* ]]; then
set -x
fi
source_project_id="${SOURCE_PROJECT_ID}"
destination_project_id="${DESTINATION_PROJECT_ID}"
dataset_name="${DATASET_NAME}"
function export_schema_to_json() {
local table_name=$1
bq show --format=json "${source_project_id}:${dataset_name}.${table_name}" | jq .schema.fields >"${table_name}.json"
echo "Schema exported to ${table_name}.json"
}
function load_data_from_csv() {
local table_name=$1
local bucket_name="${table_name}-ax"
bq load --source_format=CSV --skip_leading_rows=1 "${destination_project_id}:${dataset_name}.${table_name}" "gs://${bucket_name}/*" "./${table_name}.json"
echo "Data loaded to marketdata.${table_name}"
}
function move_between_buckets() {
local bucket_name=$1
gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" \
-o "GSUtil:parallel_thread_count=10" \
-o "GSUtil:check_hashes=if_fast_else_skip" \
-h "Content-Encoding:gzip" \
-o "GSUtil:sliced_object_download=true" \
-o "GSUtil:sliced_object_upload=true" \
-m cp -r "gs://${bucket_name}" "gs://${bucket_name}-ax"
}
function check_and_create_bucket() {
local project_id=$1
local bucket_name=$2
if gsutil ls "gs://${project_id}:${bucket_name}" &>/dev/null; then
echo "Bucket gs://${project_id}:${bucket_name} already exists."
else
gsutil mb -p "${project_id}" "gs://${bucket_name}"
fi
}
function export_table_to_bucket() {
local table_name=$1
local bucket_name="${table_name}-bkp"
# check_and_create_bucket "${source_project_id}" "${bucket_name}"
check_and_create_bucket "${destination_project_id}" "${bucket_name}"
bq extract --compression GZIP "${source_project_id}":"${dataset_name}"."${table_name}" gs://"${bucket_name}"/"x-*"
move_between_buckets "${bucket_name}"
echo "Data migrated from gs://${source_project_id}:${dataset_name}.${table_name} to gs://${bucket_name}"
}
file_path=$1
if [ ! -f "$file_path" ]; then
echo "Error: File '$file_path' not found."
exit 1
fi
while IFS= read -r table_name; do
(
echo "Table: $table_name"
echo "Source: gs://${source_project_id}:${dataset_name}.${table_name}"
echo "Destination: gs://${destination_project_id}:${dataset_name}.${table_name}"
echo ""
)
export_table_to_bucket "$table_name"
export_schema_to_json "$table_name"
load_data_from_csv "$table_name"
done <"$file_path"
@iklobato
Copy link
Author

iklobato commented Jul 27, 2023

Data Migration Script

This Bash script facilitates the migration of data from a Google BigQuery dataset in one project to another GCS bucket in a different project. The script extracts the data from BigQuery tables, compresses it using GZIP, and then copies it to the specified destination bucket in another project. The script also exports the schema of each table to separate JSON files.

After exporting the data to GCS, the script further reads the data from the generated CSV files and loads it into the corresponding destination BigQuery tables. This process ensures that the data is transferred from one project to another while maintaining the schema.

How to Use

  1. Ensure that you have the necessary permissions to access the source and destination BigQuery datasets and GCS buckets.

  2. Before running the script, set the environment variables SOURCE_PROJECT_ID, DESTINATION_PROJECT_ID, and DATASET_NAME with their appropriate values. For example, you can set them in your shell or in a script that precedes the execution of the main script.

export SOURCE_PROJECT_ID=projet-get-data-from-id
export DESTINATION_PROJECT_ID=projet-send-data-to-id
export DATASET_NAME=my-big-query-dataset-id
  1. Create a file listing the names of the BigQuery tables you want to migrate, with each table name on a separate line.
<tables.txt>
usernames
profiles
profiles_parsed
mySpecialTable
  1. Execute the script using the file as an argument:
./data_migration_script.sh tables.txt

The script will process each table listed in the file and perform the following steps for each table:

  • Export the table's schema to a JSON file in the current working directory.
  • Check if the destination bucket exists, and create it if it doesn't.
  • Extract the data from the source BigQuery table, compress it using GZIP, and store it in the destination bucket.
  • Copy the data from the destination bucket to another bucket with the -ax suffix.
  • Read the data from the generated CSV files in the -ax bucket and load it into the corresponding destination BigQuery tables.
  • Print informative logs for each table migrated.

Please make sure to review and adjust the environment variables and file paths before running the migration. Note that the script assumes the table names in the input file match the actual table names in the BigQuery dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment