Skip to content

Instantly share code, notes, and snippets.

@tomz
Last active November 24, 2015 17:01
Show Gist options
  • Save tomz/76ae4e829900e7a4fe9f to your computer and use it in GitHub Desktop.
Save tomz/76ae4e829900e7a4fe9f to your computer and use it in GitHub Desktop.

Loading SQL Server tables to Spark on EC2 and output to S3

0. Setup sqlserver on Azure

Follow the instructions here https://azure.microsoft.com/en-us/documentation/articles/sql-database-get-started/ to setup a SQL Server instance on Azure

1. Launch spark cluster using spark-ec2 script

Export AWS credentials

export AWS_ACCESS_KEY_ID=your_aws_key
export AWS_SECRET_ACCESS_KEY=your_aws_secret
export AWS_PRIVATE_KEY=your_aws_private_key

Clone the spark repo and cd to the ec2 sub-directory

git clone https://github.com/apache/spark.git

cd spark

./spark-ec2 --key-pair=$AWS_PRIVATE_KEY --identity-file=$AWS_PRIVATE_KEY.pem --region=us-east-1 --zone=us-east-1a  --spark-version=1.5.1 launch my-spark-cluster

ssh to the cluster

./spark-ec2 -k $AWS_PRIVATE_KEY -i $AWS_PRIVATE_KEY.pem login my-spark-cluster

2. Set up the master instance for development

Update aws-cli, set up Java 8 and maven

yum install -y aws-cli java-1.8.0-openjdk java-1.8.0-openjdk-devel
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export PATH=$JAVA_HOME/bin:$PATH

wget http://apache.mirrors.hoobly.com//maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
tar xvfz apache-maven-3.3.3-bin.tar.gz
sudo mv apache-maven-3.3.3 /opt/maven
sudo ln -s /opt/maven/bin/mvn /usr/bin/mvn

Export AWS and Azure credentials

export SQLSERVER_USERNAME=<your SQL Sever user name>
export SQLSERVER_PWD=<your SQL Server password>
export SQLSERVER_HOST_PORT=<your SQL Server hostname and port, e.g (mine is) ngc1379w4v.database.windows.net:1433>
export SQLSERVER_DB=<your SQL Server database>
export SQLSERVER_TABLE=<your SQL Server table>
export AWS_ACCESS_KEY_ID=<your AWS access key>
export AWS_SECRET_ACCESS_KEY=<your AWS secret>
export AWS_S3_BUCKET=<s3 bucket to store output, e.g tomzeng/rdd-output>

Clone the sample app repo

git clone https://github.com/tomz/SparkApps.git

Change into the sample app directory

cd SparkApps

copy the SQL Server jdbc driver

aws s3 cp s3://tomzeng/jdbc-drivers/sqljdbc4.jar .

3. Build and run the app

Build the app

mvn install

Run the app

~/spark/bin/spark-submit --master local[*] --class com.sparkexpert.MainMS --jars sqljdbc4.jar --packages com.databricks:spark-csv_2.11:1.2.0 spark-load-from-db/target/spark-load-from-db-1.0-SNAPSHOT.jar

4. Check your s3 bucket for the output

aws s3 ls s3://$AWS_S3_BUCKET/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment