Fetching data via JDBC using pyspark
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Data used - academia.stackexchange.com.7z
from https://archive.org/details/stackexchange
Data from multiple tables are there in academia.stackexchange.com
folder. I used PostHistory.xml.
MySQL commands:
CREATE Database stackexchange;
use stackexchange
CREATE TABLE post_history (
Id INT NOT NULL PRIMARY KEY,
PostHistoryTypeId SMALLINT NOT NULL,
PostId INT NOT NULL,
RevisionGUID VARCHAR(36),
CreationDate DATETIME,
UserId INT,
Text TEXT
);
I created a new user for the test
CREATE USER 'devender'@'localhost' IDENTIFIED BY 'dev_mysql';
GRANT ALL ON stackexchange.* TO 'devender'@'localhost';
FLUSH PRIVILEGES;
Then I tried to load PostHistory.xml data into table post_history
using
load xml infile '<path to xml>/PostHistory.xml' into table post_history rows identified by '<row>';
I got an error
ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
So I moved xml file to /var/lib/mysql-files directory and tried
load xml infile '/var/lib/mysql-files/PostHistory.xml' into table post_history rows identified by '<row>';
It worked 👌
check JDBC code without partitioning and with partitioning.
spark-submit --jars <path-to-connector>/mysql-connector-java-5.1.36.jar --driver-class-path <path-to-connector>/mysql-connector-java-5.1.36.jar spark-jdbc-partition.py
fetchsize
an optional parameter can be used to improve speed of ingestion.