devender-yadav/sqoop.md

Last active December 21, 2018 05:33

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/devender-yadav/ae4896c0baf48d2fc83b21dfa0b0adaa.js"></script>
Save devender-yadav/ae4896c0baf48d2fc83b21dfa0b0adaa to your computer and use it in GitHub Desktop.

Download ZIP

To make sqoop job efficient

Raw

sqoop.md

How to choose split by column?

Idea is to use a uniformly distributed numeric column. So, we should prefer primary key and then any numeric column and we should avoid using text column for splitting.

How to determine the number of Mappers?

We should consider:

Number of rows
Number of tasks that can be run in parallel in Hadoop
Number of concurrent connection in RDBMS
Memory assigned to the mapper

Direct mode

Direct mode should be preferred while imported data for faster migration because it uses the underlying utility (e.g. mysqldump for Mysql) to migration data instead of firing range queries via JDBC. Direct mode is not supported for all supported RDBMS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment