Skip to content

Instantly share code, notes, and snippets.

@devender-yadav
Last active December 21, 2018 05:33
Show Gist options
  • Save devender-yadav/ae4896c0baf48d2fc83b21dfa0b0adaa to your computer and use it in GitHub Desktop.
Save devender-yadav/ae4896c0baf48d2fc83b21dfa0b0adaa to your computer and use it in GitHub Desktop.
To make sqoop job efficient

How to choose split by column?

Idea is to use a uniformly distributed numeric column. So, we should prefer primary key and then any numeric column and we should avoid using text column for splitting.

How to determine the number of Mappers?

We should consider:

  • Number of rows
  • Number of tasks that can be run in parallel in Hadoop
  • Number of concurrent connection in RDBMS
  • Memory assigned to the mapper

Direct mode

Direct mode should be preferred while imported data for faster migration because it uses the underlying utility (e.g. mysqldump for Mysql) to migration data instead of firing range queries via JDBC. Direct mode is not supported for all supported RDBMS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment