Idea is to use a uniformly distributed numeric column. So, we should prefer primary key and then any numeric column and we should avoid using text column for splitting.
We should consider:
- Number of rows
- Number of tasks that can be run in parallel in Hadoop
- Number of concurrent connection in RDBMS
- Memory assigned to the mapper
Direct mode should be preferred while imported data for faster migration because it uses the underlying utility (e.g. mysqldump for Mysql) to migration data instead of firing range queries via JDBC. Direct mode is not supported for all supported RDBMS.