Here's an efficient way to load a dataset into Vertica by splitting it up into multiple pieces and then parallelizing the load process.
Note that this only makes sense if your Vertica cluster is a single node. If it's running more nodes, there are definitely more efficient ways of doing this.
For this example, the large CSV file will be called large_file.csv
. If your file is under 1GB, it
probably makes sense to load it using a single COPY
command.
Next, split up your CSV file into multiple parts, like this:
$ mkdir -p /tmp/parts/
$ split -C 500m large_file.csv /tmp/parts/large_file.csv-
This splits up the CSV file into smaller files, each is 500MB.
Install Uber's vertica-python bindings (Everybody loves Uber except for when they do surge pricing):
$ pip install vertica-python
Create a file called loader.py
with these contents:
import os
import Queue
import threading
from vertica_python import connect
conn_dict = {
'host':'127.0.0.1',
'port': 5433,
'user': '<< username >>',
'password': '<< password >>',
'database': '<< dbname >>'
}
class ThreadLoader(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
conn = connect(conn_dict)
cursor = conn.cursor()
while True:
filename = self.queue.get()
print "Loading %s..." % filename
cursor.execute("COPY << table_name >> FROM '%s' DELIMITER ',' DIRECT" % filename)
print cursor.fetchall()
self.queue.task_done()
if __name__ == '__main__':
# Load into queue.
dirname = '/tmp/parts/' # Change this to the directory where the split files reside.
queue = Queue.Queue()
# You can increase or decrease the number of inserter threads by
# increasing or decreasing the range below.
for i in range(8):
t = ThreadLoader(queue)
t.start()
for filepath in os.listdir(dirname):
print filepath
queue.put(dirname + filepath)
queue.join()
You'll need to change the items that are in << hard brackets >>
with your own values.
$ python loader.py
That's all it took for me - leave a comment if you need a hand getting this to work and I'll be happy to help.