Created
June 6, 2013 18:10
-
-
Save 89465127/5723635 to your computer and use it in GitHub Desktop.
Read in flat hadoop files into python. Uses a generator, so it is memory efficient.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import glob | |
import os | |
def filelist(path, _filter="part-*"): | |
basepath = os.path.abspath(os.path.expanduser(path)) | |
return [filename for filename in glob.glob(basepath + '/' + _filter)] | |
def hfile(path, _filter="part-*"): | |
for filename in filelist(path, _filter): | |
with open(filename) as f: | |
for line in f: | |
yield line | |
''' Usage example: | |
from open_hadoop import hfile | |
for line in hfile('./input/path/'): | |
print line | |
''' | |
''' Installation: | |
- Place open_hadoop.py in your site-packages directory. | |
- Your site-packages directory can be located by running: | |
python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())" | |
''' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment