Very Bad Things

Originally, I named this document "very bad things" because it does some stuff that proper folks should not do, like mixing bash and python and leaving filehandles unclosed. It's about a function I wrote to get sh*t done.

I've been doing bioinformatics for about 10 years now. I used to joke with a friend of mine that most of our work was converting between file formats. We don't joke about that anymore.

I was repeating over and over in my scripts, things like (slight exaggeration):

    for toks in (line.rstrip("\r\n").split("\t") for line in open(file_name):
        if toks[0] == "chr1" and int(toks[1]) >= 20000 \
                             and 25000 <= int(toks[2]):
            print "\t".join(toks)

of course, you could use csv python module or awk from the terminal, for something like that, but bash scripts get out of hand. And it's usually just some simple kind of filtering and altering of the columns in a text file.

so I wrote a 3 or 4 line function named reader(), that simplifies this. And after I wrote it, it grew to have a few more lines (but not many) and do a lot more. It's really too simple to write up, but from the python code I've seen out there, it could benefit a few people. Simplest usage is:

    for toks in reader(file_name, header=False):
        if toks[0] != "chr1": continue
        ...

or if it has a header:

    for d in reader(file_name, header=True):
        if d['chrom'] != "chr1": continue

or it doesn't have a header, but you want to acces by key anyway:

    for d in reader(file_name, header="chrom start end name".split()):
        if d['chrom'] != "chr1": continue

or if you want to get an object for each line:

    class Bed(object):
        def __init__(self, toks):
            self.chrom = toks[0]
            self.start, self.end = map(int, toks[1:3])

    for bed in reader(file_name, header=Bed):
        if bed.chrom == "chr1" and bed.start < 20000:
            print bed

or if it's split by comma (default is tab) add sep=",". It could be a remote file, a gzipped file a bzip2'ed file or an xls file. it's all the same:

    for d in reader('http://some.where.net/file.txt'):
        # do stuff
    for d in reader('some.file.xls'):
        # do other stuff

it can be tricked, because it just looks for names starting with https?:// or ending with .gz. It can also do streaming download of remote gzip files. It can also handle a file like "~/work/somestuff" or "$HOME/somestuff". Or stdin in using the filename "-". Again, simple things for convenience.

reader can also take an iterable, so reader(some_list) will work.

It can also handle processes -- calls to the shell by prefixing the command with "|".

    for toks in reader("|awk '(NR > 1){ print $1,$3-100,$3 }'", header=False):

...

which means it can handle process substitution. e.g., below, i want to intersect 2 bed files (chromosome locations), but only where the 7th column in the second file is less than some value.

    for toks in reader('|bedtools intersect -a query.bed -b <(awk "$7 < 0.05" regions.bed)', header=False):
        # do something.

and if having a dictionary is easier than indexing a list (usually the case), you can get a dict like:

    for toks in reader('|bedtools intersect -header -a query.bed -b <(awk "$7 < 0.05" regions.bed)'):
        # do something

(removing header=False and telling bedtools to print the header.). or like this, by specifying the header explicitly.

    for d in reader('|bedtools intersect -a query.bed -b <(awk "$7 < 0.05" regions.bed)',
         header="chrom start end name other".split()):
         print d['chrom']

which means you can mix python idioms with bash:

    from itertools import groupby
    from operator import itemgetter

    fniter = reader('|bedtools intersect -a query.bed \
                                         -b <(awk "$7 < 0.05" regions.bed)',
                         header="chrom start end name other".split()):

    for name, bed_list in groupby(fniter, key=itemgetter("name")):
        for bed in bed_list:
            # doo stuff

if you want to mix a lot of bash into your python, you can do:

    for toks in reader('|bash somescript.sh'):
        # do something

if you just want to force a process to run, you can do:

    list(reader('|bash script.sh > output.txt'))

This is, admittedly unremarkable, but turns out to be quite useful. Even for things like looking at a binary format making something else turn it into a text format:

    for sam in reader("|samtools view -F4 some.bam", header=False):
        # do stuff

this cuts down on some of the code that I'd otherwise write. It also gives helpful error messages on fail, something that would require discipline to get everytime from an external process.

reader is available in toolshed from pypi for python2.6 or 2.7 by doing:

easy_intall toolshed

or preferably

pip install toolshed

or on github: https://github.com/brentp/toolshed/ it could use improvements, pull requests will be welcomed.

brentp/very-bad-things.md

Very Bad Things