Skip to content

Instantly share code, notes, and snippets.

@IdoBar
Forked from sujaikumar/UniRef90.md
Created May 29, 2017 23:01
Show Gist options
  • Save IdoBar/65219c2645975325ec9f598d5c7d66ad to your computer and use it in GitHub Desktop.
Save IdoBar/65219c2645975325ec9f598d5c7d66ad to your computer and use it in GitHub Desktop.
UniRef90 protein blast database with taxon IDs

Goal

  • To create UniRef90 protein databases for NCBI blast and Diamond Blast
  • To create a tab delimited taxid mapping file with two columns : sequenceID\tNCBITaxonID

Steps:

Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref90/uniref90.xml.gz

Make the protein fasta file and the taxlist out of this xml file. Out of 43,405,259 entries in uniref90.xml.gz (dated 2016-07-06), 851,213 have no taxon id, so there's no need to put them in the fasta file (as the purpose of this blast db is to identify the likely taxa of the query sequences).

To make the fasta file and the tab delimited taxid mapping file:

zcat uniref90.xml.gz  | perl -e '
open FA, ">uniref90.fasta";
open TX, ">uniref90.taxlist";
while ($line = <>) {
    if ($line =~ /<entry id="(\S+)"/) {
        $id = $1;
        $tx = 0;
        $sq = "";
        <>;<>;<>; $line = <>;
        if ($line =~ /"common taxon ID" value="(\d+)"/) {
            $tx = $1;
        }
    }
    if ($line =~ /<sequence/) {
        while ($line = <>) {
            chomp $line;
            last if $line =~ /<\/sequence/;
            $sq .= $line;
        }
        if ($tx and $sq) {
            print FA ">$id\n$sq\n";
            print TX "$id\t$tx\n";
        }
    }
}'

Now you can make the NCBI Blast db like this:

makeblastdb -dbtype prot -in uniref90.fasta

And the Diamond blast db like this:

diamond makedb --in uniref90.fasta --db uniref90

Using the taxon ID list

If in future you want to append the NCBI taxid to a blast tabular output file with uniref90 as the database (whether from ncbi-blast or diamond-blast) you can use this:

# assuming the UniRef90 id is in the second column (typical for ncbi and diamond blast tabular output)
 
diamond view -a blastoutput.daa \
| perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
> blastoutput.daa.taxid
 
# or
 
perl -lne '
  BEGIN{open UT, "<uniref90.taxlist" or die $!; while (<UT>) { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
  {print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< ncbiblast.out.txt \
> ncbiblast.out.txt.taxid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment