Skip to content

Instantly share code, notes, and snippets.

@TomConlin
Last active June 17, 2021 18:17
Show Gist options
  • Save TomConlin/6d38f8795add0230566456498164a63b to your computer and use it in GitHub Desktop.
Save TomConlin/6d38f8795add0230566456498164a63b to your computer and use it in GitHub Desktop.
FriendsOfEntropy - enrich decidability for Variant Call Format(VCF) data across multiple cultivar/strains for Machine Learning
gvcf is not just "gzipped" vcf it is some "genomic" extension to the format
# test
vcftools --gzvcf ./data/gvcf/IDTX_PCRfree_PI562985_ATTACTCG-ATAGAGGC_Sorghum_I549_L1.bwa_pcr_free.raw.snps.indels.g.vcf.gz \
--non-ref-ac-any 1 \
--recode \
--recode-INFO-all \
--stdout > test_vcftools_filtered
ls ./data/gvcf/I*raw.snps.indels.g.vcf.gz |
sed 's|.*[_.]\?\(PI[_]\?[0-9]\{4,7\}\)[_.]\?.*|\1|'|tr -d '_' | sort -u | wc -l
364 cultivars? lines? strains?
this "rio" breaks the name pastern of "PInnnnnnn" ...(it is a [new?] reference strain)
data/gvcf/IEDF_IEDE_merged_PCRfree_rio.bwa_pcr_free.raw.snps.indels.g.vcf.gz
cultivars starts as 1.7T gzipped in ~364 files
time \
for gvcf in $(ls ./data/gvcf/I*raw.snps.indels.g.vcf.gz); do
cultivar=$(sed 's|.*[_.]\?\(PI[_]\?[0-9]\{4,7\}\)[_.]\?.*|\1|' <(echo -n "$gvcf")|tr -d '_') ;
vcftools --gzvcf "$gvcf" --non-ref-ac-any 1 --recode --recode-INFO-all --stdout >\
"./data/Cultivar/$cultivar.vcf";
done
~ 280G uncompressed & 47G compressed
compressing ... should have used a parallel compress
tar -cf - Cultivar | pigz -p $(($(nproc)/2)) > cultivar.tar.gz
the filtering step will have removed every call
where the cultivar agrees with the reference.
leaving
1,370,111,696 calls is 3.764M call per cultivar on avg
more precisely stats for calls per cultivar are
cut -f2 cultivar_call_count.txt | ~/bin/sumstat.r
V1
Min. : 43 only one close to this small likely needs prune/redo
1st Qu.: 3207119
Median : 3755217
Mean : 3774412
3rd Qu.: 4245765
Max. :11101842
biggest question;
Are all 364 cultivar lines relevant? -- yes
--------------------------------------------------------------
for a variation.
is it within a gene (& which one(s))
is it within an exon?
is it a silent substitution
is it in a regulatory region up/down stream what size
can regions be generically characterized? (diminishing periodic? histome sized?)
is it in a known binding site (which)
does it change the gc content (in which direction)
need to do a massive crunches
want maybe 30-40 features total
so metrics that can be done per cultivar vcf and wind up as some concise value
1) gc delta of genome
2) total calls
3) calls within genes
4) calls upstream 2k
5) calls downstream 1k
6) 0/1, 1/0, 1/1
7) ...
-------------------------------------------------------------------------
Is the gene symbol|id w/coordinates I got from OMA the correct assembly?
Is there a preferred one?
have a permission request for a cultivar mapping file on Goog. no response yet
https://drive.google.com/file/d/18DI3BqVv7NCJdQM2tu9S69dZww8CXdCB/view
can Ensembl Plant 49 references be used? e.g.
##gff-version 3
##sequence-region 10 1 61233695
#!genome-build Joint Genome Institute Sorghum_bicolor_NCBIv3
#!genome-version Sorghum_bicolor_NCBIv3
#!genome-date 2017-04
#!genome-build-accession GCA_000003195.3
#!genebuild-last-updated 2017-06
Is ncbi's different/better?
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Sorghum_bicolor_NCBIv3
#!genome-build-accession NCBI_Assembly:GCF_000003195.3
#!annotation-source NCBI Sorghum bicolor Annotation Release 101
##sequence-region NC_012870.2 1 80884392
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4558
should we limit to particular cohort of "suspected" genes?
-------------------------------------------------------------------------------
vcf call col
1 2 3 4 5 6 7 8 9 10
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IDYE_PI92270
grep -v "^#" data/Cultivar/PI92270.vcf | head -1 | tr '\t' '\n' | grep -n .
1:Chr01
2:104
3:.
4:C
5:A,<NON_REF>
6:475.77
7:.
8:BaseQRankSum=0.046;ClippingRankSum=0.000;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=3.829;RAW_MQ=54154.00;ReadPosRankSum=-1.038
9:GT:AD:DP:GQ:PGT:PID:PL:SB
10:0/1:8,11,0:19:99:0|1:104_C_A:504,0,169,527,208,735:8,0,5,6
-----------------------------------------------------------------------------
cut -f 1,2,5 data/Cultivar/PI[0-9]*.vcf | grep -v "^#" | more
with straight cultivars ... say we want snp that occur in apx half the cultivars.
with phylo trees making "blurred-lines" cluster of cultivars.... clustivars
want snps that dominate within a clustivar
and occur in apx. half the clustivars maybe the middle half or middle third?
# get a list of all the plain snps (no indels) ... expect about a billion
# chr location cultivar and non-ref base
# trying the bash history substitution trick to quote a string
# awk -F'[/:,\t]' 'BEGIN{OSF="\t"}$7~/^A$|^C$|^T$|^G$/{print $4,$5,$7,substr($3,1,length($3)-4)}'
!:q
CMD="awk -F'[/:,\t]' 'BEGIN{OSF=\"\t\"}$7~/^A$|^C$|^T$|^G$/{print $4,$5,$7,substr($3,1,length($3)-4)}'"
time \
grep -v "^#" data/Cultivar/PI[0-9]*.vcf |
parallel -j16 "$CMD {} > chr_loc_snp_cul.tab"
time \
grep -v "^#" data/Cultivar/PI[0-9]*.vcf |
parallel -j24 awk -F'[/:,\t]' 'BEGIN{OSF="\t"}$7~/^A$|^C$|^T$|^G$/{print $4,$5,$7,substr($3,1,length($3)-4)}' {} > chr_loc_snp_cul.tab
...gaaaahhh too slow
time \
grep -v "^#" data/Cultivar/PI[0-9]*.vcf |
awk -F '[/:,\t]' 'BEGIN{OSF="\t"}$7 ~ /^A$|^C$|^T$|^G$/{print $4,$5,$7,substr($3,1,length($3)-4)}' |
sort --parallel=6 -k1,2n,3,4 > chr_loc_snp_cul.tab
real 42m49.090s
wc -l < chr_loc_snp_cul.tab
1,277,633,097
next pass, accumulate genomic locations with how many hits (from cultivars) it has
to reject the too-sparse and the ubiquitous
time cut -f 1,2,3 -d ' ' foo | uniq -c | sort --parallel=4 -nr > count_snp.txt
real 4m35.117s
look at the distribution are there obvious cut offs?
cut -c1-8 ~/fuse/monarch4/Projects/GenoPhenoEnvo/count_snp.txt| sumstat.r
V1
Min. : 1.00
1st Qu.: 1.00
Median : 2.00
Mean : 22.11
3rd Qu.: 13.00
Max. :362.0
ouch.
nothing anywhere near half. not shocking but hope does spring eternal
# anything in the distribution of counts?
cut -c1-8 count_snp.txt | uniq -c > count_of_counts.txt
cut -c1-8 ~/fuse/monarch4/Projects/GenoPhenoEnvo/count_of_counts.txt| sumstat.r
V1
Min. : 2,962
1st Qu.: 10,092
Median : 17,198
Mean : 159,661
3rd Qu.: 47,859
Max. :23709912
counts w/ between 10k & 50k dominate, skewed to the low side not immediately helpful.
clearly we can cut off all the single instance snps
24M exist only once and ~5k exist everywhere which is zero info in either case
maybe a different approach that looks for distance from max information
for *this* collection of ~360 cultivars how far is a snp from being included in half
# lop the the top & bottom now as we will later anyway
awk '$1<330 && $1>30 {print $1<180?180-$1:$1-180,$2,$3,$4}' count_snp.txt |
sort -n > middling-snp_count.txt
# anything in the distribution of middling counts?
cut -f1 -d ' ' middling-snp_count.txt | uniq -c > count_of_middling-counts.txt
cut -c1-8 ~/fuse/monarch4/Projects/GenoPhenoEnvo/count_of_middling-counts.txt|sumstat.r
Min. : 16457
1st Qu.: 37663
Median : 46814
Mean : 63162
3rd Qu.: 71216
Max. :165330
that is a smoother distribution
cut -f1 -d ' ' ~/fuse/monarch4/Projects/GenoPhenoEnvo/middling-snp_count.txt |
uniq -c |
~/bin/otsu.awk
32
that is a warm fuzzy number
# how many snps are kept?
awk '$1<=32{print}' middling-snp_count.txt | wc -l
1,185,000
# for how many chromosome or parts there of
awk '$1<=32{print $2}' middling-snp_count.txt | sort -u | wc -l
482
So keeping *snps* which exist within 150-to-210 of the 360 lines
results in about 1.1M (distinct) out of 1.4B snps flagged for future consideration.
(to be specific I am not considering indels at this point)
fgrep -f <(cut -c3- middling-snp_count.txt) foo > middling-snp-all.txt
# all individual discriminating snps
wc -l < middling-snp-all.txt
59,451,572 ~60 M individual snps
166,666 snps per cultivar is uniformly distributed
# Distribution of individual snps over cultivars.
cut -f4 -d ' ' middling-snp-all.txt | sort | uniq -c | sort -nr >
ohhh looks very nice 100k to 250k per cultivar
cut -f4 -d ' ' ~/fuse/monarch4/Projects/GenoPhenoEnvo/middling-snp-all.txt |
sort | uniq -c | cut -c1-8 | sumstat.r
cut -f4 -d ' ' ~/fuse/monarch4/Projects/GenoPhenoEnvo/middling-snp-all.txt | sort | uniq -c | cut -c1-8 | sumstat.r
V1
Min. : 13
1st Qu.:142376
Median :161990
Mean :163778
3rd Qu.:187212
Max. :248036
not perfectly uniform, but close enough.
make a "--regions-file" to filter the per cultivar vcf
to provide a filter for [bv]cftools
cut -f 1,2 -d ' ' middling-snp-all.txt | tr ' ' '\t' | sort -u |
sort -k1,1 -k2,2n > middling-region.tsv
wc -l < middling-region.tsv
330,228
Note: these 330k spots may have a base other than one that is "middling"
or have more than one viable base ... (any but ref)
for c in ./PI[0-9]*.vcf; do bgzip $c ; done
for f in ./PI[0-9]*.vcf.gz ; do tabix -p vcf $f; done
# ended up filtering with tablix instead of [bv]cftools here ; seems even slower
# bcftools --gvcf data/Cultivar/PI[0-9]*.vcf.gz --out data/Middling --regions-file middling-region.tsv
test case
tabix -R middling-region.tsv -p vcf data/Cultivar/PI92270.vcf.gz
results in 176,664 out of 3,884,467 calls kept (removed 95%)
### well there goes a whole day
takes it down to 60M calls that is 12G (uncompressed) 2.1G gzipped
# [--cache|--offline|--database]
~/GitHub/Ensembl/ensembl-vep/vep --offline --species sorghum_bicolor -in Middling/PI303658.vcf
tried snagging Kent's .vep/ cache but it is complaining
so retrying with
cd $HOME/.vep
curl -O ftp://ftp.ensemblgenomes.org/pub/plants/current/variation/vep/sorghum_bicolor_vep_49_Sorghum_bicolor_NCBIv3.tar.gz
tar xzf sorghum_bicolor_vep_49_Sorghum_bicolor_NCBIv3.tar.gz
needs the EnsembleGenome version number (49) as it
is different than the Ensembl build number (102)
~/GitHub/Ensembl/ensembl-vep/vep --offline --cache_version 49 --species sorghum_bicolor -i Middling/PI303658.vcf
is having problems because chr0N does not have a leading capital 'C'
for i in $(seq 1 9) ; do echo -e "$i\tChr0$i"; done \
>> ~/.vep/sorghum_bicolor/49_Sorghum_bicolor_NCBIv3/chr_synonyms.txt
echo -e "10\tChr10" >> ~/.vep/sorghum_bicolor/49_Sorghum_bicolor_NCBIv3/chr_synonyms.txt
lots of complaints of "super_nnn" not found in annotation sources or synonyms
but spot checks they do exist as values in
~/.vep/sorghum_bicolor/49_Sorghum_bicolor_NCBIv3/chr_synonyms.txt
I wonder if I should put the inverse in the dict... don't know.
awk -F '\t' '$2 ~ /super_[0-9]+/{print $2 "\t" $1}' \
~/.vep/sorghum_bicolor/49_Sorghum_bicolor_NCBIv3/chr_synonyms.txt >> \
~/.vep/sorghum_bicolor/49_Sorghum_bicolor_NCBIv3/chr_synonyms.txt
guess I will try it out ...
still see the complaints so that was not the solution to all the:
...
...
WARNING: Chromosome super_85 not found in annotation sources or synonyms on line 131356
...
...
so maybe it is reporting the other way around;
that the contig is not im my sample.
-----------------------------------------------------------
~/GitHub/Ensembl/ensembl-vep/vep \
--offline \
--cache_version 49 \
--species sorghum_bicolor \
--force_overwrite \
-i Middling/PI303658.vcf
For this generic particular test case there are:
~130k calls and vep finds ~16.5k hit genes
(it is pretty fast, about a minute. so maybe 6 hours compute for all of them)
the cached annotation file may not be rich enough to capture things like
regulatory regions / other features
I do see up/down stream and don't know if those are count as gene hits or not
but they might work as an approximate proxy for regulation
SIFT and PolyPhen filters are not available (don't know enough to say why)
running with `--everything` takes quite a bit longer ~ 300 seconds
bloats the output from 30M to 50M
some human genome metadata there ... not seeing it as a win
time \
for f in ./data/Middling/PI[0-9]*.vcf ; do
g=${f##*/}
# tabix -R middling-region.tsv -p vcf "$f"
~/GitHub/Ensembl/ensembl-vep/vep \
--offline \
--cache_version 49 \
--species sorghum_bicolor \
--force_overwrite \
-i "$f" \
-o "./data/MidVep/$g"
done
# real 369m23.404s 6-hrs & 10-min
du -sh ./data/MidVep/
14G (added 2G of "stuff")
What is in it? (my same generic particular sample in this case)
grep -v "^#" ./data/MidVep/PI92270.vcf | cut -f7 | sort | uniq -c | sort -nr
93733 intergenic_variant
76547 upstream_gene_variant
69342 downstream_gene_variant
18395 intron_variant
9639 coding_sequence_variant
6163 3_prime_UTR_variant
5399 5_prime_UTR_variant
24 start_lost,coding_sequence_variant
11 non_coding_transcript_exon_variant
1 start_lost,coding_sequence_variant,5_prime_UTR_variant
don't see any here but there could be premature-stops and read-through
maybe that is part of what the missing SIFT / PolyPhen filters would provide
variants in coding sequences across all cultivars
fgrep -c "coding_sequence_variant" ./data/MidVep/PI[0-9]*.vcf | cut -f2 -d ':' | sumstat.r
V1
Min. : 0
1st Qu.: 8316
Median : 9142
Mean : 9116
3rd Qu.:10126
Max. :13256
very consistent, a little less than 10k coding snps per cultivar
Note:
```
perl INSTALL.pl -a p -g list
WARNING: DBD::mysql module not found. VEP can only run in offline (--offline) mode without DBD::mysql installed
http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#requirements
WARNING: The following plugins have not been found: list
Available plugins:
AncestralAllele,Blosum62,CADD,CSN,Carol,Condel,Conservation,
DisGeNET,Downstream,Draw,ExAC,ExACpLI,FATHMM,FATHMM_MKL,FlagLRG,FunMotifs,
G2P,GO,GeneSplicer,Gwava,HGVSIntronOffset,LD,LOVD,LoF,LoFtool,LocalID,MPC,MTR,
Mastermind,MaxEntScan,NearestExonJB,NearestGene,PON_P2,Phenotypes,PostGAP,
ProteinSeqs,REVEL,ReferenceQuality,SameCodon,SingleLetterAA,SpliceAI,SpliceRegion,
StructuralVariantOverlap,SubsetVCF,TSSDistance,dbNSFP,dbscSNV,gnomADc,miRNA,
neXtProt,satMutMPRA
```
SIFT is not immediately among them
time \
> ag -v "^#" ./data/MidVep/PI[0-9]*.vcf | cut -f7 | sort | uniq -c | sort -nr
31,292,150 intergenic_variant
25,978,416 upstream_gene_variant
23,811,383 downstream_gene_variant
8,458,950 intron_variant
3,298,203 coding_sequence_variant
2,205,214 3_prime_UTR_variant
1,877,106 5_prime_UTR_variant
8,639 non_coding_transcript_exon_variant
8,638 start_lost,coding_sequence_variant
786 coding_sequence_variant,intron_variant
749 coding_sequence_variant,3_prime_UTR_variant
392 coding_sequence_variant,3_prime_UTR_variant,intron_variant
207 coding_sequence_variant,5_prime_UTR_variant
135 start_lost,coding_sequence_variant,5_prime_UTR_variant
2 coding_sequence_variant,5_prime_UTR_variant,intron_variant
real 6m38.872s
user 7m34.265s
---------------------------------------------------------------
those 3.3M snps in coding regions look like the hottest properties
how many genes?
ag "coding_sequence_variant" ./data/MidVep/PI[0-9]*.vcf |
cut -f4 |
sort |
uniq -c |
sort -nr > middling_genes.counts
wc -l < middling_genes.counts
4455
there are ~40k genes in the genome so ~10% of the genes
there are about 21 genes which must lack sufficient representation across our set
as there fewer calls than than half our count of cultivars [1-60 calls]
and there is one
119,773 SORBI_3001G353900
with over twice the average that may want a closer look.
other than that the counts of calls per gene within a line look like:
cut -c1-8 middling_genes.counts| sumstat.r
V1
Min. : 1.0
1st Qu.: 181.0
Median : 350.0
Mean : 742.8
3rd Qu.: 690.0
Max. :119773.0
the main story is the majority *could* be genes with a small handful of snps
We could maybe take the step of learning over the 4k genes as well as the 300 cultivars
it has the advantage of allowing more columns to consider whild maintaining a 10:1 shape.
want to know about snps per gene and genes per line
time\
ag "coding_sequence_variant" ./data/MidVep/PI[0-9]*.vcf |
cut -f1,4 |
sed 's|data/MidVep/\(PI[0-9]*\).vcf.*\t\(.*\)|\1\t\2|' |
sort| uniq -c |
awk '{print $2 "\t" $3 "\t" $1}' > line_gene_calls.tab
# calls per gene
cut -f3 ./line_gene_calls.tab | sumstat.r
V1
Min. : 1.000
1st Qu.: 1.000
Median : 2.000
Mean : 3.911
3rd Qu.: 3.000
Max. :645.000
# gene per line
cut -f1 ./line_gene_calls.tab |sort | uniq -c | cut -c1-8 | sumstat.r
V1
Min. : 23
1st Qu.:2105
Median :2364
Mean :2337
3rd Qu.:2611
Max. :3433
So we get a couple of thousand genes per line ...
does not contradict the previous finding of ~4.5k genes in play which contained
snps specifically chosen to split the set of lines roughly in half.
to recap
everything (1.7TB)
-> variants (w/o ref ~50GB)
-> discriminating SNPs
-> genes-w/hi-entropy-snps (discriminating)
- cultivar clusters based on gene perturbations (snp count within gene)
I wonder if I can recover sets of genes-w/ds which discriminate these lines as well
so for each of the genes, how many lines use them
then look for the ones jolly in the middle
# lines per gene
cut -f2 ./line_gene_calls.tab |sort | uniq -c | cut -c1-8 | sumstat.r
V1
Min. : 1.0
1st Qu.:177.0
Median :183.0
Mean :189.9
3rd Qu.:188.0
Max. :351.0
kinda surprised it tracks so perfectly
still not out of the woods though
the lines themselves will not be in any (biologically) meaningful sense "independent"
so the sets discriminating genes/snps will also be ... clumpy
we can cluster the lines/genes... jaccard distance for a first pass perhaps
... bi-clustering seems like it might be a legit approach here (maybe).
- omit gene SORBI_3001G353900 with its 120k calls across the board
- omit the 21 genes which contain less than 60 call across the board
awk '$1 > 100 && $1 < 100000{print $2}' middling_genes.counts > selected_genes.txt
wc -l < selected_genes.txt
4431
the only lines that look a bit weak are
cut -f1 ./line_gene_calls.tab |sort | uniq -c | sort -nr | tail -2
706 PI606706
23 PI564163
but I can just leave them in now and they can fall out on their own later.
cut -f1 ./line_gene_calls.tab |sort | uniq -c | sort -nr |cut -c9- > selected_cultivar.txt
wc -l < selected_cultivar.txt
362
# made a quick sparse representation to full dataframe converter
./scripts/sp2df.awk -F'\t' line_gene_calls.tab > gene_cultivar_call.dataframe
moved to R-lang/R-README
Ishita was not liking the R formatted data (save file)
there is still maybe another basic question something like
how many lines include a particular gene ...
the stats above preclude finding a smoking gun there...
still looking for something that shows if the collection
must be unhelpfully lopsided in some way
sqlite3 gpe
sqlite> .mode tabs
sqlite> create table lgc(line text not null, gene text not null, cnt integer not null);
sqlite> .import ./line_gene_calls.tab lgc
sqlite> select count(*) from lgc;
846,079
sqlite> create index lgc_line_idx on lgc(line);
sqlite> create index lgc_gene_idx on lgc(gene);
sqlite> select line, sum(cnt) snps from lgc group by line order by snps;
PI564163|53
PI606706|3221
PI561072|4647
PI329435|4694
...
PI455307|12919
PI329466|12941
PI569422|13053
PI562897|13256
I wonder if the cultivar PI564163 is a (nigh) reference genome.
----------------------------------------------------------------------------------------
Tried the venerable BiCat
https://academic.oup.com/bioinformatics/article/22/10/1282/237365
.. sub optimal
3 of the 7 methods did not fail with Java exceptions
- one method produced (ISA) produced a huge skew (one cluster won too much)
- Bicluster (OPSM) worked ... lots (48) overlapping clusters... maybe?
- K-means seems to work (not biclustering)
to make a list of cultivars and a cluster id
# awk -F' ' '/^P/{for(i=1;i<=NF;i++)print $i"\t" FNR}' K30-means.out > cultivar_clustK30.tab
# ... this cultivar_clustK30.tab is not as expected.
# the k-means keeps a sub set of genes but _every_ cultivar
awk -F' ' '/^S/{for(i=1;i<=NF;i++)print $i"\t" FNR}' K30-means.out > gene_clustK30.tab
awk -F' ' '/^P/{for(i=1;i<=NF;i++)print $i"\t" FNR}' OPSM_bicluster.out > cultivar_clustOPSM.tab
### back in sql land
.mode tabs
drop table if exists K30;
create table K30(gene text not null, clust integer not null);
.import ./BiCat/gene_clustK30.tab K30
create table OPSM(cultivar text not null, clust integer not null);
.import ./BiCat/cultivar_clustOPSM.tab OPSM
.mode tabs
.output cultivar_k30cluster_score.tab
select line as cultivar,clust, sum(cnt) score
from lgc join K30 on lgc.gene = k30.gene
group by 1,2
--limit 30
;
.output
select a.clust, lgc.gene, sum(lgc.cnt) num,
(select count(b.clust) from K30 b where b.cultivar = lgc.line) denom
from lgc join K30 a on lgc.line = a.cultivar
group by 1,2
;
----------------------------------------------------------------------------------------
reformat cultivar_k30cluster_score.tab to be more dataframe-ish
awk '{a[$1,$2]=$3;line[$1]++;c[$2]++}\
END{n=asorti(c);for(i=1;i<=n;i++)r=r "\t" c[i];print r;\
for(l in line){r=l;for(i=1;i<=n;i++)r=r "\t" a[l,c[i]];print r}}' \
cultivar_k30cluster_score.tab > cultivar_k30cluster_score.dataframe
########################################################################################
Function
from docs at
https://github.com/genophenoenvo/documentation
in the Phenotype section we find:
Genes relevant for the phenotypes of interest were found using Gramene and are here.
https://docs.google.com/spreadsheets/d/1ugMisjghvSfa0W_TPhA-0_6C8A0X-gwOqPZbzqjJOrg/edit#gid=1246181510
export & transform that to something useful
hmmm ~300 gene-trait pairs is most likely insufficient.
Especially it they do not happen to fall into the 10% if the genome
selected as decidable.
import as 'KG_gene_query-data.tsv' (11k)
grep "Sorghum bicolor" KG_gene_query-data.tsv | wc -l
64
ahh ... most genes are not for our species
grep "Sorghum bicolor" KG_gene_query-data.tsv | cut -f 1 | sort | uniq -c | sort -nr
32 days to flowering
17 canopy and biomass
15 days to flag leaf emergence
60 genes have three phenotypes
fgrep -n -f <(grep "Sorghum bicolor" KG_gene_query-data.tsv | cut -f 4 |sort -u) ../GitHub/genomic_data/FriendsOfEntropy/selected_genes.txt
472:SORBI_3004G207000
596:SORBI_3008G136601
975:SORBI_3001G177100
1041:SORBI_3010G191800
2652:SORBI_3003G046800
2949:SORBI_3003G028300
3267:SORBI_3009G228700
Seven of the genes with function are in the set of 4.4k selected
not getting warm fuzzies here
##.###########################################################################
2021 Feb:
Asked to build a phylogenetic tree of the cultivars based on my F.O.E. approach
a tree could not really be ancestor based anymore as genetics
have been primaraly ignored throughout the filtering process.
So it is better to just think of it as heirachiacal clustering.
try generating (brute force) jaccard distance with several cross joins in sqlite.
blows out remaing disk space on laptop.
cp /dev/null cultivar_gene_onehot.txt
for col in $(seq 2 363) ; do cut -f "$col" gene_cultivar_call.dataframe | tr '\n' ' '|
sed -n 's| [2-9][0-9]*| 1|gp' >> "cultivar_gene_onehot.txt"; echo >> "cultivar_gene_onehot.txt"; done
pairwise reduce the onehot genes to their Jaccard distance
scripts/onehot_jaccard.awk -F ' ' cultivar_gene_onehot.txt > cultivars_jdistance.txt
wc -l <cultivars_jdistance.txt
65,341
should be something like ((362^2)-362)/2 which is ... 65,341
tail cultivars_jdistance.txt
358 359 PI656051 PI656056 1110 2248 0.493772
358 360 PI656051 PI656065 1004 3034 0.330916
358 361 PI656051 PI673843 1107 2892 0.38278
358 362 PI656051 PI92270 923 3323 0.277761
359 360 PI656056 PI656065 934 2926 0.319207
359 361 PI656056 PI673843 1015 2806 0.361725
359 362 PI656056 PI92270 857 3211 0.266895
360 361 PI656065 PI673843 1386 3115 0.444944
360 362 PI656065 PI92270 1288 3460 0.372254
361 362 PI673843 PI92270 1225 3484 0.351607
are there any pairs of cultivars with _no_ genes in common?
fgrep -c "NAN" cultivars_jdistance.txt
0
no there are not.
most similar?
sort -k7,7nr cultivars_jdistance.txt | head
92 93 PI329300 PI329301 2678 2738 0.978086
205 353 PI540518 PI656023 1792 1877 0.954715
91 92 PI329299 PI329300 2607 2738 0.952155
203 204 PI535795 PI535796 1957 2056 0.951848
91 93 PI329299 PI329301 2609 2742 0.951495
315 316 PI620072 PI620157 2230 2350 0.948936
182 262 PI524475 PI569459 2357 2503 0.94167
257 277 PI569453 PI570087 2252 2411 0.934052
185 186 PI526905 PI527045 1860 1998 0.930931
43 48 PI156463 PI157033 1858 1998 0.92993
least similar?
sort -k7,7n cultivars_jdistance.txt | head
137 236 PI329710 PI564163 6 3165 0.00189573
236 310 PI564163 PI595699 8 2519 0.00317586
195 236 PI534120 PI564163 8 2426 0.00329761
57 236 PI181080 PI564163 8 2387 0.00335149
126 236 PI329585 PI564163 9 2676 0.00336323
208 236 PI552851 PI564163 8 2329 0.00343495
110 236 PI329506 PI564163 9 2615 0.00344168
179 236 PI521280 PI564163 8 2234 0.00358102
145 236 PI330168 PI564163 10 2726 0.00366838
79 236 PI266927 PI564163 9 2441 0.00368701
summary stats on jaccard distances
cut -f7 cultivars_jdistance.txt | sumstat.r
V1
Min. :0.001896
1st Qu.:0.315296
Median :0.356688
Mean :0.367557
3rd Qu.:0.402389
Max. :0.978086
summary stats on size of paiwise union of gene sets per cultivar
cut -f6 cultivars_jdistance.txt | sumstat.r
V1
Min. : 716
1st Qu.:3265
Median :3456
Mean :3417
3rd Qu.:3632
Max. :4239
summary stats on size of paiwise intersection of gene sets per cultivar
cut -f5 cultivars_jdistance.txt | sumstat.r
V1
Min. : 6
1st Qu.:1071
Median :1235
Mean :1258
3rd Qu.:1405
Max. :2811
# Record an indexing of the cultivars for distance matrix and h-trees
cut -f1 -d ' ' cultivar_gene_onehot.txt| grep -n . | tr ':' '\t' > index_cultivar.tsv
cut -f1,2,7 cultivars_jdistance.txt > jaccard-dist-mat-sparse.tsv
######################################################################################
old: (... everything old is new again)
- consider cohorts that share snps and how well they correlate with the phylogenitic trees
- tag each cultivar with a id of the clustivar phylogenitic cluster it belongs to
use Kents tassle distance matrices to cluster?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment