$ cat contigs.fasta
>NODE_1_length_869844_cov_1135.34
ACTGNacgtn
>NODE_2_length_576386_cov_975.882
acgtn
- Converting FASTA to tabular format using SeqKit (http://bioinf.shenwei.me/seqkit/)
Note that seqkit fx2tab
converts FASTA to 3-column tabular format,
with sequence in the 2nd column and quality in 3rd column.
$ seqkit fx2tab contigs.fasta
NODE_1_length_869844_cov_1135.34 ACTGNacgtn
NODE_2_length_576386_cov_975.882 acgtn
-
Retrieving coverage as new column using csvtk (http://bioinf.shenwei.me/csvtk/)
$ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" NODE_1_length_869844_cov_1135.34 ACTGNacgtn 1135.34 NODE_2_length_576386_cov_975.882 acgtn 975.882
-
Filtering by coverage (4th column) using csvtk or awk
# seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1000' $ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | csvtk filter2 -H -t -f "$4>=1000" NODE_1_length_869844_cov_1135.34 ACTGNacgtn 1135.34
-
Converting tabular format back to FASTA format
$ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1000' | seqkit tab2fx >NODE_1_length_869844_cov_1135.34 ACTGNacgtn
Thanks for posting this. It was quite helpful. Based on your code, I reached this solution:
seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | csvtk mutate -H -t -f 1 -p "length_([0-9]+)" | awk -F "\t" '$4>=10 && $5>=500' | seqkit tab2fx > filtered_contigs.fasta
for filtering assembled contigs by both coverage and length.