Skip to content

Instantly share code, notes, and snippets.

@IsmailM
Forked from ilevantis/bedtools_cheatsheet.md
Created May 2, 2019 23:05
Show Gist options
  • Save IsmailM/92c550d2cc45603bc2c402d9f33931f1 to your computer and use it in GitHub Desktop.
Save IsmailM/92c550d2cc45603bc2c402d9f33931f1 to your computer and use it in GitHub Desktop.
Bedtools cheatsheet

Bedtools Cheatsheet

General:

Tools Description
flank Create new intervals from the flanks of existing intervals.
slop Adjust the size of intervals.
shift Adjust the position of intervals.
subtract Remove intervals based on overlaps b/w two files.
complement Extract intervals not represented by an interval file.
closest Find the closest, potentially non-overlapping interval.
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
merge Combine overlapping/nearby intervals into a single interval.
map Apply a function to a column for each overlapping interval.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")

Formatting:

Notes: BED file format, GFF vs BED indexing

Tools Description
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
sort Order the intervals in a file.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
bamtofastq Convert BAM records to FASTQ records.
bamtobed Convert BAM alignments to BED (& other) formats.
bedpetobam Convert BEDPE intervals to BAM records.
bedtobam Convert intervals to BAM records.

Statistics:

Tools Description
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
random Generate random intervals in a genome.
reldist Calculate the distribution of relative distances b/w two files.
shuffle Randomly redistribute intervals in a genome.
makewindows Makes adjacent or sliding windows across a genome or BED file.
nuc Profile the nucleotide content of intervals in a FASTA file.

Coverage:

Tools Description
annotate Annotate coverage of features from multiple files.
coverage Compute the coverage over defined intervals.
genomecov Compute the coverage over an entire genome.
multicov Counts coverage from multiple BAMs at specific intervals.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.

common flags:

  • -s, -S : Require same strandedness or opposite strandedness, respectively.
  • -f, -F : Minimum overlap required as a fraction of A or a fraction of B respectively.
  • -r, -e : Require that the minimum overlap be satisfied for A AND B, or A OR B respectively.
  • -split : Treat "split" BAM or BED12 entries as distinct BED intervals.
  • -abam : A is a BAM file.

General

flank, slop

Create new intervals from the flanks of existing intervals. (flank Docs)

Adjust the size of intervals. (slop Docs)

IN           ▓▓▓▓▓       ▓▓▓
Flank      ██     ██   ██   ██
Slop       █████████   ███████

$ bedtools flank [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]

$ bedtools slop [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-b or (-l and -r)]

OPTIONS .
-b, -l, -r Flank/extend regions by x bp on both sides, on the left, or on the right respectively.
-s Define -l and -r based on strand.
-pct Define -l and -r as a fraction of the feature's length.

shift

Adjust the position of intervals, while respecting chromosome edges. (Docs).

IN      ██   ██      ████
OUT        ██   ██      ████

$ bedtools shift [OPTIONS] -i <BED/GFF/VCF> -g <GENOME> [-s or (-m and -p)]

OPTIONS .
-s Number of BPs to shift the features.
-m, -p Number of BPs to shift the features on the - strand or + strand, respectively.
-pct Define -s, -m and -p as a fraction of the feature's length.

subtract

Remove intervals based on overlaps b/w two files. (Docs)

A        ▓▓▓▓▓▓▓▓▓▓   ▓▓▓     ▓▓▓▓▓▓
B          ▓▓▓▓           ▓▓▓▓▓▓▓  
A sub B  ██    ████   ███        ███

$ bedtools subtract [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS .
-A Remove entire feature if any overlap.
common strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e

complement

Extract intervals not represented by an interval file. (Docs)

IN           ▓▓▓▓▓     ▓▓▓     ▓▓▓▓▓▓
          ▓▓▓▓            ▓▓▓  
OUT  █████        █████      ██

$ bedtools complement -i <BED/GFF/VCF> -g <GENOME>

closest

Find the closest, potentially non-overlapping interval. (Docs)

A            █­███­█   ✓
B   ██­██            ██­█   

$ bedtools closest [OPTIONS] -a <FILE> -b <FILE1, FILE2, ..., FILEN>

OPTIONS .
-d Also report distance from A to the closest feature.
-k Report the k closest hits. Default: 1.
-io Ignore features in B that overlap A.
-iu, -id Ignore features in B that are upstream or downstream, respectively, of features in A.
common strandedness: -s, -S

intersect

Find overlapping intervals in various ways. (Docs)

A           ██████████
B         ▓▓▓▓    ▓▓        ▓▓▓  
A int B     ▓▓    ▓▓

$ bedtools intersect [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>

OPTIONS .
-wa, -wb Write the original entry in A/original entry in B, respectively, for each overlap.
-loj For each feature in A report each overlap with B. Report a NULL feature for B if no overlap.
-wao Report A and B features and no. of bp overlap between them.
-u Only report each overlapping A feature once.
-c For each entry in A, report count of overlapping B features.
-v Only report features in A not overlapping B.
common strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -abam, -split

window

Find overlapping intervals within a window around an interval. (Docs)

A           ┌────█████────┐
B         ▓▓▓▓    ▓▓▓        ▓▓▓  
A win B   ▓▓▓▓    ▓▓▓

$ bedtools window [OPTIONS] [-a|-abam] -b <BED/GFF/VCF>

OPTIONS .
-w, -l, -r Flank length of overlap window in each direction, upstream or downstream, respectively.
-sw Define -l and -r based on strand.
-u Only report each overlapping A feature once.
-c For each entry in A, report count of overlapping B features.
-v Only report features in A not overlapping B.
common strandedness: -sm, -Sm; bam: -abam

cluster

Cluster (but don't merge) overlapping/nearby intervals. (Docs)

BED        ██­██     █­███­█  ██­█  
clustID   └─#1─┘   └────#2────┘

$ bedtools cluster [OPTIONS] -i <BED/GFF/VCF>

OPTIONS .
-d Max distance between features in cluster.
common strandedness: -s, -S

Aggregation Tools

For merge, groupby, and map the following* aggregation functions (specified by -o) can be applied to a column/columns specified by -c: sum, count, count_distinct, min, max, mean, median, mode, antimode, stdev, sstdev, collapse, distinct, first, last

*Other functions are available.

merge

Combine overlapping/nearby intervals into a single interval. (Docs)

IN       ▓▓▓      ▓        ▓▓··d··▓▓▓
      ▓▓▓▓         ▓▓        
OUT   ██████      ███      ██████████

$ bedtools merge [OPTIONS] -i <BED/GFF/VCF/BAM>

OPTIONS .
-s Require same strandedness.
-S Force merge for one specific strand only. Options: <+/->.
-d Maximum distance between features to be merged.
common aggregation: -o, -c;

map

Apply a function to a column for each overlapping interval.(Docs)

        score = 3  1     5                 4      6
B              ▓▓▓ ▓   ▓▓▓▓▓             ▓▓▓▓▓▓ ▓▓▓▓
A               ██████████                 ███████
B map(mean) A   ██████████ mean(3,1,5)=5   ███████ mean(4,6)=5

$ bedtools map [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS . .
common aggregation: -o, -c; strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split

groupby

Group by common cols & summarize other cols (~ SQL "groupBy"). (Docs)

$ bedtools groupby [OPTIONS] -i <BED> -g <groupby columns> -c <op. column> -o <operation>

OPTIONS .
common aggregation: -o, -c

Formatting

BED file format

Column e.g. Defi­nit­ion
chrom Sc112.1 <ST­R> name of chromo­som­e/s­caffold
start 2134 <IN­T> start position of feature
end 2565 <IN­T> end position of feature
name gene123 <ST­R> name of feature
score 544 <NU­M> score for the feature e.g. bit score
strand + <+/­-/.> strand on which feature is located
thic­kSt­art 2235
thic­kEnd 2489
item­Rgb 255,0,0
bloc­kCo­unt 2
bloc­kSi­zes 150,80
bloc­kSt­arts 0,2333

GFF vs BED indexing

GFF    ┌─1   2   3─┐ 4   ...
         G---A---T   C   ...
BED    └─0   1   2 └─3   ...
. gff -> bed bed -> gff
new_start = gff_start - 1 bed_start + 1
new_end = gff_end bed_end

getfasta

Use intervals to extract sequences from a FASTA file. (Docs)

FASTA  ­ ACT­GAT­CAT­GAT­ACA­TGA­TAC­CAT­TAG­GAT­ACAATA
BED         ██­██       █­███­█      ██­██
OUTFA­ ­ ­     AT­CA       TGA­TA      G­GAT­      

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BE­D/G­FF/­VCF­>

OPTIONS .
-name Use “name” column in BED file for FASTA headers in the output.
-s Reverse comple­ment features on "-" strand. Default: strand inform­ation ignored.
-split Given BED12 input, concat­enate the sequences from BED blocks (e.g., exons).

maskfasta

Use intervals to mask sequences from a FASTA file. (Docs)

FASTA­ ­ ­ ­ACT­GAT­CAT­GAT­ACA­TGA­TAC­CAT­TAG­GAT­ACAATA
BED           ██­██       █­███­█      ██­██
FASTA­'  AC­TGA­TNN­NNA­TAC­ATG­NNN­NNA­TTA­GGN­NNN­AATA

$ bedtools maskfasta [OPTIONS] -fi <input FASTA> -bed <BE­D/G­FF/­VCF> -fo <output FASTA>

OPTIONS .
-soft Soft-mask (convert to lower-case bases) instead of masking with "N".
-mc Specify masking character.

sort

Order the intervals in a file. (Docs)

$ bedtools sort [OPTIONS] -i <BED/GFF/VCF>

OPTIONS .
-sizeA Sort by feature size (asc).
-sizeD Sort by feature size (desc).
-chrThenSizeA Sort by chromosome (asc), then by feature size (asc).
-chrThenSizeD Sort by chromosome (asc), then by feature size (desc).
-chrThenScoreA Sort by chromosome (asc), then by score (asc).
-chrThenScoreD Sort by chromosome (asc), then by score (desc).

Statistics

jaccard

Calculate the Jaccard statistic b/w two sets of intervals. (Docs)

A                 ███████████  15bp
B               ▓▓▓▓ 10bp ▓▓ 4bp       ▓▓▓ 8bp
A int B           ▓▓ 6bp  ▓▓ 4bp
Jaccard(A,B)     (6+4)/((15+10+4+8)-(6+4)) =  0.37     

$ bedtools jaccard [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS .
common strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bed12: -split

random

Generate random intervals in a genome. (Docs)

$ bedtools random [OPTIONS] -g <GENOME>

OPTIONS .
-l The length of the intervals to generate. Default: 100
-n The number of intervals to generate. Default: 1,000,000
-seed Supply an integer seed for the shuffling.

reldist

Calculate the distribution of relative distances b/w two files. (Docs)

                ───────r──────
A            ▓▓▓▓▓▓         ▓▓▓▓
B                      ███
                ───d1─── ──d2──
reldist = min(d1,d2)/r

$ bedtools reldist [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF>

OPTIONS .
-detail Instead of a summary, report relative distance for each region in A.

shuffle

Randomly redistribute intervals in a genome. (Docs)

$ bedtools shuffle [OPTIONS] -i <BED/GFF/VCF> -g <GENOME>

OPTIONS .
-excl BED file with regions into which features won't be shuffled.
-incl BED file with regions into which features will be shuffled.
-chrom Keep features on the same chromosome.
-chromFirst Distribute features ~uniformly across chroms, not across total sequence.
-noOverlapping Don't allow shuffled intervals to overlap.

Coverage

annotate

Annotate coverage of features from multiple files. (Docs)

$ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed
chr1  100 200 nasty 1 - 0.500000  1.000000  0.300000
chr2  500 1000  ugly  2 + 0.000000  0.600000  1.000000

$ bedtools annotate [OPTIONS] -i <BED/GFF/VCF> -files FILE1 FILE2 FILE3 ... FILEn

OPTIONS .
-counts Report count of features that overlap -i in each file. Default: report fraction of -i covered by each file.
-both Report counts & fractions for each file.
common strandedness: -s, -S.

coverage

Compute the coverage over defined intervals. (Docs)

BED FILE A  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓  
BED File B  ████ ████              ██             █████████
              ████████                                      
Result      [  N=3, 10/15 ]     [  N=1, 2/15  ]    [N=1,6/6]

$ bedtools coverage [OPTIONS] -a <BAM/BED/GFF/VCF> -b <FILE1, FILE2, ..., FILEN>

OPTIONS .
-d Report the depth at each position in each A feature.
common strandedness: -s, -S; overlap: -f, -F; overlap mode: -r, -e; bam/bed12: -split,-abam
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment