These examples of regular expressions are taken largely from our book Practical computing for Biologists. More information is available at http://practicalcomputing.org.
This document can be accessed via the delightful github url shortener at https://git.io/dine
Given the following names:
Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus
Challenge:
Shorten to A. elegans
format
Can't just search and replace galma
with .
###############################################################
Introduce \w
wildcard to delete compass directions
+40 46'N +014 15'E
+21 17'N -157 52'W
Try \w
first
then try '\w
replaced by '
###############################################################
5th
3rd
2nd
4th
Introduce capture ()
Reduce to just numbers:
(\w)\w\w
\1
############################################################### Revisit original challenge:
Agalma elegans
Frillagalma vitiazi
Cordagalma tottoni
Shortia galacifolia
Mus musculus
Challenge:
Shorten to A. elegans
format
Introduce +
quantifier
(\w)\w+ (\w+) \1. \2
###############################################################
Introduce .
escape \
Exercise 1
>CAA58790.1= green fluorescent protein [Aequorea victoria]
MSKGEELFTGVVPILVELDGDVNGQKFSVRGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFLKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNSHNVYIMGDKPKNGIKVNFKIRHNIKDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSQDPHGKRDHMVLLEFVTSAGITHGMDELYK
>AAZ67342.1= GFP-like red fluorescent protein [Corynactis californica]
MSLSKQVLPRDVKMRYHMDGCVNGHQFIIEGEGTGKPYEGKKILELRVTKGGPLPFAFDILSSVFTYGNRCFCEYPEDMPDYFKQSLPEGHSWERTLMFEDGGCGTASAHISLDKNCFVHKSTFHGVNFPANGPVMQKKTLNWEPSSELITAGDGILKGDVTMFLMLEGGHRLKCQFTTSYKAKKAVKMPPNHIIEHRLVRKEVADAVQIQEHAVAKHFIV
>ACX47247.1= green fluorescent protein [Haeckelia beehleri]
MEFEPEFFNKPVPLEMTLRGCVNGKEFMIFGKGEGDASKGNIKGKWILSHSEDGKCPMSWAVLAPTFAYGFKVFAKYPKDFAHFWQDCMPVGYSERRITRFGRLSGNDDIEQEGIMNTYHEVQMRERMVGDEITWIVESRVKLDATINENSPILMNDGLSEYRPNLERTVSFEDGLKNYSQFFYPIKDCETKDYIIANQMTHERPLSKCNKPGRLPPSHFKRTDLEQWKDSKEDKDHIVQEEITAFLLQAQDKDLQSLGIGM
>ABC68474.1= red fluorescent protein [Discosoma sp. RC-2004]
MRSSKNVIKEFMRFKVRMEGTVNGHEFEIEGEGEGRPYEGHNTVKLKVTKGGPLPFAWDILSPQFQYGSKVYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGVVTVTQDPSLQDGCFIYKVKFIGVNFPSDGPVMQKKTMGWEASTERLYPRDGVLKGEIHKALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDSKLDITSHNKDYTIVEQYERTEGRHHLFLKAELGSNVGER
>AAQ01183.1= green fluorescent protein 1 [Pontellina plumata]
MPAMKIECRISGTLNGVVFELVGGGEGIPEQGRMTNKMKSTKGALTFSPYLLSHVMGYGFYHFGTYPSGYENPFLHAANNGGYTNTRIEKYEDGGVLHVSFSYRYEAGRVIGDFKVVGTGFPEDSVIFTDKIIRSNATVEHLHPMGDNVLVGSFARTFSLRDGGYYSFVVDSHMHFKSAIHPSILQNGGSMFAFRRVEELHSNTELGIVEYQHAFKTPTAFA
Challenge:
Convert the headers from the format: >CAA58790.1= GFP [Aequorea victoria]
To: >CAA58790_Aequorea
(>\w+).+\[(\w+) \w+\]
\1_\2