Ok, so, you might know about JavaScript regular expressions. Well, here is a tutorial about them, but written by a 13 year old, so it isn't actually any good!
Regular expressions go between /
characters. Here is an exampe, /hi/
.
Ok, now then. Let's learn how to match the string abc
. Well, that's quite simple.
/abc/
. Yey! So putting letters next to each other makes them match one after the other.
Ok, now, after the second /
we can put a g
to make it match globally, that is, we can extract abc
from xyzabcghi
.
/abc/g
.
Great! Eh?
What if we want to match either an A
, a B
, or a C
, but not all three one after the other? Well, if you put them into these square brackets ([]
) then you create what's called a character class, which is a group of characters, where any of them could be matched! Great, right?
/[abc]/g
, now that matches a
, b
, and c
. Great, right?
They are great, aren't they?
Well, I want to match a single digit! This should be easy, we already know how to make a character class.
/[0123456789]/g
. Done, right?
Well, yes, it works. But it's a bit long, isn't it?
I wish there was a way of saying "a number between 0 and 9". Well, it turns out there is! Yey!
/[0-9]/g
. Wow, that's much shorter. What if I want to match a digit, or a decimal point? Well, we can do that! /[0-9.]g/
.
Huh, that looks a bit weird? What happened. Well, remember that [0-9]
means [0123456789]
so [0-9.]
means [0123456789.]
.
That makes sense.
Can I do that, but without using all the numbers. Let's say I have a regular expression, [34567]
. How can we shorten that?
Well, [3-7]
is the answer! Yey!
What about letters, can we do the alphabet? Yes! [a-z]
WOW!
So, now that we're going for letters, we might want to be able to not care about whether a letter is uppercase or lowercase.
The way we do that, is by putting an i
after the /
. So, let's say we have /abc/g
which matches abc
ONLY. If we do /abc/gi
(or /abc/ig/
, it doesn't matter), then we can match
abc
(still)abC
aBc
aBC
Abc
AbC
ABc
ABC
That's way more possibilities!
Never, ever, underestimate the backslash. What it does, is, it gives characters that don't have special meaning a special meaning, and take away the special meaning from characters that do.
Let's do a quick example! /abc\[/g
matches "abc[". Usually, [
means the beggining of a character class, but not if you put a \
before it!
And \
, it has a special meaning, so if you want to match the string "abc\[
" then you need to escape both the \
and the [
.
So, we get, /abc\\\[/g
. abc
for the abc
, \\
for the \
and \[
for the [
.
So, we have already shortened our digit-matching code to [0-9]
. Can we get shorter? As it turns out, if you put \d
then the d
gets some special meaning! It means "digit".
Let's try it out /abc\d/
is the same as /abc[0-9]/
. Isn't this great? I, personally, think this is.
Even cooler, if you make the d
a capital letter, then it negates its meaning. So, for example, \d
means digit, \D
means NOT a digit.
\b
, a word boundary, that is, the end or start of a string; or the point before or after a space character that must be before or after a word-character (see\w
about word-characters). **Important: ** word boundaries are points of length zero where the change between words and word-boundaries occurs, and they don't match characters!\B
, anything that isn't a word boundary\c<capital letter>
, it's complicated, and I wouldn't worry about it 😄. Note that this doesn't have a negative, and also that there are two characters after the backslash, which is unusual.\d
a digit,/[0-9]/
is the workaround\D
anything that isn't a digit\f
form feed. This is a character. It doesn't have a negative.\n
is a newline character. It's what seperates lines on most operating systems. Doesn't have a negative.\r
is a carrige return, it's a bit like the\n
charater.\s
is a space-character, and it includes the tab character, the space character, the newline character, the carriage return character, and many more.\S
is everything that isn't a space character\t
is the tab character, you know, the one that takes out about 4 spaces worth of gap.\v
something called a "vertical tab". I know, right?\w
, a word-character! This is the same as/[a-z0-9_]/i
(or,/[a-zA-Z0-9_]/
).\W
everything that isn't a word character\<number goes here>
we'll cover these later!\0
is aNUL
character, which you shouldn't need to worry about.- there are a couple more, but we will cover those later.
Ok, so, let's say you want to match everything except for a
, b
and c
. Well, if you put a ^
as the first character in a character class, it negates it. [^abc]
is what we want!
So \D
is [^0-9]
since \d
is [0-9]
.
Sometimes, we want to repeat the same pattern over and over again. And programmers are lazy, they don't want to write \d\d\d\d\d\d\d\d\d\d\d
.
So, let's learn something new! Adding {X,Y}
where X
and Y
are numbers, after something that can be matched, then it tries to find between X
and Y
of that thing (inclusive). Note that {X, Y}
probably won't be valid, so avoid any spaces in there.
If we want between 4 and 6 a
characters, we can do a{4,6}
.
What about exactly 7 digits. Let's do this: \d{7}
.
What about a letter, 3 or more times. Well, we can do that like so [a-z]{3,}
(note the extra coma).
Ok, now, again, we want some shortcuts. {0,1}
can be shortened to ?
, {1,}
can be shorterned to +
, and {0,}
can be shorterned to *
.
If we have a+
, then will aa
match the first a, the second a, and then both of them together? No. Usually, it will act greedy. It will try and match as many characters as possible.
If you want a quantifier to act lazy, that is, match as few characters as possible, then you should put ?
after it.
However, putting a??
is pointless, because a non-greedy version of "0 or 1" is "0" so it gets ignored, but makes the RegExp engine take more time to run. Likewise with a*?
.
(To be continued...)
nooo i want more