Regular Expressions and Japanese

In our modules, we have covered how to read and write regular expressions that use numbers, special characters, and letters from the Roman alphabet (A-Z). But what about characters from other languages? This is also possible, and there are different ways to do it!

Now, most character sets and alphabets have their own ranges in Unicode, so you can simply grab the first and last code and use that, as long as your system supports it. However, Japanese uses three different "alphabets," one of which is derived from Chinese. Additionally, Japanese users will sometimes use English words when they are talking about something specific, quoting something written in English, or if there isn't a good translation for it. So how can we account for all of this? Regex is flexible enough to cover it all!

A Quick Breakdown of Japanese Characters

You don't need to know Japanese to understand this tutorial, but you'll benefit from having some small idea of the different types of Japanese characters, so I'll do my best to quickly explain. There are three sets of characters primarily used in Japanese: hiragana, katakana, and kanji.

Hiragana typically makes up the bulk of a Japanese sentence. These characters are fairly simple, and have a curvy, handwritten-like nature. The equivalent of the Roman vowels (a, e, i, o, u) is あ、え、い、お、う.
Katakana represents the same sounds as hiragana, but is used for foreign words or for emphasis, similar to bold or italics. Katakana is much straighter and sharper-looking than hiragana. The vowels are ア、エ、イ、オ、ウ.
Kanji are Chinese characters that have been incorporated into Japanese. With some exception, these typically have more lines than the other two, and represent words or concepts. Some examples are 今(now), 見 (to see), 高 (high, expensive).

Our Regular Expression

If we wanted to simply check for the presence of ONLY possible Japanese characters, we could simply use a range of all three alphabets, along with some special characters and punctuation that are typically only used by Japanese speakers:

/^[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」]*$/

Adding in the Roman alphabet or numbers is as simple as adding new ranges:

/^[a-zA-Z0-9ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」]*$/

However, for the sake of explaining how regex works, I will be using a more complex expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

This expression checks for "I am" and "we are" statements, line by line. I'll explain the different parts in detail as we go along.

Anchors
Quantifiers
OR Operator
Character Classes
Flags
Grouping and Capturing
Greedy and Lazy Match
Conclusion
Author

Regex Components

Anchors

Anchors tell our expression where to begin and end. Use ^ to indicate the beginning of the string, and $ to indicate the end.

In our expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

The anchors encompass the whole expression, indicating that our expression applies to the entire string. For another example, if we wanted to make sure the string began with 私は　(watashi wa, roughly "As for me..." or "About me..."), we could use the following expression:

/^私は/

Keep in mind that ^ is also used to negate sets, meaning that placing it within a bracketed set will match any character that is not in the set. If we wanted to make sure our string does NOT contain 私 or は, we would use:

/＾[^私は]＄/

If we wanted to check if the string ends with です。（desu., polite form of "to be" or "is"), we can use:

/です。$/

Also note the use of a Japanese full-stop or period (。) to indicate the end of the sentence.

Quantifiers

Quantifiers tell us how many times a character must be present for a match to occur.

Asterisk | 0 or more times
Plus sign | 1 or more times
Question mark | 0 or 1 time (and no more)
{x} | exactly x times
{x,} | at least x times
{x,y} | from x to y times, inclusive

These go after the character or range that they apply to.

In our expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

We are looking for exactly 1 instance of the first group, 0 or 1 of the second group, exactly 1 of the third group, and 1 or more of the group in square brackets. Our 0 or 1 group is a modifier that turns "I" into "we," so it makes sense that it's optional here.

OR Operator

The OR operator is represented by a vertical bar (|). Use this within parenthetical groups to indicate that any of the items will match, but the user may only choose one.

In our expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

We are checking for...

私 OR 僕 OR 俺 (three different forms of "I")
は OR が (particle)
です OR だ (forms of "to be", normally used at the end of a sentence)

Character Classes

Character classes, or character sets, are used to match any character from a set. Ranges of characters can be included by using hyphens. Custom character classes are typically denoted as a bracket expression. Additionally, there are several built-in classes: most are indicated by the use of a backslash () followed by a letter. Most common are \d, \s, and \w, which stand for digit, space, and word, respectively. We are using the character class \w, which will match any alphanumeric character from the Roman alphabet. This class can be considered shorthand for [a-zA-Z0-9_].

To match any character that is NOT covered by \w, we simply make the w capital, as in \W. This negates the set, as in [^a-zA-Z0-9_].

Flags

Flags are optional parameters that go at the end of a regular expression. They are represented by single lowercase letters. There are 6 flags in JavaScript regex:

i - Ignore Casing - makes the expression case-insensitive
g - Global - searches for all instances of the expression
s - Dot All - makes the wildcard dot character (.) match new lines (\n)
u - Unicode - enables Unicode support
y - Sticky - matches only from the index indicated in the lastIndex property. Does not check later indexes. Supercedes the global flag.
m - Multiline - the expression is checked again at every new line. Anchors ^ and $ now indicate the start and end of every line, instead of the whole string.

Our expression uses the global and multiline flags, so it is searching for all matches and checking every line.

Grouping and Capturing

When we want to apply regex operators to several items at once, we group them. There are several ways to do this:

Putting characters or character ranges (indicated by a hyphen) in square brackets creates a character class. Putting a ^ within the brackets at the start creates a negated class.
Parentheses indicate a capturing group. A set in a capturing group will be matched, and the match will be remembered. Useful if the matched substring needs to be recalled or stored.
(?:x) - A non-capturing group. In this example, x will be matched, but the match will not be remembered and cannot be recalled.

In our expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

We have three capturing groups, one non-capturing group, and one bracketed character class containing three ranges, several individual characters, and the alphanumeric class.

Greedy and Lazy Match

A greedy quantifier will look for as large a group as possible, while a lazy quantifier will stop at the smallest group. For example, the quantifier + is naturally greedy and will continue to search after the first instance has been found. To make a quantifier lazy, add a question mark to the end (+?). Our expression only uses greedy quantifiers because we have no need to check within the strings.

Conclusion

Let's sum up this expression:

/^(私|僕|俺){1}(?:たち)?(は|が){1}[ぁ-ゔァ-・一-龯ゞヽヾ゛゜ー々〆〤¥「」\w]+(です|だ)。+$/gm

Per line, this expression checks for:

One of three forms of "I"
(optional) plural modifier that turns "I" into "we"
Connecting particle
At least one of the following: a Japanese character, punctuation mark, or alphanumeric character
One of two forms of "to be" or "is"
A full-stop

This expression will match with the following examples of simple Japanese sentences:

僕がカナダ人だ。(I am Canadian.)
私たちはみんな友達です。(We are all friends.)
私が25歳です。(I am 25 years old.)
俺はDavidだ。（I am David.)
俺たちは気にないです。(We don't care.)

Author

This was written by Brenham Pozzi, a student at UTSA Coding Bootcamp. You can find my github profile here

brenhamp/regex_japanese_tutorial.MD

Regular Expressions and Japanese

A Quick Breakdown of Japanese Characters

Our Regular Expression

Table of Contents

Regex Components

Anchors

Quantifiers

OR Operator

Character Classes

Flags

Grouping and Capturing

Greedy and Lazy Match

Conclusion

Author