Skip to content

Instantly share code, notes, and snippets.

@Snarp
Last active February 18, 2024 00:40
Show Gist options
  • Save Snarp/eae972a742f84e53a253be215a2688d4 to your computer and use it in GitHub Desktop.
Save Snarp/eae972a742f84e53a253be215a2688d4 to your computer and use it in GitHub Desktop.
Unicode superscripts and subscripts normalized via NFKC
---
:subscripts:
"": '0'
"": '1'
"": '2'
"": '3'
"": '4'
"": '5'
"": '6'
"": '7'
"": '8'
"": '9'
"": "("
"": ")"
"": "+"
"": "="
"": ""
: a
: e
: h
: i
: j
: k
: l
: m
: n
: o
: p
: r
: s
: t
: u
: v
: x
: ə
: β
: γ
: ρ
: φ
: χ
:superscripts:
"": '0'
"¹": '1'
"²": '2'
"³": '3'
"": '4'
"": '5'
"": '6'
"": '7'
"": '8'
"": '9'
"": "("
"": ")"
"": "+"
"": "="
: A
: B
: D
: E
: G
: H
: I
: J
: K
: L
: M
: N
: O
: P
ᴿ: R
: T
: U
: V
: W
ª: a
: a
: b
: c
: d
: e
: f
: g
ʰ: h
: i
ʲ: j
: k
ˡ: l
: m
: n
º: o
: o
: p
ʳ: r
ˢ: s
: t
: u
: v
ʷ: w
ˣ: x
ʸ: y
: z
: Æ
: ð
: Ħ
: ŋ
: œ
: Ǝ
: ƫ
: Ȣ
: ɐ
: ɑ
: ɒ
: ɔ
: ɕ
: ə
: ɛ
: ɜ
: ɜ
: ɟ
: ɡ
ˠ: ɣ
: ɥ
ʱ: ɦ
: ɨ
: ɩ
: ɪ
: ɫ
: ɭ
: ɯ
: ɰ
: ɱ
: ɲ
: ɳ
: ɴ
: ɵ
: ɸ
ʴ: ɹ
ʵ: ɻ
ʶ: ʁ
: ʂ
: ʃ
: ʉ
: ʊ
: ʋ
: ʌ
: ʐ
: ʑ
: ʒ
ˤ: ʕ
: ʝ
: ʟ
: β
: γ
: δ
ᶿ: θ
: φ
: χ
: н
: ъ
: ь
:
:
:
:
:
:
:
:
:
"": ""
:
"":
"":
"":
"":
"":
"":
"":
"":
"":
"":
"":
"":
"":
"":
:
:
:
:
--- # As Ruby regular expressions:
:subscripts: !ruby/regexp /[₀₁₂₃₄₅₆₇₈₉₍₎₊₌₋ₐₑₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓₔᵦᵧᵨᵩᵪ]+/
:superscripts: !ruby/regexp /[⁰¹²³⁴⁵⁶⁷⁸⁹⁽⁾⁺⁼ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂªᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿºᵒᵖʳˢᵗᵘᵛʷˣʸᶻᴭᶞꟸᵑꟹᴲᶵᴽᵄᵅᶛᵓᶝᵊᵋᵌᶟᶡᶢˠᶣʱᶤᶥᶦꭞᶩᵚᶭᶬᶮᶯᶰᶱᶲʴʵʶᶳᶴᶶᶷᶹᶺᶼᶽᶾˤᶨᶫᵝᵞᵟᶿᵠᵡᵸꚜꚝჼᵆᵔᵕᶸᵙᵜᶧᶪ⁻ⵯ㆒㆜㆔㆖㆘㆛㆗㆚㆓㆟㆕㆞㆝㆙ꭜꝰꭝꭟ]+/
# Method for processing HTML to replace Unicode subscript and superscript
# characters with normalized characters wrapped in `<sub>` and `<sup>` tags.
# Uses regular expressions to identify sequences of sub- and superscripts.
# Examples:
#
# "1ˢᵗ 2ⁿᵈ 3ʳᵈ" => "1<sup>st</sup> 2<sup>nd</sup> 3<sup>rd</sup>"
# "PO₄³⁻ ion" => "PO<sub>4</sub><sup>3−</sup> ion"
REGEXPS = {
# Less-thorough regular expressions:
sub: /[₀₁₂₃₄₅₆₇₈₉₍₎₊₌₋ₐₑₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓ]+/,
sup: /[⁰¹²³⁴⁵⁶⁷⁸⁹⁽⁾⁺⁼⁻ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂªᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿºᵒᵖʳˢᵗᵘᵛʷˣʸᶻ]+/,
# # More-thorough versions:
# sub: /[₀₁₂₃₄₅₆₇₈₉₍₎₊₌₋ₐₑₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓₔᵦᵧᵨᵩᵪ]+/,
# sup: /[⁰¹²³⁴⁵⁶⁷⁸⁹⁽⁾⁺⁼ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂªᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿºᵒᵖʳˢᵗᵘᵛʷˣʸᶻᴭᶞꟸᵑꟹᴲᶵᴽᵄᵅᶛᵓᶝᵊᵋᵌᶟᶡᶢˠᶣʱᶤᶥᶦꭞᶩᵚᶭᶬᶮᶯᶰᶱᶲʴʵʶᶳᶴᶶᶷᶹᶺᶼᶽᶾˤᶨᶫᵝᵞᵟᶿᵠᵡᵸꚜꚝჼᵆᵔᵕᶸᵙᵜᶧᶪ⁻ⵯ㆒㆜㆔㆖㆘㆛㆗㆚㆓㆟㆕㆞㆝㆙ꭜꝰꭝꭟ]+/,
}
# "Dihydrogen monoxide, or H₂O, can be very dangerous."
# => "Dihydrogen monoxide, or H<sub>2</sub>O, can be very dangerous."
def subscripts_to_sub_tags(html)
capture_normalize_and_wrap(html,
REGEXPS[:sub], # /[₀₁₂₃₄₅ ...
'sub')
end
# "He wore a white <i>gat</i>¹."
# => "He wore a white <i>gat</i><sup>1</sup>."
def superscripts_to_sup_tags(html)
capture_normalize_and_wrap(html,
REGEXPS[:sup], # /[⁰¹²³⁴⁵ ...
'sup')
end
def capture_normalize_and_wrap(html, rx, tag, log: true)
scanned,to_scan = "",html
while m=rx.match(to_scan)
str = "<#{tag}>#{m[0].unicode_normalize(:nfkc)}</#{tag}>"
(puts str) if log
scanned += (m.pre_match + str)
to_scan = m.post_match
end
return scanned + to_scan
end
# "PO₄³⁻" => "PO<sub>4</sub><sup>3−</sup>"
def process_html(html="PO₄³⁻")
REGEXPS.each do |tag, rx|
html = capture_normalize_and_wrap(html, rx, tag)
end
return html
end
def process_html_file(in_fname='sample_in.html', out_fname='sample_out.html')
html = process_html File.read(in_fname)
File.write(out_fname, html)
return html
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment