Skip to content

Instantly share code, notes, and snippets.

@hydrargyrum
Forked from dpk/gist:8325992
Created February 20, 2024 10:50
Show Gist options
  • Save hydrargyrum/aa79a855c7c441df2c7509c1c8a80e77 to your computer and use it in GitHub Desktop.
Save hydrargyrum/aa79a855c7c441df2c7509c1c8a80e77 to your computer and use it in GitHub Desktop.
PyICU cheat sheet

PyICU cheat sheet

Because you can't get the docs.

Transliteration

Create a transliterator:

greek2latin = icu.Transliterator.createInstance('Greek-Latin')

Transliterate:

greek2latin.transliterate('Ψάπφω') # => 'Psápphō'

Inverse transformation:

latin2greek = icu.Transliterator.createInstance('Greek-Latin', icu.UTransDirection.REVERSE)
latin2greek.transliterate('Psápphō') # => 'Ψάπφω'

or

latin2greek = greek2latin.createInverse()
latin2greek.transliterate('Psápphō') # => 'Ψάπφω'

See http://demo.icu-project.org/icu-bin/translit and http://userguide.icu-project.org/transforms/general for an idea of what kind of transliteration is built in.

Locales

Create a locale object:

britain = icu.Locale('en-GB')
french_ca = icu.Locale('fr_CA')
# etc.

… there's also a few shortcuts:

icu.Locale.getFrance()
icu.Locale.getDefault()
# etc.

You can get a few bits of information like name from each locale object:

britain.getDisplayName() # => 'English (United Kingdom)'
french_ca.getDisplayLanguage() # => 'French'
# etc.

Collation

See the bit above on Locales first, you'll need to understand locales in order to work the collator.

Create a collator for a particular Locale:

collator = icu.Collator.createInstance(icu.Locale('en_GB'))

Sort a list of strings, e.g.:

sorted(['sandwiches', 'angel delight', 'custard', 'éclairs', 'glühwein'], key=collator.getSortKey) #=> ['angel delight', 'custard', 'éclairs', 'glühwein', 'sandwiches']

Rule-based collation (tailoring)

The following makes (or should make — tailoring is a bit of a black art) thorn (Þþ) sort in Old English order (see Michael Everson's article, Sorting the letter ÞORN):

collator = icu.RuleBasedCollator('[normalization on]\n&t<þ<u\n&T<Þ<U\n&Þ=þ')
sorted(['þinking', 'tweet', 'uppity', 'Typography', 'Þeology', 'Urology'], key=collator.getSortKey) # => ['tweet', 'Typography', 'Þeology', 'þinking', 'uppity', 'Urology']

Tweaking a locale-based collocation with extra rules

Ignore word breaks in Welsh:

rules = icu.Collator.createInstance(icu.Locale('cy')).getRules()
rules = '[alternate shifted]' + rules
collator = icu.RuleBasedCollator(rules)

Date format

Date-time:

formatter = icu.DateFormat.createDateTimeInstance(icu.DateFormat.LONG, icu.DateFormat.kDefault, icu.Locale('de_DE'))
formatter.format(datetime.now()) #=> '26. Juli 2014 14:57:22'

Date only/time only, replace the first line with e.g.:

formatter = icu.DateFormat.createDateInstance(icu.DateFormat.LONG, icu.Locale('de_DE'))
formatter = icu.DateFormat.createTimeInstance(icu.DateFormat.LONG, icu.Locale('de_DE'))

Break Iteration

Unfortunately this is even more of a pain than you’d hope.

de_words = icu.BreakIterator.createWordInstance(icu.Locale('de_DE'))
de_words.setText('Bist du in der U-Bahn geboren?')
de_words.nextBoundary() #=> 4
de_words.nextBoundary() #=> 5
# etc.

The following function might be useful:

def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

Usage:

de_words = icu.BreakIterator.createWordInstance(icu.Locale('de_DE'))
list(iterate_breaks('Bist du in der U-Bahn geboren?', de_words))
#=> ['Bist', ' ', 'du', ' ', 'in', ' ', 'der', ' ', 'U', '-', 'Bahn', ' ', 'geboren', '?']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment