-
-
Save j4mie/557354 to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*- | |
import unicodedata | |
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """ | |
data = u'naïve café' | |
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore') | |
print normal | |
# prints "naive cafe" |
thanks it's working
It works with a list instead of a single string?
@renatofmartins use a list builder [unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore') for x in my_list]
How can I keep the letter Ñ
?
@frangeris, that's a great question. I've ended with that. Works perfectly but will ignore anything combined with ~ tilde char ('COMBINING TILDE'). Being exact, it will ONLY normalize letters combined with ´ or ` and nothing else:
def strip_accents_spain(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Docs didn't say that much about wich combos can be used for normalize(), but you can get the whole idea here: ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt (Search for "COMBINING" at bottom document to see all options).
@frangeris a quick and probably non-pythonic solution is as follows:
line = "EL NIÑO"
line = line.replace('Ñ','-&-')
line= str(unicodedata.normalize('NFKD', line).encode('ascii','ignore'))[2:-1]
line = line.replace('-&-','Ñ')
Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).
Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize
options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt.
LATIN SMALL LETTER O WITH STROKE becomes the empty string instead of LATIN SMALL LETTER O