- Python's Standard Library, especially str.methods and string module are powerful for text processing. Start there.
- regex - Extends Python's Standard Library
re
module while being backwards-compatible. - chardet - Finds character encoding.
- ftfy - Takes in bad Unicode and outputs good Unicode. Seriously automagical.
- ploygot - Helpful for multilingual preprocessing.
- fuzzywuzzy - Fuzzy string matching like a boss.
- enchant - Spell checking.
- inflect - Convert numbers to words, switch between singular/plural, and generate ordinals.
- nltk - Hard pass. Too academic, too slow.
- scikit-learn - Handles basic text processing and modeling. Easy to combine text-based features with other features.
- TextBlob - A great package for common NLP tasks. Consistent OOP-style API.
- spaCy - Industrial strength NLP including, very good transformers and named entity recognition (NER) abilities.
- textacy - Higher level NLP built on top of spaCy.
- Hugging Face - Collections of datasets and pretrained models.
- gensim - A nice API for all kinds of topic modeling and word2vec.
- pattern - Text mining at its finest. Handles normalizing numbers, comparatives, and superlatives.
- jellyfish - Approximate & phonetic string matching.