- This Document is written for developers who try to change the behavior of Term Suite in source code level.
- Please see Term Suite User Guide only for using it.
The outline of this document follows the Term Suite's overall process which can be expressed as an sequence of UIMA analysis engine, especially focusing on the local and language-dependent changes carried out for Japanese components.
JapaneseTagger -> JapaneseNormaliser -> JapaneseTermSpotter -> SpotterTSVWriter -> JapaneseFilter -> Contextualizer -> Writer
Instead of TreeTagger we use Japanese morphological analyzer Igo which takes charge of morpheme segmentation, lemmatization, and part-of-speech (POS) tagging. We developed a simple wrapper of Igo (uima-igo) to adjust it for UIMA framework.
- uima-igo (JapaneseMorphologicalAnalyzer)
To adapt the type of outputs from uima-igo (net.hitsujiwool.uima.igo.types.MechabMorpheme
) to the type which is commonly used in Term Suite (eu.project.ttc.types.WordAnnotation
), we use 2 general-purpose components.
The former maps the basic information (lemma, surface, POS) earned from uima-igo outputs, and the latter "zips" the annotated features with the help of rules defined in japanese-pos-sub-category-zipping.xml
. See Readme.md
and their test cases in each repositories for further understaning.
These 3 primitive analysis engines listed above compose aggregate analysis engine JapaneseTagger
of whom JapaneseSpotter
consists.
There is no additional change in this process, but some of the unnecessary normalization in Japanese (gender, mood, number and so on) are omitted.
In addtion to the stopword-based uima-filter commonly used by all language components, here we adopt uima-regex-filter which filters outs extracted terms by regular expression matching. This is because Igo tends to identify sequences of non-Japanese characters as "noun", which cannot be filted out by simple stopword-list. Both SWT and MWT consist only of symbol, parenthesis and blacket are filtered out in this process. What to be noted here is that not a few Japanese multiword terms include latin alphabets as their components (e.g. "iPS細胞 [iPScell]", "C型肝炎 [Hepatitis C]"), we remain the annotated terms unremoved if there are at least Japanese characters in them.
The other components (such as JapaneseTermSpotter, SpotterTSVWriter and Contextualizer, etc) have no local change for Japanese.
No major change for Japanese.
No major change for Japanese.