Tokenizers
List of Japanese tokenizers and BPE models supported by toiro.
Morphological Analysis Tokenizers
Janome
- Type: Morphological analyzer
- Dictionary: MeCab IPADIC
- Features: Pure Python implementation, no external dependencies
- Default: Included in toiro by default
nagisa
- Type: RNN-based
- Features: Supports POS tagging and named entity extraction
mecab-python3
- Type: Morphological analyzer
- Dictionary: MeCab IPADIC
- Features: Python binding for MeCab
SudachiPy
- Type: Morphological analyzer
- Dictionary: Sudachi dictionary
- Features: Multiple split modes (A/B/C), synonym expansion
spaCy
- Type: Statistical model-based
- Features: Multi-functional including NER, dependency parsing
GiNZA
- Type: Japanese model for spaCy
- Features: Universal Dependencies compliant, NER
KyTea
- Type: Pointwise prediction-based
- Features: Pronunciation estimation
- Note: Requires system-level installation
Juman++
- Type: Morphological analyzer
- Dictionary: JUMAN dictionary
- Features: RNN-based re-ranking
- Note: Requires system-level installation (used via pyknp)
fugashi
- Type: Cython wrapper for MeCab
- Dictionary: IPADIC or UniDic
- Features: Fast MeCab Python binding
TinySegmenter
- Type: Compact tokenizer
- Features: Lightweight, dictionary-free
Subword Tokenizers
SentencePiece
- Type: BPE / Unigram
- Features: Language-independent, designed for neural machine translation
tiktoken
- Type: BPE
- Models: GPT-4o / GPT-5
- Features: Tokenizer for OpenAI models
Choosing a Tokenizer
| Use Case | Recommendation |
|---|---|
| Easy to start | Janome (no dependencies) |
| High-speed processing | MeCab, fugashi, SudachiPy |
| Need NER | GiNZA, spaCy |
| Neural machine translation | SentencePiece |
| Integration with OpenAI models | tiktoken |
System-level installation required
KyTea and Juman++ require system-level installation before installing the Python package. Please refer to their official documentation for details.