Skip to content

toiro

Tokenizers

Tokenizers

List of Japanese tokenizers and BPE models supported by toiro.

Morphological Analysis Tokenizers

Janome

Type: Morphological analyzer
Dictionary: MeCab IPADIC
Features: Pure Python implementation, no external dependencies
Default: Included in toiro by default

nagisa

Type: RNN-based
Features: Supports POS tagging and named entity extraction

mecab-python3

Type: Morphological analyzer
Dictionary: MeCab IPADIC
Features: Python binding for MeCab

SudachiPy

Type: Morphological analyzer
Dictionary: Sudachi dictionary
Features: Multiple split modes (A/B/C), synonym expansion

spaCy

Type: Statistical model-based
Features: Multi-functional including NER, dependency parsing

GiNZA

Type: Japanese model for spaCy
Features: Universal Dependencies compliant, NER

KyTea

Type: Pointwise prediction-based
Features: Pronunciation estimation
Note: Requires system-level installation

Juman++

Type: Morphological analyzer
Dictionary: JUMAN dictionary
Features: RNN-based re-ranking
Note: Requires system-level installation (used via pyknp)

fugashi

Type: Cython wrapper for MeCab
Dictionary: IPADIC or UniDic
Features: Fast MeCab Python binding

TinySegmenter

Type: Compact tokenizer
Features: Lightweight, dictionary-free

Subword Tokenizers

SentencePiece

Type: BPE / Unigram
Features: Language-independent, designed for neural machine translation

tiktoken

Type: BPE
Models: GPT-4o / GPT-5
Features: Tokenizer for OpenAI models

Choosing a Tokenizer

Use Case	Recommendation
Easy to start	Janome (no dependencies)
High-speed processing	MeCab, fugashi, SudachiPy
Need NER	GiNZA, spaCy
Neural machine translation	SentencePiece
Integration with OpenAI models	tiktoken

System-level installation required

KyTea and Juman++ require system-level installation before installing the Python package. Please refer to their official documentation for details.