Skip to content

Tokenizers

List of Japanese tokenizers and BPE models supported by toiro.

Morphological Analysis Tokenizers

Janome

  • Type: Morphological analyzer
  • Dictionary: MeCab IPADIC
  • Features: Pure Python implementation, no external dependencies
  • Default: Included in toiro by default

nagisa

  • Type: RNN-based
  • Features: Supports POS tagging and named entity extraction

mecab-python3

  • Type: Morphological analyzer
  • Dictionary: MeCab IPADIC
  • Features: Python binding for MeCab

SudachiPy

  • Type: Morphological analyzer
  • Dictionary: Sudachi dictionary
  • Features: Multiple split modes (A/B/C), synonym expansion

spaCy

  • Type: Statistical model-based
  • Features: Multi-functional including NER, dependency parsing

GiNZA

  • Type: Japanese model for spaCy
  • Features: Universal Dependencies compliant, NER

KyTea

  • Type: Pointwise prediction-based
  • Features: Pronunciation estimation
  • Note: Requires system-level installation

Juman++

  • Type: Morphological analyzer
  • Dictionary: JUMAN dictionary
  • Features: RNN-based re-ranking
  • Note: Requires system-level installation (used via pyknp)

fugashi

  • Type: Cython wrapper for MeCab
  • Dictionary: IPADIC or UniDic
  • Features: Fast MeCab Python binding

TinySegmenter

  • Type: Compact tokenizer
  • Features: Lightweight, dictionary-free

Subword Tokenizers

SentencePiece

  • Type: BPE / Unigram
  • Features: Language-independent, designed for neural machine translation

tiktoken

  • Type: BPE
  • Models: GPT-4o / GPT-5
  • Features: Tokenizer for OpenAI models

Choosing a Tokenizer

Use Case Recommendation
Easy to start Janome (no dependencies)
High-speed processing MeCab, fugashi, SudachiPy
Need NER GiNZA, spaCy
Neural machine translation SentencePiece
Integration with OpenAI models tiktoken

System-level installation required

KyTea and Juman++ require system-level installation before installing the Python package. Please refer to their official documentation for details.