Skip to content

toiro Documentation

toiro is a Python package for comparing Japanese tokenizers. You can:

  • Compare processing speed across tokenizers
  • Compare tokenization outputs side by side
  • Evaluate downstream task performance (e.g., text classification)
  • Use helper utilities for Japanese NLP (corpus download/preprocessing, simple classifiers, etc.)

Key Features

Supported Tokenizers

13 Japanese tokenizers and BPE models:

  • janome (included by default)
  • nagisa
  • mecab-python3
  • sudachipy
  • spacy
  • ginza
  • kytea
  • jumanpp
  • sentencepiece
  • fugashi (ipadic/unidic)
  • tinysegmenter
  • tiktoken (BPE for GPT-4o / GPT-5)

👉 Project: https://github.com/taishi-i/toiro 👉 Demo (Hugging Face Spaces): https://huggingface.co/spaces/taishi-i/Japanese-Tokenizer-Comparison 👉 PyPI: https://pypi.org/project/toiro/

Supported Python Versions

Python 3.10 or later is recommended.

License

toiro is released under the Apache License 2.0.


This documentation site is generated with MkDocs Material.