toiro Documentation

toiro is a Python package for comparing Japanese tokenizers. You can:
- Compare processing speed across tokenizers
- Compare tokenization outputs side by side
- Evaluate downstream task performance (e.g., text classification)
- Use helper utilities for Japanese NLP (corpus download/preprocessing, simple classifiers, etc.)

Key Features
Supported Tokenizers
13 Japanese tokenizers and BPE models:
- janome (included by default)
- nagisa
- mecab-python3
- sudachipy
- spacy
- ginza
- kytea
- jumanpp
- sentencepiece
- fugashi (ipadic/unidic)
- tinysegmenter
- tiktoken (BPE for GPT-4o / GPT-5)
Links
👉 Project: https://github.com/taishi-i/toiro 👉 Demo (Hugging Face Spaces): https://huggingface.co/spaces/taishi-i/Japanese-Tokenizer-Comparison 👉 PyPI: https://pypi.org/project/toiro/
Supported Python Versions
Python 3.10 or later is recommended.
License
toiro is released under the Apache License 2.0.
This documentation site is generated with MkDocs Material.