Corpora & Data

The toiro.datadownloader module provides easy access to Japanese text classification corpora.

Available Corpora

livedoor News Corpus

Task: Text classification (news category)
Categories: 9 categories
Categories: Topic News, Sports Watch, IT Life Hack, Kaden Channel, MOVIE ENTER, Dokujo Tsushin, S-MAX, livedoor HOMME, Peachy
Source: livedoor News Corpus

Yahoo! Movie Reviews

Task: Sentiment analysis (positive/negative)
Domain: Movie reviews
Format: Review text and rating scores

Amazon Reviews

Task: Sentiment analysis (positive/negative)
Domain: Product reviews
Format: Review text and rating scores

Basic Usage

List available corpora

from toiro import datadownloader

corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

Download a corpus

datadownloader.download_corpus("livedoor_news_corpus")

Corpora are downloaded to the ~/.toiro/ directory.

Load a corpus

train_df, dev_df, test_df = datadownloader.load_corpus("livedoor_news_corpus")

# Check data
print(f"Train: {len(train_df)} samples")
print(f"Dev: {len(dev_df)} samples")
print(f"Test: {len(test_df)} samples")

# Data structure (pandas DataFrame)
# Column 0: label
# Column 1: text
texts = train_df[1].tolist()
labels = train_df[0].tolist()

Data Preprocessing

Downloaded corpora are provided as pandas DataFrames, pre-split into train/dev/test sets.

# Example: Extract texts only for tokenizer evaluation
texts = train_df[1].tolist()

# Example: Extract labels and texts for classifier training
X_train = train_df[1].tolist()
y_train = train_df[0].tolist()

Data storage location

Downloaded corpora are saved in the ~/.toiro/ directory. Once downloaded, you can load them directly with load_corpus().