JGLUE: Japanese General Language Understanding Evaluation

JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese. JGLUE has been constructed from scratch without translation. We hope that JGLUE will facilitate NLU research in Japanese.

JGLUE has been constructed by a joint research project of Yahoo Japan Corporation and Kawahara Lab at Waseda University.


JGLUE consists of the tasks of text classification, sentence pair classification, and QA. Each task consists of multiple datasets. Each dataset can be found under the datasets directory. Only train/dev sets are available now, and the test set will be available after the leaderboard is made public. We use Yahoo! Crowdsourcing for all crowdsourcing tasks in constructing the datasets.

Task Dataset Train Dev Test
Text Classification MARC-ja 187,528 5,654 5,639
Sentence Pair Classification JSTS 12,451 1,457 1,589
JNLI 20,073 2,434 2,508
QA JSQuAD 62,859 4,442 4,420
JCommonsenseQA 8,939 1,119 1,118

†JCoLA will be added soon.

Dataset Description


MARC-ja is a dataset of the text classification task. This dataset is based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC) (Keung+, 2020).

We performed the following modifications to the original dataset:

  1. To make it easy for both humans and computers to judge a class label, we cast the text classification task as a binary classification task, where 1 and 2-star ratings are converted to negative, and 4 and 5 are converted to positive. We do not use reviews with a 3-star rating.
  2. There are some instances where the rating diverges from a review text. To improve the quality of the dev/test instances, we crowdsource a positive/negative judgment task, adopt only the reviews with the same votes from seven or more out of 10 workers and assign a label of the maximum votes to these reviews.

We don’t distribute the dataset itself. Please download the original dataset, and run a conversion script as follows:

  1. Download https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
  2. Run the following commands:

$ pip install -r preprocess/requirements.txt
$ cd preprocess/marc-ja/scripts
$ gzip -dc /somewhere/amazon_reviews_multilingual_JP_v1_00.tsv.gz | \
  python marc-ja.py \
         --positive-negative \
         --output-dir ../../../datasets/marc_ja-v1.0 \
         --max-char-length 500 \
         --filter-review-id-list-valid ../data/filter_review_id_list/valid.txt \
         --label-conv-review-id-list-valid ../data/label_conv_review_id_list/valid.txt

The train and valid sets will be generated under the datasets/marc_ja-v1.0 directory.

When you use this dataset, please follow the license of Multilingual Amazon Reviews Corpus (MARC).


JSTS is a Japanese version of the STS (Semantic Textual Similarity) dataset. STS is a task to estimate the semantic similarity of a sentence pair. The sentences in JSTS and JNLI (described below) are extracted from the Japanese version of the MS COCO Caption Dataset, the YJ Captions Dataset (Miyazaki and Shimizu, 2016).

{"sentence_pair_id": "691",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", 
 "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", 
 "label": "4.4"}

(Note that English translations are added in this example for those who do not understand Japanese, and are not included in the dataset.)

Name Description
sentence_pair_id id
yjcaptions_id sentence ids in yjcaptions (explained below)
sentence1 first sentence
sentence2 second sentence
label sentence similarity: 5 (equivalent meaning) – 0 (completely different meaning)

Explanation for yjcaptions_id

There are the following two cases:

  1. sentence pairs in one image: (image id)-(sentence1 id)-(sentence2 id)
    • e.g., 723-844-847
    • a sentence id starting with “g” means a sentence generated by a crowdworker (e.g., 69501-75698-g103): only for JNLI
  2. sentence pairs in two images: (image id of sentence1)_(image id of sentence2)-(sentence1 id)-(sentence2 id)
    • e.g., 91337_217583-96105-91680


JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are entailment, contradiction, and neutral.

{"sentence_pair_id": "1157",
 "yjcaptions_id": "127202-129817-129818",
 "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", 
 "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", 
 "label": "entailment"}
Name Description
sentence_pair_id id
yjcaptions_id sentence ids in yjcaptions
sentence1 premise sentence
sentence2 hypothesis sentence
label inference relation


JSQuAD is a Japanese version of SQuAD (Rajpurkar+, 2016), one of the datasets of reading comprehension. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). We used the Japanese Wikipedia dump as of 20211101.

The json format is the same as the original SQuAD.

      "title": "東海道新幹線 (Tokaido Shinkansen)",
      "paragraphs": [
          "qas": [
              "question": "2020年(令和2年)3月現在、東京駅 - 新大阪駅間の最高速度はどのくらいか。 (What is the maximum speed between Tokyo Station and Shin-Osaka Station as of March 2020?)",
              "id": "a1531320p0q0",
              "answers": [
                  "text": "285 km/h",
                  "answer_start": 182
              "is_impossible": false
          "context": "東海道新幹線 [SEP] 1987年(昭和62年)4月1日の国鉄分割民営化により、JR東海が運営を継承した。西日本旅客鉄道(JR西日本)が継承した山陽新幹線とは相互乗り入れが行われており、東海道新幹線区間のみで運転される列車にもJR西日本所有の車両が使用されることがある。2020年(令和2年)3月現在、東京駅 - 新大阪駅間の所要時間は最速2時間21分、最高速度285 km/hで運行されている。"
Name Description
title title of a Wikipedia article
paragraphs a set of paragraphs
qas a set of pairs of a question and its answer
question question
id id of a question
answers a set of answers
text answer text
answer_start start position (character index)
is_impossible all the values are false
context a concatenation of the title and paragraph


JCommonsenseQA is a Japanese version of CommonsenseQA (Talmor+, 2019), which is a multiple-choice question answering dataset that requires commonsense reasoning ability. It is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet.

{"q_id": 3016,
 "question": "会社の最高責任者を何というか? (What do you call the chief executive officer of a company?)",
 "choice0": "社長 (president)",
 "choice1": "教師 (teacher)",
 "choice2": "部長 (manager)",
 "choice3": "バイト (part-time worker)",
 "choice4": "部下 (subordinate)",
 "label": 0}
Name Description
q_id id
question question
choice{0..4} choice
label correct choice id

Baseline Scores

The following foundation models are used for the evaluation.

Model Basic Unit Pretraining Texts
Tohoku BERT base subword(MeCab + BPE) Japanese Wikipedia
Tohoku BERT base (char) character Japanese Wikipedia
NICT BERT base subword(MeCab + BPE) Japanese Wikipedia
Waseda RoBERTa base subword(Juman++ + Unigram LM) Japanese Wikipedia + CC
XLM RoBERTa base subword(Unigram LM) multi-lingual CC

Note that the large-sized models are also used corresponding to Tohoku BERT base, Waseda RoBERTa base and XLM RoBERTa base.

When you use NICT BERT base or Waseda RoBERTa base models, the dataset text should be segmented into words by the following corresponding morphological analyzer in advance:

  • NICT BERT base: MeCab (0.996) with JUMAN dictionary
  • Waseda RoBERTa base: Juman++ (2.0.0-rc3)

Please refer to preprocess/morphological-analysis/README.md.

The fine-tuning was performed using the transformers library provided by Hugging Face. See fine-tuning/README.md for details.

The performance along with human scores on the JGLUE dev set is shown below.

Model MARC-ja JSTS JNLI JSQuAD JCommonsenseQA
acc Pearson/Spearman acc EM/F1 acc
Human 0.989 0.899/0.861 0.925 0.871/0.944 0.986
Tohoku BERT base 0.958 0.899/0.859 0.899 0.871/0.941 0.808
Tohoku BERT base (char) 0.956 0.882/0.841 0.892 0.864/0.937 0.718
Tohoku BERT large 0.955 0.908/0.870 0.900 0.880/0.946 0.816
NICT BERT base 0.958 0.903/0.867 0.902 0.897/0.947 0.823
Waseda RoBERTa base 0.962 0.901/0.865 0.895 0.864/0.927 0.840
Waseda RoBERTa large 0.954 0.923/0.891 0.924 0.884/0.940 0.901
XLM RoBERTa base 0.961 0.870/0.825 0.893 -/-† 0.687
XLM RoBERTa large 0.964 0.915/0.882 0.919 -/-† 0.840

†XLM RoBERTa base/large models use the unigram language model as a tokenizer and they are excluded from the JSQuAD evaluation because the token delimitation and the start/end of the answer span often do not match, resulting in poor performance.


A leaderboard will be made public soon. The test set will be released at that time.


  title = "JGLUE: Japanese General Language Understanding Evaluation",
  author = "Kentaro Kurihara and 
      Daisuke Kawahara and
      Tomohide Shibata",
  booktitle = "Proceedings of the 13th Language Resources and Evaluation Conference",
  year = "2022",
  publisher = "European Language Resources Association (ELRA)",
  note = "to appear"

  author = 	"栗原健太郎 and 河原大輔 and 柴田知秀",
  title = 	"JGLUE: 日本語言語理解ベンチマーク",
  booktitle = 	"言語処理学会第28回年次大会",
  year =	"2022",
  url = "https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf"
  note= "in Japanese"


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License

Contributor License Agreement

This project requires contributors to accept the terms in the Contributor License Agreement (CLA).

Please note that contributors to the JGLUE repository on GitHub (https://github.com/yahoojapan/JGLUE) shall be deemed to have accepted the CLA without individual written agreements.


View Github