Assessing Dialogue Systems with Distribution Distances

Aug 04, 2021 1 min read

Assessing Dialogue Systems with Distribution Distances

We propose to measure the performance of a dialogue system by computing the distributionwise distance between its generated conversations and real-world conversations.

To appear in Findings of ACL 2021.

Note that this is not an officially supported Tencent product.

1. Configuratin

This repository requires the packages:

pytorch
huggingface/transformers.

2. Usage

To evaluate the system-level human correlations of metrics:

python eval_metric.py \
  --data_path ./datasets/convai2_annotation.json \
  --metric fbd \
  --sample_num 10 \
  --model_type roberta-base \
  --batch_size 32

Currently, our repo supports the common metrics used in text generation field, inclduing bleu, meteor, rouge, greedy, average, extrema, bert_score, fbd and prd.

Here are some details of the six corpura compared in the main paper:

File Name	Dataset Name	Num. of Samples	Reference
personam_annotation.json	Persona(M)	60	Shikib/usr
dailyh_annotation.json	Daily(H)	150	li3cmz/GRADE
convai2_annotation.json	Convai2	150	li3cmz/GRADE
empathetic_annotation.json	Empathetic	150	li3cmz/GRADE
dailyz_annotation.json	Daily(Z)	100	ZHAOTING/dialog-processing
personaz_annotation.json	Persona(Z)	150	ZHAOTING/dialog-processing

Citation

If you use this research/codebase/dataset, please cite our paper:

@article{xiang2021assessing,
  title={Assessing Dialogue Systems with Distribution Distances},
  author={Xiang, Jiannan and Liu, Yahui and Cai, Deng and Li, Huayang and Lian, Defu and Liu, Lemao},
  journal={arXiv preprint arXiv:2105.02573},
  year={2021}
}

GitHub

https://github.com/yhlleo/frechet-bert-distance

Distributed

John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.