Medical Natural Language Processing for AI Chat (MedNLP-CHAT), which is one of the core tasks in NTCIR-18, aims to evaluate medical chatbots based on multiple viewpoints. Medical chatbot service is a promising solution for medical/healthcare human resource problem. However, the risk of chatbot is not well-known. We create the testbed of potential chatbot responses from various aspects: medical validation, legal viewpoints, ethical issues, etc. 


Task Overview

    • A pair of a patient’s question and a chatbot answer
    • Objective evaluation by a specialist: Binary class (TRUE/FALSE)
      • medicalRisk
      • ethicalRisk
      • legalRisk
    • Subjective evaluation by the general public: A probability distribution of evaluations on a 5-point scale from -2 to 2
      • fluency
      • helpfulness
      • harmlessness


Japanese domain dataset

  • The data consists of a question, an answer, and a set of labels for the answer, which are objective labels (Risks) based on Japanese laws and medical guidelines and subjective labels (fluency, helplessness, and harmlessness) by Japanese people [README]. 
  • Data size: We are preparing to construct 200 pairs of {Question, Answer, Answer labels}
    • Both the questions and answers are created by humans, referencing responses from a chatbot.
    • Answer labels represent the evaluation of the answers, which will be estimated in this task. There are six labels comprising three objective labels (medicalRisk, ethicalRisk, and legalRisk) assigned by experts based on Japanese laws and medical guidelines and three subjective labels (fluency, helpfulness, and harmlessness) assigned by Japanese through crowdsourcing.
  • Languages: Japanese (JA), English (EN), German (DE), and French (FR).
    • Step 1: Japanese data is created.
    • Step 2: It is translated into the other languages.
    • Note: Chinese (ZH) and Arabic (AR) might be included.
Example of Japanese domain data with English translations

German domain dataset (TBA)

  • The data consists of a question, an answer, and a set of labels for the answer
  • Data size: TBA
  • Languages: German (DE), TBD




Eiji Aramaki, Ph.D. (NAIST, Japan)
Shoko Wakamiya, Ph.D. (NAIST, Japan)
Shuntaro Yada, Ph.D. (NAIST, Japan)
Tomohiro Nishiyama (NAIST, Japan)
Peitao Han (NAIST, Japan)
Lisa Raithel, Ph.D. (DFKI, Germany, TU Berlin, Germany)
Roland Roller, Ph.D. (DFKI, Germany)
Philippe Thomas, Ph.D. (DFKI, Germany)
Hui-Syuan Yeh (Université Paris-Saclay, CNRS, LISN, France)
Pierre Zweigenbaum‬, Ph.D. (Université Paris-Saclay, CNRS, LISN, France)


Ryuma Shineha, Ph.D. (Osaka University, Japan)