MedNLP-CHAT

News

About MedNLP-CHAT

A medical chatbot service is a promising solution for medical/healthcare human resource problems. However, the potential risks caused by the use of chatbots are not well-known. Medical Natural Language Processing for AI Chat (MedNLP-CHAT), which is one of the core tasks in NTCIR-18, aims to evaluate medical chatbots from multiple viewpoints, namely medical, legal, and ethical aspects. In this shared task, participants must analyze a given medical question with a corresponding chatbot response and determine whether this response creates a possible medical, legal, or ethical risk (binary).

Task Overview

This task is to determine whether a chatbot’s answer to a medical question is appropriate. Judgments are made from multiple perspectives. 

  • INPUT
    • A pair of a patient’s question and a chatbot answer
  • OUTPUT
    • Objective evaluation by a specialist: Binary class (TRUE/FALSE)
      • medical risk
      • ethical risk
      • legal risk
    • Subjective evaluation by the general public (Japanese dataset only): A probability distribution of evaluations on a 5-point scale from -2 to 2
      • fluency
      • helpfulness
      • harmlessness

Dataset

The data consists of a question, an answer, and a set of labels for the answer: objective labels (medical, ethics, and legal risks) and subjective labels (fluency, helplessness, and harmlessness).  Subjective labels are provided only in the Japanese dataset. 
Experts judge the objective label (risks) for each answer as either TRUE (risk = inappropriate) or FALSE (no risk = appropriate). In the case of TRUE, the reason is given in the note (Japanese dataset only). 

The subjective labels are rated on a 5-point scale, and since we considered the variability of non-expert responses to be also important, we have included the distribution of the 5-point scale. For example, fluency ranges from very non-fluent (-2), non-fluent (-1), normal (0), fluent (+1), to very fluent (+2), and the number of responses obtained through crowdsourcing is stored.
The task for the subjective labels is to estimate this distribution; it is only defined for the Japanese dataset. 

For detailed data specifications, please see the README and overview papers that will be released in the future.

Medical and legal systems:
The scope of medical care that can be provided as the standard of care and the legal and ethical risks vary from country to country depending on their medical and legal systems.
Therefore, we have prepared two sets of data: the Japanese dataset judged based on the Japanese system, and the German dataset judged based on the German system.

Languages:
Both the Japanese and German Q&A pairs are translated into multiple languages. The Japanese dataset is translated into English, French, and German, and the German dataset is translated into English and French.

Data details

  • The data consists of a question, an answer, and labels for the answer.
    The labels for each answer are objective labels (‘medical_risk’, ‘ethical_risk’, and ‘legal_risk’) judged by experts considering Japanese/German laws and medical guidelines. In addition, the Japanese data also includes subjective labels (‘fluency’, ‘helpfulness’, and ‘harmlessness’) [README].

  • Data size: Each language comprises approximately 200 pairs of Question, Answer, and Answer labels. Out of the 200 language pairs, 100 (each) are released as a training set. 
    • Questions and answers are created by humans, referencing responses from a chatbot.
    • Answer labels represent the evaluation of the answers, which will be estimated in this task. There are six labels comprising three objective labels (medical risk, ethical risk, and legal risk) assigned by experts based on Japanese/German laws and medical guidelines. The subjective labels (fluency, helpfulness, and harmlessness) are assigned only to Japanese source data through crowdsourcing.
  • Languages: 
    • Step 1: Data is created.
    • Step 2: It is translated into the other languages. Note that DeepL was only used to translate the sample data. The training and test data will be translated manually by professional translators. Japanese source data is translated to English, German and French, and German source data is translated to English and French.

Sample data

Japanese source data

Examples of Japanese source data with English translations

German source data

Registration

Schedule

Organizer

Eiji Aramaki, Ph.D. (NAIST, Japan)
Shoko Wakamiya, Ph.D. (NAIST, Japan)
Shuntaro Yada, Ph.D. (NAIST, Japan)
Shohei Hisada (NAIST, Japan)
Tomohiro Nishiyama (NAIST, Japan)
Lisa Raithel, Ph.D. (DFKI, Germany, TU Berlin, Germany)
Roland Roller, Ph.D. (DFKI, Germany)
Philippe Thomas, Ph.D. (DFKI, Germany)
Hui-Syuan Yeh (Université Paris-Saclay, CNRS, LISN, France)
Pierre Zweigenbaum‬, Ph.D. (Université Paris-Saclay, CNRS, LISN, France)

Adviser

Akiko Aizawa, Ph.D. (NII, Japan)
Ryuma Shineha, Ph.D. (Osaka University, Japan)