MedNLP-CHAT

News

About MedNLP-CHAT

Medical Natural Language Processing for AI Chat (MedNLP-CHAT), which is one of the core tasks in NTCIR-18, aims to evaluate medical chatbots based on multiple viewpoints. Medical chatbot service is a promising solution for medical/healthcare human resource problem. However, the risk of chatbot is not well-known. We create the testbed of potential chatbot responses from various aspects: medical validation, legal viewpoints, ethical issues, etc. 

 

Task Overview

This task is to determine whether a chatbot’s answer to a medical question is appropriate. Judgments are made from multiple perspectives.

  • INPUT
    • A pair of a patient’s question and a chatbot answer
  • OUTPUT
    • Objective evaluation by a specialist: Binary class (TRUE/FALSE)
      • medical risk
      • ethical risk
      • legal risk
    • Subjective evaluation by the general public: A probability distribution of evaluations on a 5-point scale from -2 to 2 (Japan domain only)
      • fluency
      • helpfulness
      • harmlessness

Dataset

The data consists of a question, an answer, and a set of labels for the answer: objective labels (medical, ethics, and legal risks) and subjective labels (fluency, helplessness, and harmlessness).
Experts judge the objective label (risks) for each answer as either TRUE (risk = inappropriate) or FALSE (no risk = appropriate). In the case of TRUE, the reason is given in the note (Japan domain only). 

The subjective labels are rated on a 5-point scale, and since we considered the variability of non-expert responses to be also important, we have included the distribution of the 5-point scale.
For example, fluency ranges from very fluent (+2), fluent (+1), normal (0), non-fluent (-1), and very non-fluent (-2), and the number of responses obtained through crowdsourcing is stored.
The task for the subjective labels is to estimate this distribution, which will only be for the Japan domain. 

For detailed data specifications, please see the README and overview papers that will be released in the future.

Medical domain
The scope of medical care that can be provided as the standard of care and the legal and ethical risks vary from country to country depending on their medical systems.
Therefore, we have prepared two sets of data: the Japan domain dataset judged based on the Japanese medical system, and the German domain dataset judged based on the German medical system.

Languages:
Both the Japan domain and the German domain are translated into multiple languages. The Japan domain is translated into English, French, and German. The German domain is translated into English and French.

Japan domain dataset

  • The data consists of a question, an answer, and labels for the answer.
    The labels for each answer are objective labels (‘medical_risk’, ‘ethical_risk’, and ‘legal_risk’) judged by experts considering Japanese laws and medical guidelines and subjective labels (‘fluency’, ‘helpfulness’, and ‘harmlessness’) [README].

  • Data size: We are preparing to construct 200 pairs of {Question, Answer, Answer labels}. Out of the 200 pairs, 100 pairs are released to task participants as a training set. 
    • Both the questions and answers are created by humans, referencing responses from a chatbot.
    • Answer labels represent the evaluation of the answers, which will be estimated in this task. There are six labels comprising three objective labels (medical risk, ethical risk, and legal risk) assigned by experts based on Japanese laws and medical guidelines and three subjective labels (fluency, helpfulness, and harmlessness) assigned by Japanese through crowdsourcing.
  • Languages: Japanese (JA), English (EN), French (FR), and German (DE).
    • Step 1: Japanese data is created.
    • Step 2: It is translated into the other languages. Note that the sample data was just translated by DeepL, but the training and test data will be translated manually.
Example of Japanese domain data with English translations

Germany domain dataset

  • The data consists of a question, an answer, and a set of labels for the answer, which are objective labels  (‘medical_risk’, ‘ethical_risk’, and ‘legal_risk’) based on German laws and medical guidelines.
  • Data size: We are preparing to construct 200 pairs of {Question Answer, Answer labels}. Out of the 200 pairs, 100 pairs will be released to task participants as a training set in September 2024. 
    • Both the questions and answers are created by humans. 
    • Answer labels represent the evaluation of the answers, which will be estimated in this task. There are three objective labels (medical risk, ethical risk, and legal risk) assigned by experts based on German laws and medical guidelines.
  • Languages: German (DE), English (EN), and French (FR).
    • Step 1: German data is created.
    • Step 2: It is translated into the other languages. Note that the sample data was just translated by DeepL, but the training and test data will be translated manually.

Registration

Schedule

Organizer

Eiji Aramaki, Ph.D. (NAIST, Japan)
Shoko Wakamiya, Ph.D. (NAIST, Japan)
Shuntaro Yada, Ph.D. (NAIST, Japan)
Shohei Hisada (NAIST, Japan)
Tomohiro Nishiyama (NAIST, Japan)
Lisa Raithel, Ph.D. (DFKI, Germany, TU Berlin, Germany)
Roland Roller, Ph.D. (DFKI, Germany)
Philippe Thomas, Ph.D. (DFKI, Germany)
Hui-Syuan Yeh (Université Paris-Saclay, CNRS, LISN, France)
Pierre Zweigenbaum‬, Ph.D. (Université Paris-Saclay, CNRS, LISN, France)

Adviser

Ryuma Shineha, Ph.D. (Osaka University, Japan)