MedNLP-CHAT

News

June 16, 2025: The dataset used in the MedNLP-CHAT shared task has been released.
June 12, 2025: MedNLP-CHAT task session has successfully concluded. Thank you for your participation!
June 10, 2025: Proceedings of the NTCIR-18 Conference have been released.
May 28, 2025: The NTCIR-18 conference program including the task session details is available. MedNLP-CHAT session will be held on June 12 (DAY-3).
February 1, 2025: The draft of our task overview paper was released.
February 1, 2025: The evaluation results for the formal run were returned.
January 17, 2025: The test data has been released to task participants, and the formal run has started.
December 2, 2024: The complete version of the training data has been released (available to task participants only).
December 2, 2024: Updated sample data, consisting of ten pairs from the training data, has been provided.
December 1, 2024: Descriptions in ‘Dataset’ have been updated.
November 30, 2024: The training data for the German dataset has been released (limited to task participants).
November 8, 2024: Descriptions in ‘About MedNLP-CHAT,’ ‘Task Overview,’ and ‘Dataset’ have been updated.
November 7, 2024: The schedule has been updated.
September 9, 2024: The training data for the Japanese dataset has been released (limited to task participants).
September 9, 2024: Descriptions in ‘Task Overview’ and ‘Dataset’ have been updated.
August 15, 2024: Descriptions in ‘German dataset’ have been updated, and the German sample data is available.
August 15, 2024: Japanese sample data has been updated (notes for the objective labels have been added).
July 16, 2024: The Japanese sample data has been updated.
July 12, 2024: Descriptions in ‘Task Overview’ and ‘Dataset’ have been updated.
July 9, 2024: Japanese domain sample data is available.

About MedNLP-CHAT

A medical chatbot service is a promising solution for medical/healthcare human resource problems. However, the potential risks caused by the use of chatbots are not well-known. Medical Natural Language Processing for AI Chat (MedNLP-CHAT), which is one of the core tasks in NTCIR-18, aims to evaluate medical chatbots from multiple viewpoints, namely medical, legal, and ethical aspects. In this shared task, participants must analyze a given medical question with a corresponding chatbot response and determine whether this response creates a possible medical, legal, or ethical risk (binary).

Task Overview

This task is to determine whether a chatbot’s answer to a medical question is appropriate. Judgments are made from multiple perspectives.

INPUT
- A pair of a patient’s question and a chatbot answer

OUTPUT
- Objective evaluation by a specialist: Binary class (TRUE/FALSE)
  - medical risk
  - ethical risk
  - legal risk
- Subjective evaluation by the general public (Japanese dataset only): A probability distribution of evaluations on a 5-point scale from -2 to 2
  - fluency
  - helpfulness
  - harmlessness

Dataset

The data consists of a question, an answer, and a set of labels for the answer: objective labels (medicalRisk, ethicalRisk, and legalRisk) and subjective labels (fluency, helplessness, and harmlessness). Subjective labels are provided only in the Japanese dataset.
Experts judge the objective label (risks) for each answer as either TRUE (risk = inappropriate) or FALSE (no risk = appropriate). In the case of TRUE, the reason is given in the note (Japanese dataset only).

The subjective labels are rated on a 5-point scale, and since we considered the variability of non-expert responses to be also important, we have included the distribution of the 5-point scale. For example, fluency ranges from very non-fluent (-2), non-fluent (-1), normal (0), fluent (+1), to very fluent (+2), and the number of responses obtained through crowdsourcing is stored.
The task for the subjective labels is to estimate this distribution; it is only defined for the Japanese dataset.

For detailed data specifications, please see the README and overview papers that will be released in the future.

Medical and legal systems:
The scope of medical care that can be provided as the standard of care and the legal and ethical risks vary from country to country depending on their medical and legal systems.
Therefore, we have prepared two sets of data: the Japanese dataset judged based on the Japanese system, and the German dataset judged based on the German system.

Languages:
Both the Japanese and German Q&A pairs are translated into English and French, respectively.

Data details

The data consists of a question, an answer, and labels for the answer.
The labels for each answer are objective labels (‘medicalRisk’, ‘ethicalRisk’, and ‘legalRisk’) judged by experts considering Japanese/German laws and medical guidelines. In addition, the Japanese data also includes subjective labels (‘fluency’, ‘helpfulness’, and ‘harmlessness’) [README].
Data size: Each language comprises approximately 200 pairs of Question, Answer, and Answer labels. Out of the 200 language pairs, 100 (each) are released as a training set.
- Questions and answers are created by humans, referencing responses from a chatbot.
- Answer labels represent the evaluation of the answers, which will be estimated in this task. There are six labels comprising three objective labels (medicalRisk, ethicalRisk, and legalRisk) assigned by experts based on Japanese/German laws and medical guidelines. The subjective labels (fluency, helpfulness, and harmlessness) are assigned only to Japanese source data through crowdsourcing.
Languages:
- Step 1: Data is created.
- Step 2: It is translated into the other languages. The training and test data will be translated to English and French manually by professional translators.