News
- December 2, 2024: Released the complete version of the training data (limited to task participants)
- December 2, 2024: Updated sample data, which consists of ten pairs from the training data
- December 1, 2024: Updated descriptions in Dataset
- November 30, 2024: Released the training data of the German dataset (limited to task participants)
- November 8, 2024: Updated descriptions in About MedNLP-CHAT, Task Overview, and Dataset
- November 7, 2024: Updated the schedule
- September 9, 2024: Released the training data of the Japanese dataset (limited to task participants)
- September 9, 2024: Updated descriptions in Task Overview and Dataset
- August 15, 2024: Updated descriptions in German dataset and German sample data available
- August 15, 2024: Updated Japanese sample data (Notes for the objective labels are added)
- July 16, 2024: Updated Japanese sample data
- July 12, 2024: Updated descriptions in Task Overview and Dataset
- July 9, 2024: Japanese domain sample data available
About MedNLP-CHAT
A medical chatbot service is a promising solution for medical/healthcare human resource problems. However, the potential risks caused by the use of chatbots are not well-known. Medical Natural Language Processing for AI Chat (MedNLP-CHAT), which is one of the core tasks in NTCIR-18, aims to evaluate medical chatbots from multiple viewpoints, namely medical, legal, and ethical aspects. In this shared task, participants must analyze a given medical question with a corresponding chatbot response and determine whether this response creates a possible medical, legal, or ethical risk (binary).
Task Overview
This task is to determine whether a chatbot’s answer to a medical question is appropriate. Judgments are made from multiple perspectives.
- INPUT
- A pair of a patient’s question and a chatbot answer
- OUTPUT
- Objective evaluation by a specialist: Binary class (TRUE/FALSE)
- medical risk
- ethical risk
- legal risk
- Subjective evaluation by the general public (Japanese dataset only): A probability distribution of evaluations on a 5-point scale from -2 to 2
- fluency
- helpfulness
- harmlessness
- Objective evaluation by a specialist: Binary class (TRUE/FALSE)
Dataset
The data consists of a question, an answer, and a set of labels for the answer: objective labels (medicalRisk, ethicalRisk, and legalRisk) and subjective labels (fluency, helplessness, and harmlessness). Subjective labels are provided only in the Japanese dataset.
Experts judge the objective label (risks) for each answer as either TRUE (risk = inappropriate) or FALSE (no risk = appropriate). In the case of TRUE, the reason is given in the note (Japanese dataset only).
The subjective labels are rated on a 5-point scale, and since we considered the variability of non-expert responses to be also important, we have included the distribution of the 5-point scale. For example, fluency ranges from very non-fluent (-2), non-fluent (-1), normal (0), fluent (+1), to very fluent (+2), and the number of responses obtained through crowdsourcing is stored.
The task for the subjective labels is to estimate this distribution; it is only defined for the Japanese dataset.
For detailed data specifications, please see the README and overview papers that will be released in the future.
Medical and legal systems:
The scope of medical care that can be provided as the standard of care and the legal and ethical risks vary from country to country depending on their medical and legal systems.
Therefore, we have prepared two sets of data: the Japanese dataset judged based on the Japanese system, and the German dataset judged based on the German system.
Languages:
Both the Japanese and German Q&A pairs are translated into English and French, respectively.
Data details
The data consists of a question, an answer, and labels for the answer.
The labels for each answer are objective labels (‘medicalRisk’, ‘ethicalRisk’, and ‘legalRisk’) judged by experts considering Japanese/German laws and medical guidelines. In addition, the Japanese data also includes subjective labels (‘fluency’, ‘helpfulness’, and ‘harmlessness’) [README].- Data size: Each language comprises approximately 200 pairs of Question, Answer, and Answer labels. Out of the 200 language pairs, 100 (each) are released as a training set.
- Questions and answers are created by humans, referencing responses from a chatbot.
- Answer labels represent the evaluation of the answers, which will be estimated in this task. There are six labels comprising three objective labels (medicalRisk, ethicalRisk, and legalRisk) assigned by experts based on Japanese/German laws and medical guidelines. The subjective labels (fluency, helpfulness, and harmlessness) are assigned only to Japanese source data through crowdsourcing.
- Languages:
- Step 1: Data is created.
- Step 2: It is translated into the other languages. The training and test data will be translated to English and French manually by professional translators.
Sample data
Registration
Schedule
-
Mar 29, 2024: Kickoff event -
May -> July 2024: Sample dataset release -
Aug -> Sept 2024: Training dataset release (ja) -
Sept -> Nov 2024: Training dataset release (de) -
[NEW] Dec 1, 2024: Training set release (complete version) - [NEW] Jan 17, 2025: Registration deadline
- [NEW] Jan 17, 2025: Test set release
-
Nov 2024-Jan 2025Jan 17-24, 2025: Formal run - Feb 1, 2025: Evaluation results return
- Feb 1, 2025: Task overview release (draft)
- Mar 1, 2025: Submission due of participant papers (draft)
- May 1, 2025: Camera-ready participant paper due
- Jun 10-13 2025: NTCIR-18 Conference @ NII, Tokyo, Japan