The corpus consists of 600 PubMed abstracts which are RCTs related to breast cancer. For each abstract, the annotator marked text snippets that identify the Participants, Intervention, Control, and Outcome (PICO elements).

  • Participants: identify text snippets that describe the characteristics of the participants. These include the number of participants (total participants, participants in the intervention group, or participants in the control group), average age, ethnicity, location of the study, eligibility, total Although breast cancer is the main condition, we are also interested in identifying the condition / symptom of breast cancer that is being treated (such as hair loss, bone loss, and vomiting).
  • Intervention and Control: identify the specific intervention and control used in the study.
  • Outcome: identify what is being measured in the study. These include the outcomes that were measured, the number of events in the intervention group, the number of events in the control group, and the outcome measure.


The dataset can be downloaded from our Github repository. The filename corresponds to the PubMed identification number (PMID). 

  1. Brat files: The abstracts were annotated using the brat annotation tool. 
  2. XML files : The brat files converted to XML format.



Please check out our NER  demo.

Check the visualization system here.


If you find this dataset useful please cite our paper:
Faith Wavinya Mutinda, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki: AUTOMETA: Automatic Meta-Analysis System Employing Natural Language Processing, MedInfo 2021 (2021/10/2-4, Online)