- All OCR Extracted Text Documents (3148 documents) ←all non-text content removed by OCR error
- Frequency Balanced Subset (224documents) ←fixed word-level OCR errors, NER annotation included
If you would like to obtain this dataset, please contact our laboratory (email address below).
socialcomputing-office [at] is.naist.jp
The details of each data are summarized below.
All OCR Extracted Text Documents
- Non-text removal: Visually detect and remove strings that cannot be detected as Japanese due to OCR errors
- Title, author name, header, footer, page numbers, references, figures, tables, captions, and English abstracts are also removed
- Sentence Formatting: Format the text to one sentence per line.
Format for the Data
A case report is filed in XML format with the following structure.
For the case reports where the attribute value such as SEX or AGE is unknown, value is assigned to “-1”.
Frequency Balanced Dataset
- The number of documents have been balanced according to the actual frequency of occurrence for the common disease names.
- Other criteria：
- Maximum of 1,500 words in the text
- Reported cases should be relatively recent (published in and after 2010).
- Other criteria：
- Documents with many word/phrase level OCR errors are excluded.
- Word- and phrase-level OCR errors were corrected and recovered by viewing the PDF of the original article.
The following medical expression entities were assigned as XML tags (tag names in parentheses).
- Disease (d)
- Anatomical part (a)
- Feature (f)
- Change (c)
- Test: [TestTest (t-test), TestKey (t-key), TestVal (t-val)]
- Medicine: [MedicineKey (m-key), MedicineVal (m-val)]
- Remedy (r)
- ClinicalContext (cc)
- Pending (p)
We used the the guidelines discussed in the following paper for the annotation.
Yada, S., Joh, A., Tanaka, R., Cheng, F., Aramaki, E., & Kurohashi, S. (2020). Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases. Proceedings of The 12th Language Resources and Evaluation Conference, 4567–4574.
（The following table refers to the past statistical count with 227 documents. The information is to be updated soon.）