Nowadays scientific knowledge can be published digitally within many different forms and sources, such as encyclopedias, scientific papers, regulatory documents, but also structured knowledge sources like ontologies or knowledge bases. Beside that also news articles, blog posts, forums or social media can contain relevant information or can be used for research. All this is published everyday in a large number of different languages. The volume and speed of production of digital content has become too fast however in some domains for humans to be able to keep up with them and maintain an up-to-date view of current scientific evidence. In MEDLINE for instance every year close to one million new articles are included.

The KEEPHA (Knowledge-Enhanced information Extraction across languages for PHArmacovigilance) project aims to design Artificial Intelligence (AI) methods that ​automatically digest these different types of text sources and jointly extract such knowledge and observations in order to populate existing knowledge bases​. The project showcases these methods in the domain of ​pharmacovigilance​, which endeavors to maintain up-to-date knowledge on adverse drug reactions (ADRs) for the benefit of public health. In this domain, authoritative sources include scientific journals and drug labels while elementary observations are reported in patient records and social media.

Current mainstream information extraction methods use self-supervised extraction of word representations from large text corpora and tend to neglect existing knowledge on the target domain. In contrast, the present project aims to ​integrate existing knowledge into the word representation acquisition and information extraction processes to improve the extraction of new information and knowledge. This is all the more needed to address less formal sources and hence more challenging sources such as social media. Additionally, it will take advantage of the existence of similar information published in ​multiple languages to pool knowledge across countries​.

Language barriers hamper the free flow of knowledge and thought across languages. Relevant findings need to be articulated across these barriers, which requires time and effort to collect and translate into the respective languages. In the not too distant future, tools will assist researchers and other citizens in finding and linking information distributed across sources and languages. In this project, we will help to improve such technologies and will demonstrate them for adverse drug reactions.

The consortium is composed of three internationally recognized teams specialized in natural language processing. RIKEN, NII and NAIST (JP) has created the de-facto natural language processing tools for Japanese, and produced a number of document and text analysis tools for extracting knowledge from scholarly documents. DFKI (DE) has a strong background in corpus generation, general information extraction and biomedical text processing. LIMSI (FR) has a long and strong experience in corpus annotation, hybrid information extraction and biomedical language processing, including for pharmacovigilance from patient forums.

Japanese Research Team

RIKEN AIP – Institute of Physical and Chemical Research – Center for Advanced Intelligence Project

松本 裕治 / Prof. Yuji Matsumoto - Team Leader
西田 典起 / Noriki Nishida - Postdoctoral researcher
寺西 裕紀 / Hiroki Teranishi - Postdoctoral researcher
徳永 なるみ / Narumi Tokunaga  - Technical staff

NII – National Institute of Informatics

相澤 彰子 / Prof. Akiko Aizawa
An Tuan Dao
壹岐 太一 / Taichi Iki
杉本 海人 / Kaito Sugimoto

NAIST – Nara Institute of Science and Technology

荒牧 英治 / Prof. Eiji Aramaki
矢田 竣太郎 / Assoc. Prof. Shuntaro Yada
西山 智弘 / Tomohiro Nishiyama
Faith Wavinya Mutinda
Gabriel Herman Bernardim Andrade


DFKI – German Research Center for Artificial Intelligence, Speech and Language Technology Lab
LIMSI – Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur