Database Contributor Review

CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools

Eulalia Farre Maduell Salvador Lima-Lopez Santiago Andres Frid Artur Conesa Elisa Asensio Antonio Lopez-Rueda Helena Arino Elena Calvo Maria Jesús Bertran Maria Angeles Marcos Montserrat Nofre Maiz Laura Tañá Velasco Antonia Marti Ricardo Farreres Xavier Pastor Xavier Borrat Frigola Martin Krallinger

Published: Nov. 2, 2023. Version: 1.0 <View latest version>


When using this resource, please cite: (show more options)
Farre Maduell, E., Lima-Lopez, S., Frid, S. A., Conesa, A., Asensio, E., Lopez-Rueda, A., Arino, H., Calvo, E., Bertran, M. J., Marcos, M. A., Nofre Maiz, M., Tañá Velasco, L., Marti, A., Farreres, R., Pastor, X., Borrat Frigola, X., & Krallinger, M. (2023). CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools (version 1.0). PhysioNet. https://doi.org/10.13026/bxrx-y344.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.


Background

There is a pressing need to enable access to annotated electronic health records (EHRs) for the development and evaluation of clinical NLP resources, with the aim to implement de-identification tools and to detect medical variables of interest. This is particularly true for non-English EHRs, where only a limited number of resources have been published. Due to the large number of hospital data generated in Spanish speaking countries and the potential of adapting NLP technologies originally developed for content in Spanish to other romance languages (with more than 900 million native speakers), the release of clinical records in Spanish is now essential.

The COVID-19 pandemic demonstrated the urgent need for systems capable of processing and analyzing high volumes of unstructured data locked in large collections of clinical narratives in order to identify patterns, trends, and actionable clinical insights.

This corpus was created to address the demand for these data and to foster the development of clinical NLP tools able to cope with the particularities of real-world clinical language (complex medical jargon, use of abbreviated expressions, typos and spelling errors or ungrammatical sentences).

CARMEN–I is a collaboration between clinical experts from the University Hospital Clínic of Barcelona (HCB) and researchers in AI and NLP from the NLP4BIA at the Barcelona Supercomputing Center (BSC). It consists of a publicly-released set of real clinical records. It is the first corpus of complete, publicly released de-identified EHRs in Spanish covering not only COVID-19, but also a range of comorbidities including cancer and cardiovascular diseases. Sensitive data items have been identified, masked and replaced following the approach of previous efforts in Spanish such as HitzalMed [1].


Methods

CARMEN-I consists of 2,000 documents selected from the EHRs of 6811 patients with COVID-19. Documents were written in the Hospital Clínic of Barcelona, one of the main tertiary hospitals in Spain, sampled over a two-year period (March 2020 to March 2022).
The anonymization of document dates involves a multi-step process. Initially, a rule-based system and gazetteers are employed to anonymize the dates, parsing them into days, months, years, and separators like slashes or dots. Written dates are transformed into numerical forms using Spanish and Catalan month names. Dates are then categorized into specific types (e.g., year-only, month and year) to facilitate the replacement process. Subsequently, modifications are applied to days, months, and years in each document using randomly chosen values to ensure temporal coherence and document-specific anonymization. The amounts of modification may be positive or negative, shifting the date into the future or past. Lastly, Python's datetime library is utilized to calculate the modified dates while preventing illogical dates (e.g., day 32 or month 13) from being generated.

The rest of the dataset concepts have been carefully anonymised following a protocol created with the cooperation of clinicians, linguists, and AI academics. Based on the annotation guidelines of the MEDDOCAN anonymisation corpus [2], the reports were reviewed by linguists, who verified the annotation criteria and amended suggestions provided by automatic anonymisation models. With the support from computational linguists from the NLP4BIA team, clinicians next verified that all sensitive information in the annotated documents and new resynthesized version was masked or hidden from plain sight. Here, masking consisted of replacing the annotation with its semantic class (for instance, the annotation3/2/2022 was replaced by a label “DATE”).

After this process, clinicians at the HCB reviewed each masked report and further assessed the documents before including it in the corpus. To this end, they first validated the annotation of sensitive items as correct, and then whether the report met extra criteria. These criteria take into account indirect sensitive data, especially from a clinical point of view (e.g., an uncommon combination of comorbidities or extremely rare diseases). This entire process, including the criteria followed, can be found in a protocol document published as part of CARMEN-I.

In addition, a second version of the data was created in which synthetic equivalents replaced the original sensitive items. These replacements were generated using a complex system of rules specific for each type of sensitive data and custom-created gazetteers (lists of terms). Special attention was paid to creating credible replacements. For instance, all dates within the same document are moved by the same amount to maintain consistent temporality.

Finally, a subset of 500 documents was selected and annotated with relevant clinical concept types (diseases, symptoms/findings, procedures, drugs, species and humans) for the development and benchmarking of information extraction systems. The annotation strategy followed for these entity types was based on publicly released corpora created by the NLP4BIA, such as DisTEMIST [3] for diseases, MedProcNER/ProcTEMIST [4] for clinical procedures or LivingNER [5] for species. An annotation guideline summarizing the rules for all six entities will also be released as part of the project. The annotation took place with the collaboration of HCB clinicians, who contributed their knowledge of the hospital setting.

This project was approved by the HCB Ethics Research Committee. Individual patient consent was waived because the project did not impact clinical care and all protected health information was anonymized.


Data Description

CARMEN-I is a collection of 2,000 anonymized clinical records written in a University hospital (HCB). Specifically, the texts were written from March 1, 2020 to March 1, 2022.

From a clinical point-of-view, the corpus includes patients with different presentations of COVID-19, mostly in severe form. In addition, since many of the patients had comorbidities and underlying conditions, the corpus also contains diseases that cause immunosuppression (patients undergoing treatment for cancer and organ transplant, treatment with corticosteroids, patients infected with HIV), respiratory diseases (asthma, COPD), cardiovascular diseases, geriatric complexity, and other.

From a linguistic point-of-view, the clinical records present typical characteristics of electronic health records, namely: a large number of ad-hoc acronyms and abbreviations, typos, repetitions, incomplete sentences, and other. Additionally, since the documents come from a Catalan hospital, some are written both in Spanish and Catalan, often mixed in the same document and sentence. It has been estimated that around 15% of the documents include Catalan to some extent. The language for each document has been classified and is made available on an accompanying .TSV file.

CARMEN-I includes five types of medical document: discharge reports (in Spanish, “informe de alta or IA”); referral letters (“informe de traslado or IT”); death reports (“informe de exitus or IE”); progress notes (curso clínico or CC); and imaging reports (informe de radiología or IR). Most documents are discharge, referral and imaging reports, with a few progress notes and death reports included due to their interest for the annotation process.

Discharge and Referral letters are composed of the following sections: Medical History (Antecedentes); Progress Notes (Evolución); Physical Exam (Exploración Clínica); Medical Tests (Exploración Complementaria) Surgery (Intervención Quirúrgica); Treatment Plan (Plan Terapéutico); Current Problem (Proceso Actual); Imaging description (Radiografía); and Follow-up (Seguimiento).

Corpus Versions

The corpus is presented in two different versions and both versions include annotations for sensitive and clinical entities.

In total, the corpus includes 2,000 documents, classified as follows: 1,201 imaging reports; 617 discharge reports; 172 referral letters; 5 death reports; and 5 progress notes. Discharge reports and referral letters are not presented in full. Instead, they were divided into sections as stated above due to their excessive length, resulting in: 189 medical histories; 154 progress notes; 72 physical exams; 61 medical tests; 25 surgery reports; 31 treatment plans; 176 current problems; 40 imaging; and 41 follow-up sections.

As for the sensitive data annotation, the corpus includes 18 different labels with a total number of 8,228 annotated items. The most common label is date, with 5,384 annotations, followed by patient age with 815, and patient gender with 458 items. The least common type are websites, with only one annotation.

500 documents in the collection include clinical concept annotations for the development of named entity recognition systems. There are a total of 26,144 annotations for 6 different concept types: 5,335 diseases; 7,785 findings; 6,509 procedures; 3,546 drugs; 1,592 species; and 1,377 humans.

Files format and structure

The reports are offered as .txt files, with stand-off annotations (i.e. separate files) available in multiple formats (.ann and .tsv). The CARMEN-I text files in two versions: with masked sensitive data (e.g. '01/01/2020' becomes 'FECHAS'; `masked/` folder) and replaced sensitive data (e.g. '01/01/2020' becomes '03/07/2013'; `replaced/` folder). The CARMEN-I entity annotations in the annotation tool brat's standalone .ann format [6]. Again, there is a different folder for each anonymized version. On top of that, sensitive items annotations (anon/ folder) and medical named entity annotations (ner/ folder) are given separately. Additionally, brat configuration files (annotation.conf and visual.conf) are also provided. For more information about `.ann` format please visit brat's website [7]. The CARMEN-I entity annotations in .tsv format. Again, there is a different folder for each anonymized version (replaced and masked). On top of that, sensitive items annotations and medical named entity annotations are given separately.

Each “.tsv” file contains the following columns: name (associated filename), tag (annotation label), span (start and end character position in text), text (annotation content).

All .txt and .ann files follow the same naming convention: CARMEN-I_{report_type}_{section_type}_{number}.{extension}. For instance: CARMEN-I_IA_ANTECEDENTES_2.txt.

Possible report types are: CC (curso clínico, or clinical notes), IA (informe de alta, or discharge report), IT (informe de traslado, or transfer report), IE (informe de exitus, or death report) and IR (informe de radiología, or radiology report).

Discharge and transfer reports are divided in sections. Possible section types are antecedentes (medical history), evolución (progress notes), exploración clínica (physical examination), exploración complementaria (medical tests), intervención quirúrgica (surgery), plan terapéutico (treatment plan), proceso actual (current problem), radiografía (imaging), and seguimiento (follow-up).

Finally, the dataset includes a file called “CARMEN1_mappings.tsv”, in which every file is classified in two aspects: its language (“es” for Spanish, “cat” for Catalan, “bi” for bilingual texts that include a mix of both languages) and whether it has clinical concept recognition annotations (either “True” or “False”).

  • txt/
  • ann/
  • tsv/
    • Masked text: sensitive data items have been replaced by the concept of the sensitive data type (i.e. a date such as "01/03/2020" in text becomes [**FECHAS**]). The resulting text is visibly anonymized as it introduces special tokens, which might limit its applications in certain types of text. Masked text is specially useful for health centers that cannot create a replaced version of their anonymized data.
    • Replaced text: sensitive data items in the text are replaced with similar, synthetic replacements created using the methodology described in the previous section. Despite some minor inconsistencies, the output is much closer to a real clinical document.

Usage Notes

CARMEN–I is intended for use as a gold standard to train and test NLP tools under development. It is not suitable for clinical research because data related to persons, dates, ages, locations, centers, etc. have been completely substituted by other values. Users must register and provide information about their intended use of the resource before accessing it. CARMEN-I is available under Creative Commons Attribution-ShareAlike 4.0 International Public Licenses (CC-BY-SA)[8].  Users of CARMEN–I must register and provide information about their intended use of the resource. The purpose is to better know the use of the resource and to inform the patients about the current use of shared data. Users must acknowledge the access conditions, including the license, permissions, restrictions, obligations, Data Protection Agreement (DPA), and disclaimer. Users must also agree to use the resource only for its intended purpose and to maintain the anonymization of the data. The license allows users to share, adapt, and build upon the resource for any purpose, including commercial uses, as long as appropriate credit is given and modifications are indicated. If the user detects any expression with suspected possible identification, it is their obligation to immediately notify the CARMEN-I authors at [9].


Ethics

This project was approved by the HCB Ethics Research Committee. Individual patient consent was waived because the project did not impact clinical care and all protected health information was anonymized.


Acknowledgements

We thank the following people for their participation in the creation of Carmen-I: firstly, all the clinical specialists in the HCB who kindly contributed their expertise during the height of the COVID-19 pandemic:  Helena Ariño, Elisa Asensio, Elena Calvo, Maria Ángeles Marcos; also, the valuable contribution of the Universitat de Barcelona’s CLiC group: Montse Nofre, Laura Tañá, and Maria Antònia Martí. Finally, Ricard Farreres from Words for Knowledge IT, for his assistance in preprocessing the text documents.

We would also like to acknowledge the funding of the Spanish Government’s Encargo del PlanTL to the BSC.


Conflicts of Interest

The authors declare no conflict of interest.


References

  1. Salvador Lima Lopez, Naiara Perez, Laura García-Sardiña, and Montse Cuadros. 2020. HitzalMed: Anonymisation of Clinical Text in Spanish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 7038–7043, Marseille, France. European Language Resources Association.
  2. Montserrat Marimon , Aitor Gonzalez-Agirre, Ander Intxaurrondo, Heidy Rodríguez, Jose Antonio Lopez Martin, Marta Villegas, and Martin Krallinger. Automatic De-Identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019).
  3. Antonio Miranda Escalada, Luis Gascó, Salvador Lima-López, Eulàlia Farré-Maduell, Darryl Estrada, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. "Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources." In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings. 2022.
  4. Salvador Lima-López, Eulàlia Farré-Maduell, Luis Gascó, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. "Overview of MedProcNER task on medical procedure detection and entity linking at BioASQ 2023." In Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum. 2023.
  5. Antonio Miranda-Escalada, Eulàlia Farré-Maduell, Salvador Lima-López, Darryl Estrada, Luis Gascó, and Martin Krallinger. "Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of LivingNER shared task and resources." Procesamiento del Lenguaje Natural (2022).
  6. Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun'ichi Tsujii (2012). brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012.
  7. brat standoff format. [Online]. Available on: https://brat.nlplab.org/standoff.html. [Last accessed: 19-Jul-2023]
  8. Creative Commons. (n.d.). Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Retrieved June 25, 2023, from https://creativecommons.org/licenses/by-sa/4.0/
  9. Hospital Clínic. (s.f.). Email communication: Notification of personal data finding in the corpus [Email to infosic@clinic.cat].

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.

License (for files):
PhysioNet Contributor Review Health Data License 1.5.0

Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - Nov. 2, 2023
  • 1.0.1 - April 20, 2024

Files