Database Credentialed Access

Annotated MIMIC-IV discharge summaries for a study on deidentification of names

Shulammite Lim Yuxin Xiao Alistair Johnson Dana Moukheiber Lama Moukheiber Mira Moukheiber Marzyeh Ghassemi Tom Pollard

Published: July 5, 2023. Version: 1.0


When using this resource, please cite: (show more options)
Lim, S., Xiao, Y., Johnson, A., Moukheiber, D., Moukheiber, L., Moukheiber, M., Ghassemi, M., & Pollard, T. (2023). Annotated MIMIC-IV discharge summaries for a study on deidentification of names (version 1.0). PhysioNet. https://doi.org/10.13026/63ab-qf77.

Additionally, please cite the original publication:

Xiao Y, Lim S, Pollard TJ, Ghassemi M. In the Name of Fairness: Assessing the Bias in Clinical Record De-identification. FAccT ’23, June 12–15, 2023, Chicago, IL, USA. https://doi.org/10.1145/3593013.3593982

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

We introduce a dataset of people’s names and clinical note templates to support research on the demographic bias of de-identification systems. The dataset contains 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. It also includes 100 clinical templates manually curated from MIMIC-IV discharge summaries. These templates are then populated with names generated from the 16 name sets to investigate the demographic bias in nine public and private de-identification methods. The dataset provides a valuable resource for researchers who are interested in understanding and mitigating the unfairness of de-identification systems.


Background

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

As part of a study to explore demographic bias of de-identification models, we created a collection of 100 discharge studies sourced from a pre-release version of the MIMIC-IV database [1-3]. At the time when the study was carried out, the discharge summaries were not yet publicly available, ensuring that they had not been seen by the de-identification models prior to evaluation.


Methods

We first randomly extracted 100 discharge summaries (50 male patients, 50 female patients) from a pre-release version of the MIMIC-IV database. We annotated protected health information within the notes so that they could be used as "templates", allowing entities to be replaced with alternatives from a look up table. Our research focus was bias in de-identification of patient names, so we then created 16 name sets that varied along four demographic dimensions: gender, race, name popularity, and the decade of popularity.

  • Names for the general analysis:
    • In this dataset, we compute the popularity of first names for each gender based on the U.S. Social Security dataset [4] across the entire population, rather than for each racial group. We then select names that are primarily associated with a self-identified racial group with a margin over 10% based on the mortgage application dataset in [5]. We note that this is different from picking the most popular names for each racial group independently. 
    • In the U.S. setting, all top popularity names, as evaluated by absolute frequency ranking, are identified with the White racial group. For this reason, we consider names associated with the Black, Asian, or Hispanic groups that are of medium popularity. First names of medium popularity for each race and gender (i.e., Name Sets 3, 4, 7, 8, 9, 10, 11, and 12) are randomly sampled from those with a frequency ranking between 400 and 8,000 in the entire population in the 2000s. First names of bottom popularity for the White group (i.e., Name Sets 5 and 6) are randomly sampled from those occurring exactly five times in the 2000s. We set each name set to 20 names since based on the procedure described above, there are only 20 names that are of medium popularity in the 2000s and primarily used by Black males. We also ensure that first names of top popularity within each gender and decade are mutually exclusive (i.e., no shared first names in Name Sets 1, 2, 13, 14, 15, and 16).
    • We prepare last names in a similar fashion based on the 2000 Census dataset alone [6], because we assume that the last name popularity is relatively fixed. Specifically, this means that the most popular last names for the White racial group in the 1970s and 1940s are assigned to be the same as those in the 2000s.
  • Templates for the general analysis:
    • We manually curate 100 clinical note templates based on hospital discharge records from Beth Israel Lahey Health between 2017 and 2019. We follow the HIPAA Safe Harbor provisions by marking the occurrence of names in the templates and replacing other PHI classes with realistic, synthetic values.
    • We use **NAME-{number}{gender}** to mark the occurrence of names in the templates. {number} starts from 1 and indicates the ID of a unique name appearing in a template. {gender} can be one of A, M, or F, where M or F suggests that the respective gender of the marked name can be easily inferred from the local context, and A suggests not.
    • We use **AGE**, **CONTACT**, **DATE**, **HOSPITAL**, **ID**, **LANGUAGE**, **LOCATION**, **PROFESSION**, and **OTHER** to mark the other respective PHI classes.
  • Context and names for the fine-tuning analysis:
    • We prepare the fine-tuning de-identification datasets by considering two types of context and two types of names. We treat the longitudinal clinical narratives in the 2014 i2b2 de-identification challenge [7] as the clinical context and the Wikipedia articles in the DocRED dataset [8] as the general context. We generate 160 diverse names by randomly sampling ten names from each of the 16 name sets we created above and 160 popular names based on the most popular names over the three chosen decades that do not appear in the 16 name sets. For each type of context, we randomly sample 1,000 templates for training and 100 for validation. These templates are then populated with the names of each type (i.e., diverse names and popular names) separately. In this way, we create four fine-tuning setups in total by pairing the two types of context with the two types of names.

Data Description

The project contains three subfolders, as outlined below. Additional information can be found in our accompanying paper and GitHub repository [9, 10].

  • The general folder: the data used for our general analysis.
    • input/names-first.csv and input/names-last.csv store the first and last names in the 16 name sets we created. Please refer to Section 3.2 in our paper and Preparation/name.ipynb in our GitHub repository for the full detail.
    • input/notes-base.csv stores the 100 manually curated clinical templates where we marked the occurrence of names and other categories of protected health information (PHI).
    • input/notes-input.jsonl and input/notes-label.jsonl store the 16,000 notes and the corresponding labels, respectively, for evaluating the performance on de-identifying names. Please refer to Section 3.4 in our paper and Preparation/note.ipynb in our GitHub repository for the full detail.
    • Output/notes-{method}.jsonl stores the prediction output of each of the nine evaluated de-identification baseline methods.
  • The polysemy folder: the data used for assessing how polysemy names affect model performance.
    • input/polysemies-input.jsonl and input/polysemies-label.jsonl keep the notes and the corresponding labels, respectively, for evaluating the performance on de-identifying polysemy names. Please refer to Section 5.1 in our paper and Preparation/polysemy.ipynb in our GitHub repository for the full detail.
    • output/polysemies-{method}.jsonl keeps the prediction output of each of the nine evaluated de-identification baseline methods on polysemy names.
  • The finetune folder: the data used for fine-tuning spaCy and NeuroNER. Here, we consider two types of context (i.e., general and clinical) and two types of names (i.e., popular and diverse). Please refer to Section 6.1 in our paper and Preparation/finetune.ipynb in our GitHub repository for the full detail.
    • input/context-general.jsonl and input/context-clinical.jsonl contain the two types of context.
    • input/names-popular.jsonl and input/names-diverse.jsonl contain the two types of names. input/names-test.jsonl contains the names used for populating the test notes.
    • input/inputs-{context}+{name}.jsonl and input/labels-{context}+{name}.jsonl contain the notes and the corresponding labels, respectively, for fine-tuning.
    • input/inputs-test.jsonl and input/labels-test.jsonl contain the notes and the corresponding labels, respectively, for evaluating fine-tuned methods.
    • output/finetunes-{context}+{name}-{method}-{seed}.jsonl contains the prediction output of each of the two fine-tuned methods.

Usage Notes

We used this dataset to explore the bias of de-identification systems with respect to names in clinical notes via a large-scale empirical analysis. Our study found statistically significant performance gaps along demographic dimensions in most of our examined methods. The study illustrates that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics.

We have made the relevant preparation and analysis code for this study available in a GitHub repository under an open source license [9, 10]. Researchers are encouraged to use the code and dataset to reproduce the results presented in our paper, to audit other existing de-identification baselines, and to develop future de-identification methods. 

Researchers should not perform any demographic inference based on our name sets as part of a classification system or training set. We do not believe that the categories in the name sets should be viewed as scientific truth and recognize the larger critical interrogation surrounding whether gender and ethnicity should be discerned from names in such systems [11]. We recommend researchers to use these categories in the spirit in which they were created by the U.S. Office of Management and Budget to “monitor and redress social inequality” [12].


Release Notes

v1.0: Initial release.


Ethics

Limitations of Standardized Demographic Categories:

  • We acknowledge the limitation of using standardized self-reported racial categorization and binary gender groups when composing the name sets. More fine-grained racial categorizations are possible in future work, and there could be variety in the linguistic norms and naming traditions even within each racial group we consider. Transgender and non-binary gender groups are also important to consider in future work, as these groups may use gender-neutral names or have variations in name usage between records.

Limitations of the Dataset:

  • We acknowledge that our datasets are limited to the U.S., and therefore, our findings need to be reproduced in other contexts with distinct name distributions. Furthermore, our use of the mortgage application dataset for self-reported racial matching is limited to those who have the financial security to apply for a loan. As we do not have access to other sources of names and self-reported races, we use the available data to demonstrate that–even in this presumably more privileged subset of the population–there are de-identification gaps.

Acknowledgements

This project is supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362.


Conflicts of Interest

The authors have no conflicts of interest to report.


References

  1. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
  2. Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV (version 2.2). PhysioNet. https://doi.org/10.13026/6mm1-ek67.
  3. Johnson, A.E.W., Bulgarelli, L., Shen, L. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1 (2023). https://doi.org/10.1038/s41597-022-01899-x
  4. Social Security. Popular Baby Names. (n.d.). https://www.ssa.gov/oact/babynames/limits.html. [Accessed 30-June-2022]
  5. Tzioumis, K. Demographic aspects of first names. Sci Data 5, 180025 (2018). https://doi.org/10.1038/sdata.2018.25
  6. Bureau, U. C. (2021, October 8). Decennial Census Surname Files (2010, 2000). Census.gov. https://www.census.gov/data/developers/data-sets/surnames.html. [Accessed 30-June-2022]
  7. Stubbs, A., & Uzuner, Ö. (2015). Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics, 58, S20-S29.
  8. Yao, Y., Ye, D., Li, P., Han, X., Lin, Y., Liu, Z., ... & Sun, M. (2019). DocRED: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127.
  9. Github repository for the bias in deidentification project: https://github.com/xiaoyuxin1002/bias_in_deid [Accessed: 20 May 2023]
  10. Xiao Y, Lim S, Pollard TJ, Ghassemi M. In the Name of Fairness: Assessing the Bias in Clinical Record De-identification. FAccT ’23, June 12–15, 2023, Chicago, IL, USA
  11. Lockhart, J. W., King, M. M., & Munsch, C. (2023). Name-based demographic inference and the unequal distribution of misrecognition. Nature Human Behaviour, 1-12.
  12. Bliss, C. (2012). Race decoded: The genomic fight for social justice. Stanford University Press.

Parent Projects
Annotated MIMIC-IV discharge summaries for a study on deidentification of names was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files