In electronic health record (EHR) foundation models, ICD diagnosis codes are typically treated as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. Consequently, existing EHR representation learning methods do not explicitly exploit the hierarchical structure inherent in the coding system. This work investigates the ICD-10-CM hierarchy as a general inductive bias for clinical representation learning.
We explore two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy; second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure.
We evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding enhances transfer across datasets, and how hierarchy reshapes embedding similarity structure. Experiments are conducted on two large-scale real-world clinical datasets: MIMIC-IV for pretraining and in-domain evaluation, and eICU for assessing cross-dataset transfer via frozen encoder probing.
Our findings indicate that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and demonstrate that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.
Blogger's Review: This paper explores the significance of the hierarchical structure of ICD codes in EHR representation learning, proposing innovative model architectures that combine BERT and graph representations. It showcases the potential of hierarchical encoding in clinical data analysis, enriching theoretical foundations in the EHR field while providing fresh perspectives for practical applications, making it a noteworthy contribution to the domain.