[CS.AI] LLM Features May Harm GNN Performance: Concatenat...

Adding LLM-generated node features to graph neural networks (GNNs) is often reported to enhance accuracy on standard benchmarks. However, we document a contrasting observation: introducing LLM features via pure input concatenation can systematically degrade accuracy on homophilous benchmarks where end-to-end LLM pipelines succeed.

Using an MLP backbone with bag-of-words original features, we found that concatenating SBERT-encoded GPT-4o-mini TAPE features reduced PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The accuracy drop attenuates as we relax conditions (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily datasets like WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp).

To predict when concatenation helps versus hurts, we propose a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets, Delta_sig correlates more strongly with concatenation cost than homophily ($r^2 = 0.38$ vs. $0.06$; $N=9$, bootstrap CIs overlap). The bootstrap-best change-point is $\tau = 13.8$ pp, and the rule is that concatenation helps when "Delta_sig > \tau".

Blogger's Review: This study highlights potential issues with LLM features in GNNs, underscoring the importance of input handling methods on model performance. Choosing the right feature fusion strategy is crucial in practical applications, especially on homophilous graph datasets. The findings provide a fresh perspective and direction for future feature design.

[CS.AI] LLM Features May Harm GNN Performance: Concatenation Interference on Homophilous Graphs