The OOD Illusion: Revisiting Out-of-Distribution Evaluation in Physics-Informed Neural Networks
We present a systematic analysis of out-of-distribution (OOD) evaluation practices in physics-informed machine learning, revealing that standard benchmarks primarily test interpolation rather than extrapolation capabilities. Through representation space analysis of learned features, we demonstrate that 96-97% of samples labeled as "far out-of-distribution" in physics learning benchmarks fall within the 99th percentile of training set distances. We observe performance degradation factors of 3,000-55,000x between published results and controlled reproductions, suggesting that reported successes often reflect comprehensive training coverage rather than genuine extrapolation ability. When tested on truly out-of-distribution scenarios involving time-varying physical parameters, all evaluated models show substantial performance degradation, with errors increasing by up to 8.9 million MSE. Our findings suggest reinterpreting many published results claiming successful physics extrapolation and highlight the need for more rigorous evaluation protocols that distinguish interpolation from extrapolation in learned representations.