LLMs are increasingly becoming part of everyday clinical decisions, yet they are trained on datasets that were static long before they were released into the world. The time gap between when the model was trained and when the model is used, referred to as a knowledge cutoff, has proven to be a subtle but crucial failure mode. The model may be capable and aligned, yet offer out-of-date medical advice with absolute fluency. We want to investigate what difference the degree of data freshness alone can make in a model's clinical performance. We isolated the cutoff parameter between two model families with differing release patterns: the closed-weight GPT models and the open-weight LLaMA models, using data from two dated versions of the IDSA COVID-19 Treatment and Management Guidelines (v5.0.0, 8/25/21; v11.0.0, 6/26/23). We evaluated recommendation-level differences and generated 363 multiple-choice questions that represented true changes in treatment advice. All models were queried using the exact same prompts and deterministic settings on the same questions. We see that accuracy only jumps up when the assumed training cutoff of the model falls after the date of the newer guideline. Both GPT-3.5-Turbo and LLaMA-2-13B, with their older cutoffs, fail to match the accuracy of models with cutoffs later than 6/26/23, whereas the more recently trained models GPT-4o, GPT-5, and LLaMA-3.3-70B achieve over 90% accuracy and show similar results. This demonstrates the success of using fresh data and that this alone leads to improvements in applied medical reasoning and is more than just a count of parameters. The users of these systems may have an outsized trust in their fluency even if the information they convey is not up to date, especially in sensitive or stressful conditions.
Publication Date: 2026-06-19