Wearable devices and smartphones generate rich behavioral time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures are lacking. This study benchmarks six deep learning architectures, two zero-shot Foundation Models (FM), and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time, and sleep duration across 1-8 day horizons. We further conduct a per-feature personalization study and assess FM transferability across dataset sizes and temporal granularities.
Key findings include:
- No single architecture dominates; PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference;
- FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes;
- Participant-level fine-tuning reduces per-feature RMSE by 16-60%, with sleep benefiting most and step counts least.
These results provide practical guidance on architecture selection, FM applicability, and personalization strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs, and personalization for multi-horizon behavioral forecasting from wearables.
Blogger's Review: This research fills a critical gap in the predictive modeling of wearable device data, providing a comprehensive comparison of various deep learning architectures and emphasizing the importance of personalized fine-tuning. Its contributions systematically assess model applicability, offering clear pathways for researchers and developers in health interventions.