NeFut Logo NeFut
Admin Login

[CS.AI] STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming

Published at: 2026-06-16 22:00 Last updated: 2026-06-17 01:38
#AI #Machine Learning #Open Source

Researchers and practitioners working with large language models face a fragmented landscape: local models are free and private but hardware limits the model size and context windows; institutional HPC centers offer powerful GPU resources at no marginal cost but operate behind firewalls and are designed for batch jobs rather than interactive use; commercial cloud APIs provide frontier-model quality on demand but impose significant costs and data retention policies unsuitable for sensitive research data. No existing system unifies all three. STREAM (Smart Tiered Routing Engine for AI Models) addresses this gap with four contributions:

  1. Three-tier routing architecture: Combines local, HPC, and cloud inference with a local LLM-based complexity judge.
  2. Dual-channel HPC streaming architecture: Separates the Globus Compute control plane (authentication and job dispatch) from a WebSocket relay data plane (token delivery), enabling sub-second TTFT (0.54 s median, 21.1x over batch mode's 11.40 s) through institutional firewalls without VPN or firewall rule changes, with end-to-end AES-256-GCM encryption ensuring the relay operator cannot read token payloads.
  3. Tier-aware context summarization: Prevents long conversations from forcing simple queries onto expensive tiers.
  4. HPC-as-API proxy mode: Exposes HPC inference as an OpenAI-compatible endpoint callable from any standard client with no HPC expertise, a deployment pattern made practical only by the sub-second TTFT of contribution (2).

Llama 3.2 3B achieves 85.1% free-tier retention on a 1,200-query benchmark spanning ten domains. Measured TTFT: 0.26 s local, 0.54 s HPC (relay), 1.68 s cloud.

Blogger's Review: The multi-tier architecture and innovative dual-channel streaming design of STREAM significantly enhance the inference efficiency of large language models, especially in complex scenarios that span local and cloud environments. This flexibility not only helps researchers better utilize resources but also effectively protects sensitive data, showcasing the potential of modern computing architectures in the AI domain.

Original Source: https://arxiv.org/abs/2606.13968

[h] Back to Home