[CS.DS] Space-Efficient Language Generation Theory

We initiate a resource-aware theory of language generation in the limit, focusing on the minimal constraint of space efficiency. In our framework, a learner observes an adversarial positive stream from a target language $K$ and must eventually output a hallucination-free hypothesis language $L \neq K$, while omitting at most $\triangle$ strings of $K$. We focus on $\text{C}_{s,k}$, the collection of languages recognized by DFAs with at most $s$ states over an alphabet of size $k$, as the natural hypothesis class for memory-bounded learners.

In the exponential-space regime, we prove that a learner can exactly identify the target $K$. Under a stricter memory budget, we characterize the strongest possible generation guarantees. In particular, we present a streaming algorithm using $\text{poly}(s,k)$ space that converges to a hypothesis with generation gap $\triangle = O(k^{2s-2})$. Moreover, the learned hypothesis captures every string in $K$ of length at least $2s-1$. We complement this result with a near-matching lower bound through a reduction from a standard communication complexity problem. Specifically, achieving generation gap $\triangle \leq k^{(1-\text{ε})s}$ requires $k^{\text{Ω(εs)}}$ memory. Together, these results reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

Blogger's Review: This study provides a novel perspective on language generation, especially under memory constraints, showcasing how learners can effectively identify target languages in resource-limited environments. It not only offers new theoretical foundations for computer science but also insights for algorithm design in practical applications.