Property graphs are increasingly adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained within them, we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all major Text-To-Cypher benchmarks, demonstrating that our synthetic data generation approach can significantly enhance the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings where models must be deployed locally, we can ensure data sovereignty without sacrificing accuracy or incurring costly annotation campaigns.
Blogger's Review: This paper highlights the potential of synthetic data generation in enhancing the performance of small LLMs, especially in an era where data privacy is paramount. The proposed method not only reduces costs but also effectively improves model accuracy, making it a valuable solution for local deployment scenarios.