Synthetic but Not Infinite: How Much LLM-Generated Data to Use in Market Research
Large Language Models (LLMs) have transformed data generation by enabling scalable synthetic data augmentation. This paper studies a fundamental question in market research: how much synthetic data should be used? In the existing literature, the proportion of LLM-generated synthetic data is usually treated as a hyperparameter, tuned by ad hoc experimentation. In addition, a more important yet unexplored question is how the optimal amount of synthetic data should vary across different population segments, e.g., between younger and older respondents.
In this work, we address these challenges by deriving a closed-form expression for this highly heterogeneous hyperparameter vector that governs the proportion of synthetic data. This prescriptive characterization arises from our analysis of a logit choice model that integrates real and synthetic data, under which we establish a finite-sample complexity bound for the estimation error of the resulting hybrid estimator. Specifically, our expression for the heterogeneous proportion of synthetic data corresponds to the one that minimizes this upper bound. We further demonstrate the strong empirical performance of our closed-form expression using both synthetic data and a real-world vaccine preference dataset. Overall, our results suggest that the closed-form expression serves as a practical and reliable empirical rule of thumb for determining how much LLM-generated synthetic data to incorporate into market research studies.
This is based on a working paper with Qichuan (Ethan) Yin (second-year Ph.D. student in atatistics, University of Chicago).
Bio: Linwei Xin joined the School of Operations Research and Information Engineering at Cornell as an associate professor in July 2025. Prior to Cornell, he was an associate professor of operations management at the University of Chicago Booth School of Business. He specializes in inventory and supply chain management, where he designs cutting-edge models and algorithms that enable organizations to effectively balance supply and demand in various contexts with uncertainty.
Xin’s research using asymptotic analysis to study stochastic inventory theory is renowned and has been recognized with several prestigious INFORMS paper competition awards, including first place in the George E. Nicholson Student Paper Competition in 2015 and the Applied Probability Society Best Publication Award in 2019. Xin’s recent interest focuses on AI for supply chains, driven by labor shortages, reshoring trends, global supply chain disruptions, and e-commerce growth. He leverages various tools such as neural networks, VC theory, applied probability, online optimization/learning, and random graph theory to address emerging challenges arising from AI-driven automation. His work targets problems in inventory management, robotics and automation in modern warehousing, dual-sourcing, real-time order fulfillment, omnichannel, and transportation network design. His research on implementing state-of-the-art multi-agent deep reinforcement learning techniques in Alibaba’s inventory replenishment system was selected as a finalist for the INFORMS 2022 Daniel H. Wagner Prize, with over 65% algorithm-adoption rate within Alibaba’s own supermarket brand Tmall Mart. His research on designing dispatching algorithms for robots in JD.com’s intelligent warehouses was recognized as a finalist for the INFORMS 2021 Franz Edelman Award, with estimated annual savings in the hundreds of millions of dollars.
Xin currently serves as an associate editor for Operations Research, Management Science, Manufacturing & Service Operations Management, and Naval Research Logistics.