Can we use the large language models as a mechanism for quantitative knowledge retrieval to aid data analysis tasks? A guest post by Kai Spriestersbach.
In data science, researchers often face the challenge of working with incomplete data sets. Many established algorithms simply cannot process incomplete data series. Traditionally, data scientists have turned to experts to fill in the gaps with their expert knowledge, a process that is time-consuming and not always practical.
But what if a machine could take over this expert role?
Our research group has focused on this question and investigated whether large language models (LLMs) can act as digital experts. These models, trained on a huge amount of text, potentially have a deep understanding of diverse topics, from medical data to social science issues.
By comparing the LLMs' answers with real data and established statistical methods for dealing with data gaps, we have gained exciting insights. Our results show that in many cases, LLMs can provide similarly accurate estimates as traditional methods without relying on human experts.
Two methods in data analysis
When analyzing data, whether in medicine, economics, or environmental research, one often encounters the problem of incomplete information. Two key techniques are used: prior elicitation (the determination of prior knowledge) and data imputation (the supplementation of missing data).
Prior elicitation refers to the systematic collection of existing expert knowledge to make assumptions about certain parameters in our models.
Data imputation, on the other hand, comes into play when information is missing from our data sets. Rather than discarding valuable data sets because of a few gaps, scientists use statistical methods to fill those gaps with plausible values.
Data imputation with LLMs
In the first part of the research project, we asked whether large language models (LLMs) can replace human experts in practice, and how the information from LLMs compares to traditional data imputation methods.
Our study focused on the widest possible range of data from the OpenML-CC18 Curated Classification Benchmark, which includes 72 classification datasets from domains ranging from credit rating to medicine and marketing. This diversity ensured that our experiments covered a wide range of real-world scenarios and provided relevant insights into the performance of LLMs in different contexts.
A key step in our methodology was to artificially generate missing values in the datasets to simulate a situation where data points are incomplete and would normally be consulted with experts. We generated this missing data using the Missing at Random (MAR) pattern from complete entries to allow comparison to ground truth.
We first generated an appropriate expert role for each dataset from the OpenML description, which we then used to initialize the LLM so that it could be queried for missing values.
We used several LLMs for imputation, including LLaMA 2 13B Chat, LLaMA 2 70B Chat, Mistral 7B Instruct, and Mixtral 8x7B Instruct, each of which was evaluated separately. These models were compared with three empirical approaches commonly used in such analyses: Mean and mode imputation for continuous and categorical features respectively, k-Nearest Neighbors (k-NN) imputation, and Random Forest imputation. The assessment of imputation quality was based on the Normalized Root Mean Square Error (NRMSE) and the F1 score for continuous and categorical characteristics.
This methodological approach allowed us not only to investigate the ability of LLMs to act as experts in data imputation but also to compare their performance with traditional methods. This innovative methodology opens new perspectives in dealing with incomplete datasets and highlights the potential of LLMs in data science.
Comparison with traditional methods: Insights from LLM-based data imputation
Contrary to expectations, our analysis showed that the imputation quality of LLMs does not generally exceed that of the three empirical methods. Nevertheless, our results indicate that LLM-based imputation can be useful for certain datasets, especially in the engineering and computer vision domains. Some datasets, such as 'pc1', 'pc3', and 'satimage' in these domains, showed imputation quality with an NRMSE of around 0.1, and similar results were observed in the biology and NLP domains.
Interestingly, the downstream performance of LLM-based imputation varied by domain. While some domains such as social sciences and psychology performed worse, medicine, economics, business, and biology performed better. Notably, LLM-based imputation performed best in the business domain.
Our results suggest that, at least in some domains, LLMs can provide accurate and relevant estimates based on their rich training data that can match real-world data.
While the nuanced results illustrate that the use of LLMs for data imputation is promising, it requires careful consideration of the domain and the specific use case. The results of our research thus contribute to a better understanding of the potential and limitations of LLMs in data science, and point to the need to use this technology in a targeted manner with a deep understanding of its strengths and weaknesses.
Prior elicitation with LLMs
In the second part of the project, we investigated prior elicitation with large language models. Our experiment aimed to evaluate whether LLMs can provide information about the distribution of features and what implications this has for data collection and subsequent data analysis. In particular, we wanted to understand the influence and effectiveness of prior distributions obtained by LLMs and to compare how well they perform with traditional approaches and models.
We compared the estimates of LLMs with those from an experiment by Stefan et al. (2022) in which six psychology researchers were asked about typical small to medium effect sizes and Pearson correlations in their respective fields.
Using similar questions, LLMs were asked to simulate a single expert, a group of experts, or a nonexpert, and then to query priority distributions. This was done with and without reference to the interview protocol used in the comparison experiment.
To do this, we first had to develop a specific methodology to use the models to generate expert knowledge in areas where direct quantitative statements from the models are limited due to built-in security precautions. Typical instructor or chat models typically refuse to provide quantitative information on sensitive topics such as health conditions due to their bias.
To circumvent these limitations, we applied a novel prompting strategy in which we asked the models to provide expert-informed prior distributions for Bayesian data analysis. Instead of asking for specific means or standard deviations, we asked the models to formulate their responses in terms of pseudocode Stan distributions, such as y ∼ normal(120, 10), to indicate, for example, a distribution for the typical systolic blood pressure of a randomly selected individual.
In doing so, ChatGPT 3.5 demonstrated its familiarity with academic elicitation frameworks, such as the Sheffield Elicitation Framework combined with the histogram method, which we used to generate a prior distribution for the typical daily temperature and precipitation in 25 small and large cities around the world for December.
ChatGPT used its knowledge gained from the training data to conduct a simulated expert discussion and construct a parametric probability distribution.
The analysis of our experiment was to check how "concentrated" or "broad" these AI-generated distributions are compared to real data. We wanted to find out how many real data points we would need to confirm or refute the AI's predictions. This helped us understand how reliable AI-based prediction is compared to traditional methods.
Results of the prior elicitation experiments
To our surprise, we found that the role of expert in different subdomains had no noticeable effect on the priors generated by the LLMs. In our experiments, their judgments remained quite similar no matter what role they took: Most of the artificial experts tended to make cautious predictions, suggesting small effects - except for one, GPT-4, who was bolder, suggesting moderately strong effects.
When it came to the relationship between two things - for example, how much the weather affects our shopping behavior - the digital assistants had their own, unexpected views that differed from those of real people. Some showed us a kind of "bathtub" curve that was low in the middle and high at the edges, while GPT-4 showed us a smoother, bell-shaped curve.
We then looked at how confident these digital experts were in their predictions. Some were quite cautious and offered conservative estimates, except for Mistral 7B Instruct, which was extremely confident in the quality of its estimates.
Interestingly, the beta priors for Pearson correlations provided by the LLMs had little in common with those of real experts. GPT-4 provided a symmetric unimodal distribution, while other models provided a right-skewed "bathtub" distribution.
In our meteorological task, we measured how many real weather observations would be needed to make more accurate predictions than the artificial experts. This helped us understand whether it is better to rely on our digital assistants or traditional weather models to predict tomorrow's weather.
In summary, these results also show that LLMs are quite capable of generating priors that are competitive with human expert judgments in some aspects, but significantly different in others. The ability of LLMs to substitute for human experts in determining prior distributions varies depending on the specific task and the model chosen.
Conclusion
The ability of LLMs to synthesize knowledge from a variety of sources and apply it in specific application contexts opens up new horizons for data analysis. Especially in scenarios where experts are hard to find or their time is precious, LLMs could be a valuable resource.
Our research suggests that in fields such as medicine, economics, and biology, LLMs can already provide valuable insights based on traditional data imputation methods. Similarly, the value of prior knowledge provided by large language models may be high in certain scenarios compared to traditional methods, taking into account cost and accuracy. The use of LLMs for prior elicitation may therefore be a cost-effective alternative in some cases.
In conclusion, our research represents an important step towards the integration of large-scale language models in data science. The prospects are promising, and with further advances in technology and methodology, we may be on the cusp of a new era of data analysis in which LLMs play a central role.
Kai Spriestersbach is a researcher at DFKI and co-author of the paper "Quantitative knowledge retrieval from large language models". The Data Science and its Applications (DSA) group, led by Prof. Sebastian Vollmer, is dedicated to problems and questions from data science in the new research department founded in 2021 at the German Research Center for Artificial Intelligence (DFKI) and recognized early on the potential that recent breakthroughs in large language models (LLMs) offer for data analysis and interpretation.