Along with context length constraints, attending to prompt quality is the other vital consideration for LLM output quality. Dr Mark Carrigan, in a 2023 ÐÓ°ÉÂÛ̳ Blog article, highlights an issue facing not just academics new to using LLMs but people generally:
“I have noticed a tendency for critical scholars to share examples of how their prompts elicited an underwhelming or superficial reaction from ChatGPT. The uniform feature of these examples was that little thought had gone into the prompt: they were extremely brief, failed to define the context of the request, provided no sense of the result they were expecting and certainly did not provide examples.”
There’s an interesting contradiction here in that those who are disappointed with the quality of outputs also by implication assume the model is vastly more powerful than it is, since they expect outstanding answers from a lazily created prompt. Every single word in a prompt will impact the response. The length, style, tone and diction you choose all have a significant effect. All the versatility and nuance of the English language is at the user's disposal to help improve output quality, and advantage should be taken. As writing quality is an essential skill for academic work, researchers theoretically have an advantage to reap the benefits of generative AI. Detailed, clear and explicit prompts along with experiment and refinement bring substantial dividends.
Here are the more well-established tips on effective prompting (update September 2024: the new Open AI o1 models which incorporate chain of thought out of the box do not need detailed 'prompt engineering' as much, but most principles still apply particularly for providing context and being explicit):
Regarding the final example, here's a diagram indicating superior quality results simply from adding urgency to the prompt, from :
Zero-Shot, One-Shot, and Few-Shot Prompting
Advanced LLMs learned from examples and as a result they are typically more adept at producing quality outputs based on examples of desired outputs alongside instructions, than from instructions alone. A prompt with instructions only (the most common way most use LLMs) is known as 'zero shot', and relies overwhelmingly on its training data correlations.
One-shot prompting involves the user providing a single example of the desired output format or content alongside the instructions, e.g.:
"You are a concise summariser of sociological theories for a textbook glossary. I will be giving you examples of theories and you should summarise them faithfully in no more than a single sentence. For example, if I were to ask for a summary of functionalism, you would response with "A theory that views society as a complex system of interconnected parts working together to maintain stability". Now please provide summaries for the following: 1. Conflict Theory; 2. etc."
Few-shot prompting extends this by providing multiple examples, which may be helpful for comparative or nuanced language tasks, for example to help with thematic coding of qualitative data:
"Quote 1: 'I feel anxious when I have to speak in public.' Code: Social anxiety;
Quote 2: 'I prefer working alone because I can focus better.' Code: Introversion
Quote 3: 'I often find myself worrying about things that might go wrong in the future'. Code: Generalised anxiety."
The best advice however is to experiment and note patterns where outputs fail to meet expectations, and take note for future prompts. Building personal prompt templates for common tasks is useful as well, for example below is a template that incorporates reflective thinking through multiple personas, designed to get the higher quality result each time because the model is given the opportunity to reflect, critique and adapt and thereby 'think before it speaks'.
Example of a personal template prompt designed for higher quality outputs
"Agent 1 Prompt: You are an expert in academic research. Based on the requested output in [prompt_detail], provide an accurate, grounded response based firmly in the text provided in [provided_text], to identify practical research project ideas emerging from the paper. Important: You must provide direct quotes from the paper to support your suggestions, so it's easy to verify.
After Agent 1's response comes in:
Agent 2 Prompt: You are an expert in academic research who is obsessed with accuracy and evidence for any claims made. Your task is to evaluate the accuracy of the output provided by another LLM agent from the provided academic text. Assess how well the output aligns with the paper's content and whether the direct quotes provided accurately support the claims.
Initial Prompt: [prompt1_text]
Paper Text: [provided_text]
First LLM Response: [agent1_text]
Evaluate the accuracy of the research project ideas and the relevance of the direct quotes provided. Highlight any discrepancies or areas where the response could be improved.
After Agent 2's response comes in:
Agent 3 Prompt: You are an expert in academic research. Your task is to provide detailed suggestions on how to improve the accuracy and overall quality of the output from another LLM to a user's prompt. Review the initial prompt, the paper text, the first LLM response, and the accuracy evaluation from the second LLM response. Provide constructive feedback on how to enhance the accuracy, clarity, and value of the response.
Initial Prompt: [prompt1_text]
Paper Text: [provided_text]
First LLM Response: [agent1_text]
Accuracy Evaluation: [agent2_text]
Provide detailed suggestions on improving both the accuracy and overall quality of the output.
After Agent 3's response comes in:
Agent 4 Prompt: You are an expert in academic research. Your task is to re-answer the initial prompt, taking into account all previous responses and evaluations. Use the initial prompt, the provided paper text, the first LLM response, the accuracy evaluation, and the quality improvement suggestions from the other LLM responses, in order to create the best quality and most well-informed answer possible.
Initial Prompt: [prompt1_text]
Paper Text: [provided_text]
First LLM Response: [agent1_text]
Accuracy Evaluation: [agent2_text]
Quality Improvement Suggestions: [agent3_text]
Based on the provided information, re-answer the initial prompt to extract practical research project ideas from the academic paper, ensuring the highest level of accuracy, clarity, and value."
Hallucinations
As of Summer 2024, generative AI tools even when connected to browsers, search engines and databases, are not good enough to rely on, and everything needs to be verified independently. In many cases this is fine if you just need a starting point for your own investigation. In others the amount of time needed is such that you’re better off just doing it directly yourself. Searching, identifying and evaluating information is a vast, complex, iterative and rigorous process which current LLMs as yet are not capable of doing well, certainly in any reasonable timeframe or without a lot of steering.
Why do LLMS ‘hallucinate’? The fact that people are surprised that LLMs output errors is a combination of not understanding what an LLM actually does, combined with the fact that it is so often correct, which understandably makes novices assume it is somehow ‘looking up information’. Pure LLMs are not machines that produce deterministic outputs based on data. In the simplest terms, they do nothing more than predict words (tokens really) based on probabilities of word associations in the trillions of words in the training data, and by default they mix things up so that it’s not always the most high probability word next (this is the ‘temperature’ setting available in the API which is sometimes referred to as how ‘creative’ outputs are). Novel artificially generated content cannot exist without the possibility of hallucinations; the key thing to remember is that LLMs should not be thought of as fact-finding oracles. They produce plausible looking text and so tasks that need to be accurate will need additional verification. The fact that (mostly) word association probabilities result in correct information so often is astounding in itself, but also dangerous if people assume it’s actually finding information in some sense. This assumption is common among students and novices to LLM generally who initially perceive a tool like Chat GPT as a search engine on steroids.
Another important factor is that the top LLMs have been trained on public internet texts. Consider the landscape of Q&A style websites like Quora, Twitter, Reddit, Facebook, any specific discussion forum etc. How often do you see people putting in the effort to post “I don’t know”? Almost never. So the training data vastly overrepresents text from humans who are confident enough to answer publicly, but neither their confidence nor the content of their answer is a reliable predictor of accuracy. LLMs do extremely well in simulating that kind of confident answer, but you can never assume it’s correct without additional independent verification. A further concern is that, as expected for anything trained on human-produced data, underlying biases may come through any AI-generated ‘information’, so critical engagement is always needed.
Making verification simpler
Whenever it’s important to have the right information, it needs to be verified because there's always the possibility of hallucination. But you can adapt the prompt such that it makes the verification process simpler, or at least far less painful. Ultimately this can be thought of as instructing the LLM to ‘show its working’, and the example above was a good illustration in that the core of Agent 1's prompt was to provide direct quotes to back up its claims.
Here's an example where GPT4 had to find matches or near matches to set keywords based on any given text. Given the propensity to hallucinate, the instructions require adding 2 additional columns to the output in addition to the matches it identified. Firstly, a direct quote is mandatory, which means the user can more easily do a search using Ctrl-F on the original text to check if the classification was correct. It would be much harder to verify without a direct quote. Secondly, a reflection column is required where the LLM has to rationalise its decision on identifying the match. This means it’s easier for the user to see at a glance its ‘reasoning’ which combined with the direct quote can help check the original source and either verify, amend or remove entirely.
“Please read the following text and identify any words, phrases or sentences that match or relate fairly closely to the above core keywords. Please use your intelligent judgement to identify matches - close enough matches are totally fine if you find any examples of text that are more or less similar in meaning to any of the original keywords.
The outputs needed are:
1) a direct quote (sentence or phrase) from the text that surrounds the match or near match,
2) the matching keyword from the original key word list, and
3) Reflective notes – these should always be included, so if you find a perfect match, add a brief note indicating as such.
If you find a borderline or ambiguous ‘close enough’ match, please add a suitable note indicating the uncertainty. If you find nothing at all relating to any of the keywords, please also clarify with the reflective note. It is preferable to identify false positives because I can delete irrelevant ones easily later after seeing your clear direct quote and notes, whereas I don’t want to go through every text to double check in case you missed a borderline match. The goal is identify every possible keyword match even if vaguely related, and show your working with quotes and brief reflective notes, in order to make it as easy as possible for me to look up and verify your outputs."
While these steps mean more effort and using up more context length, if the task is one where accuracy is important and therefore verification is inevitable, it’s worth doing because the effort of verifying without the additional help can often take more time and lead to frustration.