Prompt Engineering

Communicating with LLMs

9 min readApr 18, 2023

The Rise of Prompt Engineering

In this post, we will explore current leading ideas and innovative approaches to prompt engineering to possibly tackle some of the difficult problems of hallucinations and alignment.

Note: I use GPT, LLM and model interchangeably when talking about language models such as GPT-4.

This is intended for readers who have knowledge of how LLMs work, and prompt engineering in general. I won’t go into how LLMs are measured and instead focus on prompting strategies for complex problems.

Importance of Prompt Engineering

Prompt engineering is the art of crafting input queries (or chains of queries) to guide AI language models (LLMs) in producing relevant and accurate responses.

In my view, prompt engineers serve as the liaison between raw AI models and consumers. They bridge the foundation models by creating useful customer tools. As prompting becomes better consumers will have access to better and easier-to-use tools.

An example in a parallel domain is AI for image generation including apps such as Lensa and Midjourney along with foundation models like stable diffusion. Successful apps are the ones that give consumers simple inputs for generating good images. “Upload 10 images and select from a dropdown”. Giving customers raw access to stable diffusion or a dreambooth model in a Google Colab notebook can be overwhelming and difficult to craft your first prompt. The apps make the foundation models more approachable and easier to use.

Current leading ideas and papers in prompting

In this section, I’ll discuss the evolution of prompting by reviewing important publications that build upon one another.

Chain of Thought Prompting

One of the first steps toward better prompting was chain of thought prompting.

By thinking through the steps the model is able to access recent information when producing the answer. This more than doubles the number of correct answers on question-answering datasets!

Because GPT models generate one token at a time the inference can be broken at any point for the exact same result (given no temperature or randomness). So, because we can break it anywhere, It’s natural to break it into a series of “thoughts”.

Therefore, using the same example as in the paper, these schemes of calling the LLM are the same:

Inference 1: A: The cafeteria had 23 apples originally.
Inference 2: They used 20 to make lunch.
Inference 3: So they had 23–20 = 3.
Inference 4: They bought 6 more apples,
Inference 5: so they have 3 + 6 = 9.
Inference 6: The answer is 9.

As a single inference:

Inference 1: A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23–20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.

Chain-of-thought prompting can be drawn as shown below, which becomes a building block for complex approaches.

Toolformer

Toolformer introduced “API Calls” to LLMs. By including examples of APIs in the prompt, the model is able to access new information to better inform its answers.

These tools include question answering, Wikipedia search, calendar, machine translation, and calculator.

This was hugely important because now LLMs could access new information at inference time. There isn’t a need to train all the specific information directly into the model.

ReAct

ReAct combines the previous two approaches. It was also one of the first iterative approaches to problem-solving (called Agents). It utilizes tools and chain of thought prompting in a loop in order to answer questions.

Being able to think through steps, select an action and then determine if more information is needed to answer the question allows much more flexibility.

This overcame some problems like tool calls not yielding the desired result, and searching for multiple things before answering.

Reflexion

If we take a step back and treat ReAct like a single round of optimization, reflexion becomes a hyperparameter optimizer. After answering a question, the model gets a chance to reflect on how things could’ve gone better. The model can then use this reflection to get even better results.

Reflexion brings in the ability to store hyperparameters about any question-answering process (such as ReAct) and use them to do better in the future.

These reflections can be stored for later use and potentially translated to other prompts as long as they’re about the same task. The authors also note that this strategy can be used on any question-answering scheme, which is interesting. They already see the potential for this to get use at a bigger scale.

Auto-GPT

Auto-GPT takes the ReAct paradigm to the next level.

Auto-GPT uses an iterative approach and scaled up the number of tools and added tricks to handle memory and the context window size.

New tools include writing to output files, running Python, starting up a secondary agent to do research for the primary agent, scraping and summarizing websites (using a summary GPT call), git commands, Google search, and many more. More tools can be added to an expanding list that continues to increase its capabilities. It’s important to note that these tools are mostly using LLMs to do the tasks. For instance, the tool that writes unit tests calls GPT with a function definition and description of the tests to write.

Here’s an example for how an LLM can generation code. The prompt specifies the input and output, and the LLM is able to fill in the middle. This example is from the AI-Functions github.

function = "def fake_people(n: int) -> list[dict]:"
args = ["4"]
description_string = """Generates n examples of fake data representing people, each with a name and an age."""

result = ai_functions.ai_function(function_string, args, description_string, model)

""" Output: [
  {"name": "John Doe", "age": 35},
  {"name": "Jane Smith", "age": 28},
  {"name": "Alice Johnson", "age": 42},
  {"name": "Bob Brown", "age": 23}
]"""

Memory is also an interesting problem for long-term agents. As they continue to run they aren’t able to fit everything into the limited context window. To overcome this results are stored in a dynamic database. The database can be accessed by both the primary agent and any other secondary agents that have been created to do tasks such as searching the internet or writing code.

This design extends the capability of GPT to do iterative tasks. For instance, writing, testing, iterating on, and committing code to github is now possible. It could also search the web trying to accomplish tasks like booking flights by comparing websites.

New ideas and frontiers

Now that there’s the context for what current prompting looks like I’ll discuss some new personal ideas building on these methods.

I’m mainly considering the problem of how to better collaborate with the agent and how to ensure interests are aligned between these agents and their human operators.

Considering the Human

Right now the “human” can be used as a tool, but is only considered as an option for when the model gets stuck. Instead, the prompt and human tool description could be rewritten to view the human as a collaborator. That way when the model makes big decisions (such as buying something or double-checking a document outline looks good) the human can be included.

Furthermore, human preferences could be learned. Just like reflexion asked the model to reflect on past interactions, the model could also reflect on ways to better collaborate with specific humans.

Perhaps one user never has feedback, and another always likes content written in a certain way. Now the model can anticipate the users' requests and pre-emptively include them. It’s possible to have tasks specific and user-specific reflections that can be combined for very flexible and knowledgeable pre-prompting.

Agents as Markov Decision Processes

When looking closer at the agents it can be seen that they resemble Markov Decision Processes (MDPs). The huge state space is the prior thought/action chain and the action space is possible tool usage. When comparing agents to a traditional MDP from Reinforcement learning the resemblance is clear.

The thing missing is the reward from the environment. Creating a reward is tricky. In fact, RL has a reward-shaping problem where ill-defined rewards can lead to the model taking undesirable shortcuts.

Instead, would it be possible to ask the model to rate itself? We could theoretically pick a few dimensions and see what a model says about the action taken and if it achieves the desired result. These self-ratings could then be placed into a reflection module so that the model could know how to better interact with its tools.

Reward dimensions could include utility/benefit for the next thought/action, alignment with the user, and safety. Saving these as reward reflections could ultimately help guide the agent in generating better content and using tools more productively.

Branching Ideas

When humans make decisions we like to compare options and proceed with the best overall. Using the MDP structure for agents it now is possible to run ideation in a tree search!

The tree search would have the benefit of being able to consider the multi-step effects of taking actions. By comparing rewards summed rewards from the root to each leaf the agent could select the best overall action.

A problem is that tree searches involve a dense amount of branching and to call tools at each node would be very expensive. It would be helpful to have a lightweight approximation function for the tool calls. Something that would give an estimated response for what the agent could expect the tool to return. This would greatly save compute and time in selecting actions.

An example configuration for the chat setting would be the following configuration:

Thoughts: Possible things to say.
Tool: Reply as if you are the user.
Reward: Is the user happy and benefit from the interaction?

With this framework, the agent would consider the potential reactions of the user, if they benefit, and their emotions. Having this kind of consideration would help align agents with humans. The agent would have the context of if their actions (chat replies in this case) are harmful when making the decision about what to say. They could potentially avoid topics or responses that lead to follow-ups that are dangerous. This search could also predict follow-up questions and pre-emptively answer them.

An example of an agent that can code:

Thoughts: Action list of things to do.
Tools: Write code, look up things online, run code, etc.
Reward: Is your output harmless, and accomplish the goal?

This example is interesting. I see benefits in making sure all the code that’s generated is useful downstream, and that sections of the code are programmed in a way that can be tested. This approach can also help avoid problems on the host machine. I’m specifically thinking of the example of setting up a container before starting to install dependencies when starting a new project.

With this strategy, a fast approximation tool function should be used.

Tool Approximation with Search Flowchart

This will reduce compute/time when searching for actions. It will also avoid the tool from making mistakes while searching. The search will entirely be used for “thinking”

Summary

In this post, I summarized findings from important prompt engineering papers and discussed new ideas in the field. If you enjoyed it or want to chat please DM me through my website! Thanks for reading!