What's next for AI agentic workflows featuring Andrew Ng of AI Fund
Recently, Andrew Ng gave a talk on AI agents at Sequoia’s AI Ascent event.
Each individual AI product today has its limitations, but Andrew shows that using multiple agents in a well designed agentic workflow can produce performance in existing models comparable to what we may see in GPT-5/Claude 4/Gemini 2.0.
He cited 8 papers in his presentation. I couldn’t find his slides online but found the 8 papers and pulled out key points from each paper below.
Hopefully this is a helpful starting point to explore any of the papers in more depth. I would also interested to see Andrew give this talk a quarter or two in the future to see how much progress we’ve made.
I’m personally excited about multi-agent collaboration, because after all, we humans were able to achieve our modern civilization through collaboration.
Reflection
Self-Refine: Iterative Refinement with Self-Feedback, Madaan et al. (2023)
The main idea is to generate an initial output using an LLM; then, the same LLM provides feedback for its output and uses it to refine itself, iteratively.
Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test-time using our simple, standalone approach.
SELF-REFINE operates within a single LLM, requiring neither additional training data nor reinforcement learning.
Reflexion: Language Agents with Verbal Reinforcement Learning, Shinn et al., (2023)
We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback.
Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode.
This is akin to how humans iteratively learn to accomplish complex tasks in a few-shot manner – by reflecting on their previous failures in order to form an improved plan of attack for the next attempt.
Tool use
Gorilla: Large Language Model Connected with Massive APIs, Patil et al. (2023)
We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls.
By empowering LLMs to use tools [33], we can grant access to vastly larger and changing knowledge bases and accomplish complex computational tasks.
We construct, APIBench, a large corpus of APIs with complex and often overlapping functionality by scraping ML APIs (models) from public model hubs.
The finetuned model’s performance surpasses prompting the state-of-the-art LLM (GPT-4) in three massive datasets we collected.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al. (2023)
To this end, we present MM-REACT, a system paradigm that composes numerous vision experts with ChatGPT for multimodal reasoning and action.
MM-REACT presents a simple and flexible way to empower LLMs with a pool of vision experts.
Planning
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al., (2022)
However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021).
We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree.
For many reasoning tasks where standard prompting has a flat scaling curve, chain-of-thought prompting leads to dramatically increasing scaling curves.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, Shen et al. (2023)
By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.
The mechanism of HuggingGPT allows it to address tasks in any modality or any domain by organizing cooperation among models through the LLM.
By providing only the model descriptions, HuggingGPT can continuously and conveniently integrate diverse expert models from AI communities, without altering any structure or prompt settings.
Multi-agent collaboration
Communicative Agents for Software Development, Qian et al., (2023)
The instrumental analysis of ChatDev highlights its remarkable efficacy in software generation, enabling the completion of the entire software development process in under seven minutes at a cost of less than one dollar.
At each phase, ChatDev recruits multiple "software agents" with different roles, such as programmers, reviewers, and testers.
Within the chat chain, each node represents a specific subtask, and two roles engage in context-aware, multi-turn discussions to propose and validate solutions.
As a result, this technology is best suited for open and creative software production scenarios where variations are acceptable.
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations, Wu et al., (2023)
Given the expanding tasks that could benefit from LLMs and the growing task complexity, an intuitive approach to scale up the power of agents is to use multiple agents that cooperate.
In practice, applications of varying complexities may need distinct sets of agents with specific capabilities, and may require different conversation patterns, such as single- or multi-turn dialogs, different human involvement modes, and static vs. dynamic conversation.
When configured properly, an agent can hold multiple turns of conversations with other agents autonomously or solicit human inputs at certain rounds, enabling human agency and automation.
The adoption of AutoGen has resulted in improved performance (over state-of-the-art approaches), reduced development code, and decreased manual burden for existing applications.
As we further develop and refine AutoGen, we aim to investigate which strategies, such as agent topology and conversation patterns, lead to the most effective multi-agent conversations while optimizing the overall efficiency, among other factors.