Skip to content

Instantly share code, notes, and snippets.

@ezodude
Created September 23, 2024 10:25
Show Gist options
  • Save ezodude/d1efd28ea21c28a51e30a162bbd07e38 to your computer and use it in GitHub Desktop.
Save ezodude/d1efd28ea21c28a51e30a162bbd07e38 to your computer and use it in GitHub Desktop.
An LLM Compiler for Parallel Function Calling (text version for LLMs)
Abstract
The reasoning capabilities of the recent LLMs
enable them to execute external function calls
to overcome their inherent limitations, such as
knowledge cutoffs, poor arithmetic skills, or lack
of access to private data. This development has
allowed LLMs to select and coordinate multiple
functions based on the context to tackle more
complex problems. However, current methods
for function calling often require sequential
reasoning and acting for each function which
can result in high latency, cost, and sometimes
inaccurate behavior. To address this, we introduce
LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function
calls. Drawing inspiration from the principles of
classical compilers, LLMCompiler enables parallel function calling with three components: (i) a
Function Calling Planner, formulating execution
plans for function calling; (ii) a Task Fetching
Unit, dispatching function calling tasks; and (iii)
an Executor, executing these tasks in parallel.
LLMCompiler automatically generates an optimized orchestration for the function calls and can
be used with both open-source and closed-source
models. We have benchmarked LLMCompiler
on a range of tasks with different patterns of
function calling. We observe consistent latency
speedup of up to 3.7×, cost savings of up
to 6.7×, and accuracy improvement of up to
∼9% compared to ReAct. Our code is available at
https://github.com/SqueezeAILab/LLMCompiler.
1. Introduction
Recent advances in the reasoning capability of Large Language Models (LLMs) have expanded the applicability of
LLMs beyond content generation to solving complex problems (Besta et al., 2023; Chen et al., 2023b; Gao et al., 2022;Kojima et al., 2023; Wang et al., 2023b; Wei et al., 2022;
Yang et al., 2022; Yao et al., 2023b; Zhou et al., 2023b);
and recent works have also shown how this reasoning capability can be helpful in improving accuracy for solving
complex and logical tasks. The reasoning capability has also
allowed function (i.e., tool) calling capability, where LLMs
can invoke provided functions and use the function outputs
to help complete their tasks. These functions range from a
simple calculator that can invoke arithmetic operations to
more complex LLM-based functions. The ability of LLMs to integrate various tools and function
calls could enable a fundamental shift in how we develop
LLM-based software. However, this brings up an important
challenge: what is the most effective approach to incorporate multiple function calls? A notable approach has been
introduced in ReAct (Yao et al., 2022), where the LLM calls
a function, analyzes the outcomes, and then reasons about
the next action, which involves a subsequent function call.
For a simple example illustrated in Fig. 1 (Left), where the
LLM is asked if Scott Derrickson and Ed Wood have the
same nationality, ReAct initially analyzes the query and decides to use a search tool to search for Scott Derrickson. The
result of this search (i.e., observation) is then concatenated
back to the original prompt for the LLM to reason about
the next action, which invokes another search tool to gather
information about Ed Wood.
ReAct has been a pioneering work in enabling function
calling, and it has been integrated into several frameworks (Langchain; Liu, 2022). However, scaling this approach for more complex applications requires considerable
optimizations. This is due to the sequential nature of ReAct, where it executes function calls and reasons about their
observations one after the other. This approach, along with
the agent systems that extend ReAct (Khot et al., 2023; Qin
et al., 2023; Ruan et al., 2023b; Sumers et al., 2023; Yao
et al., 2023b), may lead to inefficiencies in latency and cost,
due to the sequential function calling and repetitive LLM
invocations for each reasoning and action step. Furthermore,
while dynamic reasoning about the observations has benefits
in certain cases, concatenating the outcomes of intermediate function calls could disrupt the LLM’s execution flow,
potentially reducing accuracy (Xu et al., 2023). Common
failure cases include repetitive invocation of the same function, which is also highlighted in the original paper (Yao
et al., 2022), and early stopping based on the partial intermediate results, as will be further discussed in Sec. 5.1 and
Appendix A.
To address this challenge, we draw inspiration from classical compilers, where optimizing instruction executions
in traditional programming languages has been extensively
explored. A key optimization technique in compilers involves identifying instructions that can be executed in parallel and effectively managing their dependencies. Similarly,
one can envision a compiler, tailored for LLM function
calling, which can efficiently orchestrate various function
calls and their dependencies. This shares a similar philosophy with the recent studies that align LLMs with computer systems (Karpathy, 2023; Packer et al., 2023). To
this end, we introduce LLMCompiler, a novel framework
that enables parallel multi-tool execution of LLMs across
different models and workloads. To the best of our knowledge, LLMCompiler is the first framework to optimize
the orchestration of LLM function calling that can not only
improve latency and cost, but also accuracy, by minimizing
interference from the outputs of intermediate function calls.
In more detail, we make the following contributions:
• We introduce LLMCompiler, an LLM compiler that
optimizes the parallel function calling performance of
LLMs. At a high level, this is achieved by introducing
three key components: (i) a Function Calling Planner
(Sec. 3.1) that identifies an execution flow; (ii) a Task Fetching Unit (Sec. 3.2) that dispatches the function calls
in parallel; (iii) an Executor (Sec. 3.3) that executes the
dispatched tasks using the associated functions.
• We evaluate LLMCompiler on embarrassingly parallel
patterns using HotpotQA (Yang et al., 2018) and Movie
Recommendation (Srivastava et al., 2022), where we observe 1.80×/3.74× speedup and 3.37×/6.73× cost reduction compared to ReAct (Sec. 5.1).
• To test the performance on more complex patterns, we
introduce a new benchmark called ParallelQA which includes various non-trival function calling patterns. We
show up to 2.27× speedup, 4.65× cost reduction, and 9%
improved accuracy compared to ReAct (Sec. 5.2).
• We evaluate LLMCompiler’s capability in dynamic replanning, which is achieved through a feedback loop
from the Executor back to our Function Calling Planner.
For the Game of 24 (Yao et al., 2023b), which requires
repeated replanning based on the intermediate results,
LLMCompiler demonstrates a 2× speedup compared
to Tree-of-Thoughts (Sec. 5.3).
• We show that LLMCompiler can explore the interactive
decision-making environment effectively and efficiently.
On WebShop, LLMCompiler achieves up to 101.7×
speedup and 25.7% improved success rate compared to
the baselines (Sec. 5.4).
2. Related Work
2.1. Latency Optimization in LLMs
Various studies have focused on optimizing model design (Chen et al., 2023a; Dettmers et al., 2023; Frantar &
Alistarh, 2023; Frantar et al., 2022; Kim et al., 2023; 2024;
Kwon et al., 2022; Leviathan et al., 2023; Lin et al., 2023)
and systems (tgi; trt; Kwon et al., 2023; Yu et al., 2022) for
efficient LLM inference. Optimizations at the application
level, however, are less explored. This is critical from a practical point of view for situations involving black-box LLM
models and services where modifications to the models and
the underlying inference pipeline are highly restricted.
Skeleton-of-Thought (Ning et al., 2023) recently proposed
to reduce latency through application-level parallel decoding. This method involves a two-step process of an initial
skeleton generation phase, followed by parallel execution
of skeleton items. However, it is primarily designed for
embarrassingly parallel workloads and does not support
problems that have inherently interdependent tasks, as it
assumes no dependencies between skeleton tasks. This
limits its applicability in complex scenarios such as coding (Austin et al., 2021; Chen et al., 2021; Hendrycks et al.,
2021a; Madaan et al., 2023) or math (Hendrycks et al.,
2021b;c) problems, as also stated in the paper (Ning et al.,
2023). LLMCompiler addresses this by translating an
input query into a series of tasks with inter-dependencies,
thereby expanding the spectrum of problems it can handle.
Concurrently to our work, OpenAI has recently introduced
a parallel function calling feature in their 1106 release, enhancing user query processing through the simultaneous
generation of multiple function calls (OpenAI, 2023). Despite its potential for reducing LLM execution time, this
feature has certain limitations, as it is exclusively available for OpenAI’s proprietary models. However, there is
a growing demand for using open-source models driven
by the increasing number of open-source LLMs as well
as parameter-efficient training techniques (Houlsby et al.,
2019; Hu et al., 2022; Lester et al., 2021) for finetuning and
customization. LLMCompiler enables efficient parallel
function calling for open-source models, and also, as we
will show later in Sec. 5, it can potentially achieve better
latency and cost.
2.2. Plan and Solve Strategy
Several studies (Hao et al., 2023; Patel et al., 2022; Press
et al., 2023; Wolfson et al., 2020; Zhou et al., 2023b) have
explored prompting methods of breaking down complex
queries into various levels of detail to solve them, thereby
improving LLM’s performance in reasoning tasks. Specifically, Decomposed Prompting (Khot et al., 2023) tackles
complex tasks by decomposing them into simpler sub-tasks, each optimized through LLMs with dedicated prompts.
Step-Back Prompting (Zheng et al., 2023) enables LLMs to
abstract high-level concepts from details to enhance reasoning abilities across various tasks. Plan-and-Solve Prompting (Wang et al., 2023a) segments multi-step reasoning
tasks into subtasks to minimize errors and improve task
accuracy without manual prompting. However, these methods primarily focus on improving the accuracy of reasoning
benchmarks. In contrast, LLMCompiler uses a planner
to identify parallelizable patterns within queries, aiming to
reduce latency while maintaining accuracy.
In addition to the aforementioned works, TPTU (Ruan
et al., 2023a), HuggingGPT (Shen et al., 2023), and
ViperGPT (Sur´ıs et al., 2023) have introduced end-to-end
plan-and-solve frameworks. LLMCompiler sets itself
apart by providing a general framework that enables efficient and accurate function calling in a broader range of
problems. This stems from LLMCompiler’s capabilities
in (i) planning and replanning; (ii) parallel execution; and
(iii) addressing a wider range of problem domains, which
will be discussed in more detail in Appendix F.
Another notable work is ReWOO (Xu et al., 2023) which
employs a planner to separate the reasoning process from the
execution and observation phases to decrease token usage
and cost as compared to ReAct. Our approach is different
from ReWOO in multiple aspects. First, LLMCompiler
allows parallel function calling which can reduce latency
as well as cost. Second, LLMCompiler supports dynamic
replanning which is important for problems whose execution flow cannot be determined statically in the beginning
(Sec. 5.3).
2.3. Tool-Augmented LLMs
The enhanced reasoning capability of LLMs has enabled
them to invoke user-provided functions and use their outputs
to effectively complete tasks. Detailed exploration of this
subject is provided in the Appendix C.1.
3. Methodology
To illustrate the components of LLMCompiler, we use a
simple 2-way parallel example in Fig. 2. To answer “How
much does Microsoft’s market cap need to increase to exceed Apple’s market cap?,” the LLM first needs to conduct
web searches for both companies’ market caps, followed by
a division operation. While the existing frameworks, including ReAct, perform these tasks sequentially, it is evident that
they can be executed in parallel. The key question is how to
automatically determine which tasks are parallelizable and
which are interdependent, so we can orchestrate the execution of the different tasks accordingly. LLMCompiler accomplishes this through a system that consists of the follow ing three components: a Function Calling Planner (Sec. 3.1)
that generates a sequence of tasks and their dependencies; a
Task Fetching Unit (Sec. 3.2) that replaces arguments based
on intermediate results and fetches the tasks; and an Executor (Sec. 3.3) that executes the tasks with associated tools.
To use LLMCompiler, users are only required to provide
tool definitions, and optional in-context examples for the
Planner, as will be further discussed in Sec. 4.1.
3.1. Function Calling Planner
The Function Calling Planner is responsible for generating a
sequence of tasks to be executed along with any dependency
among them. For instance, Tasks $1 and $2 in Fig. 2 are
two independent searches that can be performed in parallel. However, Task $3 has a dependency on the outcomes
of the first and second searches. Therefore, the Planner’s
role is to automatically identify the necessary tasks, their
input arguments, as well as their inter-dependencies using
the sophisticated reasoning capability of LLMs, essentially
forming a directed acyclic graph of task dependencies. If
a task is dependent on a preceding task, it incorporates a
placeholder variable, such as $1 in Task 3 of Fig. 2, which
will later be substituted with the actual output from the
preceding task (Sec. 3.2).
The Planner in LLMCompiler leverages LLMs’ reasoning
capability to decompose tasks from natural language inputs.
To achieve this, the Planner LLM incorporates a pre-defined
prompt that guides it on how to create dependency graphs
and to ensure correct syntax (see Appendix H for details).
Besides this, users also need to supply tool definitions and
optional in-context examples for the Planner. These examples provide detailed demonstrations of task decomposition
specific to a problem, helping the Planner to better understand the rules. Further details on user-supplied information
for LLMCompiler are elaborated in Sec. 4.1. In Sec. 4.2,
we introduce an additional optimization for the Planner that streams tasks as soon as they are created, instead of waiting
to complete the entire planning process.
3.2. Task Fetching Unit
The Task Fetching Unit, inspired by the instruction fetching
units in modern computer architectures, fetches tasks to the
Executor as soon as they are ready for (parallel) execution
based on a greedy policy. Another key functionality is
to replace variables with the actual outputs from preceding
tasks, which were initially set as placeholders by the Planner.
For the example in Fig. 2, the variable $1 and $2 in Task $3
would be replaced with the actual market cap of Microsoft
and Apple. This can be implemented with a simple fetching
and queuing mechanism without a dedicated LLM.
3.3. Executor
The Executor asynchronously executes tasks fetched from
the Task Fetching Unit. As the Task Fetching Unit guarantees that all the tasks dispatched to the Executor are independent, it can simply execute them concurrently. The
Executor is equipped with user-provided tools, and it delegates the task to the associated tool. These tools can be
simple functions like a calculator, Wikipedia search, or API
calls, or they can even be LLM agents that are tailored for a
specific task. As depicted in the Executor block of Fig. 2,
each task has dedicated memory to store its intermediate
outcomes, similar to what typical sequential frameworks
do when aggregating observations as a single prompt (Yao
et al., 2022). Upon completion of the task, the final results
are forwarded as input to the tasks dependent on them.
3.4. Dynamic Replanning
In various applications, the execution graph may need to
adapt based on intermediate results that are a priori unknown.
An analogy in programming is branching, where the path of execution is determined only during runtime, depending
on which branch conditions are satisfied. Such dynamic execution patterns can also appear with LLM function calling.
For simple branching (e.g., if-else statements) one could
statically compile the execution flow and choose the right
dynamically based on the intermediate results. However, for
more complex branching it may be better to do a recompilation or replanning based on the intermediate results.
When replanning, the intermediate results are sent back
from the Executor to the Function Calling Planner which
then generates a new set of tasks with their associated dependencies. These tasks are then sent to the Task Fetching
Unit and subsequently to the Executor. This cycle continues
until the desired final result is achieved and can be delivered
to the user. We show an example use case of this in Sec. 5.3
for solving the Game of 24 using the Tree-of-Thoughts
approach.
4. LLMCompiler Details
4.1. User-Supplied Information
LLMCompiler requires two inputs from the user:
1. Tool Definitions: Users need to specify the tools that
LLMs can use, including their descriptions and argument
specifications. This is essentially the same requirement
as other frameworks like ReAct and OpenAI function
calling.
2. In-context Examples for the Planner: Optionally,
users can provide LLMCompiler with examples of
how the Planner should behave. For instance, in the case
of Fig. 2, users may provide examples illustrating expected inter-task dependencies for certain queries. These
examples can assist the Planner LLM understand how
to use various tools and generate the appropriate dependency graph for incoming inputs in the correct format.
In Appendix G, we include the examples that we used
in our evaluations.
4.2. Streamed Planner
The Planner may incur a non-trivial overhead for user
queries that involve a lot of tasks as it blocks the Task Fetching Unit and the Executor, which must wait for the Planner
output before initiating their processes. However, analogous
to instruction pipelining in modern computer systems, this
can be mitigated by enabling the Planner to asynchronously
stream the dependency graph, thereby allowing each task
to be immediately processed by the Executor as soon as its
dependencies are all resolved. In Table C.1, we present a
latency comparison of LLMCompiler with and without
the streaming mechanism across different benchmarks. The
results demonstrate consistent latency improvements with
streaming. Particularly, in the ParallelQA benchmark, the streaming feature leads to a latency gain of up to 1.3×. This
is attributed to the math tool’s longer execution time for
ParallelQA, which can effectively hide the Planner’s latency
in generating subsequent tasks, unlike the shorter execution
times of the search tool used in HotpotQA and Movie
Recommendation.
5. Results
In this section, we evaluate LLMCompiler using a variety
of models and problem types. We use both the proprietary
GPT models and the open-source LLaMA-2 model, with
the latter demonstrating LLMCompiler’s capability in enabling parallel function calling in open-source models. Furthermore, there are various types of parallel function calling
patterns that can be addressed with LLMs. This ranges
from embarrassingly parallel patterns, where all tasks can
be executed in parallel without any dependencies between
them, to more complex dependency patterns, as illustrated
in Fig. 3. Importantly, we also assess LLMCompiler
on the Game of 24 benchmark, which involves dynamic
replanning based on intermediate results, highlighting its
adaptability to dynamic dependency graphs. Finally, we
apply LLMCompiler to the WebShop benchmark to showcase its potential in decision-making tasks. Overall, we start
presenting results for simple execution patterns, and then
we move to more complex ones.
5.1. Embarrassingly Parallel Function Calling
The simplest scenario involves an LLM using a tool repeatedly for independent tasks such as conducting parallel
searches or analyses to gather information on different topics, like the pattern depicted in Fig. 3 (a). While these tasks
are independent of each other and can be executed in parallel, ReAct, along with other LLM solutions as they stand,
would need to run sequentially. This leads to increased
latency and token consumption due to its frequent LLM invocations for each tool usage, as also illustrated in Fig. 1. In
this section, we demonstrate how LLMCompiler can identify parallelizable patterns and execute independent tasks
concurrently to resolve this issue. To do so, we use the
following two benchmarks:
• HotpotQA: A dataset that evaluates multi-hop reasoning (Yang et al., 2018). We only use the comparison dev
set. This contains 1.5k questions comparing two different
entities, thus exhibiting a 2-way embarrassingly parallel
execution pattern. An example question is shown in Fig. 1.
• Movie Recommendation: A dataset with 500 examples
that asks to identify the most similar movie out of four
options to another set of four movies, exhibiting an 8-way
embarrassingly parallel pattern (Srivastava et al., 2022).
Experimental Setups. As a baseline method, we compare LLMCompiler with ReAct. We follow the ReAct
setup (Yao et al., 2022) using the same Wikipedia search
tool that LLMs can use to search for information. We did
not include the lookup tool since it is not relevant to our
problem setting. We have optimized the prompt and incontext examples for both ReAct and LLMCompiler to
the best of our abilities. For all experiments across these datasets, we use gpt-3.5-turbo (1106 release). For the experiments using GPT, we additionally report the results using
OpenAI’s parallel function calling capability, which was
announced concurrently with our work. We also show how
LLMCompiler can be effectively combined with the opensource LLaMA-2 70B model to provide the model with
parallel function calling capabilities. For all experiments,
we have measured accuracy, end-to-end latency, as well as
input and output token usage. See Appendix D for details
on experimental setups.
Accuracy and Latency. We report the accuracy, end-toend latency, and relative speed-up of LLMCompiler compared to ReAct in Tab. 1. First, we observe that ReAct
consistently achieves lower accuracy compared to OpenAI
parallel function calling and LLMCompiler. We identify
two main failure modes in ReAct: (1) the tendency for redundant generation of prior function calls, a point also noted
in the original ReAct paper (Yao et al., 2022); and (2) premature early stopping based on the incomplete intermediate
results. In Appendix A, we offer a detailed analysis demonstrating how these two prevalent failure cases significantly hurt ReAct’s accuracy, and how they can be resolved with
LLMCompiler, leading to an accuracy enhancement of
up to 7 – 8%. Furthermore, we have conducted interventional experiments in which we incorporated ReAct-specific
prompts to avoid repetitive function calls and early stopping.
ReAct†
in Tab. 1 refers to ReAct with this ReAct-specific
prompt. The ReAct-specific prompt yields a general accuracy improvement with ReAct†
as compared to the original
ReAct. Nevertheless, LLMCompiler still demonstrates
on-par and better accuracy than ReAct†
, as such prompting
does not serve as a perfect solution to completely avoiding
the erroneous behavior of ReAct.
Additionally, when compared to ReAct†
, LLMCompiler
demonstrates a noticeable speedup of 1.80× and 1.40× on
HotpotQA with GPT and LLaMA, respectively. Similarly,
LLMCompiler demonstrates 3.74× and 2.82× speedup
on Movie Recommendation with each model. Note that
we benchmark the latency of LLMCompiler against that
of ReAct†
since the repeating and early stopping behavior of the original ReAct as discussed above makes its latency unpredictable and unsuitable for a fair comparison.
LLMCompiler demonstrates a speedup of up to 35% compared to OpenAI parallel function calling whose latency
gain over ReAct is 1.61× and 2.76× on each benchmark.1
Costs. Another important consideration of using LLMs
is cost, which depends on the input and output token usage. The costs for GPT experiments are provided in Tab. 2.
LLMCompiler is more cost-efficient than ReAct for cost,
as it involves less frequent LLM invocations. Interestingly,
LLMCompiler also outperforms the recent OpenAI parallel function calling in cost efficiency. This is because
LLMCompiler’s planning phase is more prompt length
efficient than that of OpenAI parallel function calling since
our Planner’s in-context examples are rather short and only
include plans, not observations (see Appendix H).
5.2. Parallel Function Calling with Dependencies
The cases considered above are rather simple, as only one
tool is used and all tasks can be executed independently
of one another. However, similar to code execution in traditional code blocks, we may encounter function calling
scenarios that involve more complex dependencies. To systematically evaluate the capability to plan out function calling in scenarios that involve complex task dependencies, we
have designed a custom benchmark called ParallelQA. This
benchmark is designed to incorporate non-trivial function
calling patterns, including three different types of patterns in Fig. 3 (b) and (c). Inspired by the IfQA benchmark (Yu
et al., 2023), ParallelQA contains 113 examples that involve
mathematical questions on factual attributes of various entities. In particular, completing the task requires using two
tools (i.e., search and math tools), with the second tool’s argument depending on the result of the first tool’s output. We
have meticulously included questions that are answerable
only with information from Wikipedia’s first paragraph, effectively factoring out the failure cases due to unsuccessful
searches. See Appendix I for more details in ParallelQA.
Experimental Setups. Similar to Sec. 5.1, we use ReAct (Yao et al., 2022) as the main baseline. Here, both ReAct and LLMCompiler are equipped with two tools: (1)
the search tool, identical to the one mentioned in Sec.5.1;
and (2) the math tool, which solves mathematical problems.
The math tool is inspired by the Langchain (Langchain)’s
LLMMathChain, which uses an LLM as an agent that interprets input queries and invokes the numexpr function
with the appropriate formula. This enables the math chain to
address a broad spectrum of math problems that are written
both in mathematical and verbal form. See Appendix D for
more details on experimental setups.
Accuracy and Latency. As shown in the ParallelQA row
of Tab. 1, LLMCompiler arrives at the final answer with
an average speedup of 2.15× with gpt-4-turbo and 2.27×
with LLaMA-2 70B, by avoiding sequential execution of
the dependency graphs. Beyond the latency speedup, we observe higher accuracy of LLMCompiler with the LLaMA2 model as compared to that of ReAct, due to the reasons discussed in Sec. 5.1. Particularly in the LLaMA2 experiment, where LLMCompiler achieves around a
9% increase in accuracy, we note that ∼20% of the examples experienced repetitive function calls with ReAct,
aligning with our observations from the accuracy analysis detailed in Appendix A. Additionally, a comprehensive
analysis of LLMCompiler’s failure cases is provided in
Appendix B, where we note minimal Planner failures, highlighting LLMCompiler’s effectiveness in breaking down
problems into complex multi-task dependencies.
Cost. Similar to Sec. 5.1, LLMCompiler demonstrates
substantial cost reductions of 4.65× and 2.57× compared
to ReAct and OpenAI’s parallel function calling, respectively, as indicated in Tab. 2. This efficiency stems from
LLMCompiler’s reduced frequency of LLM invocations,
which is also the case with OpenAI’s parallel function calling, which is limited to planning out immediate parallelizable tasks, not the entire dependency graph. For example, in
Fig. 3 (c), OpenAI’s method would necessitate three distinct
LLM calls for initial search tasks, following math tasks, and
the final math task. In contrast, LLMCompiler achieves
this with a single LLM call, planning all tasks concurrently.
5.3. Parallel Function Calling with Replanning
In the previous sections, we have discussed cases in which
dependency graphs can be determined statically. However,
there are cases where dependency graphs need to be constructed dynamically depending on intermediate observations. Here, we consider one such dynamic approach in the
context of the Game of 24 with the Tree-of-Thoughts (ToT)
strategy proposed in (Yao et al., 2023b). The Game of 24 is
a task to generate 24 using a set of four numbers and basic
arithmetic operations. For example, from the numbers 2, 4,
4, and 7, a solution could be 4 × (7 − 4) × 2 = 24. ToT
approaches this task through two iterative LLM processes:
(i) the thought proposer generates candidate partial solutions
by selecting two numbers and applying an operation (e.g.
2, 3, 7 from 2, 4, 4, 7 by calculating 7 - 4); (ii) the state
evaluator assesses the potential of each candidate. Only
the promising candidates are then processed in subsequent
iterations of the thought proposer and state evaluator until
24 is reached. Details about the Game of 24 benchmark and
the ToT strategy can be found in Appendix J.
While ToT achieves significant improvement at solving
the Game of 24, its sequential, breadth-first search approach through the state tree can be time-consuming.
LLMCompiler offers a faster alternative by enabling parallel execution of the thought proposer and the subsequent
feasibility evaluator, akin to a parallel beam search method.
Experimental Setups. Although LLMCompiler offers
latency advantages, solving this problem with a single
static graph is not feasible, as the Planner cannot plan
out the thought proposing stage before identifying the
selected candidates from the state evaluator of the previous iteration. Consequently, the Planner is limited to
planning only within one iteration at a time. To address
this, we resort to LLMCompiler’s replanning capability. In particular, LLMCompiler is equipped with three
tools: thought proposer and state evaluator,
which are both LLMs adapted from the original ToT
framework, and top k select, which chooses the top
k candidates from the thought proposer based on the
state evaluator’s assessment. After all these tools
are executed, LLMCompiler can decide to “replan” if no
proposal reaches 24, triggering the Planner to devise new
plans using the shortlisted states from top k select of
the previous iteration. In this way, LLMCompiler can dynamically regenerate plans of each iteration, being able to
tackle highly complex tasks that require iterative replanning
based on the outcomes of previous plans.
To evaluate LLMCompiler’s performance on the Game of
24, we use 100 different instances of the game. For each
problem, we consider the output as successful if its operations are valid and yield 24 while also using the provided numbers exactly once each. Further details on experiment setups are outlined in Appendix D.
Success Rate and Latency. In the last two rows of Tab. 1,
we explore the latency and success rate of LLMCompiler
in comparison to the baseline described in (Yao et al., 2023b)
on the Game of 24 benchmark. With the gpt-4 model,
LLMCompiler demonstrates a 2.89× enhancement in latency while slightly improving the success rate compared
to the baseline. Similarly, when applied with the LLaMA-2
model, LLMCompiler shows a 2.01× improvement in latency, again without compromising on success rate. These
results demonstrate not only a significant latency reduction
without quality degradation, but also the replanning capability of LLMCompiler for solving complex problems.
5.4. Application: LLMCompiler in Interactive Decision
Making Tasks
In this section, we demonstrate that LLMCompiler can
explore language-based interactive environments effectively
by benchmarking LLMCompiler on WebShop (Yao et al.,
2023a). As highlighted in (Shinn et al., 2023; Yao et al.,
2022; 2023a), WebShop exhibits considerable diversity,
which requires extensive exploration to purchase the most
appropriate item. While recent work feature advanced exploration strategies and show promising results (Ma et al., 2023;
Zhou et al., 2023a), their approaches are largely based on a
sequential and extensive tree search that incurs significant
latency penalties. Here, LLMCompiler showcases an exploration strategy that is both effective and efficient with the
use of parallel function calling. Our method enables broader
exploration of items in the environment, which improves
success rate compared to ReAct. At the same time, this exploration can be parallelized, yielding up to 101.7× speedup
against baselines that perform sequential exploration.
Experimental Setups. We evaluate LLMCompiler
against three baselines on this benchmark, ReAct (Yao
et al., 2022), LATS (Zhou et al., 2023a), and LASER (Ma
et al., 2023), using 500 WebShop instructions. The evaluation metrics are success rate, average score, and latency.
More details of the WebShop environment and the baseline
methods are provided in Appendix K. For this experiment,
LLMCompiler is equipped with two tools: search and
explore. The search function triggers the model to
generate and dispatch a query that returns a list of typically
ten items from the Webshop environment. The explore
function then clicks through links for each of the found
items and retrieves information about options, prices, attributes, and features that are available. Finally, based on
the gathered information, LLMCompiler decides on the
item that best matches the input instruction for purchasing.
Further details on experiments can be found in Appendix D.
Performance and Latency. Our approach significantly
outperforms all baseline models as shown in Table 3. When
using gpt-3.5-turbo, LLMCompiler achieves a 28.4%
and 6% improvement in success rate against ReAct and
LATS; with gpt-4, our method improves upon ReAct and
LASER by 20.4% and 5.6%, respectively. In terms of
latency, LLMCompiler exhibits a 101.7× and 2.69×
speedup against LATS and LASER. While we note that
LLMCompiler execution is slightly slower than ReAct on
this benchmark, mainly due to the Planner overhead, we
also highlight that the gains in success rate far outweigh the
minor latency penalty.
We further delve into why LLMCompiler attains such an
improved success rate and score compared to ReAct. Based
on our observations, we discover that the ReAct agent tends
to commit to a decision with imperfect information, a scenario that can arise when the agent has not gathered sufficient details about the features and options available for
items. This observation was also noted in (Shinn et al.,
2023) – without exploring more items in the environment,
the agent struggles to differentiate between seemingly similar choices, ultimately failing to make the correct decision.
In contrast, LLMCompiler undergoes further exploration
by visiting all ten items found by search and retrieving
relevant information about each item. We find that employing an effective search strategy is critical to decision-making
tasks such as the WebShop benchmark.
The relatively high performance of LATS can also be explained in terms of its exploration scheme. In this framework, the agent executes a brute-force search through the
state and action space of Webshop, exploring as many as
30 trajectories before making the final purchase. While this
approach provides richer information for decision-making,
the end-to-end execution becomes prohibitively slow.
We report that our method, LLMCompiler, outperforms
LASER by an average score of 1.5. When compared to
LATS, this score is within the standard deviation range of
our method. The average score for LLMCompiler, along
with its standard deviation, is 72.8 ± 4.01 for gpt-3.5-turbo.
Further note that while the performance differences are
marginal, our method exhibits significant execution speedup,
101.7× over LATS and 2.69× over LASER.
6. Conclusions
Existing methods for invoking multiple functions with
LLMs resort to sequential and dynamic reasoning. As a
result, they suffer from inefficiencies in latency, cost, and
accuracy. As a solution, we introduced LLMCompiler,
a compiler-inspired framework that enables efficient parallel function calling across various LLMs, including opensource models like LLaMA-2 and OpenAI’s GPT series. By decomposing user inputs into tasks with defined
inter-dependencies and executing these tasks concurrently
through its Planner, Task Fetching Unit, and Executor components, LLMCompiler demonstrates substantial improvements in latency (up to 3.7×), cost efficiency (up to 6.7×),
and accuracy (up to ∼9%), even outperforming OpenAI’s
parallel function calling feature in latency gains. We look
forward to future work building upon LLMCompiler that
will improve both the capabilities and efficiencies of LLMs
in executing complex, large-scale tasks, thus transforming
the future development of LLM-based applications.
Impact Statement
This paper presents research towards advancing the field of
Machine Learning. While there are many potential societal
consequences of our work, we do not find any one to be
particularly noteworthy.
Acknowledgements
We appreciate the valuable feedback from Minwoo Kang.
We acknowledge gracious support from Furiosa team. We
also appreciate the support from Microsoft through their
Accelerating Foundation Model Research, including great
support from Sean Kuno. Furthermore, we appreciate support from Google Cloud, the Google TRC team, and specifically Jonathan Caton, and Prof. David Patterson. Prof.
Keutzer’s lab is sponsored by the Intel corporation, Intel
One-API, Intel VLAB team, the Intel One-API center of
excellence, as well as funding through BDD and BAIR. We
also appreciate support from Samsung including Dongkyun
Kim, and David Thorsley. We appreciate the great support
from Ellick Chan, Saurabh Tangri, Andres Rodriguez, and
Kittur Ganesh. Sehoon Kim and Suhong Moon would like
to acknowledge the support from the Korea Foundation for
Advanced Studies. Amir Gholami was supported through
funding from Samsung SAIT. Michael W. Mahoney would
also like to acknowledge a J. P. Morgan Chase Faculty Research Award as well as the DOE, NSF, and IARPA. Our
conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.
G. User-Supplied Examples for LLMCompiler Configuration
LLMCompiler provides a simple interface that allows for tailoring the framework to different use cases by providing tool
definitions as well as optional in-context examples for the Planner. Below, we provide the Planner example prompts that are
used to set up the framework for the Movie Recommendation and Game of 24 benchmarks with only a few lines of prompts.
G.1. Movie Recommendation Example Prompts
###
Question: Find a movie similar to Mission Impossible, The Silence of the
Lambs, American Beauty, Star Wars Episode IV - A New Hope
Options:
Austin Powers International Man of Mystery
Alesha Popvich and Tugarin the Dragon
In Cold Blood
Rosetta
1. search("Mission Impossible")
2. search("The Silence of the Lambs")
3. search("American Beauty")
4. search("Star Wars Episode IV - A New Hope")
5. search("Austin Powers International Man of Mystery")
6. search("Alesha Popvich and Tugarin the Dragon")
7. search("In Cold Blood")
8. search("Rosetta")
Thought: I can answer the question now.
9. finish()
###
H. Pre-defined LLMCompiler Planner Prompts
###
The pre-defined LLMCompiler Planner prompt provides it with specific instructions on how to break down tasks and
generate dependency graphs while ensuring that the associated syntax is formatted correctly. This prompt contains specific
rules such as assigning each task to a new line, beginning each task with a numerical identifier, and using the $ sign to
denote intermediate variables.
###
In addition to user-provided functions, the Planner includes a special, hard-coded finish function. The Planner uses this
function either when the plan is sufficient to address the user query or when it can no longer proceed with planning before
executing the current plan, i.e., when it deems replanning necessary. When the Planner outputs the finish function, its
plan generation stops. Refer to Appendix G for examples of the Planner’s usage of the finish function in planning. The
definition of the finish function is as below and is included as a prompt to the Planner along with the definitions of other
user-provided functions.
###
finish():
- Collects and combines results from prior actions.
- A LLM agent is called upon invoking join to either finalize the user
query or wait until the plans are executed.
- join should always be the last action in the plan, and will be called in
two scenarios:
(a) if the answer can be determined by gathering the outputs from tasks to
generate the final response.
(b) if the answer cannot be determined in the planning phase before you
execute the plans.
###
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment