ezodude · September 23, 2024 10:25
diff --git a/LLMCompiler.txt b/LLMCompiler.txt
 Abstract
 The reasoning capabilities of the recent LLMs
 enable them to execute external function calls
 to overcome their inherent limitations, such as
 knowledge cutoffs, poor arithmetic skills, or lack
 of access to private data. This development has
 allowed LLMs to select and coordinate multiple
 functions based on the context to tackle more
 complex problems. However, current methods
 for function calling often require sequential
 reasoning and acting for each function which
 can result in high latency, cost, and sometimes
 inaccurate behavior. To address this, we introduce
 LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function
 calls. Drawing inspiration from the principles of
 classical compilers, LLMCompiler enables parallel function calling with three components: (i) a
 Function Calling Planner, formulating execution
 plans for function calling; (ii) a Task Fetching
 Unit, dispatching function calling tasks; and (iii)
 an Executor, executing these tasks in parallel.
 LLMCompiler automatically generates an optimized orchestration for the function calls and can
 be used with both open-source and closed-source
 models. We have benchmarked LLMCompiler
 on a range of tasks with different patterns of
 function calling. We observe consistent latency
 speedup of up to 3.7×, cost savings of up
 to 6.7×, and accuracy improvement of up to
 ∼9% compared to ReAct. Our code is available at
 https://github.com/SqueezeAILab/LLMCompiler.

 1. Introduction
 Recent advances in the reasoning capability of Large Language Models (LLMs) have expanded the applicability of
 LLMs beyond content generation to solving complex problems (Besta et al., 2023; Chen et al., 2023b; Gao et al., 2022;Kojima et al., 2023; Wang et al., 2023b; Wei et al., 2022;
 Yang et al., 2022; Yao et al., 2023b; Zhou et al., 2023b);
 and recent works have also shown how this reasoning capability can be helpful in improving accuracy for solving
 complex and logical tasks. The reasoning capability has also
 allowed function (i.e., tool) calling capability, where LLMs
 can invoke provided functions and use the function outputs
 to help complete their tasks. These functions range from a
 simple calculator that can invoke arithmetic operations to
 more complex LLM-based functions. The ability of LLMs to integrate various tools and function
 calls could enable a fundamental shift in how we develop
 LLM-based software. However, this brings up an important
 challenge: what is the most effective approach to incorporate multiple function calls? A notable approach has been
 introduced in ReAct (Yao et al., 2022), where the LLM calls
 a function, analyzes the outcomes, and then reasons about
 the next action, which involves a subsequent function call.
 For a simple example illustrated in Fig. 1 (Left), where the
 LLM is asked if Scott Derrickson and Ed Wood have the
 same nationality, ReAct initially analyzes the query and decides to use a search tool to search for Scott Derrickson. The
 result of this search (i.e., observation) is then concatenated
 back to the original prompt for the LLM to reason about
 the next action, which invokes another search tool to gather
 information about Ed Wood.
 ReAct has been a pioneering work in enabling function
 calling, and it has been integrated into several frameworks (Langchain; Liu, 2022). However, scaling this approach for more complex applications requires considerable
 optimizations. This is due to the sequential nature of ReAct, where it executes function calls and reasons about their
 observations one after the other. This approach, along with
 the agent systems that extend ReAct (Khot et al., 2023; Qin
 et al., 2023; Ruan et al., 2023b; Sumers et al., 2023; Yao
 et al., 2023b), may lead to inefficiencies in latency and cost,
 due to the sequential function calling and repetitive LLM
 invocations for each reasoning and action step. Furthermore,
 while dynamic reasoning about the observations has benefits
 in certain cases, concatenating the outcomes of intermediate function calls could disrupt the LLM’s execution flow,
 potentially reducing accuracy (Xu et al., 2023). Common
 failure cases include repetitive invocation of the same function, which is also highlighted in the original paper (Yao
 et al., 2022), and early stopping based on the partial intermediate results, as will be further discussed in Sec. 5.1 and
 Appendix A.
 To address this challenge, we draw inspiration from classical compilers, where optimizing instruction executions
 in traditional programming languages has been extensively
 explored. A key optimization technique in compilers involves identifying instructions that can be executed in parallel and effectively managing their dependencies. Similarly,
 one can envision a compiler, tailored for LLM function
 calling, which can efficiently orchestrate various function
 calls and their dependencies. This shares a similar philosophy with the recent studies that align LLMs with computer systems (Karpathy, 2023; Packer et al., 2023). To
 this end, we introduce LLMCompiler, a novel framework
 that enables parallel multi-tool execution of LLMs across
 different models and workloads. To the best of our knowledge, LLMCompiler is the first framework to optimize
 the orchestration of LLM function calling that can not only
 improve latency and cost, but also accuracy, by minimizing
 interference from the outputs of intermediate function calls.
 In more detail, we make the following contributions:
 • We introduce LLMCompiler, an LLM compiler that
 optimizes the parallel function calling performance of
 LLMs. At a high level, this is achieved by introducing
 three key components: (i) a Function Calling Planner
 (Sec. 3.1) that identifies an execution flow; (ii) a Task Fetching Unit (Sec. 3.2) that dispatches the function calls
 in parallel; (iii) an Executor (Sec. 3.3) that executes the
 dispatched tasks using the associated functions.
 • We evaluate LLMCompiler on embarrassingly parallel
 patterns using HotpotQA (Yang et al., 2018) and Movie
 Recommendation (Srivastava et al., 2022), where we observe 1.80×/3.74× speedup and 3.37×/6.73× cost reduction compared to ReAct (Sec. 5.1).
 • To test the performance on more complex patterns, we
 introduce a new benchmark called ParallelQA which includes various non-trival function calling patterns. We
 show up to 2.27× speedup, 4.65× cost reduction, and 9%
 improved accuracy compared to ReAct (Sec. 5.2).
 • We evaluate LLMCompiler’s capability in dynamic replanning, which is achieved through a feedback loop
 from the Executor back to our Function Calling Planner.
 For the Game of 24 (Yao et al., 2023b), which requires
 repeated replanning based on the intermediate results,
 LLMCompiler demonstrates a 2× speedup compared
 to Tree-of-Thoughts (Sec. 5.3).
 • We show that LLMCompiler can explore the interactive
 decision-making environment effectively and efficiently.
 On WebShop, LLMCompiler achieves up to 101.7×
 speedup and 25.7% improved success rate compared to
 the baselines (Sec. 5.4).

 2. Related Work
 2.1. Latency Optimization in LLMs
 Various studies have focused on optimizing model design (Chen et al., 2023a; Dettmers et al., 2023; Frantar &
 Alistarh, 2023; Frantar et al., 2022; Kim et al., 2023; 2024;
 Kwon et al., 2022; Leviathan et al., 2023; Lin et al., 2023)
 and systems (tgi; trt; Kwon et al., 2023; Yu et al., 2022) for
 efficient LLM inference. Optimizations at the application
 level, however, are less explored. This is critical from a practical point of view for situations involving black-box LLM
 models and services where modifications to the models and
 the underlying inference pipeline are highly restricted.
 Skeleton-of-Thought (Ning et al., 2023) recently proposed
 to reduce latency through application-level parallel decoding. This method involves a two-step process of an initial
 skeleton generation phase, followed by parallel execution
 of skeleton items. However, it is primarily designed for
 embarrassingly parallel workloads and does not support
 problems that have inherently interdependent tasks, as it
 assumes no dependencies between skeleton tasks. This
 limits its applicability in complex scenarios such as coding (Austin et al., 2021; Chen et al., 2021; Hendrycks et al.,
 2021a; Madaan et al., 2023) or math (Hendrycks et al.,
 2021b;c) problems, as also stated in the paper (Ning et al.,
 2023). LLMCompiler addresses this by translating an
 input query into a series of tasks with inter-dependencies,
 thereby expanding the spectrum of problems it can handle.
 Concurrently to our work, OpenAI has recently introduced
 a parallel function calling feature in their 1106 release, enhancing user query processing through the simultaneous
 generation of multiple function calls (OpenAI, 2023). Despite its potential for reducing LLM execution time, this
 feature has certain limitations, as it is exclusively available for OpenAI’s proprietary models. However, there is
 a growing demand for using open-source models driven
 by the increasing number of open-source LLMs as well
 as parameter-efficient training techniques (Houlsby et al.,
 2019; Hu et al., 2022; Lester et al., 2021) for finetuning and
 customization. LLMCompiler enables efficient parallel
 function calling for open-source models, and also, as we
 will show later in Sec. 5, it can potentially achieve better
 latency and cost.

 2.2. Plan and Solve Strategy
 Several studies (Hao et al., 2023; Patel et al., 2022; Press
 et al., 2023; Wolfson et al., 2020; Zhou et al., 2023b) have
 explored prompting methods of breaking down complex
 queries into various levels of detail to solve them, thereby
 improving LLM’s performance in reasoning tasks. Specifically, Decomposed Prompting (Khot et al., 2023) tackles
 complex tasks by decomposing them into simpler sub-tasks, each optimized through LLMs with dedicated prompts.
 Step-Back Prompting (Zheng et al., 2023) enables LLMs to
 abstract high-level concepts from details to enhance reasoning abilities across various tasks. Plan-and-Solve Prompting (Wang et al., 2023a) segments multi-step reasoning
 tasks into subtasks to minimize errors and improve task
 accuracy without manual prompting. However, these methods primarily focus on improving the accuracy of reasoning
 benchmarks. In contrast, LLMCompiler uses a planner
 to identify parallelizable patterns within queries, aiming to
 reduce latency while maintaining accuracy.
 In addition to the aforementioned works, TPTU (Ruan
 et al., 2023a), HuggingGPT (Shen et al., 2023), and
 ViperGPT (Sur´ıs et al., 2023) have introduced end-to-end
 plan-and-solve frameworks. LLMCompiler sets itself
 apart by providing a general framework that enables efficient and accurate function calling in a broader range of
 problems. This stems from LLMCompiler’s capabilities
 in (i) planning and replanning; (ii) parallel execution; and
 (iii) addressing a wider range of problem domains, which
 will be discussed in more detail in Appendix F.
 Another notable work is ReWOO (Xu et al., 2023) which
 employs a planner to separate the reasoning process from the
 execution and observation phases to decrease token usage
 and cost as compared to ReAct. Our approach is different
 from ReWOO in multiple aspects. First, LLMCompiler
 allows parallel function calling which can reduce latency
 as well as cost. Second, LLMCompiler supports dynamic
 replanning which is important for problems whose execution flow cannot be determined statically in the beginning
 (Sec. 5.3).

 2.3. Tool-Augmented LLMs
 The enhanced reasoning capability of LLMs has enabled
 them to invoke user-provided functions and use their outputs
 to effectively complete tasks. Detailed exploration of this
 subject is provided in the Appendix C.1.

 3. Methodology
 To illustrate the components of LLMCompiler, we use a
 simple 2-way parallel example in Fig. 2. To answer “How
 much does Microsoft’s market cap need to increase to exceed Apple’s market cap?,” the LLM first needs to conduct
 web searches for both companies’ market caps, followed by
 a division operation. While the existing frameworks, including ReAct, perform these tasks sequentially, it is evident that
 they can be executed in parallel. The key question is how to
 automatically determine which tasks are parallelizable and
 which are interdependent, so we can orchestrate the execution of the different tasks accordingly. LLMCompiler accomplishes this through a system that consists of the follow ing three components: a Function Calling Planner (Sec. 3.1)
 that generates a sequence of tasks and their dependencies; a
 Task Fetching Unit (Sec. 3.2) that replaces arguments based
 on intermediate results and fetches the tasks; and an Executor (Sec. 3.3) that executes the tasks with associated tools.
 To use LLMCompiler, users are only required to provide
 tool definitions, and optional in-context examples for the
 Planner, as will be further discussed in Sec. 4.1.

 3.1. Function Calling Planner
 The Function Calling Planner is responsible for generating a
 sequence of tasks to be executed along with any dependency
 among them. For instance, Tasks $1 and $2 in Fig. 2 are
 two independent searches that can be performed in parallel. However, Task $3 has a dependency on the outcomes
 of the first and second searches. Therefore, the Planner’s
 role is to automatically identify the necessary tasks, their
 input arguments, as well as their inter-dependencies using
 the sophisticated reasoning capability of LLMs, essentially
 forming a directed acyclic graph of task dependencies. If
 a task is dependent on a preceding task, it incorporates a
 placeholder variable, such as $1 in Task 3 of Fig. 2, which
 will later be substituted with the actual output from the
 preceding task (Sec. 3.2).
 The Planner in LLMCompiler leverages LLMs’ reasoning
 capability to decompose tasks from natural language inputs.
 To achieve this, the Planner LLM incorporates a pre-defined
 prompt that guides it on how to create dependency graphs
 and to ensure correct syntax (see Appendix H for details).
 Besides this, users also need to supply tool definitions and
 optional in-context examples for the Planner. These examples provide detailed demonstrations of task decomposition
 specific to a problem, helping the Planner to better understand the rules. Further details on user-supplied information
 for LLMCompiler are elaborated in Sec. 4.1. In Sec. 4.2,
 we introduce an additional optimization for the Planner that streams tasks as soon as they are created, instead of waiting
 to complete the entire planning process.
 3.2. Task Fetching Unit
 The Task Fetching Unit, inspired by the instruction fetching
 units in modern computer architectures, fetches tasks to the
 Executor as soon as they are ready for (parallel) execution
 based on a greedy policy. Another key functionality is
 to replace variables with the actual outputs from preceding
 tasks, which were initially set as placeholders by the Planner.
 For the example in Fig. 2, the variable $1 and $2 in Task $3
 would be replaced with the actual market cap of Microsoft
 and Apple. This can be implemented with a simple fetching
 and queuing mechanism without a dedicated LLM.
 3.3. Executor
 The Executor asynchronously executes tasks fetched from
 the Task Fetching Unit. As the Task Fetching Unit guarantees that all the tasks dispatched to the Executor are independent, it can simply execute them concurrently. The
 Executor is equipped with user-provided tools, and it delegates the task to the associated tool. These tools can be
 simple functions like a calculator, Wikipedia search, or API
 calls, or they can even be LLM agents that are tailored for a
 specific task. As depicted in the Executor block of Fig. 2,
 each task has dedicated memory to store its intermediate
 outcomes, similar to what typical sequential frameworks
 do when aggregating observations as a single prompt (Yao
 et al., 2022). Upon completion of the task, the final results
 are forwarded as input to the tasks dependent on them.
 3.4. Dynamic Replanning
 In various applications, the execution graph may need to
 adapt based on intermediate results that are a priori unknown.
 An analogy in programming is branching, where the path of execution is determined only during runtime, depending
 on which branch conditions are satisfied. Such dynamic execution patterns can also appear with LLM function calling.
 For simple branching (e.g., if-else statements) one could
 statically compile the execution flow and choose the right
 dynamically based on the intermediate results. However, for
 more complex branching it may be better to do a recompilation or replanning based on the intermediate results.
 When replanning, the intermediate results are sent back
 from the Executor to the Function Calling Planner which
 then generates a new set of tasks with their associated dependencies. These tasks are then sent to the Task Fetching
 Unit and subsequently to the Executor. This cycle continues
 until the desired final result is achieved and can be delivered
 to the user. We show an example use case of this in Sec. 5.3
 for solving the Game of 24 using the Tree-of-Thoughts
 approach.

 4. LLMCompiler Details
 4.1. User-Supplied Information
 LLMCompiler requires two inputs from the user:
 1. Tool Definitions: Users need to specify the tools that
 LLMs can use, including their descriptions and argument
 specifications. This is essentially the same requirement
 as other frameworks like ReAct and OpenAI function
 calling.
 2. In-context Examples for the Planner: Optionally,
 users can provide LLMCompiler with examples of
 how the Planner should behave. For instance, in the case
 of Fig. 2, users may provide examples illustrating expected inter-task dependencies for certain queries. These
 examples can assist the Planner LLM understand how
 to use various tools and generate the appropriate dependency graph for incoming inputs in the correct format.
 In Appendix G, we include the examples that we used
 in our evaluations.
 4.2. Streamed Planner
 The Planner may incur a non-trivial overhead for user
 queries that involve a lot of tasks as it blocks the Task Fetching Unit and the Executor, which must wait for the Planner
 output before initiating their processes. However, analogous
 to instruction pipelining in modern computer systems, this
 can be mitigated by enabling the Planner to asynchronously
 stream the dependency graph, thereby allowing each task
 to be immediately processed by the Executor as soon as its
 dependencies are all resolved. In Table C.1, we present a
 latency comparison of LLMCompiler with and without
 the streaming mechanism across different benchmarks. The
 results demonstrate consistent latency improvements with
 streaming. Particularly, in the ParallelQA benchmark, the streaming feature leads to a latency gain of up to 1.3×. This
 is attributed to the math tool’s longer execution time for
 ParallelQA, which can effectively hide the Planner’s latency
 in generating subsequent tasks, unlike the shorter execution
 times of the search tool used in HotpotQA and Movie
 Recommendation.

 5. Results
 In this section, we evaluate LLMCompiler using a variety
 of models and problem types. We use both the proprietary
 GPT models and the open-source LLaMA-2 model, with
 the latter demonstrating LLMCompiler’s capability in enabling parallel function calling in open-source models. Furthermore, there are various types of parallel function calling
 patterns that can be addressed with LLMs. This ranges
 from embarrassingly parallel patterns, where all tasks can
 be executed in parallel without any dependencies between
 them, to more complex dependency patterns, as illustrated
 in Fig. 3. Importantly, we also assess LLMCompiler
 on the Game of 24 benchmark, which involves dynamic
 replanning based on intermediate results, highlighting its
 adaptability to dynamic dependency graphs. Finally, we
 apply LLMCompiler to the WebShop benchmark to showcase its potential in decision-making tasks. Overall, we start
 presenting results for simple execution patterns, and then
 we move to more complex ones.
 5.1. Embarrassingly Parallel Function Calling
 The simplest scenario involves an LLM using a tool repeatedly for independent tasks such as conducting parallel
 searches or analyses to gather information on different topics, like the pattern depicted in Fig. 3 (a). While these tasks
 are independent of each other and can be executed in parallel, ReAct, along with other LLM solutions as they stand,
 would need to run sequentially. This leads to increased
 latency and token consumption due to its frequent LLM invocations for each tool usage, as also illustrated in Fig. 1. In
 this section, we demonstrate how LLMCompiler can identify parallelizable patterns and execute independent tasks
 concurrently to resolve this issue. To do so, we use the
 following two benchmarks:
 • HotpotQA: A dataset that evaluates multi-hop reasoning (Yang et al., 2018). We only use the comparison dev
 set. This contains 1.5k questions comparing two different
 entities, thus exhibiting a 2-way embarrassingly parallel
 execution pattern. An example question is shown in Fig. 1.
 • Movie Recommendation: A dataset with 500 examples
 that asks to identify the most similar movie out of four
 options to another set of four movies, exhibiting an 8-way
 embarrassingly parallel pattern (Srivastava et al., 2022).

 Experimental Setups. As a baseline method, we compare LLMCompiler with ReAct. We follow the ReAct
 setup (Yao et al., 2022) using the same Wikipedia search
 tool that LLMs can use to search for information. We did
 not include the lookup tool since it is not relevant to our
 problem setting. We have optimized the prompt and incontext examples for both ReAct and LLMCompiler to
 the best of our abilities. For all experiments across these datasets, we use gpt-3.5-turbo (1106 release). For the experiments using GPT, we additionally report the results using
 OpenAI’s parallel function calling capability, which was
 announced concurrently with our work. We also show how
 LLMCompiler can be effectively combined with the opensource LLaMA-2 70B model to provide the model with
 parallel function calling capabilities. For all experiments,
 we have measured accuracy, end-to-end latency, as well as
 input and output token usage. See Appendix D for details
 on experimental setups.
 Accuracy and Latency. We report the accuracy, end-toend latency, and relative speed-up of LLMCompiler compared to ReAct in Tab. 1. First, we observe that ReAct
 consistently achieves lower accuracy compared to OpenAI
 parallel function calling and LLMCompiler. We identify
 two main failure modes in ReAct: (1) the tendency for redundant generation of prior function calls, a point also noted
 in the original ReAct paper (Yao et al., 2022); and (2) premature early stopping based on the incomplete intermediate
 results. In Appendix A, we offer a detailed analysis demonstrating how these two prevalent failure cases significantly hurt ReAct’s accuracy, and how they can be resolved with
 LLMCompiler, leading to an accuracy enhancement of
 up to 7 – 8%. Furthermore, we have conducted interventional experiments in which we incorporated ReAct-specific
 prompts to avoid repetitive function calls and early stopping.
 ReAct†
 in Tab. 1 refers to ReAct with this ReAct-specific
 prompt. The ReAct-specific prompt yields a general accuracy improvement with ReAct†
 as compared to the original
 ReAct. Nevertheless, LLMCompiler still demonstrates
 on-par and better accuracy than ReAct†
 , as such prompting
 does not serve as a perfect solution to completely avoiding
 the erroneous behavior of ReAct.
 Additionally, when compared to ReAct†
 , LLMCompiler
 demonstrates a noticeable speedup of 1.80× and 1.40× on
 HotpotQA with GPT and LLaMA, respectively. Similarly,
 LLMCompiler demonstrates 3.74× and 2.82× speedup
 on Movie Recommendation with each model. Note that
 we benchmark the latency of LLMCompiler against that
 of ReAct†
 since the repeating and early stopping behavior of the original ReAct as discussed above makes its latency unpredictable and unsuitable for a fair comparison.
 LLMCompiler demonstrates a speedup of up to 35% compared to OpenAI parallel function calling whose latency
 gain over ReAct is 1.61× and 2.76× on each benchmark.1
 Costs. Another important consideration of using LLMs
 is cost, which depends on the input and output token usage. The costs for GPT experiments are provided in Tab. 2.
 LLMCompiler is more cost-efficient than ReAct for cost,
 as it involves less frequent LLM invocations. Interestingly,
 LLMCompiler also outperforms the recent OpenAI parallel function calling in cost efficiency. This is because
 LLMCompiler’s planning phase is more prompt length
 efficient than that of OpenAI parallel function calling since
 our Planner’s in-context examples are rather short and only
 include plans, not observations (see Appendix H).
 5.2. Parallel Function Calling with Dependencies
 The cases considered above are rather simple, as only one
 tool is used and all tasks can be executed independently
 of one another. However, similar to code execution in traditional code blocks, we may encounter function calling
 scenarios that involve more complex dependencies. To systematically evaluate the capability to plan out function calling in scenarios that involve complex task dependencies, we
 have designed a custom benchmark called ParallelQA. This
 benchmark is designed to incorporate non-trivial function
 calling patterns, including three different types of patterns in Fig. 3 (b) and (c). Inspired by the IfQA benchmark (Yu
 et al., 2023), ParallelQA contains 113 examples that involve
 mathematical questions on factual attributes of various entities. In particular, completing the task requires using two
 tools (i.e., search and math tools), with the second tool’s argument depending on the result of the first tool’s output. We
 have meticulously included questions that are answerable
 only with information from Wikipedia’s first paragraph, effectively factoring out the failure cases due to unsuccessful
 searches. See Appendix I for more details in ParallelQA.
 Experimental Setups. Similar to Sec. 5.1, we use ReAct (Yao et al., 2022) as the main baseline. Here, both ReAct and LLMCompiler are equipped with two tools: (1)
 the search tool, identical to the one mentioned in Sec.5.1;
 and (2) the math tool, which solves mathematical problems.
 The math tool is inspired by the Langchain (Langchain)’s
 LLMMathChain, which uses an LLM as an agent that interprets input queries and invokes the numexpr function
 with the appropriate formula. This enables the math chain to
 address a broad spectrum of math problems that are written
 both in mathematical and verbal form. See Appendix D for
 more details on experimental setups.
 Accuracy and Latency. As shown in the ParallelQA row
 of Tab. 1, LLMCompiler arrives at the final answer with
 an average speedup of 2.15× with gpt-4-turbo and 2.27×
 with LLaMA-2 70B, by avoiding sequential execution of
 the dependency graphs. Beyond the latency speedup, we observe higher accuracy of LLMCompiler with the LLaMA2 model as compared to that of ReAct, due to the reasons discussed in Sec. 5.1. Particularly in the LLaMA2 experiment, where LLMCompiler achieves around a
 9% increase in accuracy, we note that ∼20% of the examples experienced repetitive function calls with ReAct,
 aligning with our observations from the accuracy analysis detailed in Appendix A. Additionally, a comprehensive
 analysis of LLMCompiler’s failure cases is provided in
 Appendix B, where we note minimal Planner failures, highlighting LLMCompiler’s effectiveness in breaking down
 problems into complex multi-task dependencies.
 Cost. Similar to Sec. 5.1, LLMCompiler demonstrates
 substantial cost reductions of 4.65× and 2.57× compared
 to ReAct and OpenAI’s parallel function calling, respectively, as indicated in Tab. 2. This efficiency stems from
 LLMCompiler’s reduced frequency of LLM invocations,
 which is also the case with OpenAI’s parallel function calling, which is limited to planning out immediate parallelizable tasks, not the entire dependency graph. For example, in
 Fig. 3 (c), OpenAI’s method would necessitate three distinct
 LLM calls for initial search tasks, following math tasks, and
 the final math task. In contrast, LLMCompiler achieves
 this with a single LLM call, planning all tasks concurrently.

 5.3. Parallel Function Calling with Replanning
 In the previous sections, we have discussed cases in which
 dependency graphs can be determined statically. However,
 there are cases where dependency graphs need to be constructed dynamically depending on intermediate observations. Here, we consider one such dynamic approach in the
 context of the Game of 24 with the Tree-of-Thoughts (ToT)
 strategy proposed in (Yao et al., 2023b). The Game of 24 is
 a task to generate 24 using a set of four numbers and basic
 arithmetic operations. For example, from the numbers 2, 4,
 4, and 7, a solution could be 4 × (7 − 4) × 2 = 24. ToT
 approaches this task through two iterative LLM processes:
 (i) the thought proposer generates candidate partial solutions
 by selecting two numbers and applying an operation (e.g.
 2, 3, 7 from 2, 4, 4, 7 by calculating 7 - 4); (ii) the state
 evaluator assesses the potential of each candidate. Only
 the promising candidates are then processed in subsequent
 iterations of the thought proposer and state evaluator until
 24 is reached. Details about the Game of 24 benchmark and
 the ToT strategy can be found in Appendix J.
 While ToT achieves significant improvement at solving
 the Game of 24, its sequential, breadth-first search approach through the state tree can be time-consuming.
 LLMCompiler offers a faster alternative by enabling parallel execution of the thought proposer and the subsequent
 feasibility evaluator, akin to a parallel beam search method.
 Experimental Setups. Although LLMCompiler offers
 latency advantages, solving this problem with a single
 static graph is not feasible, as the Planner cannot plan
 out the thought proposing stage before identifying the
 selected candidates from the state evaluator of the previous iteration. Consequently, the Planner is limited to
 planning only within one iteration at a time. To address
 this, we resort to LLMCompiler’s replanning capability. In particular, LLMCompiler is equipped with three
 tools: thought proposer and state evaluator,
 which are both LLMs adapted from the original ToT
 framework, and top k select, which chooses the top
 k candidates from the thought proposer based on the
 state evaluator’s assessment. After all these tools
 are executed, LLMCompiler can decide to “replan” if no
 proposal reaches 24, triggering the Planner to devise new
 plans using the shortlisted states from top k select of
 the previous iteration. In this way, LLMCompiler can dynamically regenerate plans of each iteration, being able to
 tackle highly complex tasks that require iterative replanning
 based on the outcomes of previous plans.
 To evaluate LLMCompiler’s performance on the Game of
 24, we use 100 different instances of the game. For each
 problem, we consider the output as successful if its operations are valid and yield 24 while also using the provided numbers exactly once each. Further details on experiment setups are outlined in Appendix D.

 Success Rate and Latency. In the last two rows of Tab. 1,
 we explore the latency and success rate of LLMCompiler
 in comparison to the baseline described in (Yao et al., 2023b)
 on the Game of 24 benchmark. With the gpt-4 model,
 LLMCompiler demonstrates a 2.89× enhancement in latency while slightly improving the success rate compared
 to the baseline. Similarly, when applied with the LLaMA-2
 model, LLMCompiler shows a 2.01× improvement in latency, again without compromising on success rate. These
 results demonstrate not only a significant latency reduction
 without quality degradation, but also the replanning capability of LLMCompiler for solving complex problems.

 5.4. Application: LLMCompiler in Interactive Decision
 Making Tasks
 In this section, we demonstrate that LLMCompiler can
 explore language-based interactive environments effectively
 by benchmarking LLMCompiler on WebShop (Yao et al.,
 2023a). As highlighted in (Shinn et al., 2023; Yao et al.,
 2022; 2023a), WebShop exhibits considerable diversity,
 which requires extensive exploration to purchase the most
 appropriate item. While recent work feature advanced exploration strategies and show promising results (Ma et al., 2023;
 Zhou et al., 2023a), their approaches are largely based on a
 sequential and extensive tree search that incurs significant
 latency penalties. Here, LLMCompiler showcases an exploration strategy that is both effective and efficient with the
 use of parallel function calling. Our method enables broader
 exploration of items in the environment, which improves
 success rate compared to ReAct. At the same time, this exploration can be parallelized, yielding up to 101.7× speedup
 against baselines that perform sequential exploration.
 Experimental Setups. We evaluate LLMCompiler
 against three baselines on this benchmark, ReAct (Yao
 et al., 2022), LATS (Zhou et al., 2023a), and LASER (Ma
 et al., 2023), using 500 WebShop instructions. The evaluation metrics are success rate, average score, and latency.
 More details of the WebShop environment and the baseline
 methods are provided in Appendix K. For this experiment,
 LLMCompiler is equipped with two tools: search and
 explore. The search function triggers the model to
 generate and dispatch a query that returns a list of typically
 ten items from the Webshop environment. The explore
 function then clicks through links for each of the found
 items and retrieves information about options, prices, attributes, and features that are available. Finally, based on
 the gathered information, LLMCompiler decides on the
 item that best matches the input instruction for purchasing.
 Further details on experiments can be found in Appendix D.

 Performance and Latency. Our approach significantly
 outperforms all baseline models as shown in Table 3. When
 using gpt-3.5-turbo, LLMCompiler achieves a 28.4%
 and 6% improvement in success rate against ReAct and
 LATS; with gpt-4, our method improves upon ReAct and
 LASER by 20.4% and 5.6%, respectively. In terms of
 latency, LLMCompiler exhibits a 101.7× and 2.69×
 speedup against LATS and LASER. While we note that
 LLMCompiler execution is slightly slower than ReAct on
 this benchmark, mainly due to the Planner overhead, we
 also highlight that the gains in success rate far outweigh the
 minor latency penalty.
 We further delve into why LLMCompiler attains such an
 improved success rate and score compared to ReAct. Based
 on our observations, we discover that the ReAct agent tends
 to commit to a decision with imperfect information, a scenario that can arise when the agent has not gathered sufficient details about the features and options available for
 items. This observation was also noted in (Shinn et al.,
 2023) – without exploring more items in the environment,
 the agent struggles to differentiate between seemingly similar choices, ultimately failing to make the correct decision.
 In contrast, LLMCompiler undergoes further exploration
 by visiting all ten items found by search and retrieving
 relevant information about each item. We find that employing an effective search strategy is critical to decision-making
 tasks such as the WebShop benchmark.
 The relatively high performance of LATS can also be explained in terms of its exploration scheme. In this framework, the agent executes a brute-force search through the
 state and action space of Webshop, exploring as many as
 30 trajectories before making the final purchase. While this
 approach provides richer information for decision-making,
 the end-to-end execution becomes prohibitively slow.
 We report that our method, LLMCompiler, outperforms
 LASER by an average score of 1.5. When compared to
 LATS, this score is within the standard deviation range of
 our method. The average score for LLMCompiler, along 
 with its standard deviation, is 72.8 ± 4.01 for gpt-3.5-turbo.
 Further note that while the performance differences are
 marginal, our method exhibits significant execution speedup,
 101.7× over LATS and 2.69× over LASER.
 6. Conclusions
 Existing methods for invoking multiple functions with
 LLMs resort to sequential and dynamic reasoning. As a
 result, they suffer from inefficiencies in latency, cost, and
 accuracy. As a solution, we introduced LLMCompiler,
 a compiler-inspired framework that enables efficient parallel function calling across various LLMs, including opensource models like LLaMA-2 and OpenAI’s GPT series. By decomposing user inputs into tasks with defined
 inter-dependencies and executing these tasks concurrently
 through its Planner, Task Fetching Unit, and Executor components, LLMCompiler demonstrates substantial improvements in latency (up to 3.7×), cost efficiency (up to 6.7×),
 and accuracy (up to ∼9%), even outperforming OpenAI’s
 parallel function calling feature in latency gains. We look
 forward to future work building upon LLMCompiler that
 will improve both the capabilities and efficiencies of LLMs
 in executing complex, large-scale tasks, thus transforming
 the future development of LLM-based applications.
 Impact Statement
 This paper presents research towards advancing the field of
 Machine Learning. While there are many potential societal
 consequences of our work, we do not find any one to be
 particularly noteworthy.
 Acknowledgements
 We appreciate the valuable feedback from Minwoo Kang.
 We acknowledge gracious support from Furiosa team. We
 also appreciate the support from Microsoft through their
 Accelerating Foundation Model Research, including great
 support from Sean Kuno. Furthermore, we appreciate support from Google Cloud, the Google TRC team, and specifically Jonathan Caton, and Prof. David Patterson. Prof.
 Keutzer’s lab is sponsored by the Intel corporation, Intel
 One-API, Intel VLAB team, the Intel One-API center of
 excellence, as well as funding through BDD and BAIR. We
 also appreciate support from Samsung including Dongkyun
 Kim, and David Thorsley. We appreciate the great support
 from Ellick Chan, Saurabh Tangri, Andres Rodriguez, and
 Kittur Ganesh. Sehoon Kim and Suhong Moon would like
 to acknowledge the support from the Korea Foundation for
 Advanced Studies. Amir Gholami was supported through
 funding from Samsung SAIT. Michael W. Mahoney would
 also like to acknowledge a J. P. Morgan Chase Faculty Research Award as well as the DOE, NSF, and IARPA. Our
 conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

 G. User-Supplied Examples for LLMCompiler Configuration
 LLMCompiler provides a simple interface that allows for tailoring the framework to different use cases by providing tool
 definitions as well as optional in-context examples for the Planner. Below, we provide the Planner example prompts that are
 used to set up the framework for the Movie Recommendation and Game of 24 benchmarks with only a few lines of prompts.

 G.1. Movie Recommendation Example Prompts
 ###
 Question: Find a movie similar to Mission Impossible, The Silence of the
 Lambs, American Beauty, Star Wars Episode IV - A New Hope
 Options:
 Austin Powers International Man of Mystery
 Alesha Popvich and Tugarin the Dragon
 In Cold Blood
 Rosetta
 1. search("Mission Impossible")
 2. search("The Silence of the Lambs")
 3. search("American Beauty")
 4. search("Star Wars Episode IV - A New Hope")
 5. search("Austin Powers International Man of Mystery")
 6. search("Alesha Popvich and Tugarin the Dragon")
 7. search("In Cold Blood")
 8. search("Rosetta")
 Thought: I can answer the question now.
 9. finish()
 ###

 H. Pre-defined LLMCompiler Planner Prompts
 ###
 The pre-defined LLMCompiler Planner prompt provides it with specific instructions on how to break down tasks and
 generate dependency graphs while ensuring that the associated syntax is formatted correctly. This prompt contains specific
 rules such as assigning each task to a new line, beginning each task with a numerical identifier, and using the $ sign to
 denote intermediate variables.
 ###
 In addition to user-provided functions, the Planner includes a special, hard-coded finish function. The Planner uses this
 function either when the plan is sufficient to address the user query or when it can no longer proceed with planning before
 executing the current plan, i.e., when it deems replanning necessary. When the Planner outputs the finish function, its
 plan generation stops. Refer to Appendix G for examples of the Planner’s usage of the finish function in planning. The
 definition of the finish function is as below and is included as a prompt to the Planner along with the definitions of other
 user-provided functions.
 ###
 finish():
 - Collects and combines results from prior actions.
 - A LLM agent is called upon invoking join to either finalize the user
 query or wait until the plans are executed.
 - join should always be the last action in the plan, and will be called in
 two scenarios:
 (a) if the answer can be determined by gathering the outputs from tasks to
 generate the final response.
 (b) if the answer cannot be determined in the planning phase before you
 execute the plans.
 ###