ToolTalk: Benchmarking the Future of Tool-Using AI Assistants

26 May 2024

Authors:

(1) Nicholas Farn, Microsoft Corporation {Microsoft Corporation {nifarn@microsoft.com};

(2) Richard Shin, Microsoft Corporation {eush@microsoft.com}.

Table of Links

Abstract and Intro

Dataset Design

Evaluation Methodology

Experiments and Analysis

A. Complete list of tools

B. Scenario Prompt

C. Unrealistic Queries

D. Nuances comparing prior work

ABSTRACT

Large language models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement.

We release ToolTalk at https://github.com/microsoft/ToolTalk.

1 INTRODUCTION

Large language models (LLMs) can perform impressive feats in natural language understanding, generation, and other tasks involving manipulation of text. With appropriate adjustments after pretraining, they can hold fluent and natural conversations with users. However, the scope of such conversations is still limited by LLMs lacking access to knowledge outside of their training data, exhibiting limited mathematical reasoning and computational abilities, and otherwise being unable to interact with the outside world.

To overcome these limitations, various prior works have proposed integrating LLM-powered chatbots with the ability to use tools such as search engines (Nakano et al., 2022), calculators, or web APIs (Mialon et al., 2023). Making meaningful progress in tool use requires relevant benchmarks and evaluation datasets that can fully exercise these systems with realistic and challenging conversations. In this paper, we introduce ToolTalk as a step towards this goal. ToolTalk consists of 78 conversations with 178 total turns, making use of 28 unique tools grouped into 7 categories, along with an evaluation methodology tailored towards measuring accurate tool use.

Several considerations informed our design of ToolTalk in order to best simulate typical conversations that a user may wish to have with an LLM-based assistant. First, we wanted to ensure that ToolTalk is conversational, and allows for multiple rounds of dialogue between the user and the assistant for a single intent; reflecting how users may not always wish to formulate their full request in one utterance and can add additional qualifiers or issue corrections after receiving some feedback from the assistant. This allows us to include user intents requiring a complex series of tool invocations without having unnaturally long utterances. Second, we include a ground-truth set of tool calls that should have been made for each user utterance, suitable for use in an automated evaluation comparing against the tool calls predicted by an assistant. Third, ToolTalk includes executable implementations of every tool included in the dataset, to facilitate the evaluation of assistants that may consider results from prior tool invocations to decide which ones to make next. Fourth, ToolTalk includes tools intended to have side effects (such as sending emails, or adding/deleting calendar events), which we refer to as “action tools”, rather than only making database queries (such as searching for emails containing a particular keyword). Such action tools are necessary if the assistant is to automate the user’s tasks.

We tailor our evaluation methodology towards the particulars of our dataset design, going beyond common metrics like exact-match accuracy. In particular, we separately consider invocations of action and non-action tools, considering that incorrect invocations to action tools, such as sending a message to the wrong person, may have particularly negative effects for the user. On the other hand, if the assistant makes both correct non-action tool invocations and some incorrect extraneous ones, the extraneous ones may still provide useful information to the user (even if it’s not what the user directly requested). As such, we use tool invocation recall and incorrect action rate as the primary metrics within a single conversational turn, and define a conversation-level notion of success.

We apply ToolTalk on two assistants implemented using the function calling support of OpenAI’s Chat completions API with the GPT-3.5 and GPT-4 models. We found that gpt-3.5-turbo-0613 and gpt-4-0613 achieve a conversation-level success rate of 26% and 50% respectively, demonstrating that tool usage in a conversational setting is still a difficult task for even some of the most state-of-the-art models. We then conduct further analyses to determine reasons why GPT-3.5 and GPT-4 fail on conversations. We find that both GPT-3.5 and GPT-4 can hallucinate arguments, fail to understand documentation, and even outright claim to have accomplished a task without calling any tools.

Our paper makes the following contributions:

• We introduce a conversational dataset for tool-using LLM-powered assistants, containing a broad range of tools and example conversations with ground truth annotations for tool invocations that allow for an automated evaluation.

• We ensure that the dataset contains multi-turn conversations requiring use of multiple tools, including tools with side effects, to better simulate how users may interact with a tool-using assistant.

• We develop an evaluation methodology which reflects the differences between tools with side effects and tools without them.

• We evaluate assistants built using GPT-3.5 and GPT-4 using our dataset and analyze their errors, finding issues such as hallucinated arguments and misunderstood documentation.

This paper is available on arxiv under CC 4.0 license.