Reboot Agents (#30387)

* Create CodeAgent and ReactAgent * Fix formatting errors * Update documentation for agents * Add custom errors, improve logging * Support variable usage in ReactAgent * add messages * Add message passing format * Create React Code Agent * Update * Refactoring * Fix errors * Improve python interpreter * Only non-tensor inputs should be sent to device * Calculator tool slight refactor * Improve docstrings * Refactor * Fix tests * Fix more tests * Fix even more tests * Fix tests by replacing output and input types * Fix operand type issue * two small fixes * EM TTS * Fix agent running type errors * Change text to speech tests to allow changed outputs * Update doc with new agent types * Improve code interpreter * If max iterations reached, provide a real answer instead of an error * Add edge case in interpreter * Add safe imports to the interpreter * Interpreter tweaks: tuples and listcomp * Make style * Make quality * Add dictcomp to interpreter * Rename ReactJSONAgent to ReactJsonAgent * Misc changes * ToolCollection * Rename agent's logger to self.logger * Add while loops to interpreter * Update doc with new tools. still need to mention collections * Add collections to the doc * Small fixes on logs and interpretor * Fix toolbox return type * Docs + fixup * Skip doctests * Correct prompts with improved examples and formatting * Update prompt * Remove outdated docs * Change agent to accept Toolbox object for tools * Remove calculator tool * Propagate removal of calculator in doc * Fix 2 failing workflows * Simplify additional argument passing * AgentType audio * Minor changes: function name, types * Remove calculator tests * Fix test * Fix torch requirement * Fix final answer tests * Style fixes * Fix tests * Update docstrings with calculator removal * Small type hint fixes * Update tests/agents/test_translation.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_python_interpreter.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/default_tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_agents.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/bert/configuration_bert.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/speech_to_text.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_speech_to_text.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_tools_common.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * pygments * Answer comments * Cleaning up * Simplifying init for all agents * Improving prompts and making code nicer * Style fixes * Add multiple comparator test in interpreter * Style fixes * Improve BERT example in documentation * Add examples to doc * Fix python interpreter quality * Logging improvements * Change test flag to agents * Quality fix * Add example for HfEngine * Improve conversation example for HfEngine * typo fix * Verify doc * Update docs/source/en/agents.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/agents.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/prompts.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/python_interpreter.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/agents.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix style issues * local s2t tool --------- Co-authored-by: Cyril Kondratenko <kkn1993@gmail.com> Co-authored-by: Lysandre <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Reboot Agents (#30387)
* Create CodeAgent and ReactAgent * Fix formatting errors * Update documentation for agents * Add custom errors, improve logging * Support variable usage in ReactAgent * add messages * Add message passing format * Create React Code Agent * Update * Refactoring * Fix errors * Improve python interpreter * Only non-tensor inputs should be sent to device * Calculator tool slight refactor * Improve docstrings * Refactor * Fix tests * Fix more tests * Fix even more tests * Fix tests by replacing output and input types * Fix operand type issue * two small fixes * EM TTS * Fix agent running type errors * Change text to speech tests to allow changed outputs * Update doc with new agent types * Improve code interpreter * If max iterations reached, provide a real answer instead of an error * Add edge case in interpreter * Add safe imports to the interpreter * Interpreter tweaks: tuples and listcomp * Make style * Make quality * Add dictcomp to interpreter * Rename ReactJSONAgent to ReactJsonAgent * Misc changes * ToolCollection * Rename agent's logger to self.logger * Add while loops to interpreter * Update doc with new tools. still need to mention collections * Add collections to the doc * Small fixes on logs and interpretor * Fix toolbox return type * Docs + fixup * Skip doctests * Correct prompts with improved examples and formatting * Update prompt * Remove outdated docs * Change agent to accept Toolbox object for tools * Remove calculator tool * Propagate removal of calculator in doc * Fix 2 failing workflows * Simplify additional argument passing * AgentType audio * Minor changes: function name, types * Remove calculator tests * Fix test * Fix torch requirement * Fix final answer tests * Style fixes * Fix tests * Update docstrings with calculator removal * Small type hint fixes * Update tests/agents/test_translation.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_python_interpreter.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/default_tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_agents.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/bert/configuration_bert.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/tools.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/speech_to_text.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_speech_to_text.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/agents/test_tools_common.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * pygments * Answer comments * Cleaning up * Simplifying init for all agents * Improving prompts and making code nicer * Style fixes * Add multiple comparator test in interpreter * Style fixes * Improve BERT example in documentation * Add examples to doc * Fix python interpreter quality * Logging improvements * Change test flag to agents * Quality fix * Add example for HfEngine * Improve conversation example for HfEngine * typo fix * Verify doc * Update docs/source/en/agents.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/agents.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/prompts.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/agents/python_interpreter.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/agents.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix style issues * local s2t tool --------- Co-authored-by: Cyril Kondratenko <kkn1993@gmail.com> Co-authored-by: Lysandre <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
0ba15ced · Aymeric Roucher · GitHub · 3733391c · 0ba15ced · 0ba15ced
Unverified Commit 0ba15ced authored May 07, 2024 by Aymeric Roucher Committed by GitHub May 07, 2024
20 changed files
--- a/conftest.py
+++ b/conftest.py
@@ -71,7 +71,7 @@ NOT_DEVICE_TESTS = {
    "ModelTester::test_pipeline_",
    "/repo_utils/",
    "/utils/",
-    "/tools/",
+    "/agents/",
 }
 # allow having multiple repository checkouts and not needing to remember to rerun
@@ -94,7 +94,7 @@ def pytest_configure(config):
    config.addinivalue_line("markers", "is_pipeline_test: mark test to run only when pipelines are tested")
    config.addinivalue_line("markers", "is_staging_test: mark test to run only in the staging environment")
    config.addinivalue_line("markers", "accelerate_tests: mark test that require accelerate")
-    config.addinivalue_line("markers", "tool_tests: mark the tool tests that are run on their specific schedule")
+    config.addinivalue_line("markers", "agent_tests: mark the agent tests that are run on their specific schedule")
    config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -23,7 +23,7 @@
    title: Load and train adapters with 🤗 PEFT
  - local: model_sharing
    title: Share your model
-  - local: transformers_agents
+  - local: agents
    title: Agents
  - local: llm_tutorial
    title: Generation with LLMs
@@ -133,8 +133,6 @@
    title: Notebooks with examples
  - local: community
    title: Community resources
-  - local: custom_tools
-    title: Custom Tools and Prompts
  - local: troubleshooting
    title: Troubleshoot
  - local: hf_quantizer

--- a/docs/source/en/agents.md
+++ b/docs/source/en/agents.md
--- a/docs/source/en/custom_tools.md
+++ b/docs/source/en/custom_tools.md
--- a/docs/source/en/main_classes/agent.md
+++ b/docs/source/en/main_classes/agent.md
@@ -28,30 +28,27 @@ contains the API docs for the underlying classes.
 ## Agents
-We provide three types of agents: [`HfAgent`] uses inference endpoints for opensource models, [`LocalAgent`] uses a model of your choice locally and [`OpenAiAgent`] uses OpenAI closed models.
+We provide two types of agents, based on the main [`Agent`] class:
+- [`CodeAgent`] acts in one shot, generating code to solve the task, then executes it at once.
+- [`ReactAgent`] acts step by step, each step consisting of one thought, then one tool call and execution. It has two classes:
+  - [`ReactJsonAgent`] writes its tool calls in JSON.
+  - [`ReactCodeAgent`] writes its tool calls in Python code.
-### HfAgent
+### Agent
-[[autodoc]] HfAgent
-### LocalAgent
-[[autodoc]] LocalAgent
+[[autodoc]] Agent
-### OpenAiAgent
+### CodeAgent
-[[autodoc]] OpenAiAgent
+[[autodoc]] CodeAgent
-### AzureOpenAiAgent
+### React agents
-[[autodoc]] AzureOpenAiAgent
+[[autodoc]] ReactAgent
-### Agent
+[[autodoc]] ReactJsonAgent
-[[autodoc]] Agent
+[[autodoc]] ReactCodeAgent
-    - chat
-    - run
-    - prepare_for_new_chat
 ## Tools
@@ -63,18 +60,50 @@ We provide three types of agents: [`HfAgent`] uses inference endpoints for opens
 [[autodoc]] Tool
-### PipelineTool
+### Toolbox
-[[autodoc]] PipelineTool
+[[autodoc]] Toolbox
-### RemoteTool
+### PipelineTool
-[[autodoc]] RemoteTool
+[[autodoc]] PipelineTool
 ### launch_gradio_demo
 [[autodoc]] launch_gradio_demo
+### ToolCollection
+[[autodoc]] ToolCollection
+## Engines
+You're free to create and use your own engines to be usable by the Agents framework.
+These engines have the following specification:
+1. Follow the [messages format](../chat_templating.md) for its input (`List[Dict[str, str]]`) and return a string.
+2. Stop generating outputs *before* the sequences passed in the argument `stop_sequences`
+### HfEngine
+For convenience, we have added a `HfEngine` that implements the points above and uses an inference endpoint for the execution of the LLM.
+```python
+>>> from transformers import HfEngine
+>>> messages = [
+...   {"role": "user", "content": "Hello, how are you?"},
+...   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+...   {"role": "user", "content": "No need to help, take it easy."},
+... ]
+>>> HfEngine()(messages, stop_sequences=["conversation"])
+"That's very kind of you to say! It's always nice to have a relaxed "
+```
+[[autodoc]] HfEngine
 ## Agent Types
 Agents can handle any type of object in-between tools; tools, being completely multimodal, can accept and return
@@ -94,12 +123,12 @@ These types have three specific purposes:
 ### AgentText
-[[autodoc]] transformers.tools.agent_types.AgentText
+[[autodoc]] transformers.agents.agent_types.AgentText
 ### AgentImage
-[[autodoc]] transformers.tools.agent_types.AgentImage
+[[autodoc]] transformers.agents.agent_types.AgentImage
 ### AgentAudio
-[[autodoc]] transformers.tools.agent_types.AgentAudio
+[[autodoc]] transformers.agents.agent_types.AgentAudio
--- a/docs/source/en/transformers_agents.md
+++ b/docs/source/en/transformers_agents.md
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-->
-# Transformers Agents
-<Tip warning={true}>
-Transformers Agents is an experimental API which is subject to change at any time. Results returned by the agents
-can vary as the APIs or underlying models are prone to change.
-</Tip>
-Transformers version v4.29.0, building on the concept of *tools* and *agents*. You can play with in
-[this colab](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj).
-In short, it provides a natural language API on top of transformers: we define a set of curated tools and design an 
-agent to interpret natural language and to use these tools. It is extensible by design; we curated some relevant tools, 
-but we'll show you how the system can be extended easily to use any tool developed by the community.
-Let's start with a few examples of what can be achieved with this new API. It is particularly powerful when it comes 
-to multimodal tasks, so let's take it for a spin to generate images and read text out loud.
-```py
-agent.run("Caption the following image", image=image)
-```
-| **Input**                                                                                                                   | **Output**                        |
-|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------|
-| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png" width=200> | A beaver is swimming in the water |
---
-```py
-agent.run("Read the following text out loud", text=text)
-```
-| **Input**                                                                                                               | **Output**                                   |
-|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------|
-| A beaver is swimming in the water | <audio controls><source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tts_example.wav" type="audio/wav"> your browser does not support the audio element. </audio>
---
-```py
-agent.run(
-    "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?",
-    document=document,
-)
-```
-| **Input**                                                                                                                   | **Output**     |
-|-----------------------------------------------------------------------------------------------------------------------------|----------------|
-| <img src="https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/0/image/image.jpg" width=200> | ballroom foyer |
-## Quickstart
-Before being able to use `agent.run`, you will need to instantiate an agent, which is a large language model (LLM). 
-We provide support for openAI models as well as opensource alternatives from BigCode and OpenAssistant. The openAI
-models perform better (but require you to have an openAI API key, so cannot be used for free); Hugging Face is
-providing free access to endpoints for BigCode and OpenAssistant models.
-To start with, please install the `agents` extras in order to install all default dependencies.
-```bash
-pip install transformers[agents]
-```
-To use openAI models, you instantiate an [`OpenAiAgent`] after installing the `openai` dependency:
-```bash
-pip install openai
-```
-```py
-from transformers import OpenAiAgent
-agent = OpenAiAgent(model="text-davinci-003", api_key="<your_api_key>")
-```
-To use BigCode or OpenAssistant, start by logging in to have access to the Inference API:
-```py
-from huggingface_hub import login
-login("<YOUR_TOKEN>")
-```
-Then, instantiate the agent
-```py
-from transformers import HfAgent
-# Starcoder
-agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")
-# StarcoderBase
-# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase")
-# OpenAssistant
-# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
-```
-This is using the inference API that Hugging Face provides for free at the moment. If you have your own inference
-endpoint for this model (or another one) you can replace the URL above with your URL endpoint.
-<Tip>
-StarCoder and OpenAssistant are free to use and perform admirably well on simple tasks. However, the checkpoints
-don't hold up when handling more complex prompts. If you're facing such an issue, we recommend trying out the OpenAI
-model which, while sadly not open-source, performs better at this given time.
-</Tip>
-You're now good to go! Let's dive into the two APIs that you now have at your disposal.
-### Single execution (run)
-The single execution method is when using the [`~Agent.run`] method of the agent:
-```py
-agent.run("Draw me a picture of rivers and lakes.")
-```
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" width=200>
-It automatically selects the tool (or tools) appropriate for the task you want to perform and runs them appropriately. It
-can perform one or several tasks in the same instruction (though the more complex your instruction, the more likely
-the agent is to fail).
-```py
-agent.run("Draw me a picture of the sea then transform the picture to add an island")
-```
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/sea_and_island.png" width=200>
-<br/>
-Every [`~Agent.run`] operation is independent, so you can run it several times in a row with different tasks.
-Note that your `agent` is just a large-language model, so small variations in your prompt might yield completely
-different results. It's important to explain as clearly as possible the task you want to perform. We go more in-depth
-on how to write good prompts [here](custom_tools#writing-good-user-inputs).
-If you'd like to keep a state across executions or to pass non-text objects to the agent, you can do so by specifying
-variables that you would like the agent to use. For example, you could generate the first image of rivers and lakes, 
-and ask the model to update that picture to add an island by doing the following:
-```python
-picture = agent.run("Generate a picture of rivers and lakes.")
-updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture)
-```
-<Tip>
-This can be helpful when the model is unable to understand your request and mixes tools. An example would be:
-```py
-agent.run("Draw me the picture of a capybara swimming in the sea")
-```
-Here, the model could interpret in two ways:
- Have the `text-to-image` generate a capybara swimming in the sea
- Or, have the `text-to-image` generate capybara, then use the `image-transformation` tool to have it swim in the sea
-In case you would like to force the first scenario, you could do so by passing it the prompt as an argument:
-```py
-agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea")
-```
-</Tip>
-### Chat-based execution (chat)
-The agent also has a chat-based approach, using the [`~Agent.chat`] method:
-```py
-agent.chat("Generate a picture of rivers and lakes")
-```
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" width=200> 
-```py
-agent.chat("Transform the picture so that there is a rock in there")
-```
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes_and_beaver.png" width=200>
-<br/>
-This is an interesting approach when you want to keep the state across instructions. It's better for experimentation, 
-but will tend to be much better at single instructions rather than complex instructions (which the [`~Agent.run`]
-method is better at handling).
-This method can also take arguments if you would like to pass non-text types or specific prompts.
-### ⚠️ Remote execution
-For demonstration purposes and so that it could be used with all setups, we had created remote executors for several 
-of the default tools the agent has access for the release. These are created using 
-[inference endpoints](https://huggingface.co/inference-endpoints).
-We have turned these off for now, but in order to see how to set up remote executors tools yourself,
-we recommend reading the [custom tool guide](./custom_tools).
-### What's happening here? What are tools, and what are agents?
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/diagram.png">
-#### Agents
-The "agent" here is a large language model, and we're prompting it so that it has access to a specific set of tools.
-LLMs are pretty good at generating small samples of code, so this API takes advantage of that by prompting the 
-LLM gives a small sample of code performing a task with a set of tools. This prompt is then completed by the 
-task you give your agent and the description of the tools you give it. This way it gets access to the doc of the 
-tools you are using, especially their expected inputs and outputs, and can generate the relevant code.
-#### Tools
-Tools are very simple: they're a single function, with a name, and a description. We then use these tools' descriptions 
-to prompt the agent. Through the prompt, we show the agent how it would leverage tools to perform what was 
-requested in the query.
-This is using brand-new tools and not pipelines, because the agent writes better code with very atomic tools. 
-Pipelines are more refactored and often combine several tasks in one. Tools are meant to be focused on
-one very simple task only.
-#### Code-execution?!
-This code is then executed with our small Python interpreter on the set of inputs passed along with your tools. 
-We hear you screaming "Arbitrary code execution!" in the back, but let us explain why that is not the case.
-The only functions that can be called are the tools you provided and the print function, so you're already 
-limited in what can be executed. You should be safe if it's limited to Hugging Face tools. 
-Then, we don't allow any attribute lookup or imports (which shouldn't be needed anyway for passing along 
-inputs/outputs to a small set of functions) so all the most obvious attacks (and you'd need to prompt the LLM 
-to output them anyway) shouldn't be an issue. If you want to be on the super safe side, you can execute the 
-run() method with the additional argument return_code=True, in which case the agent will just return the code 
-to execute and you can decide whether to do it or not.
-The execution will stop at any line trying to perform an illegal operation or if there is a regular Python error 
-with the code generated by the agent.
-### A curated set of tools
-We identify a set of tools that can empower such agents. Here is an updated list of the tools we have integrated 
-in `transformers`:
- **Document question answering**: given a document (such as a PDF) in image format, answer a question on this document ([Donut](./model_doc/donut))
- **Text question answering**: given a long text and a question, answer the question in the text ([Flan-T5](./model_doc/flan-t5))
- **Unconditional image captioning**: Caption the image! ([BLIP](./model_doc/blip))
- **Image question answering**: given an image, answer a question on this image ([VILT](./model_doc/vilt))
- **Image segmentation**: given an image and a prompt, output the segmentation mask of that prompt ([CLIPSeg](./model_doc/clipseg))
- **Speech to text**: given an audio recording of a person talking, transcribe the speech into text ([Whisper](./model_doc/whisper))
- **Text to speech**: convert text to speech ([SpeechT5](./model_doc/speecht5))
- **Zero-shot text classification**: given a text and a list of labels, identify to which label the text corresponds the most ([BART](./model_doc/bart))
- **Text summarization**: summarize a long text in one or a few sentences ([BART](./model_doc/bart))
- **Translation**: translate the text into a given language ([NLLB](./model_doc/nllb))
-These tools have an integration in transformers, and can be used manually as well, for example:
-```py
-from transformers import load_tool
-tool = load_tool("text-to-speech")
-audio = tool("This is a text to speech tool")
-```
-### Custom tools
-While we identify a curated set of tools, we strongly believe that the main value provided by this implementation is 
-the ability to quickly create and share custom tools.
-By pushing the code of a tool to a Hugging Face Space or a model repository, you're then able to leverage the tool 
-directly with the agent. We've added a few 
-**transformers-agnostic** tools to the [`huggingface-tools` organization](https://huggingface.co/huggingface-tools):
- **Text downloader**: to download a text from a web URL
- **Text to image**: generate an image according to a prompt, leveraging stable diffusion
- **Image transformation**: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
- **Text to video**: generate a small video according to a prompt, leveraging damo-vilab
-The text-to-image tool we have been using since the beginning is a remote tool that lives in 
-[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)! We will
-continue releasing such tools on this and other organizations, to further supercharge this implementation.
-The agents have by default access to tools that reside on [`huggingface-tools`](https://huggingface.co/huggingface-tools).
-We explain how to you can write and share your tools as well as leverage any custom tool that resides on the Hub in [following guide](custom_tools).
-### Code generation
-So far we have shown how to use the agents to perform actions for you. However, the agent is only generating code
-that we then execute using a very restricted Python interpreter. In case you would like to use the code generated in 
-a different setting, the agent can be prompted to return the code, along with tool definition and accurate imports.
-For example, the following instruction
-```python
-agent.run("Draw me a picture of rivers and lakes", return_code=True)
-```
-returns the following code
-```python
-from transformers import load_tool
-image_generator = load_tool("huggingface-tools/text-to-image")
-image = image_generator(prompt="rivers and lakes")
-```
-that you can then modify and execute yourself.
--- a/docs/source/ja/custom_tools.md
+++ b/docs/source/ja/custom_tools.md
--- a/docs/source/ja/main_classes/agent.md
+++ b/docs/source/ja/main_classes/agent.md
@@ -18,88 +18,9 @@ rendered properly in your Markdown viewer.
 <Tip warning={true}>
-Transformers Agents は実験的な API であり、いつでも変更される可能性があります。エージェントから返される結果
+The Agents framework has significantly changed in version v4.41.0.
-API または基礎となるモデルは変更される傾向があるため、変更される可能性があります。
+This document has been removed as it was referencing an older API.
-</Tip>
+We eagerly welcome new contributions for the updated API.
-エージェントとツールの詳細については、[入門ガイド](../transformers_agents) を必ずお読みください。このページ
+</Tip>
-基礎となるクラスの API ドキュメントが含まれています。
\ No newline at end of file
-## エージェント
-私たちは 3 種類のエージェントを提供します。[`HfAgent`] はオープンソース モデルの推論エンドポイントを使用し、[`LocalAgent`] は選択したモデルをローカルで使用し、[`OpenAiAgent`] は OpenAI クローズド モデルを使用します。
-### HfAgent
-[[autodoc]] HfAgent
-### LocalAgent
-[[autodoc]] LocalAgent
-### OpenAiAgent
-[[autodoc]] OpenAiAgent
-### AzureOpenAiAgent
-[[autodoc]] AzureOpenAiAgent
-### Agent
-[[autodoc]] Agent
-    - chat
-    - run
-    - prepare_for_new_chat
-## Tools
-### load_tool
-[[autodoc]] load_tool
-### Tool
-[[autodoc]] Tool
-### PipelineTool
-[[autodoc]] PipelineTool
-### RemoteTool
-[[autodoc]] RemoteTool
-### launch_gradio_demo
-[[autodoc]] launch_gradio_demo
-## エージェントの種類
-エージェントはツール間であらゆる種類のオブジェクトを処理できます。ツールは完全にマルチモーダルであるため、受け取りと返品が可能です
-テキスト、画像、オーディオ、ビデオなどのタイプ。ツール間の互換性を高めるためだけでなく、
-これらの戻り値を ipython (jupyter、colab、ipython ノートブックなど) で正しくレンダリングするには、ラッパー クラスを実装します。
-このタイプの周り。
-ラップされたオブジェクトは最初と同じように動作し続けるはずです。テキストオブジェクトは依然として文字列または画像として動作する必要があります
-オブジェクトは依然として `PIL.Image` として動作するはずです。
-これらのタイプには、次の 3 つの特定の目的があります。
- 型に対して `to_raw` を呼び出すと、基になるオブジェクトが返されるはずです
- 型に対して `to_string` を呼び出すと、オブジェクトを文字列として返す必要があります。`AgentText` の場合は文字列になる可能性があります。
-  ただし、他のインスタンスのオブジェクトのシリアル化されたバージョンのパスになります。
- ipython カーネルで表示すると、オブジェクトが正しく表示されるはずです
-### AgentText
-[[autodoc]] transformers.tools.agent_types.AgentText
-### AgentImage
-[[autodoc]] transformers.tools.agent_types.AgentImage
-### AgentAudio
-[[autodoc]] transformers.tools.agent_types.AgentAudio
--- a/docs/source/ko/custom_tools.md
+++ b/docs/source/ko/custom_tools.md
--- a/docs/source/zh/main_classes/agent.md
+++ b/docs/source/zh/main_classes/agent.md
@@ -18,84 +18,9 @@ rendered properly in your Markdown viewer.
 <Tip warning={true}>
-Transformers Agents是一个实验性的API，它随时可能发生变化。由于API或底层模型容易发生变化，因此由agents返回的结果可能会有所不同。
+The Agents framework has significantly changed in version v4.41.0.
+This document has been removed as it was referencing an older API.
+We eagerly welcome new contributions for the updated API.
 </Tip>
-要了解更多关于agents和工具的信息，请确保阅读[介绍指南](../transformers_agents)。此页面包含底层类的API文档。
-## Agents
-我们提供三种类型的agents：[`HfAgent`]使用开源模型的推理端点，[`LocalAgent`]使用您在本地选择的模型，[`OpenAiAgent`]使用OpenAI封闭模型。
-### HfAgent
-[[autodoc]] HfAgent
-### LocalAgent
-[[autodoc]] LocalAgent
-### OpenAiAgent
-[[autodoc]] OpenAiAgent
-### AzureOpenAiAgent
-[[autodoc]] AzureOpenAiAgent
-### Agent
-[[autodoc]] Agent 
-    - chat 
-    - run 
-    - prepare_for_new_chat
-## 工具
-### load_tool
-[[autodoc]] load_tool
-### Tool
-[[autodoc]] Tool
-### PipelineTool
-[[autodoc]] PipelineTool
-### RemoteTool
-[[autodoc]] RemoteTool
-### launch_gradio_demo
-[[autodoc]] launch_gradio_demo
-## Agent类型
-Agents可以处理工具之间任何类型的对象；工具是多模态的，可以接受和返回文本、图像、音频、视频等类型。为了增加工具之间的兼容性，以及正确地在ipython（jupyter、colab、ipython notebooks等）中呈现这些返回值，我们实现了这些类型的包装类。
-被包装的对象应该继续按照最初的行为方式运作；文本对象应该仍然像字符串一样运作，图像对象应该仍然像`PIL.Image`一样运作。
-这些类型有三个特定目的：
- 对类型调用 `to_raw` 应该返回底层对象
- 对类型调用 `to_string` 应该将对象作为字符串返回：在`AgentText`的情况下可能是字符串，但在其他情况下可能是对象序列化版本的路径
- 在ipython内核中显示它应该正确显示对象
-### AgentText
-[[autodoc]] transformers.tools.agent_types.AgentText
-### AgentImage
-[[autodoc]] transformers.tools.agent_types.AgentImage
-### AgentAudio
-[[autodoc]] transformers.tools.agent_types.AgentAudio
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -54,6 +54,20 @@ logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 # Base objects, independent of any specific backend
 _import_structure = {
+    "agents": [
+        "Agent",
+        "CodeAgent",
+        "HfEngine",
+        "PipelineTool",
+        "ReactAgent",
+        "ReactCodeAgent",
+        "ReactJsonAgent",
+        "Tool",
+        "Toolbox",
+        "ToolCollection",
+        "launch_gradio_demo",
+        "load_tool",
+    ],
    "audio_utils": [],
    "benchmark": [],
    "commands": [],
@@ -129,8 +143,8 @@ _import_structure = {
        "load_tf2_model_in_pytorch_model",
        "load_tf2_weights_in_pytorch_model",
    ],
-    "models": [],
    # Models
+    "models": [],
    "models.albert": ["ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "AlbertConfig"],
    "models.align": [
        "ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP",
@@ -1050,18 +1064,6 @@ _import_structure = {
        "SpecialTokensMixin",
        "TokenSpan",
    ],
-    "tools": [
-        "Agent",
-        "AzureOpenAiAgent",
-        "HfAgent",
-        "LocalAgent",
-        "OpenAiAgent",
-        "PipelineTool",
-        "RemoteTool",
-        "Tool",
-        "launch_gradio_demo",
-        "load_tool",
-    ],
    "trainer_callback": [
        "DefaultFlowCallback",
        "EarlyStoppingCallback",
@@ -5039,6 +5041,21 @@ else:
 # Direct imports for type-checking
 if TYPE_CHECKING:
    # Configuration
+    # Agents
+    from .agents import (
+        Agent,
+        CodeAgent,
+        HfEngine,
+        PipelineTool,
+        ReactAgent,
+        ReactCodeAgent,
+        ReactJsonAgent,
+        Tool,
+        Toolbox,
+        ToolCollection,
+        launch_gradio_demo,
+        load_tool,
+    )
    from .configuration_utils import PretrainedConfig
    # Data
@@ -6010,20 +6027,6 @@ if TYPE_CHECKING:
        TokenSpan,
    )
-    # Tools
-    from .tools import (
-        Agent,
-        AzureOpenAiAgent,
-        HfAgent,
-        LocalAgent,
-        OpenAiAgent,
-        PipelineTool,
-        RemoteTool,
-        Tool,
-        launch_gradio_demo,
-        load_tool,
-    )
    # Trainer
    from .trainer_callback import (
        DefaultFlowCallback,

--- a/src/transformers/tools/__init__.py
+++ b/src/transformers/tools/__init__.py
@@ -24,8 +24,9 @@ from ..utils import (
 _import_structure = {
-    "agents": ["Agent", "AzureOpenAiAgent", "HfAgent", "LocalAgent", "OpenAiAgent"],
+    "agents": ["Agent", "CodeAgent", "ReactAgent", "ReactCodeAgent", "ReactJsonAgent", "Toolbox"],
-    "base": ["PipelineTool", "RemoteTool", "Tool", "launch_gradio_demo", "load_tool"],
+    "llm_engine": ["HfEngine"],
+    "tools": ["PipelineTool", "Tool", "ToolCollection", "launch_gradio_demo", "load_tool"],
 }
 try:
@@ -34,20 +35,17 @@ try:
 except OptionalDependencyNotAvailable:
    pass
 else:
+    _import_structure["default_tools"] = ["FinalAnswerTool", "PythonInterpreterTool"]
    _import_structure["document_question_answering"] = ["DocumentQuestionAnsweringTool"]
-    _import_structure["image_captioning"] = ["ImageCaptioningTool"]
    _import_structure["image_question_answering"] = ["ImageQuestionAnsweringTool"]
-    _import_structure["image_segmentation"] = ["ImageSegmentationTool"]
    _import_structure["speech_to_text"] = ["SpeechToTextTool"]
-    _import_structure["text_classification"] = ["TextClassificationTool"]
-    _import_structure["text_question_answering"] = ["TextQuestionAnsweringTool"]
-    _import_structure["text_summarization"] = ["TextSummarizationTool"]
    _import_structure["text_to_speech"] = ["TextToSpeechTool"]
    _import_structure["translation"] = ["TranslationTool"]
 if TYPE_CHECKING:
-    from .agents import Agent, AzureOpenAiAgent, HfAgent, LocalAgent, OpenAiAgent
+    from .agents import Agent, CodeAgent, ReactAgent, ReactCodeAgent, ReactJsonAgent, Toolbox
-    from .base import PipelineTool, RemoteTool, Tool, launch_gradio_demo, load_tool
+    from .llm_engine import HfEngine
+    from .tools import PipelineTool, Tool, ToolCollection, launch_gradio_demo, load_tool
    try:
        if not is_torch_available():
@@ -55,14 +53,10 @@ if TYPE_CHECKING:
    except OptionalDependencyNotAvailable:
        pass
    else:
+        from .default_tools import FinalAnswerTool, PythonInterpreterTool
        from .document_question_answering import DocumentQuestionAnsweringTool
-        from .image_captioning import ImageCaptioningTool
        from .image_question_answering import ImageQuestionAnsweringTool
-        from .image_segmentation import ImageSegmentationTool
        from .speech_to_text import SpeechToTextTool
-        from .text_classification import TextClassificationTool
-        from .text_question_answering import TextQuestionAnsweringTool
-        from .text_summarization import TextSummarizationTool
        from .text_to_speech import TextToSpeechTool
        from .translation import TranslationTool
 else:

--- a/src/transformers/tools/agent_types.py
+++ b/src/transformers/tools/agent_types.py
 # coding=utf-8
-# Copyright 2023 HuggingFace Inc.
+# Copyright 2024 HuggingFace Inc.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,7 +25,6 @@ from ..utils import is_soundfile_availble, is_torch_available, is_vision_availab
 logger = logging.get_logger(__name__)
 if is_vision_available():
-    import PIL.Image
    from PIL import Image
    from PIL.Image import Image as ImageType
 else:
@@ -33,6 +32,9 @@ else:
 if is_torch_available():
    import torch
+    from torch import Tensor
+else:
+    Tensor = object
 if is_soundfile_availble():
    import soundfile as sf
@@ -77,7 +79,7 @@ class AgentText(AgentType, str):
        return self._value
    def to_string(self):
-        return self._value
+        return str(self._value)
 class AgentImage(AgentType, ImageType):
@@ -211,10 +213,7 @@ class AgentAudio(AgentType):
 AGENT_TYPE_MAPPING = {"text": AgentText, "image": AgentImage, "audio": AgentAudio}
-INSTANCE_TYPE_MAPPING = {str: AgentText}
+INSTANCE_TYPE_MAPPING = {str: AgentText, float: AgentText, int: AgentText, Tensor: AgentAudio, ImageType: AgentImage}
-if is_vision_available():
-    INSTANCE_TYPE_MAPPING[PIL.Image] = AgentImage
 def handle_agent_inputs(*args, **kwargs):
@@ -223,55 +222,14 @@ def handle_agent_inputs(*args, **kwargs):
    return args, kwargs
-def handle_agent_outputs(outputs, output_types=None):
+def handle_agent_outputs(output, output_type=None):
-    if isinstance(outputs, dict):
+    if output_type in AGENT_TYPE_MAPPING:
-        decoded_outputs = {}
+        # If the class has defined outputs, we can map directly according to the class definition
-        for i, (k, v) in enumerate(outputs.items()):
+        decoded_outputs = AGENT_TYPE_MAPPING[output_type](output)
-            if output_types is not None:
+        return decoded_outputs
-                # If the class has defined outputs, we can map directly according to the class definition
-                if output_types[i] in AGENT_TYPE_MAPPING:
-                    decoded_outputs[k] = AGENT_TYPE_MAPPING[output_types[i]](v)
-                else:
-                    decoded_outputs[k] = AgentType(v)
-            else:
-                # If the class does not have defined output, then we map according to the type
-                for _k, _v in INSTANCE_TYPE_MAPPING.items():
-                    if isinstance(v, _k):
-                        decoded_outputs[k] = _v(v)
-                if k not in decoded_outputs:
-                    decoded_outputs[k] = AgentType[v]
-    elif isinstance(outputs, (list, tuple)):
-        decoded_outputs = type(outputs)()
-        for i, v in enumerate(outputs):
-            if output_types is not None:
-                # If the class has defined outputs, we can map directly according to the class definition
-                if output_types[i] in AGENT_TYPE_MAPPING:
-                    decoded_outputs.append(AGENT_TYPE_MAPPING[output_types[i]](v))
-                else:
-                    decoded_outputs.append(AgentType(v))
-            else:
-                # If the class does not have defined output, then we map according to the type
-                found = False
-                for _k, _v in INSTANCE_TYPE_MAPPING.items():
-                    if isinstance(v, _k):
-                        decoded_outputs.append(_v(v))
-                        found = True
-                if not found:
-                    decoded_outputs.append(AgentType(v))
    else:
-        if output_types[0] in AGENT_TYPE_MAPPING:
+        # If the class does not have defined output, then we map according to the type
-            # If the class has defined outputs, we can map directly according to the class definition
+        for _k, _v in INSTANCE_TYPE_MAPPING.items():
-            decoded_outputs = AGENT_TYPE_MAPPING[output_types[0]](outputs)
+            if isinstance(output, _k):
+                return _v(output)
-        else:
+        return AgentType(output)
-            # If the class does not have defined output, then we map according to the type
-            for _k, _v in INSTANCE_TYPE_MAPPING.items():
-                if isinstance(outputs, _k):
-                    return _v(outputs)
-            return AgentType(outputs)
-    return decoded_outputs
--- a/src/transformers/agents/agents.py
+++ b/src/transformers/agents/agents.py
--- a/src/transformers/agents/default_tools.py
+++ b/src/transformers/agents/default_tools.py
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib.util
+import json
+import math
+from dataclasses import dataclass
+from math import sqrt
+from typing import Dict
+from huggingface_hub import hf_hub_download, list_spaces
+from ..utils import is_offline_mode
+from .python_interpreter import LIST_SAFE_MODULES, evaluate_python_code
+from .tools import TASK_MAPPING, TOOL_CONFIG_FILE, Tool
+def custom_print(*args):
+    return " ".join(map(str, args))
+BASE_PYTHON_TOOLS = {
+    "print": custom_print,
+    "range": range,
+    "float": float,
+    "int": int,
+    "bool": bool,
+    "str": str,
+    "round": round,
+    "ceil": math.ceil,
+    "floor": math.floor,
+    "log": math.log,
+    "exp": math.exp,
+    "sin": math.sin,
+    "cos": math.cos,
+    "tan": math.tan,
+    "asin": math.asin,
+    "acos": math.acos,
+    "atan": math.atan,
+    "atan2": math.atan2,
+    "degrees": math.degrees,
+    "radians": math.radians,
+    "pow": math.pow,
+    "sqrt": sqrt,
+    "len": len,
+    "sum": sum,
+    "max": max,
+    "min": min,
+    "abs": abs,
+    "list": list,
+    "dict": dict,
+    "tuple": tuple,
+    "set": set,
+    "enumerate": enumerate,
+    "zip": zip,
+    "reversed": reversed,
+    "sorted": sorted,
+    "all": all,
+    "any": any,
+    "map": map,
+    "filter": filter,
+    "ord": ord,
+    "chr": chr,
+}
+@dataclass
+class PreTool:
+    name: str
+    inputs: Dict[str, str]
+    output_type: type
+    task: str
+    description: str
+    repo_id: str
+HUGGINGFACE_DEFAULT_TOOLS_FROM_HUB = [
+    "image-transformation",
+    "text-to-image",
+]
+def get_remote_tools(logger, organization="huggingface-tools"):
+    if is_offline_mode():
+        logger.info("You are in offline mode, so remote tools are not available.")
+        return {}
+    spaces = list_spaces(author=organization)
+    tools = {}
+    for space_info in spaces:
+        repo_id = space_info.id
+        resolved_config_file = hf_hub_download(repo_id, TOOL_CONFIG_FILE, repo_type="space")
+        with open(resolved_config_file, encoding="utf-8") as reader:
+            config = json.load(reader)
+        task = repo_id.split("/")[-1]
+        tools[config["name"]] = PreTool(
+            task=task,
+            description=config["description"],
+            repo_id=repo_id,
+            name=task,
+            inputs=config["inputs"],
+            output_type=config["output_type"],
+        )
+    return tools
+def setup_default_tools(logger):
+    default_tools = {}
+    main_module = importlib.import_module("transformers")
+    tools_module = main_module.agents
+    for task_name, tool_class_name in TASK_MAPPING.items():
+        tool_class = getattr(tools_module, tool_class_name)
+        default_tools[tool_class.name] = PreTool(
+            name=tool_class.name,
+            inputs=tool_class.inputs,
+            output_type=tool_class.output_type,
+            task=task_name,
+            description=tool_class.description,
+            repo_id=None,
+        )
+    return default_tools
+class PythonInterpreterTool(Tool):
+    name = "python_interpreter"
+    description = "This is a tool that evaluates python code. It can be used to perform calculations."
+    inputs = {
+        "code": {
+            "type": "text",
+            "description": (
+                "The code snippet to evaluate. All variables used in this snippet must be defined in this same snippet, "
+                f"else you will get an error. This code can only import the following python libraries: {LIST_SAFE_MODULES}."
+            ),
+        }
+    }
+    output_type = "text"
+    available_tools = BASE_PYTHON_TOOLS.copy()
+    def forward(self, code):
+        output = str(evaluate_python_code(code, tools=self.available_tools))
+        return output
+class FinalAnswerTool(Tool):
+    name = "final_answer"
+    description = "Provides a final answer to the given problem"
+    inputs = {"answer": {"type": "text", "description": "The final answer to the problem"}}
+    output_type = "any"
+    def forward(self, answer):
+        return answer
--- a/src/transformers/tools/document_question_answering.py
+++ b/src/transformers/tools/document_question_answering.py
@@ -16,10 +16,13 @@
 # limitations under the License.
 import re
+import numpy as np
+import torch
 from ..models.auto import AutoProcessor
 from ..models.vision_encoder_decoder import VisionEncoderDecoderModel
 from ..utils import is_vision_available
-from .base import PipelineTool
+from .tools import PipelineTool
 if is_vision_available():
@@ -28,17 +31,19 @@ if is_vision_available():
 class DocumentQuestionAnsweringTool(PipelineTool):
    default_checkpoint = "naver-clova-ix/donut-base-finetuned-docvqa"
-    description = (
+    description = "This is a tool that answers a question about an document (pdf). It returns a text that contains the answer to the question."
-        "This is a tool that answers a question about an document (pdf). It takes an input named `document` which "
-        "should be the document containing the information, as well as a `question` that is the question about the "
-        "document. It returns a text that contains the answer to the question."
-    )
    name = "document_qa"
    pre_processor_class = AutoProcessor
    model_class = VisionEncoderDecoderModel
-    inputs = ["image", "text"]
+    inputs = {
-    outputs = ["text"]
+        "document": {
+            "type": "image",
+            "description": "The image containing the information. Can be a PIL Image or a string path to the image.",
+        },
+        "question": {"type": "text", "description": "The question in English"},
+    }
+    output_type = "text"
    def __init__(self, *args, **kwargs):
        if not is_vision_available():
@@ -52,6 +57,10 @@ class DocumentQuestionAnsweringTool(PipelineTool):
        decoder_input_ids = self.pre_processor.tokenizer(
            prompt, add_special_tokens=False, return_tensors="pt"
        ).input_ids
+        if isinstance(document, str):
+            img = Image.open(document).convert("RGB")
+            img_array = np.array(img).transpose(2, 0, 1)
+            document = torch.tensor(img_array)
        pixel_values = self.pre_processor(document, return_tensors="pt").pixel_values
        return {"decoder_input_ids": decoder_input_ids, "pixel_values": pixel_values}

--- a/src/transformers/tools/evaluate_agent.py
+++ b/src/transformers/tools/evaluate_agent.py
--- a/src/transformers/tools/image_question_answering.py
+++ b/src/transformers/tools/image_question_answering.py
@@ -14,32 +14,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from typing import TYPE_CHECKING
 import torch
+from PIL import Image
 from ..models.auto import AutoModelForVisualQuestionAnswering, AutoProcessor
 from ..utils import requires_backends
-from .base import PipelineTool
+from .tools import PipelineTool
-if TYPE_CHECKING:
-    from PIL import Image
 class ImageQuestionAnsweringTool(PipelineTool):
    default_checkpoint = "dandelin/vilt-b32-finetuned-vqa"
    description = (
-        "This is a tool that answers a question about an image. It takes an input named `image` which should be the "
+        "This is a tool that answers a question about an image. It "
-        "image containing the information, as well as a `question` which should be the question in English. It "
        "returns a text that is the answer to the question."
    )
    name = "image_qa"
    pre_processor_class = AutoProcessor
    model_class = AutoModelForVisualQuestionAnswering
-    inputs = ["image", "text"]
+    inputs = {
-    outputs = ["text"]
+        "image": {
+            "type": "image",
+            "description": "The image containing the information. Can be a PIL Image or a string path to the image.",
+        },
+        "question": {"type": "text", "description": "The question in English"},
+    }
+    output_type = "text"
    def __init__(self, *args, **kwargs):
        requires_backends(self, ["vision"])

--- a/src/transformers/agents/llm_engine.py
+++ b/src/transformers/agents/llm_engine.py
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from copy import deepcopy
+from enum import Enum
+from typing import Dict, List
+from huggingface_hub import InferenceClient
+class MessageRole(str, Enum):
+    USER = "user"
+    ASSISTANT = "assistant"
+    SYSTEM = "system"
+    TOOL_CALL = "tool-call"
+    TOOL_RESPONSE = "tool-response"
+    @classmethod
+    def roles(cls):
+        return [r.value for r in cls]
+def get_clean_message_list(message_list: List[Dict[str, str]], role_conversions: Dict[str, str] = {}):
+    """
+    Subsequent messages with the same role will be concatenated to a single message.
+    Args:
+        message_list (`List[Dict[str, str]]`): List of chat messages.
+    """
+    final_message_list = []
+    message_list = deepcopy(message_list)  # Avoid modifying the original list
+    for message in message_list:
+        if not set(message.keys()) == {"role", "content"}:
+            raise ValueError("Message should contain only 'role' and 'content' keys!")
+        role = message["role"]
+        if role not in MessageRole.roles():
+            raise ValueError(f"Incorrect role {role}, only {MessageRole.roles()} are supported for now.")
+        if role in role_conversions:
+            message["role"] = role_conversions[role]
+        if len(final_message_list) > 0 and message["role"] == final_message_list[-1]["role"]:
+            final_message_list[-1]["content"] += "\n===\n" + message["content"]
+        else:
+            final_message_list.append(message)
+    return final_message_list
+llama_role_conversions = {
+    MessageRole.SYSTEM: MessageRole.USER,
+    MessageRole.TOOL_RESPONSE: MessageRole.USER,
+}
+class HfEngine:
+    def __init__(self, model: str = "meta-llama/Meta-Llama-3-8B-Instruct"):
+        self.model = model
+        self.client = InferenceClient(model=self.model, timeout=120)
+    def __call__(self, messages: List[Dict[str, str]], stop_sequences=[]) -> str:
+        if "Meta-Llama-3" in self.model:
+            if "<|eot_id|>" not in stop_sequences:
+                stop_sequences.append("<|eot_id|>")
+            if "!!!!!" not in stop_sequences:
+                stop_sequences.append("!!!!!")
+        # Get clean message list
+        messages = get_clean_message_list(messages, role_conversions=llama_role_conversions)
+        # Get answer
+        response = self.client.chat_completion(messages, stop=stop_sequences, max_tokens=1500)
+        response = response.choices[0].message.content
+        # Remove stop sequences from the answer
+        for stop_seq in stop_sequences:
+            if response[-len(stop_seq) :] == stop_seq:
+                response = response[: -len(stop_seq)]
+        return response
--- a/src/transformers/agents/prompts.py
+++ b/src/transformers/agents/prompts.py