HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
- Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence (AGI). While there are abundant AI models available for different domains and modalities, they cannot handle complicated AI tasks. Considering large language models (LLMs) have exhibited exceptional ability in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks and language could be a generic interface to empower this. Based on this philosophy, we present HuggingGPT, a system that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., HuggingFace) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in HuggingFace, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in HuggingFace, HuggingGPT is able to cover numerous sophisticated AI tasks in different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards AGI.
I’ve been working on creating an interactive version of my book using the GPT. This has entailed splitting the book into one text file per chapter, then trying out different versions of the GPT to produce summaries. This has been far more interesting than I expected, and it has some implications on Foundational models.
The versions of the GPT I’ve been using are Davinci-003, GPT-3.5-turbo, and GPT-4. And they each have distinct “personalities.” Since I’m having them summarize my book, I know the subject matter quite well, so I’m able to get a sense of how well these models summarize something like 400 words down to 100. Overall, I like the Davinci-003 model the best for capturing the feeling of my writing, and the GPT-4 for getting more details. The GPT-3.5 falls in the middle, so I’m using it.
They all get some details wrong, but in aggregate, they are largely better than any single summary. That is some nice support for the idea that multiple foundational models are more resilient than any single model. It also suggests a path to making resilient Foundational systems. Keep some of the old models around to use an ensemble when the risks are greater.
Multiple responses also help with hallucinations. One of the examples I like to use to show this is to use the prompt “23, 24, 25” to see what the model generates. Most often, the response continues the series for a while, but then it will usually start to generate code – e.g. “23, 24, 25, 26, 27, 28];” – where it places the square bracket and semicolon to say that this is an array in a line of software. It has started to hallucinate that it is writing code.
The thing is, the only elements that all the models will agree on in response to the same prompt repeated multiple times are the elements most likely to be trustworthy. For a model, the “truth” is the common denominator, while hallucinations are unique.
This approach makes systems more resilient for the cost of keeping the old systems on line. It doesn’t address how a deliberate attack on a Foundational model could be handled. After all, an adversary would still have exploits for the earlier models and could apply them as well.
Still…
If all models lined up and started to do very similar things, that could be a sign that there was something fishy going on, and a cue for the human operators of these systems to start looking for the nefarious activity.