The medium of the application (i.e. where or what is the application?)
Consider a task to consist of a problem or a set of problems, a solution process and a set of outputs.
The solution process is defined within some medium. This medium might by physical (e.g. pen and paper, whiteboard, physical objects) or digital (e.g. a spreadsheet, an executable application, a script, a database, a chatbot).
When we are working on a task with the assistance of an AI agent, we can attempt to solve the problem directly or have the agent indirectly use another medium to solve the problem. For example, when asking an agent to rename some files according to some pattern, we can either ask the agent to do this directly (e.g. “rename all files in directory X according to pattern Y”) or we can ask the agent to generate a script that will do this (e.g. “write a python script that renames all files in directory X according to pattern Y”). In the direct approach we expect the agent to interpret the pattern and apply it to the set of filenames within the model itself. The distinction between the approaches is blurred by the existence of tools. The agent must at some point use some external tool to access the set of filenames and then perform the rename - the LLM itself has no ability to perform any kind of action
Broadly speaking, a tool allows the LLM to outsource some of the solution process and is defined in the medium of code (for example, a script or a server process) or as an executable. Tools can exist at different levels of abstraction and generality, and the difference defines where the dividing line between LLM problem-solving (creative) and tool problem-solving (mechanical) lies. A more generic tool likely allows the LLM to solve the problem more creatively. For example, there are many ways for an agent to solve the rename problem when presented with a tool that lets it run any unix command and relatively fewer when given an explicit “rename by pattern” tool. The blurring is exacerbated if we allow the agent to generate its own tools, even when not prompted to do so explicitly. For example, given the first prompt above (“rename all files…”), the agent may opt for generating and executing a script in recognition of the mechanical nature of the task. This script could then even be added to a tool library for future use. When we or the agent decide to generate a specialised, non-generic, script to accomplish a task, even one in which a series of existing tools could have been used in tandem to complete it, we are suppressing the agent’s urge to use a direct-creative approach in favour or an indirect-mechanical one (note how this is blurred by the agent needing to work creatively to generate the script itself!). I think a lot of the nuance in using agents effectively lies in finding the right balance between these direct and indirect approaches, and giving the agent creativity and autonomy in the right places and with the right constraints.
Where next?
Right now the capabilities of agents, particularly multi-modal ones, as well as the tooling around them is such that the indirect approach can often yield better results, both in terms of predictability and efficiency. The mechanical and deterministic nature of a script or tool being preferred to the arbitrary and probabilistic nature of LLMs. It is also commonplace that the human operator, foreseeing the mechanical nature of the task, the limitations of the agent and the potential difficultly in posing the problem to the agent through a textual interface, will skip an agentic approach entirely and e.g. rename the files through the traditional interface by hand.
It’s interesting to muse on how this will evolve. It seems we are heading in the direction of agents becoming more capable and universally integrated with our environments such that the agentic approach is more and more preferable. Over time will the agent become the only user-facing solution process medium, but still with some internal recourse to tools, scripts etc.? That is, a process still running on top of something that we would recognise as an operating system. Eventually will an entirely direct approach take over where the model itself, barring some ability to persist memory and communicate (with the operator and other models), is the only medium?