A recent study co-authored by Apple examines AI agents’ understanding of the consequences of their actions, particularly in mobile user interfaces (UIs). Presented at the ACM Conference on Intelligent User Interfaces, the paper introduces a framework that not only assesses whether AI can interact with UIs correctly but also measures their ability to anticipate the potential impacts of their actions. The research identified various risky interactions—like sending messages or making financial transactions—recruiting participants to record those they would find uncomfortable if performed by AI without permission.
The framework evaluates user intent, impact on the UI and user, reversibility of actions, and frequency of tasks. When tested on large language models, Google Gemini achieved 56% accuracy, while GPT-4 reached 58% through a reasoning approach. Though the study doesn’t provide a complete solution for safety in AI agents, it offers a benchmark for evaluating their understanding of actions’ implications, enhancing future AI agent development.
Source link