MCP web search + GUI delegation + verifier
A medicine-price task is decomposed into tool lookup and GUI app exploration, then checked by composite verifier signals.
This project contributes two linked artifacts: PhoneHarness for mixed-action phone-agent execution, and PhoneHarness Bench, a benchmark built on the harness to evaluate verifiable mobile workflows.
Real mobile tasks often combine app navigation, local device operations, file handling, email, calendar, web lookup, and safety-sensitive side effects. PhoneHarness keeps the phone as the stateful execution surface, but exposes multiple action surfaces so agents can choose the right path instead of forcing every task through GUI taps.
Run shell or Python commands inside the phone environment for local files, Android settings, device state, and lightweight scripting.
Delegate visual subtasks to a GUI controller while the outer agent handles planning, routing, and verification-oriented execution.
Use tools for search, email, calendar, documents, and other workflows where side effects should be auditable and verifiable.
The benchmark is organized around mock-app tasks, real-app tasks, safety tasks, and a scored 124-task split over 30 app scenarios. Each task has a natural-language prompt, execution protocol, trace logging, and one or more verifiers for checking whether the expected side effect actually happened.
The scored split covers device/system operations, single-app GUI tasks, tool-assisted workflows, and cross-app workflows.
PhoneHarness exposes CLI, GUI, and MCP-style tools inside one phone-agent loop, enabling deterministic-first routing when appropriate.
The homepage summarizes the paper without turning individual table cells into headline claims. Detailed pass rates, model pairings, ablations, and safety-policy results are kept in the paper for the exact experimental context.
Results are grouped by device/system operations, single-app GUI tasks, tool-assisted workflows, and cross-app workflows rather than by a coarse difficulty label.
The gains concentrate on tasks where command-line operations, GUI interaction, and MCP-style tools provide complementary routes to a verifiable side effect.
Same-harness comparisons separate orchestration-model effects from delegated GUI-worker effects, with full scores reported in the paper.
The paper reports the exact task-type table, action-space breakdown, same-harness model-combination study, step/runtime analysis, and safety-policy table. Keeping the homepage qualitative avoids mixing table-specific scores with broader project claims.
Rather than only reporting final answers, PhoneHarness records executable trajectories with selected screenshots, compact non-GUI operation cards, and verifier-side success signals. The clips below are generated from real trace artifacts.
A medicine-price task is decomposed into tool lookup and GUI app exploration, then checked by composite verifier signals.
For deterministic phone-state queries, the agent routes directly through device-side commands instead of opening the GUI.
The agent runs a GUI task on a secondary display while the user's foreground screen remains unchanged.
A compact gallery of representative PhoneHarness executions, keeping screenshots, non-GUI operation cards, and verifier outcomes visible in one static view.