Phone-Agent Harness + Benchmark

PhoneHarness: A Mixed-Action Orchestration Harness and Benchmark for Phone Agents across CLI, GUI, and MCP Tools

Name: PhoneHarness Bench
Creator: PhoneHarness

This project contributes two linked artifacts: PhoneHarness for mixed-action phone-agent execution, and PhoneHarness Bench, a benchmark built on the harness to evaluate verifiable mobile workflows.

PhoneHarness A mixed-action execution harness that routes phone-agent work across device-side CLI, delegated GUI control, and MCP-style host tools.

PhoneHarness Bench A benchmark constructed on top of the harness, using observable side effects and trace-backed checks to verify task completion.

1 Tencent HY Team 2 The Chinese University of Hong Kong 3 The Chinese University of Hong Kong, Shenzhen 4 Tsinghua University

📄 Paper PDF 📊 View Results 💻 GitHub 🤗 HuggingFace Dataset 📝 HuggingFace Paper · Placeholder

124scored tasks

30app scenarios

CLI · GUI · MCPmixed action space

Why Mixed Action Spaces?

Phone workflows are not just screen navigation.

Real mobile tasks often combine app navigation, local device operations, file handling, email, calendar, web lookup, and safety-sensitive side effects. PhoneHarness keeps the phone as the stateful execution surface, but exposes multiple action surfaces so agents can choose the right path instead of forcing every task through GUI taps.

CLI

Deterministic device actions

Run shell or Python commands inside the phone environment for local files, Android settings, device state, and lightweight scripting.

GUI

Bounded screen delegation

Delegate visual subtasks to a GUI controller while the outer agent handles planning, routing, and verification-oriented execution.

MCP / Tools

Host-side workflow tools

Use tools for search, email, calendar, documents, and other workflows where side effects should be auditable and verifiable.

PhoneHarness Bench

A benchmark built on the same harness that executes the tasks.

The benchmark is organized around mock-app tasks, real-app tasks, safety tasks, and a scored 124-task split over 30 app scenarios. Each task has a natural-language prompt, execution protocol, trace logging, and one or more verifiers for checking whether the expected side effect actually happened.

Task distribution

The scored split covers device/system operations, single-app GUI tasks, tool-assisted workflows, and cross-app workflows.

Mixed action space over CLI, GUI, and MCP tools

Action surfaces

PhoneHarness exposes CLI, GUI, and MCP-style tools inside one phone-agent loop, enabling deterministic-first routing when appropriate.

Experimental Findings

Mixed-action routing helps most when tasks expose verifiable alternatives.

The homepage summarizes the paper without turning individual table cells into headline claims. Detailed pass rates, model pairings, ablations, and safety-policy results are kept in the paper for the exact experimental context.

Task-type view

Typed

Results are grouped by device/system operations, single-app GUI tasks, tool-assisted workflows, and cross-app workflows rather than by a coarse difficulty label.

Action-space view

Mixed

The gains concentrate on tasks where command-line operations, GUI interaction, and MCP-style tools provide complementary routes to a verifiable side effect.

Model-combination view

Paired

Same-harness comparisons separate orchestration-model effects from delegated GUI-worker effects, with full scores reported in the paper.

Detailed numbers live in the paper.

The paper reports the exact task-type table, action-space breakdown, same-harness model-combination study, step/runtime analysis, and safety-policy table. Keeping the homepage qualitative avoids mixing table-specific scores with broader project claims.

📄 Read the paper tables 🤗 Inspect the dataset

Trace Evidence

Representative executions show GUI, CLI, and tool actions interleaving with verifier checks.

Rather than only reporting final answers, PhoneHarness records executable trajectories with selected screenshots, compact non-GUI operation cards, and verifier-side success signals. The clips below are generated from real trace artifacts.

MCP web search + GUI delegation + verifier

A medicine-price task is decomposed into tool lookup and GUI app exploration, then checked by composite verifier signals.

CLI-first device status probe

For deterministic phone-state queries, the agent routes directly through device-side commands instead of opening the GUI.

Background GUI on a virtual display

The agent runs a GUI task on a secondary display while the user's foreground screen remains unchanged.

Representative static PhoneHarness trace gallery

Static trajectory overview

A compact gallery of representative PhoneHarness executions, keeping screenshots, non-GUI operation cards, and verifier outcomes visible in one static view.