Apple has released Ferret-UI Lite, which is a small 3-billion-parameter AI Model that enables users to comprehend and operate application interface elements present in mobile, web, and desktop environments. The research was first published on arXiv and later submitted to OpenReview. Despite its relatively small size, the system rivals significantly larger models, signaling Apple’s continued push toward efficient on-device intelligence.
The multimodal large language model Ferret-UI Lite functions as a system that extracts meaning from both visual content and textual information shown on a display. The system employs inference-time cropping to first analyze the complete interface before it focuses on specific parts that contain essential icons and text. Chain-of-thought reasoning and reinforcement learning help it decide actions step by step.
Apple researchers created a synthetic data pipeline that uses simulated task planning and error correction to solve their problem of insufficient training data. The model training process enables the system to handle actual interface problems that occur when users encounter pop-up windows or experience unresponsive touch elements.
On benchmarks, Ferret-UI Lite scored 91.6% on ScreenSpot-V2 and outperformed similar 3-billion-parameter agents on ScreenSpot-Pro by over 15 percentage points. While navigation success rates remain moderate, the results are notable given that competing systems can be up to 24 times larger.
The model operates on local systems; it needs no cloud processing for secure screen information. This aligns with Apple’s privacy-focused strategy and could support deeper app-level capabilities in future Siri upgrades.
However, limitations remain. The model struggles with complex multi-step tasks. Researcher Zhe Gan noted the focus was on scaling down efficiently rather than building larger systems. Whether Ferret-UI Lite appears in consumer products is unclear, but it highlights Apple’s long-term vision for practical, privacy-first AI.