OmniParser and Ferret-UI: New Tools for AI Understanding of User Interfaces
30/10/2024 16:15:17
Microsoft and Apple recently released tools on Hugging Face that address UI understanding for AI systems. Both focus on converting screenshots of user interfaces into structured data that AI can understand and use. These tools aim to improve how AI systems work with computer screens and apps, with potential applications in tasks like automated UI testing and process automation.
Microsoft's OmniParser combines two existing models to process UI screenshots. It uses YOLOv8 to detect interactive elements like buttons and fields on the screen. Then it applies BLIP-2 to generate descriptions of what these elements do. This processed information helps other AI models like GPT-4V better understand what they're looking at and how to interact with it.
Apple's Ferret-UI takes a different approach. Instead of preprocessing screenshots for other AI models, it's a specialized language model that directly understands UIs. Apple released two versions: one based on Gemma-2B and another on LLaMA-8B. The model can look at a screenshot and understand what different UI elements are and how they work without additional processing steps.
These tools are particularly relevant for developers and companies working on UI automation, testing, and AI assistants. OmniParser can be integrated into existing systems that use models like GPT-4V, while Ferret-UI can be used as a standalone solution. Both are available on Hugging Face with complete code and documentation.