Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this significantly reduces the number of tokens, and thus computation. For this purpose, we explore two approaches to gaze estimation: The first is a two-stage model that predicts gaze independently to guide foveation and subsequently action. The second integrates gaze into the action space, allowing the policy to jointly estimate gaze and actions end-to-end. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems.
We use the GIAVA platform to collect bimanual robot demonstrations with human eye-tracking data. The robot streams stereo camera images to the VR headset for visual feedback, while the headset transmits head and hand controller poses to control the robot, along with human gaze data.
Gaze Prediction: Gaze is predicted using two approaches: Fov-UNet, a
hierarchical two-stage model that first predicts gaze with a UNet and then uses it to guide
the policy, and Fov-Act, a novel end-to-end method that treats gaze as part of the
robot's action space where the policy predicts both gaze and action together.
Tokenization: Fov-UNet and Fov-Act methods use foveated tokenization
around predicted gaze. The other methods, Fine and Coarse, do not predict gaze
and
use a standard uniform tokenization.
Policy Architecture: We use a Transformer-Based Flow Matching Policy. Image
observations Oimg are tokenized,
processed by ViT, and compressed with a Q-Former module into tokens
cimg, which condition the Flow Transformer (FT) via cross-attention.
Proprioception is encoded by an MLP into tokens cproprio and added
to the FT input sequence. Timestep t is embedded and conditions FT via AdaLN.
FT predicts flow matching velocity vθ from noisy action latent
zt, cimg, cproprio,
and t. Actions are generated via Euler integration.
(Left) The input image is divided into patches using either the standard uniform tokenization (Middle) or foveated tokenization (Right). Foveated tokenization mimics the human retina by assigning high-resolution patches near the gaze point and lower resolution in the periphery. This reduces the number of tokens from 324 (uniform) to just 20 (foveated), greatly lowering computational cost while preserving detail where it matters most.
@misc{chuang2025lookfocusactefficient,
title={Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers},
author={Ian Chuang and Andrew Lee and Dechen Gao and Jinyu Zou and Iman Soltani},
year={2025},
eprint={2507.15833},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2507.15833},
}