Text entry is yet one of the most challenging tasks in 3D environments (VR, AR, XR). Probably the most robust is currently to use the free hands and the index finger to directly type into a floating QWERTY keyboard. This is currently used by most XR devices (e.g., Microsoft Holo Lens 2). However, moving the hand over the keyboard leads to a lot of physical effort over time and as such, it is far from ideal. As such, we explore how we can utilise the eye-tracking technology in HMDs to potentially advance text entry.
During typing, users naturally lock their eyes on the key before the manual acquisition. This is a common phenomenon in eye-hand coordination, to obtain information for on-line planning and correction of hand movements. Particularly in mid-air text entry, the user’s gaze often remains on target to confirm that the finger reached the correct depth of the keyboard to trigger selection. Thus, in many cases, gaze is already part of the task, representing an opportunity to cascade gaze with manual input to enhance typing.
In this project, we explore an entirely new idea based on combining gaze and hand motion for text entry. Users can enter text by aligning both gaze and manual pointer at each
key, as a novel alternative to existing dwell-time or explicit manual triggers. In a parallel paper, we explored the same concept called Gaze-Hand Alignment for distant object selection in menus. Here, it is a very different use case of a close-range UI and repetitive key selection actions in sequence.
How does this work specifically? The idea is to use two pointers – the eye-gaze and a hand-controlled ray. When the user aligns both in their line of sight, the user selects the target that visually lies behind the alignment point. For example, to use the index finger for alignment with the gaze-focused key on the virtual keyboard (Figure 1a). The user (1) looks at the key of interest, and (2) aligns the tip of their index finger in line of sight with the target to immediately invoke a selection command.
To assess the performance of users with the proposed selection algorithm, we conducted a user
study of a standard text entry task. The study includes the following four conditions, as illustrated
in the figure. S-Gaze&Finger and S-Gaze&Hand are the direct and indirect variants of the technique. AirTap is the standard based on directly tapping, and dwell-typing is the standard gaze-only technique based on staring at a key for 600 ms.

Our results indicate the advantage of the optimised S-Gaze&Finger technique. It leads to a
significant reduction of effective physical motion by more than 50%, without significant performance differences in speed or error compared to AirTap. Surprisingly, questions on physical effort did not indicate a perceived difference in physical effort, although user feedback indicated that it is more taxing than the gaze-based techniques. We also found that both S-Gaze&Finger and AirTap lead to significantly higher WPM than S-Gaze&Hand and Dwell-Typing. No differences were revealed w.r.t. error rates. For Dwell-Typing, users rated significantly lower physical effort, but at the compromise of higher eye fatigue. Taken together, it suggests that S-Gaze&Finger is a viable alternative to mid-air text entry, that substantially reduces the physical movement requirements, without significant compromises in performance or eye fatigue.
Abstract
With eye-tracking increasingly available in Augmented Reality, we explore how gaze can be used to assist freehand gestural text entry. Here the eyes are often coordinated with manual input across the spatial positions of the keys. Inspired by this, we investigate gaze-assisted selection-based text entry through the concept of spatial alignment of both modalities. Users can enter text by aligning both gaze and manual pointer at each key, as a novel alternative to existing dwell-time or explicit manual triggers. We present a text entry user study comparing two of such alignment techniques to a gaze-only and a manual-only baseline. The results show that
one alignment technique reduces physical finger movement by more than half compared to standard in-air finger typing, and is faster and exhibits less perceived eye fatigue than an eyes-only dwell-time technique. We discuss trade-offs between uni and multimodal text entry techniques, pointing to novel ways to integrate eye movements to facilitate virtual text entry.
Mathias N. Lystbæk, Ken Pfeuffer, Jens Emil Sloth Grønbæk, and Hans Gellersen. 2022. Exploring Gaze for Assisting Freehand Selection-based Text Entry in AR. Proc. ACM Hum.-Comput. Interact. 6, ETRA, Article 141 (May 2022), 16 pages. https://doi.org/10.1145/3530882 , pdf