TaF-VLA

Tactile-Force Alignment in Vision-Language-Action Models
for Force-aware Manipulation

Yuzhe Huang1,3*, Pei Lin2,3*, Wanlin Li3*, Daohan Li3, Jiajun Li3,4, Jiaming Jiang2,3, Chenxi Xiao2†, Ziyuan Jiao3†

* These authors contributed equally corresponding author

1 Beihang University    2 ShanghaiTech University    3 BIGAI    4 The University of Hong Kong

TaF-VLA pipeline overview.

From Data to Policy: The TaF-VLA Pipeline

To address the ``force-blindness'' of current VLA models, we propose a paradigm shift from tactile-vision to tactile-force alignment, realized through three stages:

(a) We deploy an automated data acquisition system (TaF-Device) to construct the TaF-Dataset, a large-scale collection of synchronized visuotactile images, 6-axis force/torque, and matrix force maps. Using this data, we pretrain the TaF-Adapter to align tactile observations with ground-truth force signals in a shared latent space.

(b) We fuse the TaF-Adapter into a VLA backbone and fine-tune the policy on real-world demonstrations enriched with force-aware language instructions (i.e., force-aware manipulation dataset).

(c) This explicit tactile-force alignment empowers TaF-VLA to master complex force-aware manipulation tasks, such as tool-use and deformable object manipulation, where traditional vision-based baselines consistently fail.

Abstract

Vision-Language-Action (VLA) models have recently emerged as powerful generalists for robotic manipulation. However, due to their predominant reliance on visual modalities, they fundamentally lack the physical intuition required for contact-rich tasks that require precise force regulation and physical reasoning. Existing attempts to incorporate vision-based tactile sensing into VLA models typically treat tactile inputs as auxiliary visual textures, thereby overlooking the underlying correlation between surface deformation and interaction dynamics.

To bridge this gap, we propose a paradigm shift from tactile-vision alignment to tactile-force alignment. Here, we introduce TaF-VLA, a framework that explicitly grounds high-dimensional tactile observations in physical interaction forces.

To facilitate this, we develop an automated tactile-force data acquisition device and curate the TaF-Dataset, comprising over 10 million synchronized tactile observations, 6-axis force/torque, and matrix force map. To align sequential tactile observations with interaction forces, the central component of our approach is the Tactile-Force Adapter (TaF-Adapter), a tactile sensor encoder that extracts discretized latent information for encoding tactile observations. This mechanism ensures that the learned representations capture history-dependent, noise-insensitive physical dynamics rather than static visual textures. Finally, we integrate this force-aligned encoder into a VLA backbone. Extensive real-world experiments demonstrate that TaF-VLA policy significantly outperforms state-of-the-art tactile-vision-aligned and vision-only baselines on contact-rich tasks, verifying its ability to achieve robust, force-aware manipulation through cross-modal physical reasoning.

TaF-Device: Tactile-align-Force Data Acquisition System

TaF-Device is an automated data acquisition system designed to efficiently collect aligned visuotactile observations and physical interaction forces.

By tightly synchronizing visual inputs, tactile signals, and 6-axis force/torque measurements, TaF-Device enables scalable construction of force-aware manipulation datasets with precise temporal alignment.

Overview of the TaF-Device.
TaF-Device Overview
Modeling and kinematic design of the TaF-Device.
TaF-Device Modeling

Real-world demonstration of the TaF-Device during data acquisition(Only show how device work).

Force-aware Manipulation Dataset

We utilize a high-fidelity master-puppet teleoperation system to collect over 10K real-world manipulation episodes across 20+ diverse tasks, ranging from fragile object handling to tool use. This process empowers force-aware directives, providing the hierarchical supervision necessary for force-aware downstream policy learning

Representative examples from the force-aware manipulation dataset.

Jelly Slicing

TaF-VLA can cut off the jelly on balloon without damaging it

Tweezer Weight Pick

TaF-VLA can use tweezers to continuously extract weights.

BibTeX

@article{taf_vla,
  title   = {TaF-VLA: Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation},
  author  = {Yuzhe Huang,Pei Lin,Wanlin Li,Daohan Li,Jiajun Li,Jiaming Jiang,Chenxi Xiao,Ziyuan Jiao},
  journal = {Arxiv},
  year    = {2026}
}