Touch and Go: Learning from Human-Collected Vision and Touch

Fengyu Yang*
Chenyang Ma*
Jiacheng Zhang
Jing Zhu
Wenzhen Yuan
Andrew Owens

NeurIPS 2022 Datasets and Benchmarks

[Paper]       [Dataset]       [Code]

Quick Jump:

  1. Overview
  2. Examples from our Dataset
  3. Dataset
  4. Applications
  5. Comparison to Other Datasets


We propose a dataset for multimodal visuo-tactile learning called Touch and Go, in which human data collectors probe objects in natural environments with tactile sensors, while recording egocentric video.
We successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) the novel task of tactile-driven image stylization, i.e., making an object look as though it were ''felt like'' a given tactile input, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.

Human-Powered Data Collection

                      Vision                                         Touch

Untrimmed video examples from our dataset

Two people collecting data

Examples from our Dataset


[Download Link]

Labeled Video Data


RGB Video                Gelsight Video
Synthetic Fabric


RGB Video                Gelsight Video

Touch and Go Dataset       We collect a dataset of natural vision-and-touch signals. Our dataset contains multimodal data recorded by humans, who probe objects in their natural locations with a tactile sensor. To more easily train and analyze models on this dataset, we also collect material labels and identify touch onsets.
The touch_and_go directory contains a directory of raw videos, that convert raw videos to frames, and label.txt of material labels for onset frames.

Each raw video folder in the Dataset folder consists of six items:

  video.mp4: Raw RGB video recording the interaction of human probing objects.
  gelsight.mp4: Raw GelSight (tactile) video for objects.
  time1.npy: The recording time for each frame in ''video.mp4''.
  time2.npy: The recording time for each frame in ''gelsight.mp4''.
  video_frame: The folder containing all the frames in ''video.mp4''. (Generated after running
  gelsight_frame: The folder containing all the frames in ''gelsight.mp4''. (Generated after running


To evaluate the effectiveness of our dataset, we perform tasks that are designed to span a variety of application domains, including representation learning, image synthesis, and future prediction.

Tactile-driven image stylization

Both touch and sight convey material properties and geometry. A model that can successfully predict these properties from visuo-tactile data therefore ought to be able to translate between modalities. We propose the task of tactile-driven image stylization: making an image look as though it “feels like” a given touch signal. In the figure below we show results from our model. Our model successfully manipulates images to match tactile inputs, such as by making surfaces rougher or smoother, or by creating “hybrid” materials (e.g., adding grass to a surface).

Hover over each row to make the right pictures look like the left tactile input.

             Conditional tactile example

                    Input Image

                    Input Image

                    Input Image

                    Input Image

Multimodal video prediction

We use our dataset to ask whether visual data can improve our estimates of future tactile signals: i.e., what will this object feel like in a moment?
We predict multiple frames by autoregressively feeding our output images back to the original model. We evaluate our model for predicting future tactile signals. In the figure below, we compare a tactile-only model to a multimodal visuo-tactile model, and show that the latter obtains better performance. By incorporating our dataset's visual signal, the model gains a constant performance increase under different evaluation metrics, under both experimental settings. The gap becomes larger for longer time horizon, suggesting that visual information may be more helpful in this case.

Comparison to Other Datasets

To help understand the differences between our dataset and those of previous work: Object Folder 2.0, which contains virtual objects, and two robotic datasets: Feeling of Success, and VisGel. We show examples from indoor scenes, since the other datasets do not contain outdoor scenes, and with rigid materials (since the virtual scenes do not contain deformable materials). Each row illustrates objects which are composed of similar materials, along with their corresponding GelSight images.

We provide qualitative examples of visual and tactile data from other datasets (left), along with examples from similar material taken from our dataset (right).


We thank Xiaofeng Guo and Yufan Zhang for the extensive help with the GelSight sensor, and thank Daniel Geng, Yuexi Du and Zhaoying Pan for the helpful discussions. This work was supported in part by Cisco Systems. The webpage template was adopted from Colorization project and HOGgles project.

Creative Commons License
Touch and Go: Learning from Human-Collected Vision and Touch by Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, Andrew Owens is licensed under a Creative Commons Attribution 4.0 International License