We propose a dataset for multimodal visuo-tactile learning called Touch and Go, in which human data collectors probe objects in natural environments with tactile sensors, while recording egocentric video.
We successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) the novel task of tactile-driven image stylization, i.e., making an object look as though it were ''felt like'' a given tactile input, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.
Touch and Go Dataset       We collect a dataset of natural vision-and-touch signals. Our dataset contains multimodal data
recorded by humans, who probe objects in their natural locations with a tactile sensor. To more easily
train and analyze models on this dataset, we also collect material labels and identify touch onsets.
The touch_and_go directory contains a directory of raw videos, extract_frame.py that convert raw videos to frames, and label.txt of material labels for onset frames.
Each raw video folder in the Dataset folder consists of six items:
video.mp4: Raw RGB video recording the interaction of human probing objects. gelsight.mp4: Raw GelSight (tactile) video for objects. time1.npy: The recording time for each frame in ''video.mp4''. time2.npy: The recording time for each frame in ''gelsight.mp4''. video_frame: The folder containing all the frames in ''video.mp4''. (Generated after running extract_frame.py) gelsight_frame: The folder containing all the frames in ''gelsight.mp4''. (Generated after running extract_frame.py)
Applications
To evaluate the effectiveness of our dataset, we perform tasks that are designed to span a variety of application domains, including representation learning, image synthesis, and future prediction.
Tactile-driven image stylization
Both touch and sight convey material properties and geometry. A model that can successfully predict these properties from visuo-tactile data therefore ought to be able to translate between modalities.
We propose the task of tactile-driven image stylization: making an image look as though it “feels like” a given touch signal. In the figure below we show results from our model. Our model successfully manipulates
images to match tactile inputs, such as by making surfaces rougher or smoother, or by creating
“hybrid” materials (e.g., adding grass to a surface).
Hover over each row to make the right pictures look like the left tactile input.
             Conditional tactile example
We use our dataset to ask whether visual data can improve our estimates of future tactile signals: i.e., what will this object
feel like in a moment?
We predict multiple frames by autoregressively feeding our output images back to the original model.
We evaluate our model for predicting future tactile signals. In the figure below, we compare a tactile-only model to a multimodal visuo-tactile model, and show that the latter obtains better performance.
By incorporating our dataset's visual signal, the model gains a constant performance increase
under different evaluation metrics, under both experimental settings. The gap becomes larger for
longer time horizon, suggesting that visual information may be more helpful in this case.
Comparison to Other Datasets
To help understand the differences between our dataset and those of previous work: Object Folder 2.0, which contains virtual objects, and two robotic datasets:
Feeling of Success, and VisGel.
We show examples from indoor scenes, since the other datasets do not contain outdoor scenes, and with rigid materials (since the virtual scenes do not contain deformable materials). Each row illustrates objects which are composed of similar materials, along with their corresponding GelSight images.
We provide qualitative examples of
visual and tactile data from other datasets (left), along with examples
from similar material taken from our dataset (right).
Acknowledgements
We thank Xiaofeng Guo and Yufan Zhang for the extensive help with the GelSight sensor, and thank Daniel Geng, Yuexi Du and Zhaoying Pan for the helpful discussions. This work was supported in part by Cisco Systems.
The webpage template was adopted from Colorization project and HOGgles project.
Touch and Go: Learning from Human-Collected Vision and Touch by Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, Andrew Owens is licensed under a Creative Commons Attribution 4.0 International License