Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss

Neshov, N. N.; Tonchev K.; Manolova, A. H.; Petkova, R. R.; Bozhilov, I. B.

Autors: Neshov, N. N., Tonchev K., Manolova, A. H., Petkova, R. R., Bozhilov, I. B.
Title: Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss
Keywords: augmented reality, conditional Generative Adversarial Network (cGAN), cross-modal generation, haptic feedback, hybrid loss, LMT-108 dataset, multi-scale discriminator, texture-to-tactile translation,

Abstract: Understanding surface textures through visual cues is crucial for applications in haptic rendering and virtual reality. However, accurately translating visual information into tactile feedback remains a challenging problem. To address this challenge, this paper presents a class-conditional Generative Adversarial Network (cGAN) for cross-modal translation from texture images to vibrotactile spectrograms, using samples from the LMT-108 dataset. The generator is adapted from pix2pix and enhanced with Conditional Batch Normalization (CBN) at the bottleneck to incorporate texture class semantics. A dedicated label predictor, based on a DenseNet-201 and trained separately prior to cGAN training, provides the conditioning label. The discriminator is derived from pix2pixHD and uses a multi-scale architecture with three discriminators, each comprising three downsampling layers. A grid search over multi-scale discriminator configurations shows that this setup yields optimal perceptual similarity measured by Learned Perceptual Image Patch Similarity (LPIPS). The generator is trained using a hybrid loss that combines adversarial, (Formula presented.), and feature matching losses derived from intermediate discriminator features, while the discriminators are trained using standard adversarial loss. Quantitative evaluation with LPIPS and Fréchet Inception Distance (FID) confirms superior similarity to real spectrograms. GradCAM visualizations highlight the benefit of class conditioning. The proposed model outperforms pix2pix, pix2pixHD, Residue-Fusion GAN, and several ablated versions. The generated spectrograms can be converted into vibrotactile signals using the Griffin–Lim algorithm, enabling applications in haptic feedback and virtual material simulation.

References

Zhang D. Tron R. Khurshid R.P. Haptic feedback improves human-robot agreement and user satisfaction in shared-autonomy teleoperation Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA) Xi’an, China 30 May–5 June 2021 IEEE Piscataway, NJ, USA 2021 3306 3312
Gani A. Pickering O. Ellis C. Sabri O. Pucher P. Impact of haptic feedback on surgical training outcomes: A randomised controlled trial of haptic versus non-haptic immersive virtual reality training Ann. Med. Surg. 2022 83 104734 10.1016/j.amsu.2022.104734
Hiemstra E. Terveer E.M. Chmarra M.K. Dankelman J. Jansen F.W. Virtual reality in laparoscopic skills training: Is haptic feedback replaceable? Minim. Invasive Ther. Allied Technol. 2011 20 179 184 10.3109/13645706.2010.532502 21438717
Gayathri R. Nam S. Enhancing User Experience in Virtual Museums: Impact of Finger Vibrotactile Feedback Appl. Sci. 2024 14 6593 10.3390/app14156593
Li D. Xiong Q. Zhou X. Yeow R.C.H. A Novel Kinesthetic Haptic Feedback Device Driven by Soft Electrohydraulic Actuators arXiv 2024 10.48550/arXiv.2411.18387 2411.18387
Li X. Liu H. Zhou J. Sun F. Learning cross-modal visual-tactile representation using ensembled generative adversarial networks Cogn. Comput. Syst. 2019 1 40 44 10.1049/ccs.2018.0014
Li Y. Zhu J.Y. Tedrake R. Torralba A. Connecting touch and vision via cross-modal prediction Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Long Beach, CA, USA 15–20 June 2019 10609 10618
Zhong S. Albini A. Jones O.P. Maiolino P. Posner I. Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation Proceedings of the Conference on Robot Learning Auckland, New Zealand 14–18 December 2022 PMLR Cambridge MA, USA 2023 1618 1628
Yang F. Ma C. Zhang J. Zhu J. Yuan W. Owens A. Touch and go: Learning from human-collected vision and touch arXiv 2022 10.48550/arXiv.2211.12498 2211.12498
Luo S. Yuan W. Adelson E. Cohn A.G. Fuentes R. Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA) Brisbane, QLD, Australia 21–25 May 2018 IEEE Piscataway, NJ, USA 2018 2722 2727
Lee J.T. Bollegala D. Luo S. “Touching to see” and “seeing to feel”: Robotic cross-modal sensory data generation for visual-tactile perception Proceedings of the 2019 International Conference on Robotics and Automation (ICRA) Montreal, QC, Canada 20–24 May 2019 IEEE Piscataway, NJ, USA 2019 4276 4282
Chen J. Zhou S. Vision2Touch: Imaging Estimation of Surface Tactile Physical Properties Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Rhodes Island, Greece 4–10 June 2023 IEEE Piscataway, NJ, USA 2023 1 5
Liu H. Guo D. Zhang X. Zhu W. Fang B. Sun F. Toward image-to-tactile cross-modal perception for visually impaired people IEEE Trans. Autom. Sci. Eng. 2020 18 521 529 10.1109/TASE.2020.2971713
Su Z. Huang B. Miao J. Wang W. Lin X. Configurable Performance-Communication Trade-Off for Quaternion-Based AUVs: A Partitioned Hybrid Event-Triggered Approach IEEE Trans. Veh. Technol. 2025 early access 10.1109/TVT.2025.3599895
Huang B. Song Y. Qin H. Miao J. Zhu C. Safety-enhanced formation maneuver control for electric vehicle with edge-weighted topology and reinforcement learning strategy IEEE Trans. Aerosp. Electron. Syst. 2025 61 14716 14731 10.1109/TAES.2025.3588126
Goodfellow I.J. Pouget-Abadie J. Mirza M. Xu B. Warde-Farley D. Ozair S. Courville A. Bengio Y. Generative adversarial nets Proceedings of the 28th International Conference on Neural Information Processing Systems Montreal, QC, Canada 8–13 December 2014 Volume 2 2672 2680
Li Y. Zhao H. Liu H. Lu S. Hou Y. Research on visual-tactile cross-modality based on generative adversarial network Cogn. Comput. Syst. 2021 3 131 141
Isola P. Zhu J.Y. Zhou T. Efros A.A. Image-to-image translation with conditional adversarial networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Honolulu, HI, USA 21–26 July 2017 1125 1134
Wang T.C. Liu M.Y. Zhu J.Y. Tao A. Kautz J. Catanzaro B. High-resolution image synthesis and semantic manipulation with conditional gans Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Salt Lake City, UT, USA 18–23 June 2018 8798 8807
Zhu J.Y. Park T. Isola P. Efros A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks Proceedings of the IEEE International Conference on Computer Vision Venice, Italy 22–29 October 2017 2223 2232
Kim T. Cha M. Kim H. Lee J.K. Kim J. Learning to discover cross-domain relations with generative adversarial networks Proceedings of the International Conference on Machine Learning Sydney, Australia 6–11 August 2017 PMLR Cambridge MA, USA 2017 1857 1865
Park T. Liu M.Y. Wang T.C. Zhu J.Y. Semantic image synthesis with spatially-adaptive normalization Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Long Beach, CA, USA 15–20 June 2019 2337 2346
Griffin D. Lim J. Signal estimation from modified short-time Fourier transform IEEE Trans. Acoust. Speech Signal Process. 1984 32 236 243
De Vries H. Strub F. Mary J. Larochelle H. Pietquin O. Courville A.C. Modulating early visual processing by language Proceedings of the 31st International Conference on Neural Information Processing Systems Long Beach, CA, USA 4–9 December 2017 6597 6607
Strese M. Schuwerk C. Iepure A. Steinbach E. Multimodal feature-based surface material classification IEEE Trans. Haptics 2016 10 226 239 10.1109/toh.2016.2625787
Selvaraju R.R. Cogswell M. Das A. Vedantam R. Parikh D. Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization Proceedings of the IEEE International Conference on Computer Vision Venice, Italy 22–29 October 2017 618 626
Kingma D.P. Welling M. Auto-encoding variational bayes arXiv 2013 1312.6114
Vaswani A. Shazeer N. Parmar N. Uszkoreit J. Jones L. Gomez A.N. Kaiser Ł. Polosukhin I. Attention is all you need Proceedings of the 31st International Conference on Neural Information Processing Systems Long Beach, CA, USA 4–9 December 2017 6000 6010
Van Den Oord A. Dieleman S. Zen H. Simonyan K. Vinyals O. Graves A. Kalchbrenner N. Senior A. Kavukcuoglu K. Wavenet: A generative model for raw audio arXiv 2016 10.48550/arXiv.1609.03499 1609.03499
Verma P. Chafe C. A generative model for raw audio using transformer architectures Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx) Vienna, Austria 8–10 September 2021 IEEE Piscataway, NJ, USA 2021 230 237
Zhu H. Luo M.D. Wang R. Zheng A.H. He R. Deep audio-visual learning: A survey Int. J. Autom. Comput. 2021 18 351 376 10.1007/s11633-021-1293-0
Sung-Bin K. Senocak A. Ha H. Owens A. Oh T.H. Sound to visual scene generation by audio-to-visual latent alignment Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Vancouver, BC, Canada 17–24 June 2023 6430 6440
Ujitoko Y. Ban Y. Vibrotactile signal generation from texture images or attributes using generative adversarial network Proceedings of the Haptics: Science, Technology, and Applications: 11th International Conference, EuroHaptics 2018 Pisa, Italy 13–16 June 2018 Proceedings, Part II 11 Springer Berlin/Heidelberg, Germany 2018 25 36
Ban Y. Ujitoko Y. TactGAN: Vibrotactile designing driven by GAN-based automatic generation Proceedings of the SIGGRAPH Asia 2018 Emerging Technologies Tokyo, Japan 4–7 December 2018 1 2 10.1145/3275476.3275484
Cai S. Ban Y. Narumi T. Zhu K. FrictGAN: Frictional Signal Generation from Fabric Texture Images using Generative Adversarial Network Proceedings of the ICAT-EGVE Virtual 2–4 December 2020 11 15
Cai S. Zhao L. Ban Y. Narumi T. Liu Y. Zhu K. GAN-based image-to-friction generation for tactile simulation of fabric material Comput. Graph. 2022 102 460 473
Cai S. Zhu K. Ban Y. Narumi T. Visual-tactile cross-modal data generation using residue-fusion gan with feature-matching and perceptual losses IEEE Robot. Autom. Lett. 2021 6 7525 7532
Xi Q. Wang F. Tao L. Zhang H. Jiang X. Wu J. CM-AVAE: Cross-Modal Adversarial Variational Autoencoder for Visual-to-Tactile Data Generation IEEE Robot. Autom. Lett. 2024 9 5214 5221
Simonyan K. Zisserman A. Very deep convolutional networks for large-scale image recognition arXiv 2014 1409.1556
Agatsuma S. Kurogi J. Saga S. Vasilache S. Takahashi S. Simple Generative Adversarial Network to Generate Three-axis Time-series Data for Vibrotactile Displays Proceedings of the International Conference on Advances in Computer-Human Interactions, ACHI 2020 Valencia, Spain 21–25 November 2020 19 24
Rombach R. Blattmann A. Lorenz D. Esser P. Ommer B. High-resolution image synthesis with latent diffusion models Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition New Orleans, LA, USA 18–24 June 2022 10684 10695
Sohl-Dickstein J. Weiss E. Maheswaranathan N. Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics Proceedings of the International Conference on Machine Learning Lille, France 7–9 July 2015 PMLR Cambridge MA, USA 2015 2256 2265
Corvi R. Cozzolino D. Poggi G. Nagano K. Verdoliva L. Intriguing properties of synthetic images: From generative adversarial networks to diffusion models Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Vancouver, BC, Canada 17–24 June 2023 973 982
Chen M. Mei S. Fan J. Wang M. Opportunities and challenges of diffusion models for generative AI Natl. Sci. Rev. 2024 11 nwae348 10.1093/nsr/nwae348
Chen C. Ding H. Sisman B. Xu Y. Xie O. Yao B.Z. Tran S.D. Zeng B. Diffusion models for multi-task generative modeling Proceedings of the The Twelfth International Conference on Learning Representations Vienna, Austria 7–11 May 2024
Lin X. Xu W. Mao Y. Wang J. Lv M. Liu L. Luo X. Li X. Vision-based Tactile Image Generation via Contact Condition-guided Diffusion Model arXiv 2024 2412.01639
Gu C. Gromov M. Unpaired Image-To-Image Translation Using Transformer-Based CycleGAN Proceedings of the International Conference on Software Testing, Machine Learning and Complex Process Analysis Tomsk, Russia 25–27 November 2021 Springer Berlin/Heidelberg, Germany 2021 75 82
Dubey S.R. Singh S.K. Transformer-based generative adversarial networks in computer vision: A comprehensive survey IEEE Trans. Artif. Intell. 2024 5 4851 4867 10.1109/tai.2024.3404910
Dou Y. Yang F. Liu Y. Loquercio A. Owens A. Tactile-augmented radiance fields Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Seattle, WA, USA 16–22 June 2024 26529 26539
Yang F. Zhang J. Owens A. Generating visual scenes from touch Proceedings of the IEEE/CVF International Conference on Computer Vision Paris, France 1–6 October 2023 22070 22080
Jiang S. Zhao S. Fan Y. Yin P. GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion arXiv 2025 10.48550/arXiv.2505.07455 2505.07455
Huang G. Liu Z. Van Der Maaten L. Weinberger K.Q. Densely connected convolutional networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Honolulu, HI, USA 21–26 July 2017 4700 4708
Zhang M. Terui S. Makino Y. Shinoda H. TexSenseGAN: A User-Guided System for Optimizing Texture-Related Vibrotactile Feedback Using Generative Adversarial Network IEEE Trans. Haptics 2025 18 325 339
pytorch.org. Instalation of Pytorch v1.12.1 Available online: https://pytorch.org/get-started/previous-versions/ (accessed on 1 September 2025)
Kingma D.P. Adam: A method for stochastic optimization arXiv 2014 1412.6980
Zheng W. Liu H. Wang B. Sun F. Cross-modal learning for material perception using deep extreme learning machine Int. J. Mach. Learn. Cybern. 2020 11 813 823
Zhang R. Isola P. Efros A.A. Shechtman E. Wang O. The unreasonable effectiveness of deep features as a perceptual metric Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Salt Lake City, UT, USA 18–23 June 2018 586 595
Yu Y. Zhang W. Deng Y. Frechet inception distance (fid) for evaluating gans China Univ. Min. Technol. Beijing Grad. Sch. 2021 3 1 7
Krizhevsky A. Sutskever I. Hinton G.E. Imagenet classification with deep convolutional neural networks Proceedings of the 25th International Conference on Neural Information Processing Systems Lake Tahoe, NV, USA 3–6 December 2012 1097 1105

Issue

Sensors, vol. 26, 2026, Switzerland, https://doi.org/10.3390/s26020426

Цитирания (Citation/s):
1. Xi Z., Gao Y., Li X., Gao L., Vis2Tac: Residual feature–mediated cross-modal mapping learning framework for surface micro-defect detection, 2026, Pattern Recognition, issue 0, vol. 179, DOI 10.1016/j.patcog.2026.113728, issn 00313203 - 2026 - в издания, индексирани в Scopus и/или Web of Science

Вид: статия в списание, публикация в издание с импакт фактор, публикация в реферирано издание, индексирана в Scopus и Web of Science

Е-Публикации
Технически университет - София

Детайли за публикация от базата данни на ТУ - София (Publication Details)