Heterogeneous attention based transformer for sign language translation
Sign language translation (SLT) has attracted significant interest both from research and industry, enabling convenient communications with the deaf-mute community. While recent transformer-based models have shown improved sign translation performance, it is still under-explored how to design an efficient transformer-based deep network architecture that effectively extracts joint visual-text features by exploiting multi-level spatial and temporal contextual information. In this paper, we propose heterogeneous attention based transformer(HAT), a novel SLT model to generate attentions from diverse spatial and temporal contextual levels. Specifically, the proposed light dual-stream sparse attention-based module yields more effective visual-text representations compared to conventional transformers. Extensive experiments demonstrate that our HAT achieves state-of-the-art performance on the challenging PHOENIX2014T benchmark dataset with a BLEU-4 score of 25.33 on the test set.
Funding
Natural Science Foundation of Hunan Province, China [2022GK5002,2020JJ4746]
Special Foundation for Distinguished Young Scientists of Changsha [kq2209003]
Guangxi Key Laboratory of Cryptography and Information Security [GCIS202113]
111 Project [No.D23006]
History
School
- Science
Department
- Computer Science
Published in
Applied Soft ComputingVolume
144Publisher
ElsevierVersion
- VoR (Version of Record)
Rights holder
© The AuthorsPublisher statement
This is an Open Access Article. It is published by Elsevier under the Creative Commons Attribution 4.0 International Licence (CC BY). Full details of this licence are available at: https://creativecommons.org/licenses/by/4.0/Acceptance date
2023-06-06Publication date
2023-06-14Copyright date
2023ISSN
1568-4946eISSN
1872-9681Publisher version
Language
- en