Advancements in Word Embeddings: A Comprehensive Survey and Analysis
DOI:
https://doi.org/10.53560/PPASA(61-3)842Keywords:
Word Embeddings, Word Representations, NLP, Contextual Embeddings, BERT, ELMo, Word2Vec, Cross-Lingual EmbeddingsAbstract
In recent years, the field of Natural Language Processing (NLP) has seen significant growth in the study of word representation, with word embeddings proving valuable for various NLP tasks by providing representations that encapsulate prior knowledge. We reviewed word embedding models, their applications, cross-lingual embeddings, model analyses, and techniques for model compression. We offered insights into the evolving landscape of word representations in NLP, focusing on the models and algorithms used to estimate word embeddings and their analysis strategies. To address this, we conducted a detailed examination and categorization of these evaluations and models, highlighting their significant strengths and weaknesses. We discussed a prevalent method of representing text data to capture semantics, emphasizing how different techniques can be effectively applied to interpret text data. Unlike traditional word representations, such as Word to Vector (word2vec), newer contextual embeddings, like Bidirectional Encoder Representations from Transformers (BERT) and Embeddings from Language Models (ELMo), have pushed the boundaries by capturing the use of words through diverse contexts and encoding information transfer across different languages. These embeddings leverage context to represent words, leading to innovative applications in various NLP tasks.
References
C. Liu and K.K.H. Chung. The relationships between paired associate learning and Chinese word writing in kindergarten children. Reading and Writing 34(8): 2127-2148 (2021).
P. Aceves and J.A. Evans. Mobilizing conceptual spaces: How word embedding models can inform measurement and theory within organization science. Organization Science 35(3): 788-814 (2024).
A. Berenguer, J.-N. Maz'on, and D. Tom'as. Word embeddings for retrieving tabular data from research publications. Machine Learning 113(4): 2227-2248 (2024).
F. Incitti, F. Urli, and L. Snidaro. Beyond word embeddings: A survey. Information Fusion 89: 418-436 (2023).
M. Toshevska. The Ability of Word Embeddings to Capture Word Similarities. International Journal on Natural Language Computing (IJNLC) 9(3): 25-42 (2020).
M.-C. Hung, . P.-H. Hung, X.-J. Kuang, and S.-K. Lin. Intelligent portfolio construction via news sentiment analysis. International Review of Economics and Finance 89: 605-617 (2024).
K. Das, F. Abid, J. Rasheed, Kamlish, T. Asuroglu, S. Alsubai, and S. Soomro. Enhancing Communication Accessibility: UrSL-CNN Approach to Urdu Sign Language Translation for Hearing-Impaired Individuals. CMES-Computer Modeling in Engineering and Sciences 141(1): 689-711 (2024).
B. Lal, R. Gravina, F. Spagnolo, and P. Corsonello. Compressed sensing approach for physiological signals: A review. IEEE Sensors Journal 23(6): 5513-5534 (2023).
A. Baloch, T.D. Memon, F. Memon, B. Lal, V. Viyas, and T. Jan. Hardware synthesis and performance analysis of intelligent transportation using canny edge detection algorithm. International Journal of Engineering and Manufacturing 11(4): 22-32 (2021).
E. Çano and M. Morisio. Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey. Preprint ArXiv 1: 1902.00753 (2019).
F.K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney, and F. Rudzicz. A survey of word embeddings for clinical text. Journal of Biomedical Informatics 100: 100057 (2019).
A. Agarwal, B. Agarwal, and P. Harjule. Understanding the Role of Feature Engineering in Fake News Detection. In: Soft Computing: Theories and Applications: Proceedings of SoCTA 2021, Singapore pp. 769-789 (2022).
R.A. Stein, P.A. Jaques, and J.F. Valiati. An analysis of hierarchical text classification using word embeddings. Information Sciences 471: 216-232 (2017).
J.E. Font and M.R. Costa-Jussà. Equalizing Gender Biases in Neural Machine Translation with Word Embeddings Techniques. Preprint ArXiv 2: 1901.03116 (2019).
S.M. Rezaeinia, R. Rahmani, A. Ghodsi, and H. Veisi. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems With Applications 117: 139–147 (2019).
Z. Yao, Y. Sun, W. Ding, N. Rao, and H. Xiong. Dynamic Word Embeddings for Evolving Semantic Discovery. Preprint ArXiv 2: 1703-00607 (2018).
J. Zhao, T. Wang, M. Yatskar, R. Cotterell, V. Ordonez, and K.-W. Chang. Gender Bias in Contextualized Word Embeddings. Preprint ArXiv 1: 1904.0331 (2019).
Q. Du, N. Li, W. Liu, D. Sun, S. Yang, and F. Yue. A Topic Recognition Method of News Text Based on Word Embedding Enhancement. Computational Intelligence and Neuroscience 2022(1): 4582480 (2022).
D. Suhartono, K. Purwandari, N.H. Jeremy, S. Philip, P. Arisaputra, and I.H. Parmonangan. Deep neural networks and weighted word embeddings for sentiment analysis of drug product reviews. Procedia Computer Science 216: 664-671 (2023).
S. Haller, A. Aldea, C. Seifert, and N. Strisciuglio. Survey on Automated Short Answer Grading with Deep Learning: from Word Embeddings to Transformers. Preprint ArXiv 1: 2204.03503 (2022).
A. Çalışkan, P.P. Ajay, T. Charlesworth, R. Wolfe, and M.R. Banaji. Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency Syntax, and Semantics. Preprint ArXiv 1: 2206.03390 (2022).
X. Tang, Y. Zhou, and D. Bollegala. Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation. Preprint ArXiv 3: 2208-10734 (2023).
K. Alnajjar, M. Hämäläinen, and J. Rueter. Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages. Preprint ArXiv 1: 2305.15380 (2023).
H. Yen and W. Jeon. Improvements to Embedding- Matching Acoustic-to-Word ASR Using Multiple- Hypothesis Pronunciation-Based Embeddings. Preprint ArXiv 2: 2210.16726 (2023).
J. Engler, S. Sikdar, M. Lutz, and M. Strohmaier. SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings. Preprint ArXiv 1: 2301.04704 (2023).
R. Schiffers, D. Kern, and D. Hienert. Evaluation of Word Embeddings for the Social Sciences. Preprint ArXiv 1: 2302.06174 (2023).
O. Zaland, M. Abulaish, and M. Fazil. A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches. Preprint ArXiv 2: 2303.07196 (2024).
P.J. Worth. Word Embeddings and Semantic Spaces in Natural Language Processing. International Journal of Intelligence Science 13(1): 1-21 (2023).
K. Das and Kamlish. Enhancing Automated Text Summarization: A Survey and Novel Method with Semantic Information for Domain-Specific Summaries. Journal of Computing & Biomedical Informatics 5(2): 102-113 (2023).
Z.F. Abro, S.U. Rehman, K. Das, and A. Goswami. An Analysis of Deep Neural Network for Recommending Developers to Fix Reported Bugs. Europan Journal of Science and Technology (24): 375-379 (2021).
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. Preprint ArXiv 3: 1301.3781 (2013).
J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (25-29 October, 2014) pp. 1532-1543 (2014).
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5: 135-146 (2017).
H. Öztürk, A. Özgür, P. Schwaller, T. Laino, and E. Ozkirimli. Exploring Chemical Space Using Natural Language Processing Methodologies for Drug Discovery. Preprint ArXiv 1: 2002-06053 (2020).
F. Torregrossa, R. Allesiardo, V. Claveau, N. Kooli, and G. Gravier. A survey on training and evaluation of word embeddings. International Journal of Data Science and Analytics 11(2): 85-103 (2021).
G. Curto, M.F.J. Acosta, F. Comim, and B. Garcia-Zapirain. Are AI systems biased against the poor? A machine learning analysis using Word2Vec and GloVe embeddings. AI and Society 39(2): 617-632 (2024).
P. Rakshit and A. Sarkar. A supervised deep learning-based sentiment analysis by the implementation of Word2Vec and GloVe Embedding techniques. Multimedia Tools and Applications pp. 1-34 (2024). https://doi.org/10.1007/s11042-024-19045-7
M. Greeshma and P. Simon. Bidirectional Gated Recurrent Unit with Glove Embedding and Attention Mechanism for Movie Review Classification. Procedia Computer Science 233: 528-536 (2024).
J. Howard and S. Ruder. Universal Language Model Fine-Tuning for Text Classification. Preprint ArXiv 5: 1801.06146 (2018).
A. Faruq, M. Lestandy, and A. Nugraha. Analyzing Reddit Data: Hybrid Model for Depression Sentiment using FastText Embedding. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 8(2): 288-297 (2024).
S.S. Alqahtani. Security bug reports classification using fasttext. International Journal of Information Security 23(2): 1343-1358 (2024).
D. Voskergian, R. Jayousi, and M. Yousef. Enhanced TextNetTopics for Text Classification Using the GSM Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT. Applied Science 14(19): 8914 (2024).
N.A. Nasution, E.B. Nababan, and H. Mawengkang. Comparing LSTM Algorithm with Word Embedding: FastText and Word2Vec in Bahasa Batak-English Translation. 12th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia (7 Aug - 8 Aug, 2024) pp. 306-313 (2024).
M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana (1-6 June, 2018) pp. 2227-2237 (2018).
X. Cheng, T. Mei, Y. Zi, Q. Wang, Z. Gao, and H. Yang. Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description. Preprint ArXiv: 2408.06357 (2024).
L. Rong, Y. Ding, M. Wang, A.E. Saddik, and M.S. Hossain. A Multi-Modal ELMo Model for Image Sentiment Recognition of Consumer Data. IEEE Transactions on Consumer Electronics 7(1): 3697-3708 (2024).
B. McCann, J. Bradbury, C. Xiong, and R. Socher. Learned in Translation: Contextualized Word Vectors. Advances in Neural Information Processing Systems 30: 1-12 (2017).
W. Khan, A. Daud, K. Khan, S. Muhammad, and R. Haq. Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends. Natural Language Processing Journal 4: 100026 (2023).
T. Luong, R. Socher, and C.D. Manning. Better Word Representations with Recursive Neural Networks for Morphology. Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria (8-9 August, 2013) pp. 104–113 (2013).
Z. Yang. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Preprint ArXiv 2: 1906.08237 (2019).
A.F. Adoma, N.-M. Henry, and W. Chen. Comparative analyses of Bert, Roberta, Distilbert, and Xlnet for Text-Based Emotion Recognition. 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), IEEE pp. 117-121 (2020).
J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. Preprint ArXiv 2: 1810.04805 (2018).
K. Huang, J. Altosaar, and R. Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. Preprint ArXiv 3: 1904.05342 (2019).
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.H. So, and J. Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4): 1234-1240 (2019).
I. Beltagy, K. Lo, and A. Cohan. SciBERT: A pretrained language model for scientific text. Preprint ArXiv 3: 1903.10676 (2019).
R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Subword Units. Preprint ArXiv 5: 1508.07909 (2016).
Y. Xu and J. Liu. Implicitly Incorporating Morphological Information into Word Embedding. Preprint ArXiv 3: 1701.02481 (2017).
T. Baldwin and S.N. Kim. Multiword expressions. In: Handbook of Natural Language Processing. 2nd Edition. N. Indurkhya and F.J. Damerau (Eds.). Taylor and Frances Group pp. 267-292 (2010).
A. Üstün, M. Kurfalı, and B. Can. Characters or Morphemes: How to Represent Words?. Proceedings of the 3rd Workshop on Representation Learning for NLP, Melbourne, Australia (20 July 2018) pp. 144-153 (2018).
J. Bian, B. Gao, and T.-Y. Liu. Knowledge-powered deep Learning for word embedding. In: Machine Learning and Knowledge Discovery in Databases. T. Calders, F. Esposito, E. Hüllermeier, and R. Meo (eds). Springer, Berlin, Heidelberg pp. 132–148 (2014).
K. Cao and M. Rei. A joint Model for Word Embedding and Word Morphology. Preprint ArXiv 1: 1606.02601 (2016).
Y. Kim, Y. Jernite, D. Sontag, and A.M. Rush. Character-Aware Neural Language Models. Preprint ArXiv 4: 1508.06615 (2016).
D. Smilkov, N. Thorat, C. Nicholson, E. Reif, F.B. Viegas, and M. Wattenberg. Embedding projector: Interactive visualization and interpretation of embeddings. Preprint ArXiv 1: 1611.05469 (2016).
S. Liu, P.-T. Bremer, J.J. Thiagarajan, V. Srikumar, B. Wang, Y. Livnat, and V. Pascucci. Visual Exploration of Semantic Relationships in Neural Word Embeddings. IEEE Transactions on Visualization and Computer Graphics 24(1): 553-562 (2018).
S. Bandyopadhyay, J. Xu, N. Pawar, and D. Touretzky. Interactive visualizations of word embeddings for k-12 students. Proceedings of the AAAI Conference on Artificial Intelligence 36(11): 12713-12720 (2022).
I. Robinson and E. Pierce-Hoffman. Tree-SNE: Hierarchical clustering and visualization using t-SNE. Preprint ArXiv 1: 2002.05687 (2020).
N. Oubenali, S. Messaoud, A. Filiot, A. Lamer, and P. Andrey. Visualization of medical concepts represented using word embeddings: a scoping review. BMC Medical Informatics and Decision Making 22(1): 83 (2022).
M. Gniewkowski and T. Walkowiak. Assessment of document similarity visualization methods. In: Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2019. Lecture Notes in Computer Science. Z. Vetulani, P. Paroubek, and M. Kubis (Eds.). Springer, Cham pp. 348-363 (2019).
X. Han, Z. Zhang, N. Ding, Y. Gu and et al. Pre-trained models: Past, present and future. AI Open 2: 225-250 (2021).
D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, and P. Vincent. Why Does Unsupervised Pre-training Help Deep Learning?. Journal of Machine Learning Research 1: 625-660 (2010).
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural Language Processing (almost) from Scratch. Preprint ArXiv 1: 1103.0398 (2011).
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Preprint ArXiv 1: 1310-4556 (2013).
Q.V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. Preprint ArXiv 2: 1405-4053 (2014).
J. Pilault, J. Park, and C. Pal. On the impressive performance of randomly weighted encoders in summarization tasks. Preprint ArXiv 1: 2002.09084 (2020).
O. Melamud, J. Goldberger, and I. Dagan. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin Germany (7-12 August 2016) pp. 51-61 (2016).
A.M. Dai and Q.V. Le. Semi-supervised Sequence Learning. Preprint ArXiv 1: 1511-01432 (2015).
P. Liu, X. Qiu, and X. Huang. Recurrent Neural Network for Text Classification with Multi-Task Learning. Preprint ArXiv 1: 1605-05101 (2016).
P. Ramachandran, P.J. Liu, and Q.V. Le. Unsupervised Pretraining for Sequence to Sequence Learning. Preprint ArXiv 2: 1611.02683 (2018).
A. Akbik, D. Blythe, and R. Vollgraf. Contextual String Embeddings for Sequence Labeling. Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA (20-26 August 2018) pp. 1638-1649 (2018).
S. Ruder, I. Vulic, and A. ogaard. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research 65: 569-631 (2019).
S. Conia and R. Navigli. Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (online) (8-13 December 2020) pp. 3268-3284 (2020).
A. Klementiev, I. Titov, and B. Bhattarai. Inducing Crosslingual Distributed Representations of Words. Proceedings of COLING 2012, Mumbai, India (December 2012) pp. 1459-1474 (2012).
G. Lample and A. Conneau. Cross-lingual Language Model Pretraining. Preprint ArXiv 1: 1901.07291 (2019).
J. Guan, F. Huang, Z. Zhao, X. Zhu, and M. Huang. A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation. Transactions of the Association for Computational Linguistics 8: 93-108 (2020).
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural Probabilistic language model. Journal of Machine Learning Research 3: 1137-1155 (2003).
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Blog 1(8): 9 (2019).
W. Wang, B. Bi, M. Yan, C. Wu, Z. Bao, J. Xia, L. Peng, and L. Si. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. Preprint ArXiv 3: 1908.04577 (2019).
S. Wang, W. Zhou, and C. Jiang. A survey of word embeddings based on deep learning. Computing 102: 717–740 (2020).
L. Ma and Y. Zhang. Using Word2Vec to process big text data. 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA (29 October-01 November 2015) pp. 2895-2897 (2015).
O. Kwon, D. Kim, S.-R. Lee, J. Choi, and S. Lee. Handling Out-of-Vocabulary Problem in Hangeul Word Embeddings. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online (2021) pp. 3213-3221 (2021).
A. Gladkova and A. Drozd. Intrinsic evaluations of word embeddings: What can we do better?. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany (12 August 2016) pp. 36-42 (2016).
E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. A Study on Similarity and Relatedness Using Distributional and Wordnet-based Approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistic, Boulder, Colorado (June 2009) pp. 19-27 (2009).
E. Bruni, N.-K. Tran, and M. Baroni. Multimodal Distributional Semantics. Journal of Artificial Intelligence Research 49: 1-47 (2014).
D. Gerz, I. Vulic, F. Hill, R. Reichart, and A. Korhonen. Simverb-3500: A Large-Scale Evaluation Set of Verb Similarity. Preprint ArXiv 4:1608.00869 (2016).
B. Gao, J. Bian, and T.-Y. Liu. Wordrep: A Benchmark for Research on Learning Word Representations. Preprint ArXiv 1: 1407.1640 (2014).
S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain. Methods and Datasets for the Evaluation of Semantic Measures. Semantic Similarity from Natural Language and Ontology Analysis: 131-157 (2015).
B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo. Evaluating word embedding models: Methods and experimental results. APSIPA Transactions on Signal and Information Processing 8: e19 (2019).
C. Greenberg, V. Demberg, and A. Sayeed. Verb polysemy and frequency effects in thematic fit modeling. Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics, Denver, Colorado (4 June 2015) pp. 48-57 (2015).
S. Pado and M. Lapata. Dependency-Based Construction of Semantic Space Models. Computational Linguistics 33(2): 161-199 (2007).
F. Nooralahzadeh, L. Ovrelid, and J.T. onning. Evaluation of Domain-specific Word Embeddings using Knowledge Resources. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (7-12 May2018) (2018).