Berenguer, Alberto, Mazón, Jose-Norberto, Tomás, David
Word embeddings for retrieving tabular data from research publications
Machine Learning. 2024, 113: 2227-2248. https://doi.org/10.1007/s10994-023-06472-0
URI: http://hdl.handle.net/10045/138859
DOI: 10.1007/s10994-023-06472-0
ISSN: 0885-6125 (Print)
Abstract: 
Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.
Keywords:Research tabular data, Information retrieval, Word embeddings, Text classification
Springer Nature
info:eu-repo/semantics/article