Word embeddings for retrieving tabular data from research publications

Berenguer, Alberto; Mazón, Jose-Norberto; Tomás, David

Word embeddings for retrieving tabular data from research publications

Empreu sempre aquest identificador per citar o enllaçar aquest ítem http://hdl.handle.net/10045/138859

Registre complet

Registre complet
Camp Dublin Core	Valor	Idioma
dc.contributor	Web and Knowledge (WaKe)	es_ES
dc.contributor	Procesamiento del Lenguaje y Sistemas de Información (GPLSI)	es_ES
dc.contributor.author	Berenguer, Alberto	-
dc.contributor.author	Mazón, Jose-Norberto	-
dc.contributor.author	Tomás, David	-
dc.contributor.other	Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos	es_ES
dc.date.accessioned	2023-11-30T10:05:14Z	-
dc.date.available	2023-11-30T10:05:14Z	-
dc.date.issued	2023-11-29	-
dc.identifier.citation	Machine Learning. 2024, 113: 2227-2248. https://doi.org/10.1007/s10994-023-06472-0	es_ES
dc.identifier.issn	0885-6125 (Print)	-
dc.identifier.issn	1573-0565 (Online)	-
dc.identifier.uri	http://hdl.handle.net/10045/138859	-
dc.description.abstract	Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.	es_ES
dc.description.sponsorship	Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work is part of the project TED2021-130890B-C21, funded by MCIN/AEI/10.1 3039501100011033 and by the European Union NextGenerationEU/PRTR. Alberto Berenguer has a contract for predoctoral training with the Generalitat Valenciana and the European Social Fund, funded by the grant ACIF/2021/507.	es_ES
dc.language	eng	es_ES
dc.publisher	Springer Nature	es_ES
dc.rights	© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.	es_ES
dc.subject	Research tabular data	es_ES
dc.subject	Information retrieval	es_ES
dc.subject	Word embeddings	es_ES
dc.subject	Text classification	es_ES
dc.title	Word embeddings for retrieving tabular data from research publications	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.peerreviewed	si	es_ES
dc.identifier.doi	10.1007/s10994-023-06472-0	-
dc.relation.publisherversion	https://doi.org/10.1007/s10994-023-06472-0	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/TED2021-130890B-C21	es_ES
Apareix a la col·lecció:	INV - WaKe - Artículos de Revistas INV - GPLSI - Artículos de Revistas

Arxius per aquest ítem:

Arxius per aquest ítem:
Arxiu	Descripció	Tamany	Format
Berenguer_etal_2024_MachLearn.pdf		1,26 MB	Adobe PDF	Obrir Vista prèvia Tancar vista prèvia

Veure citacions a Google Académic

Mostrar el registre simplificat de l'ítem

Tots els documents dipositats a RUA estan protegits per drets d'autors. Alguns drets reservats.