Contextual word embeddings for tabular data search and integration

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/130001
Información del item - Informació de l'item - Item information
Título: Contextual word embeddings for tabular data search and integration
Autor/es: Pilaluisa, José | Tomás, David | Navarro Colorado, Borja | Mazón, Jose-Norberto
Grupo/s de investigación o GITE: Procesamiento del Lenguaje y Sistemas de Información (GPLSI) | Web and Knowledge (WaKe)
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave: Tabular data | Contextual word embedding | Information search | Data integration | Open data
Fecha de publicación: 30-nov-2022
Editor: Springer Nature
Cita bibliográfica: Neural Computing and Applications. 2023, 35: 9319-9333. https://doi.org/10.1007/s00521-022-08066-8
Resumen: This paper presents a new approach to retrieve and further integrate tabular datasets (collections of rows and columns) using union and join operations. In this work, both processes were carried out using a similarity measure based on contextual word embeddings, which allows finding semantically similar tables and overcome the recall problem of lexical approaches based on string similarity. This work is the first attempt to use contextual word embeddings in the whole pipeline of table search and integration, including for the first time their use in the join operation. A comprehensive analysis of their performance was carried out on both retrieving and integrating tabular datasets, comparing them with context-free models. Column headings and cell values were used as contextual information and their impact on each task was evaluated. The results revealed that contextual models significantly outperform context-free models and a traditional weighting schema in ad hoc table retrieval. In the data integration task, contextual models also improved the results on union operation compared to context-free approaches.
Patrocinador/es: Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been partially funded by project “Desarrollo de un ecosistema de datos abiertos para transformar el sector turístico” (GVA-COVID19/2021/103) funded by Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana (Spain); and by projects “CHAN-TWIN” (TED2021-130890B-C21), “COnscious natuRal TEXt generation (CORTEX)” (PID2021-123956OB-I00) and “Technological Resources for Intelligent VIral AnaLysis through NLP (TRIVIAL)” (PID2021-122263OB-C22), funded by MCIN/AEI/ 10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
URI: http://hdl.handle.net/10045/130001
ISSN: 0941-0643 (Print) | 1433-3058 (Online)
DOI: 10.1007/s00521-022-08066-8
Idioma: eng
Tipo: info:eu-repo/semantics/article
Derechos: © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Revisión científica: si
Versión del editor: https://doi.org/10.1007/s00521-022-08066-8
Aparece en las colecciones:INV - GPLSI - Artículos de Revistas
INV - WaKe - Artículos de Revistas

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailPilaluisa_etal_2023_NeuralComputApplic.pdf549,63 kBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons