Contextual word embeddings for tabular data search and integration

Please use this identifier to cite or link to this item: http://hdl.handle.net/10045/130001
Información del item - Informació de l'item - Item information
Title: Contextual word embeddings for tabular data search and integration
Authors: Pilaluisa, José | Tomás, David | Navarro Colorado, Borja | Mazón, Jose-Norberto
Research Group/s: Procesamiento del Lenguaje y Sistemas de Información (GPLSI) | Web and Knowledge (WaKe)
Center, Department or Service: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Keywords: Tabular data | Contextual word embedding | Information search | Data integration | Open data
Issue Date: 30-Nov-2022
Publisher: Springer Nature
Citation: Neural Computing and Applications. 2023, 35: 9319-9333. https://doi.org/10.1007/s00521-022-08066-8
Abstract: This paper presents a new approach to retrieve and further integrate tabular datasets (collections of rows and columns) using union and join operations. In this work, both processes were carried out using a similarity measure based on contextual word embeddings, which allows finding semantically similar tables and overcome the recall problem of lexical approaches based on string similarity. This work is the first attempt to use contextual word embeddings in the whole pipeline of table search and integration, including for the first time their use in the join operation. A comprehensive analysis of their performance was carried out on both retrieving and integrating tabular datasets, comparing them with context-free models. Column headings and cell values were used as contextual information and their impact on each task was evaluated. The results revealed that contextual models significantly outperform context-free models and a traditional weighting schema in ad hoc table retrieval. In the data integration task, contextual models also improved the results on union operation compared to context-free approaches.
Sponsor: Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research has been partially funded by project “Desarrollo de un ecosistema de datos abiertos para transformar el sector turístico” (GVA-COVID19/2021/103) funded by Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana (Spain); and by projects “CHAN-TWIN” (TED2021-130890B-C21), “COnscious natuRal TEXt generation (CORTEX)” (PID2021-123956OB-I00) and “Technological Resources for Intelligent VIral AnaLysis through NLP (TRIVIAL)” (PID2021-122263OB-C22), funded by MCIN/AEI/ 10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
URI: http://hdl.handle.net/10045/130001
ISSN: 0941-0643 (Print) | 1433-3058 (Online)
DOI: 10.1007/s00521-022-08066-8
Language: eng
Type: info:eu-repo/semantics/article
Rights: © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Peer Review: si
Publisher version: https://doi.org/10.1007/s00521-022-08066-8
Appears in Collections:INV - GPLSI - Artículos de Revistas
INV - WaKe - Artículos de Revistas

Files in This Item:
Files in This Item:
File Description SizeFormat 
ThumbnailPilaluisa_etal_2023_NeuralComputApplic.pdf549,63 kBAdobe PDFOpen Preview


This item is licensed under a Creative Commons License Creative Commons