Tabular open government data search for data spaces based on word embeddings

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/133987
Información del item - Informació de l'item - Item information
Título: Tabular open government data search for data spaces based on word embeddings
Autor/es: Berenguer, Alberto | Tomás, David | Mazón, Jose-Norberto
Grupo/s de investigación o GITE: Procesamiento del Lenguaje y Sistemas de Información (GPLSI) | Web and Knowledge (WaKe)
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave: Open government data | Tabular data search | Data spaces | Word embeddings
Fecha de publicación: 4-abr-2023
Editor: CEUR
Cita bibliográfica: Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), co-located with the 26th International Conference on Extending Database Technology and the 26th International Conference on Database Theory (EDBT/ICDT 2023), Ioannina, Greece, March 28, 2023. CEUR Workshop Proceedings, Vol-3369, 61-70
Resumen: Nowadays, data spaces are envisioned as a prominent mechanism for data sharing, boosting growth and creating value. Open government data providers should be considered as important participants in data space reference infrastructures, since open data portal initiatives are adopted by most of the governments to supply their public sector information. However, open data is mostly published in the form of tabular data such as spreadsheets or CSV files. Therefore, reusing open data in data space is challenging due to the friction that may occur when combining the use of data shared in data spaces and the use of tabular data published in open government portals. To alleviate this situation, tabular open data search engines can be a promising solution. Actually, most open data portals allow reusers to retrieve and federate tabular open data by means of a keyword-based search engine over metadata. Unfortunately, these search engines rely on the (not so often good enough) metadata quality, which must be complete, descriptive, and representative of the content. Moreover, keyword-based search is not always an adequate solution for retrieving open data, since it does not consider their tabular nature and search results can be useless for reusers (e.g., when they attempt to find data useful for extending rows or columns of a given tabular dataset). To overcome these problems, this paper presents an approach that uses word embeddings for tabular open data search based on unionability and joinability. Our approach could be seamlessly integrated in a data space infrastructure. A prototype of our approach has been developed. Finally, both, an intrinsic and an extrinsic evaluation with end users, have been carried out.
Patrocinador/es: This work is part of the project TED2021-130890B-C21, funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Also, this work is partially funded by GVA-COVID19/2021/103 project from “Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital de la Generalitat Valenciana”. Alberto Berenguer has a contract for predoctoral training with the “Generalitat Valenciana” and the European Social Fund, funded by the grant ACIF/2021/507.
URI: http://hdl.handle.net/10045/133987
ISSN: 1613-0073
Idioma: eng
Tipo: info:eu-repo/semantics/conferenceObject
Derechos: © 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Revisión científica: si
Versión del editor: https://ceur-ws.org/Vol-3369/
Aparece en las colecciones:INV - WaKe - Comunicaciones a Congresos, Conferencias, etc.
INV - GPLSI - Comunicaciones a Congresos, Conferencias, etc.

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailBerenguer_etal_2023_CEUR.pdf1,02 MBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons