Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/14265
Título: | Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor |
---|---|
Autor/es: | Esplà-Gomis, Miquel | Forcada, Mikel L. |
Grupo/s de investigación o GITE: | Transducens |
Centro, Departamento o Servicio: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos |
Palabras clave: | Parallel corpus | Bitext | Bitextor | Tag aligner | Translation memories |
Área/s de conocimiento: | Lenguajes y Sistemas Informáticos |
Fecha de creación: | dic-2009 |
Fecha de publicación: | ene-2010 |
Editor: | Charles University in Prague. Institute of Formal and Applied Linguistics | Versita |
Cita bibliográfica: | ESPLÀ GOMIS, Miquel; FORCADA, Mikel L. "Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor". The Prague Bulletin of Mathematical Linguistics. No. 93 (Jan. 2010). ISSN 0032-6585, pp. 77-86 |
Resumen: | Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpusbased machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work with any pairs of languages. Content-based and URL-based heuristics and algorithms applied to identify and align the parallel web pages in a website will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project. |
Patrocinador/es: | Spanish Ministry of Science and Technology through grant TIC2003-08681-C02. Spanish Ministry of Science and Innovation through project TIN2009-14009-C02-01. |
URI: | http://hdl.handle.net/10045/14265 |
ISBN: | 978-80-904175-4-0 |
ISSN: | 0032-6585 (Print) | 1804-0462 (Online) |
DOI: | 10.2478/v10108-010-0003-9 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/article |
Revisión científica: | si |
Versión del editor: | http://dx.doi.org/10.2478/v10108-010-0003-9 |
Aparece en las colecciones: | INV - TRANSDUCENS - Artículos de Revistas |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
art-espla-gomis-forcada.pdf | 188,67 kB | Adobe PDF | Abrir Vista previa | |
Todos los documentos en RUA están protegidos por derechos de autor. Algunos derechos reservados.