This task is twofold: the transportation of the data to the supercomputer, and its processing to generate a language model. For some weeks now the MareNostrum has initiated content storage, after developing an extraction process of textual data from the Web archive of the library, which has allowed to transfer content to the BSC promptly. The transmission of this enormous quantity of data was one of the significant challenges of this initiative. As of now, the supercomputer has already stored 45 Terabytes.
The next step will be the processing of this data to generate language models through natural language processing technologies. This resource is already available in English, the best known is Google Bert, which has been a milestone in the processing of natural language. The model in which the BSC is working stands out from other initiatives of Spanish language models because of the quantity of Spanish linguistic data it contains, which makes it more precise and practical for cross use.
Language models reproduce language use and allow us to know the real meaning of words, even in whole sentences, since the data is contextualized and has more information and sense. This allows to disambiguate the sense of words - for instance, to distinguish between the meaning of sick in 'This is sick!' or in 'I'm feeling sick'. It also allows us to interpret the ideological bias, and it opens the way to deal with irony and figurative sense. It also endows artificial intelligence systems with common sense.
Quim Moré, researcher from the CASE department of the BSC, and David Vicente, team manager of the Operations group, are the ones responsible for this project. According to Quim Moré: "the generation of language models is vital to artificial intelligence. The computer application of a disambiguous language model with a context founded in our world knowledge means a great advance in the generation of smarter and closer systems".
The applications of this model are diverse: from an automatic translation, cybersecurity, or the description of the content of a XV-century picture made by a robot. Nevertheless, models capable of generating this revolution require such computational and data resources that only a few centres and companies, such as Google or Facebook, do have.
In this sense, Quim Moré highlighted that "we are lucky that MareNostrum has the computing capacity that we need, and on the other hand, we have the huge linguistic data amount revised and provided by the National Library. We have a great opportunity to be on the same level as the great centres of artificial intelligence and also to provide a computational application of linguistic knowledge to culture".
The Spanish web archive is the collection formed by websites with the .es domain and others - including globs, forums, documents, images, videos, etc. - that are collected in order to preserve the Spanish documentary heritage on Internet and to ensure access to it. In December 2019, there was the 10th anniversary since the launch of the Spanish web archive project. Since then, the Spanish National Library has strengthened its infrastructure, politics and processes to carry out this task to preserve online heritage, just as the most important national libraries have been doing for years now.