Monograph in ‘BID’ on data webs and knowledge graphs

Miquel Centelles Velilla

University of Barcelona

ORCID: https://orcid.org/0000-0003-1739-4889

EXIT: https://www.directorioexit.info/ficha292

Miquel Centelles Velilla

DOI: https://doi.org/10.1344/BID2023.51.12

It has been over twenty years since the World Wide Web Consortium (W3C) presented its vision and programme of the semantic web, with the objective of providing automatic systems with automatically processable metadata on the data and information published on the web. The challenge of automatic systems capable of semantically interpreting the data and generating processes autonomously, based on this capacity, was imagined as a definitive internet revolution. Over the past twenty years, the technologies and standards which have been developed in the context of this programme and which made up the mythical set or layer cake of the semantic web, have experienced uneven degrees of development.

Some technologies have been consolidated and have affected multiple areas of activity and specialty. This is the case of uniform/internationalized resource identifiers (URI/IRI) — predating the semantic web — of determined serialization formats, such as JSON-LD, or linked ontologies, such as metadata schemas or knowledge graphs. Finally, there is a group of technologies sitting on the top of said cake, such as Trust (linked to verifiable affirmations of identity, content origin and related matters) and Proof of Work (which provides a flexible basis for trust). While so far these have been put off, they may obtain their rehabilitation in the context of Web3.

The RDF data model, the vault key to the vision of the semantic web, has had to compete with other models within and outside of the graph-orientated field. Attempts at softening the blow with a nod towards developers have not gone well, with initiatives such as EasyRDF, but a well-adjusted space for development has been shared with their competition in the context of graph-orientated models, such as property graphs (Neo4J) or even the traditional relational model. In the context of structured data, a format linked to the semantic web, JSON-LD, in alliance with the Schema.org vocabulary, is being presented as a vehicle for embedding metadata to describe products, people, organizations, places and events on HTML websites. Testament to this is the evolution of the total number of Pay Level Domains (PLDs) compiled by Bizer et al. (2023), where JSON-LD has presented itself for years as the preferred format (in the face of Microdata, Microformat hCard) and has compensated for the gradual descent of RDFa. JSON-LD brings noteworthy benefits to SEO, and developers can also reuse the same data structures to create new front desk user interface widgets, as well as feed the search engine trackers with metadata that describes the exact meaning of website content.

At various points in its evolution, the authors who have analysed the state of development of the semantic web have highlighted that the products and services are maintained, to excess, in the stronghold of the lab. Perceptions of the reasons for this complex have evolved over time. For example, a few years after its birth, Tjoa em. (2005, p. 1163) stated:

"[...] it should be mentioned that the Semantic Web might not promise a quick return-on-investment for those formatting their data to suit the Semantic Web".

On the other hand, more recently, Hassan et al. (2015, p. 14587) attributed an excessive submission of open data to hypotheses, which made it difficult for what was known as killer applications to adopt them.

The entire panorama, however, enables us to identify multiple success stories.

The publication of linked data, which began in 2006, has had a significant impact on the ecosystem of open data in all fields of knowledge where these comply with the five-star schema. It consists of a (by now rather large) set of RDF graphs that are linked in the sense that many IRI identifiers in the graphs also appear also in other, sometimes multiple, graphs. In a sense, the collection of all these linked RDF graphs can be understood as one very big RDF graph. Data providers facilitate consultation via SPARQL access points. Currently, the website Linked Open Data Cloud, which compiles datasets that have been published in the format of linked data, contains 1,314 datasets with 16,308 links connecting them.

A central position in the publication of linked data is occupied by Dbpedia and Wikidata. The latter is the biggest and most widely-used secondary and distributed database in the world. Some fundamental decisions regarding its architecture were based on the standards of the semantic web and, despite the ultimate differences, on the interoperability between Wikidata and the RDF datasets is perfectly guaranteed. Currently, their dimensions are impressive: 565,000 registered editors, 359 bots, 1,440 million statements, 101 million elements, 10,800 properties and 420 million monthly views of data on elements described. By means of their consulting service based on SPARQL, Wikidata responds to 3.8 million consultations a day: 44 per second. However, the dimensions of Wikidata are not the only aspect of interest for its promoters. This dataset is immersed in a process of quality improvement of the search function, which, of course, profoundly affects the improvement of the data and ontology that lend it meaning. It also facilitates the approach to technologies of the semantic web and incorporates logical components for the creation and validation of RDF graphs, like SHACL.

In the context of research, the creation and publication of linked open data have received the drive of the FAIR principles (findable, accessible, interoperable, reusable). In their application, most principles of interoperability and reuse can be treated with standard, preestablished technologies of the semantic web. This application has been broadened to other components of the FAIR universe, such as semiautomatic tools for verifying compliance with the principles, such as FAIR-Checker, described by Gaignard et al. (2023).

The improvement of information access systems has also been a field rather heavily involved in the technologies and standards of the semantic web.

In some contexts, such as the legal one, the implementation of successful technologies and languages has demonstrated coordinated work on an international level as well as results that have transcended the limits of the lab and the constant beta. This is the case, for instance, of the European Legislation Identifier (ELI): an online access system for legislation in the context of the European Union, which drives the development of critical information systems for citizens of this millennium. This is an orderly process of implementation based on four pillars: identifiers, ontology and metadata, the publication of data and, since 2023, the synchronization of metadata. The compatibility of this last pillar offers to providers of the ELI two channels of content syndication: ELI Sitemap and ELI Atom feed (ELI TF, 2022).

Many national libraries have also transformed their catalogue data and authority data into the RDF model to offer these as open data. Examples of this include the ID.LOC.GOV: Linked Data Service and Datos.bne.es. This latter project was primarily driven by the legal framework of the EU established in 2015 regarding the reuse of data of public administration. A relevant fact is that the adaptation of the data model to the standards of the semantic web responds not only to the data architecture but primarily to user interaction with the information resources. This is where a highly significant role is played by the link of BNE data with external databases, such as Wikidata, in a symbiotic relationship and one of mutual growth.

The ontologies, which in the context of the semantic web, and more specifically of the OWL 2 specification, are defined as a formal description of a domain of interest formed by entities, expressions and axioms, have formed part of the information access projects we looked at previously. They have also had particular impact on the domain of biomedicine, such as SNOMED-CT and Gene Ontology. More recently, ontologies have been integrated into an ecosystem of the generation and publication of semantically-enriched data, such as knowledge graphs. These artefacts incorporate individual resources expressed in RDF, semantically represented in classes of one or more ontologies and related via properties of those same ontologies. Searches for data are possible thanks to SPARQL services and the application of automatic reasoners to generate inference and validate consistency. The development of knowledge graphs in institutional and corporate environments involves a challenge for functional principles, such as open data and an improvement in the functionalities of artificial intelligence systems, recommendation systems, question-and-answer systems and information recovery tools. The last-minute report Knowledge Graph Industry Survey Report: Data and Analysis on Industry Maturity (2022) shows that the majority of those surveyed as part of the study are in the experimental stage of their journey, and once more, there is reticence on the part of senior stakeholders, who demand immediate results. Early use cases are more related to activities associated with tidying the house of data in the form of data integration, aggregation between a range of sources and the application of data quality norms.

Is it possible to envisage a "third life" for the technologies of the semantic web? Currently, what is occurring is that which Seneviratne y McGuinness (2023) describe as a "synergetic convergence" between, on the one hand, semantic technologies, which have given rise to the Web 3.0 and, on the other hand, blockchain technologies, which have catalysed the prosperous ecosystem Web3. The use of standardized vocabularies and ontologies allows for interoperability between blockchain nodes, promotes trust and minimizes errors in the exchange of knowledge.

Despite the forecast benefits, there remain underlying challenges and doubts and there are accusations of "buzzword" or "marketing term" usage, even by significant people in the digital communication and information sector, such as Elon Musk and Jack Dorsey.

Time will tell the future of the vision of the semantic web. For the moment, we can state that it has been able to endure for two decades and that it has borne important fruit, if not always perceptibly.

References

Bizer, Christian; Meusel, Robert; Primpeli, Anna; Brinkmann, Alexander (2023, 30 abril). Web Data Commons: Microdata, RDFa, JSON-LD, and Microformat Data Sets. Companion Proceedings of the ACM Web Conference 2023. <https://webdatacommons.org/structureddata/index.html>.

ELI TF (2022). ELI ‘Pillar IV’ specification: Protocol to synchronise ELI metadata [online]. Available at: <https://eur-lex.europa.eu/content/eli-register/ELI-Pillar-IV-protocol-specification-v1.0_en.pdf>.

Gaignard, Alban; Rosnet, Thomas; De Lamotte, Frédéric; Lefort, Vincent; Devignes, Marie-Dominique (2023). "FAIR-Checker: Supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards". Journal of Biomedical Semantics, vol. 14, no. 7. DOI: <https://doi.org/10.1186/s13326-023-00289-5>.

Hassan, Bryar; Dasmahapatra, Srinandan (2015). "Towards Semantic Web: Challenges and Needs". International Journal Of Engineering And Computer Science, vol. 4, no. 10, pp. 14585-14588 [online]. Available at: <https://ijecs.in/index.php/ijecs/article/view/2953>.

Knowledge Graph Industry Survey Report: Data and Analysis on Industry Maturity (p. 30). (2022). Enterprise Knowledge Graph Foundation; Knowledge Graph Conference.

Seneviratne, Oshani; McGuinness, Deborah L. (2023). Web 3.0 Meets Web3: Exploring the Convergence of Semantic Web and Blockchain Technologies [online]. Available at: <https://ceur-ws.org/Vol-3443/ESWC_2023_TrusDeKW_paper_247.pdf>

Tjoa, A. Min; Andjomshoaa, Amin; Shayeganfar, Ferial; Wagner, Roland (2005). "Semantic Web challenges and new requirements". 16th International Workshop on Database and Expert Systems Applications (DEXA’05), pp. 1160-1163. DOI: < https://doi.org/10.1109/DEXA.2005.177>.

Temària's articles of the same author(s)

Centelles Velilla, Miquel

[ more information ]

Creative Commons licence (Attribution-Non-Commercial-No Derivative works). They may be consulted and distributed freely provided that the author and publisher are quoted (in accordance with the “Recommended citation” section in each of the articles). However, no derivative works (translation, change of format, etc.) may be made without the publisher’s permission. Therefore, it meets the definition of open access form the Budapest Open Access Initiative declaration. The journal allows the author(s) to hold the copyright without restrictions and to retain publishing rights without restrictions.

Monograph in ‘BID’ on data webs and knowledge graphs

References

similar articles in BiD

similar articles in Temària

Temària's articles of the same author(s)