Building and querying semantic layers for web archives (extended version)

Fafalios, Pavlos; Holzmann, Helge; Kasturia, Vaibhav; Nejdl, Wolfgang

doi:10.1007/s00799-018-0251-0

Building and querying semantic layers for web archives (extended version)

Published: 05 July 2018

Volume 21, pages 149–167, (2020)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Pavlos Fafalios ORCID: orcid.org/0000-0003-2788-526X¹,
Helge Holzmann¹,
Vaibhav Kasturia¹ &
…
Wolfgang Nejdl¹

566 Accesses
3 Citations
9 Altmetric
Explore all metrics

Abstract

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Article Open access 27 December 2021

Dataset search: a survey

Article Open access 24 August 2019

Data Catalogs: A Systematic Literature Review and Guidelines to Implementation

Notes

For simplicity, when we say entity we refer to entity (e.g., Barack Obama, New York, or Microsoft), concept (e.g., Democracy or Abortion) or event (e.g., 2010 Haiti earthquake or 2016 US Election).
The ALEXANDRIA project (ERC Advance Grant, Nr. 339233, http://alexandria-project.eu/) aims to develop models, tools, and techniques necessary to explore and analyze web archives in a meaningful way.
https://archive.org.
http://archive.pt.
http://mementoweb.org.
https://archive-it.org.
https://hbase.apache.org/.
https://spark.apache.org/.
The specification is available at: http://l3s.de/owa/.
http://dublincore.org/.
http://www.openannotation.org/spec/core/.
http://www.ics.forth.gr/isl/oae/.
http://schema.org/mentions.
http://mementoweb.org/depot/native/dbpedia/.
https://www.w3.org/TR/prov-dm/.
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/.
https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.
https://iptc.org/standards/nitf/.
https://virtuoso.openlinksw.com/.
https://github.com/helgeho/ArchiveSpark2Triples.
https://github.com/helgeho/FEL4ArchiveSpark.
The corresponding Jupyter Notebook is available at: https://github.com/helgeho/ArchiveSpark2Triples/blob/master/notebooks/Triples.ipynb.
http://l3s.de/owa/semanticlayers/.
https://archive-it.org/collections/2950.
http://www.openlinksw.com/schemas/twitter.
http://l3s.de/owa/semanticlayers/SemLayerEval.zip.
http://lod-cloud.net/.

References

Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.: Web archive profiling through cdx summarization. In: International Conference on Theory and Practice of Digital Libraries, Springer (2015)
Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.: Web archive profiling through fulltext search. In: International Conference on Theory and Practice of Digital Libraries, Springer (2016)
Alexander, K., Hausenblas, M.: Describing linked datasets-on the design and usage of void, the vocabulary of interlinked datasets. In: In Linked Data on the Web Workshop (LDOW 09), in conjunction with 18th International World Wide Web Conference (WWW 09, Citeseer) (2009)
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)
Article Google Scholar
Anand, A., Bedathur, S., Berberich, K., Schenkel, R., Tryfonopoulos, C.: Everlast: a distributed architecture for preserving the web. In: 9th ACM/IEEE-CS Joint Conference on Digital libraries, ACM (2009)
Arenas, M., CuencaGrau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Jimenez-Ruiz, E.: SemFacet: semantic faceted search over YAGO. In: 23rd International Conference on World Wide Web, ACM (2014)
Antoniou, G., Van Harmelen, F.: Web ontology language: owl. In: Handbook on Ontologies, pp. 67–92. Springer, Heidelberg (2004)
Chapter Google Scholar
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 179–188. ACM (2015)
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking in queries. In: Eight ACM International Conference on Web Search and Data Mining, ACM, New York, NY, USA (2015)
Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
Brickley, D., Guha, R.V., McBride, B.: Rdf schema 1.1. W3C Recomm. 25, 2004–2014 (2014)
Google Scholar
Fafalios, P., Tzitzikas, Y.: Stochastic re-ranking of biomedical search results based on extracted entities. J. Assoc. Inf. Sci. Technol. (JASIST) 68(11), 2572–2586 (2017)
Article Google Scholar
Fafalios, P., Baritakis, M., Tzitzikas, Y.: Exploiting linked data for open and configurable named entity extraction. Int. J. Artif. Intell. Tools 24(02), 1540012 (2015)
Article Google Scholar
Fafalios, P., Yannakis, T., Tzitzikas, Y.: Querying the web of data with sparql-ld. In: International Conference on Theory and Practice of Digital Libraries, Springer, pp. 175–187 (2016)
Fafalios, P., Iosifidis, V., Stefanidis, K., Ntoutsi, E.: Multi-aspect entity-centric analysis of big social media archives. In: 21st International Conference on Theory and Practice of Digital Libraries (TPDL’17), Thessaloniki, Greece (2017)
Chapter Google Scholar
Fafalios, P., Kasturia, V., Nejdl, W.: Towards a ranking model for semantic layers over digital archives. In: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’17 - Posters & Demonstrations)), Toronto, Ontario, Canada (2017)
Fernando, Z.T., Marenzi, I., Nejdl, W., Kalyani, R.: Archiveweb: Collaboratively extending and exploring web archive collections. In: International Conference on Theory and Practice of Digital Libraries, Springer (2016)
Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: 19th ACM international conference on Information and knowledge management, ACM (2010)
Ferré, S.: Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semant. Web 8(3), 405–418 (2017)
Article Google Scholar
Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: International Conference on Theory and Practice of Digital Libraries (2017)
Chapter Google Scholar
Heath, T., Bizer, C.: Linked data: evolving the web into a global data space. Synth. Lectures Semantic Web Theory Technol. 1(1), 1–136 (2011)
Article Google Scholar
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Conference on Empirical Methods in Natural Language Processing (2011)
Holzmann, H., Anand, A.: Tempas: temporal archive search based on tags. In: International Conference on World Wide Web (2016)
Holzmann, H., Risse, T.: Accessing web archives from different perspectives with potential synergies. In: 2nd International Conference on Web Archives/Web Archiving Week (RESAW/IIPC) (2017)
Holzmann, H., Goel, V., Anand, A.: Archivespark: efficient web archive access, extraction and derivation. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
Holzmann, H., Nejdl, W., Anand, A.: Exploring web archives through temporal anchor texts. In: Proceedings of the 2017 ACM on Web Science Conference, ACM, pp 289–298 (2017)
Jackson, A., Lin, J., Milligan, I., Ruest, N.: Desiderata for exploratory search interfaces to web archives in support of scholarly activities. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K.: How to search the internet archive without indexing it. In: 20th International Conference on Theory and Practice of Digital Libraries, Springer (2016)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2), 167–195 (2015)
Article Google Scholar
Lin, J., Gholami, M., Rao, J.: Infrastructure for supporting exploration and discovery in web archives. In: International Conference on World Wide Web (2014)
Marchionini, G.: Exploratory search: from finding to understanding. Commun. ACM 49(4), 41–46 (2006)
Article Google Scholar
Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P., Zaragoza, H.: Searching through time in the New York times. In: 4th Workshop on Human-Computer Interaction and Information Retrieval (2010)
Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)
Article Google Scholar
Padia, K., AlNoamany, Y., Weigle, M.C.: Visualizing digital collections at archive-it. In: 12th ACM/IEEE-CS joint conference on Digital Libraries, pp. 15–18. ACM (2012)
Page, K.R., Bechhofer, S., Fazekas, G., Weigl, D.M., Wilmering, T.: Realising a layered digital library: exploration and analysis of the live music archive through linked data. In: Digital Libraries (JCDL), 2017 ACM/IEEE Joint Conference on, IEEE, pp 1–10 (2017)
PrudHommeaux, E., Seaborne, A., et al.: Sparql query language for rdf. W3C recommendation 15 (2008)
Buil-Aranda, C., Arenas, M., Corcho, O., Polleres, A.: Federating queries in SPARQL 1.1: syntax, semantics and evaluation. Web Semant. Sci. Serv. Agents. World Wide Web 18(1), 1–17 (2013)
Google Scholar
Sacco, G.M., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience, vol. 25. Springer, New York (2009)
Book Google Scholar
Sanderson, R., Ciccarese, P., Van de Sompel, H.: Designing the W3C open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference, pp. 366–375. ACM (2013)
Sandhaus, E.: The New Tork Times annotated corpus. Linguist. Data Consort. Philadelphia 6(12), e26752 (2008)
Google Scholar
Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR conference on Research and Development in Information Retrieval (2016)
Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: ACM Conference on Human Information Interaction and Retrieval (2016)
Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States—Memento. RFC 7089 (2013). https://doi.org/10.17487/RFC7089
Tran, N.K., Tran, T., Niederée, C.: Beyond time: dynamic context-aware entity recommendation. In: European Semantic Web Conference, Springer (2017)
Tzitzikas, Y., Manolis, N., Papadakos, P.: Faceted exploration of RDF/S datasets: a survey. J. Intell. Inf. Syst. 48(2), 329–364 (2017)
Article Google Scholar
Unger, C., Bühmann, L., Lehmann, J., Ngonga Ngomo, A.C., Gerber, D., Cimiano, P.: Template-based question answering over rdf data. In: 21st international Conference on World Wide Web, ACM (2012)
Vo, K.D., Tran, T., Nguyen, T.N., Zhu, X., Nejdl, W.: Can we find documents in web archives without knowing their contents? In: ACM Conference on Web Science (2016)
Weikum, G., Spaniol, M., Ntarmos, N., Triantafillou, P., Benczúr, A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal analytics on web archive data: it’s about time! In: 5th Biennial Conference on Innovative Data Systems Research, CIDR 2011 (2011)
Whitelaw, M.: Generous interfaces for digital cultural collections. Digital Humanit. Q. 9(1), 1 (2015)
Google Scholar
Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1271–1279 (2017)
Zhang, L., Rettinger, A., Zhang, J.: A probabilistic model for time-aware entity recommendation. In: International Semantic Web Conference, Springer (2016)

Download references

Acknowledgements

The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA (No. 339233).

Author information

Authors and Affiliations

L3S Research Center, Leibniz University of Hannover, Appelstr. 9a, 30167, Hannover, Germany
Pavlos Fafalios, Helge Holzmann, Vaibhav Kasturia & Wolfgang Nejdl

Authors

Pavlos Fafalios
View author publications
You can also search for this author in PubMed Google Scholar
Helge Holzmann
View author publications
You can also search for this author in PubMed Google Scholar
Vaibhav Kasturia
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavlos Fafalios.

Additional information

This is an extended version of the paper: P. Fafalios, H. Holzmann, V. Kasturia, & W. Nejdl, “Building and Querying Semantic Layers for Web Archives”, 2017 ACM/IEEE-CS Joint Conference on Digital Libraries, June 2017.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fafalios, P., Holzmann, H., Kasturia, V. et al. Building and querying semantic layers for web archives (extended version). Int J Digit Libr 21, 149–167 (2020). https://doi.org/10.1007/s00799-018-0251-0

Download citation

Received: 15 September 2017
Revised: 09 December 2017
Accepted: 28 June 2018
Published: 05 July 2018
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00799-018-0251-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building and querying semantic layers for web archives (extended version)

Abstract

Access this article

Similar content being viewed by others

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Dataset search: a survey

Data Catalogs: A Systematic Literature Review and Guidelines to Implementation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Building and querying semantic layers for web archives (extended version)

Abstract

Access this article

Similar content being viewed by others

Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web

Dataset search: a survey

Data Catalogs: A Systematic Literature Review and Guidelines to Implementation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation