TISCO: Temporal scoping of facts

doi:10.1016/j.websem.2018.09.002

Journal of Web Semantics

Volume 54, January 2019, Pages 72-86

https://doi.org/10.1016/j.websem.2018.09.002 Get rights and content

Abstract

Some facts in the Web of Data are only valid within a certain time interval. However, most of the knowledge bases available on the Web of Data do not provide temporal information explicitly. Hence, the relationship between facts and time intervals is often lost. A few solutions are proposed in this field. Most of them are concentrated more in extracting facts with time intervals rather than trying to map facts with time intervals. This paper studies the problem of determining the temporal scopes of facts, that is, deciding the time intervals in which the fact is valid. We propose a generic approach which addresses this problem by curating temporal information of facts in the knowledge bases. Our proposed framework, Temporal Information Scoping (TISCO) exploits evidence collected from the Web of Data and the Web. The evidence is combined within a three-step approach which comprises matching, selection and merging. This is the first work employing matching methods that consider both a single fact or a group of facts at a time. We evaluate our approach against a corpus of facts as input and different parameter settings for the underlying algorithms. Our results suggest that we can detect temporal information for facts from DBpedia with an f-measure of up to 80%.

Introduction

The Web of Data can be regarded as a dynamic environment where information can change rapidly and cannot be assumed to be static [1]. Changes in the Web of Data sources should reflect changes in the real world [2], [3], otherwise data can soon become outdated. Some facts are not time variant and thus do not change over time e.g. <CristianoRonaldo, bornIn, Portugal> while others has a validity time¹ with a start and end e.g. <CristianoRonaldo, playFor, ManchesterUnited> refers to a fact valid from 2003 to 2009.

Most of knowledge bases store facts under a historical perspective. Facts in this knowledge bases have been true sometime until the current time. We refer to the representation of these facts in the knowledge bases as to historification of dynamic facts. Fig. 1 shows examples of different teams for the same entity Jennison Myrie-Williams (facts extracted from DBpedia 2015-10). Historification of dynamic facts, which is frequently found in many and prominent datasets in Linked Data (LD), fails to provide details about the time interval when the facts have been true. These knowledge bases adopt a temporal flattening approach of representing dynamic facts. The incompleteness and the inaccuracy of temporal information in LD [4] is often due to the information extraction process (that can be error prone) or to the representation model (that requires very sophisticated meta-modeling strategies to represent versioning metadata in RDF). For instance, in DBpedia it is not possible to know the time interval of the fact <Jennison Myrie-Williams, playFor, Stevenage> since all time points are associated directly with the entity rather than the fact (see Fig. 1 from f $_{5}$ –f₁₄) and the semantics of the predicate is the same for the starting and the ending time points (e.g. year).

Despite the importance of the relationship between facts and time intervals, very few solutions are proposed. Most of them are concentrated more in extracting facts with time intervals from text [5], [6], [7], [8] rather than trying to map facts with time intervals. The system CoTS provided in [9] is similar to our system since it also detects validity time of facts. In contrast to our approach, CoTS relies on document metadata such as its creation date, to assign time intervals to facts. To the best of our knowledge, this is the first work employing both local and global matching approach, and a system for mapping facts to time intervals.

To map those facts to the correct time intervals, we have to address two main challenges. First, the set of time intervals is created as a combination of all time points extracted from the knowledge base for each entity where the starting time point is smaller than the ending time point. As shown from the example, each fact f₁ to f₄ will have 55 possible time intervals. This set needs to be reduced since it contains also noisy intervals such as all intervals starting with the birth year. Second, to find the correct intervals, we need to extract supporting evidence from external sources that indicates how often a fact occurs with each time point, and subsequently predict the possible time interval of each fact based on the acquired evidence.

In this work, we focus on curating time intervals associated with facts. We introduce an approach for detecting the temporal scope of facts referred to by triples (short: the temporal scope of the triples). Given a fact (i.e., an RDF triple), our approach aims to detect the time points at which the temporal scope of the triple begins and ends. Two sources can be envisaged for gathering such information: the document Web and LD. Our approach is able to take advantage of both: the Web is made use of by extending upon a fact validation approach [10], which allows detecting Web documents which corroborate a triple. In contrast to typical search engines, the system does not just search for textual occurrences of parts of the statement, but tries to find webpages which contain the actual statement phrased in natural language. The second source of information for time scopes is the Web of Data itself. Here, we use DBpedia, for possible time scopes and devise an algorithm for combining the results extracted from Web documents with those fetched from RDF sources. The algorithm consists of three main steps: First, the evidence extracted from Web documents is matched against a set of relevant time intervals to obtain a significance score for each interval. Second, a small set of more significant intervals is selected. Finally, the selected intervals are merged, when possible, by considering their mutual temporal relations. The set of disconnected intervals [11] returned by the algorithm defines the temporal scope of the fact. We also propose two normalization strategies that can be applied to the data extracted from Web documents before running the algorithm, to account for the significance of dates appearing in the documents corroborating the input fact.

This article makes the following contributions:

•
We present an approach for modeling a space of relevant time intervals for a fact starting from dates extracted from RDF triples.
•
We devise a three-phase algorithm for temporal scoping, i.e. for mapping facts to sets of time intervals, which integrates the previous steps via matching, selection and merging.
•
We describe two matching methods that consider facts in isolation or cluster them according to the main entity.
•
We provide TISCO, a running prototype; the first system able to provide temporally annotated facts which are modeled according to a relationship-centric perspective [4].

This article is an extension of the initial description of work in [12]. The main additions are as follows:

•
We describe in detail more alternative solutions, including an additional function in the matching phase of our approach and a normalization function of occurrences of dates. We present experimental results comparing them.
•
We developed a prototype for annotating facts with temporal information and show how all the different matching functions and their combinations can be integrated in one framework.

The rest of this paper is structured as follows: We give an overview of the state of the art of relevant scientific areas in Section 2. In Section 3 we define the terminologies and the notations used in this paper. In this section, we provide a general overview and the system infrastructure of our approach. In Section 4, we describe how temporal information is extracted from web pages using a temporal extension of the DeFacto algorithm [10]. Section 5 shows how this information can be mapped to a set of time intervals specifying its temporal scope. Further time intervals are selected according to some criteria and merged when possible in Section 6. We then evaluate the approach by using temporal scopes from Yago2 as gold standard and facts from DBpedia as input in Section 8. Finally, we conclude in Section 9 and give pointers to future work.

Section snippets

Related work

The work presented in this paper relies on two areas of research: the extraction of time information and fact checking.

Problem definition

In this section, we first give a technical background about the RDF data model (Section 3.1) and then we provide the problem definition of temporal scoping of facts (Section 3.2).

Temporal evidence extraction

An overview of our approach is given in Fig. 2. The temporal evidence for a given fact $f$ is extracted from unstructured documents (see Section 4.1) and from RDF documents (see Section 4.2) where a space of possible time intervals relevant to the fact is built; the evidence extracted from unstructured documents is matched against the space of relevant time intervals and after a selection and merging function is applied, the final set of temporal scopes are associated with the input fact.

Matching methods

The matching phase is based on the determination of a family of matching functions that we describe in the following sections. It is clear that in the general case the problem we are trying to solve requires a matching * : * since a fact can be associated with several time intervals and vice versa. In our previous work [12] we used a local approach that we reassume in Section 5.1. However, it is also possible to apply a global matching approach as described in Section 5.2. All functions compute

Selection function

Once we have a set of significance matrices $S M_{1}, \dots,$ $S M_{n}$ , each one associated with a fact $f$ having subject $s$ , we then select the time intervals that might be mapped to the considered facts. We propose two basic selection functions that use $S M s$ ; both functions can select more than one interval to associate with a fact $f$ . The neighbor-x selects a set of intervals whose significance score is close to the maximum significance score in the $S M$ matrix, up to a certain threshold. In other terms, we

Web interface

TISCO,¹⁵ is a prototype that supports experts of the matching problems to test their algorithms in a straightforward way. This system already implements the matching functions described in Section 5 and also the selection functions described in Section 6. The TISCO features include its extensible architecture that facilitates the integration of a variety of matching functions, its capability to evaluate and compare matching results, and

Evaluation

This section describes the evaluation of our approach. The aim of the experiment is to show (i) the correctness of our approach by comparing different configurations of normalization, matching and selection functions and (ii) the efficiency and scalability of our approach.

Summary and discussions

This paper studies the problem of determining and mapping time intervals to dynamic facts. We proposed a framework comprising several functions and configuration parameters that can efficiently provide the matching and the selection of the set of time intervals that maximize the effectiveness of our approach. In addition, we proposed a running prototype, TISCO that will support the users in exploring facts with temporal scopes and simplify the testing of new algorithms for matching and

Acknowledgments

This research has been supported in part by the research grant number 17A209 from the University of Milano-Bicocca and by a scholarship from the University of Bonn.

References (47)

SuchanekF.M. et al.
YAGO: A Large ontology from wikipedia and WordNet
J. Web Sem.
(2008)
GerberD. et al.
DeFacto-temporal and multilingual deep fact validation
Web Semant.
(2015)
T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, A. Hogan, Observing linked data dynamics, in: ESWC, 2013, pp....
G. Correndo, M. Salvadores, I. Millard, N. Shadbolt, Linked timelines: Temporal representation and management in linked...
J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, S. Decker, Towards dataset dynamics: Change frequency of linked open...
A. Rula, M. Palmonari, A. Harth, S. Stadtmüller, A. Maurino, On the diversity and availability of temporal information...
UzZamanN. et al.
TRIPS and TRIOS system for TempEval-2: Extracting temporal information from text
KuzeyE. et al.
Extraction of temporal facts and events from wikipedia
D. Hovy, J. Fan, A. Gliozzo, S. Patwardhan, C. Welty, When did that happen?: linking events and relations to...
X. Ling, D.S. Weld, Temporal information extraction, in: 25th AAAI,...

P.P. Talukdar, D.T. Wijaya, T. Mitchell, Coupled temporal scoping of relational facts, in: 5th WSDM, 2012, pp....

LehmannJ. et al.

DeFacto - Deep fact validation

AllenJ.F.

Maintaining knowledge about temporal intervals

Commun. ACM

(1983)

A. Rula, M. Palmonari, A.N. Ngomo, D. Gerber, J. Lehmann, L. Bühmann, Hybrid acquisition of temporal scopes for RDF...

O. Alonso, J. Strötgen, R. Baeza-Yates, M. Gertz, Temporal information retrieval: Challenges and opportunities, in: 1st...

CamposR. et al.

Survey of temporal information retrieval and related applications

ACM Comput. Surv.

(2014)

A. Rula, M. Palmonari, A. Maurino, Capturing the age of linked open data: Towards a dataset-independent framework, in:...

ManiI. et al.

Robust temporal processing of news

DerczynskiL. et al.

Information retrieval for temporal bounding

C. Gutiérrez, C.A. Hurtado, A.A. Vaisman, Temporal RDF, in: 2nd ESWC, 2005, pp....

Y. Wang, M. Zhu, L. Qu, M. Spaniol, G. Weikum, Timely YAGO: Harvesting, querying, and visualizing temporal knowledge...

KoubarakisM. et al.

Modeling and querying metadata in the semantic sensor web: the model strdf and the query language stSPARQL

WangY. et al.

PRAVDA-live: interactive knowledge harvesting

Cited by (0)

View full text

Invited paperTISCO: Temporal scoping of facts