1 Introduction

It is not often one can look back and take stock twenty years on, especially when it is by invitation by the very journal that carried this contribution. While retracing memory lane, it has been nice to realize how much the journal itself was a source of inspiration towards the road we ultimately took.

To tell how Partial Directed Coherence (PDC) was conceived and some of its natural consequences, we have split the paper into its “pre-history” (Sect. 2), the questions techniques then available raised (Sect. 2.1) followed by its actual proposition and justification (Sect. 3) and development (Sects. 4 and 6), followed by key inferential aspects (Sect. 5) and its duality to directed transfer function (DTF)  (Kamiński and Blinowska 1991) that justify an upgrade to how one must interpret connectivity descriptions (Sect. 7). We also briefly cover some of its variants and its extension to characterizing nonlinear interactions (Sects. 8 and 9). We end with some considerations regarding what may follow (Sect. 10). The exposition here is broad and the reader is asked to examine details in the referenced literature.

2 Prolegomena

It all began when one of us was looking for possible pace maker candidates for theta rhythms in rat brains. In those days, correlation methods were the ‘de facto’ approach to characterizing one brain structure’s influence over another  (Glaser and Ruchkin 1976). The idea is that waveforms measuring brain activity in one structure \(x_1(n)\) would be related to that of another structure \(x_2(n)\) if they looked similar, eventually making an allowance for some propagation delay that cross-correlation functions can capture. This has a number of shortcomings. Correlation functions are inherently pairwise structure descriptions and often lead to erroneous conclusions if used for network reconstructions as we examined at some length in  Baccalá and Sameshima (2001a).

Furthermore signal shapes often suffer considerable change under propagation (due to dispersion) which reduces the usefulness of waveform similarity-based characterizations like correlation. Some mitigation can be attained via coherency analysis, the frequency domain counterpart of correlation analysis. In fact, one may show that correlation/coherency are optimal when one can set the value of what the input signal is, say \(x_1(n)\), as the mean-squared error \(\epsilon (n)\) in:

$$\begin{aligned} x_2(n)=h \star x_1(n)+\epsilon (n), \end{aligned}$$
(1)

is minimized for a suitable choice of the shape propagation/distortion filter h(n) where \(\star \) stands for the convolution operator. Use of models like (1) is widespread in engineering.

Rather than defined by experimenters, neural signals of interest are derived from observations, so that Eq. (1) applications call for ‘a priori’ consideration as to what the input signal is, and lacking that, require a choice based on relative overall signal delay.

We were encouraged by the appearance of an article in Biological Cybernetics  (Schnider et al. 1989) advocating a ‘systems identification’ approach resting on Granger Causality (GC)  (Granger 1969). It treated \(x_1(n)\) and \(x_2(n)\) without the need for predefining which one was the input and had the advantage of allowing feedback detection (Sims 1972).

Briefly, a time series is said to Granger cause another when information from the former can be used to improve prediction of the latter. Because \(x_1(n)\) may G-cause \(x_2(n)\) without \(x_2(n)\) necessarily having to G-cause \(x_1(n)\), one may characterize the direction of information flow. Feedback is present when information travels to and fro.

At this point it is worth bringing in some technical points. In order to make Granger’s proposal operational, one must choose how one is to predict a time series based on another’s past. The only operationally viable approach at the time was to use linear prediction by fitting autoregressive models, where a predictor that only considers its own past

$$\begin{aligned} \hat{\tilde{x}}_1(n)=\sum _{k} \tilde{a}_{11}(k) x_1(n-k), \; k>0 \end{aligned}$$
(2)

is compared to another that includes the other series’s past

$$\begin{aligned} {\hat{x}}_1(n)=\sum _{k} a_{11}(k) x_1(n-k)+\sum _{k} a_{12}(k) x_2(n-k) , \; k>0 \end{aligned}$$
(3)

where the \(\tilde{a}_{ii}(k)\), \(a_{ij}(k)\) coefficients are obtained by minimizing the respective mean-squared prediction errors \(E[|x_1(n)-\hat{\tilde{x}}_1(n)|^2]\) and \(E[|x_1(n)-{\hat{x}}_1(n)|^2]\) which are then compared to see if the addition of the second term in Eq. (3) significantly improves prediction  (Geweke 1982). One may alternatively test for the null hypothesis \(a_{12}(k)=0\) for all k in (3). Similar reasoning applies if \(x_1(n)\) and \(x_2(n)\) are swapped, so that one may efficiently test both GC directions simultaneously through a bivariate autoregressive model for \([x_1(n) \; x_2(n)]^T\):

$$\begin{aligned} \left[ \begin{array}{c} x_1(n)\\ x_2(n) \end{array} \right] =\sum _k \left[ \begin{array}{cc} a_{11}(k) \; a_{12}((k)\\ a_{21}(k) \; a_{22}((k) \end{array} \right] \left[ \begin{array}{c} x_1(n-k)\\ x_2(n-k) \end{array} \right] + \left[ \begin{array}{c} w_1(n)\\ w_2(n) \end{array} \right] \end{aligned}$$
(4)

where \(w_i(n)\) stand for innovations processes with covariance matrix

$$\begin{aligned} \varSigma _{\mathbf {w}}=\left[ \begin{array}{cc} \sigma _1^2 &{} \sigma _{12}\\ \sigma _{21} &{} \sigma _2^2 \end{array}\right] \end{aligned}$$
(5)

so that proper testing may proceed by testing for \(a_{ij}(k)=0\) directly.

Through Schnider et al. (1989) we also were first introduced to the idea of Directed Coherence (DC) aimed at representing GC in the frequency domain as first discussed in Saito and Harashima (1981)’s hard to find paper. This is possible because linear models like (4) allow representing the joint spectral density matrix of the time series as:

$$\begin{aligned} \begin{aligned} {\mathbf {S}}(\nu )={\mathbf {H}}(\nu ){\varvec{\varSigma }}_{\mathbf {w}}{\mathbf {H}^{H}}(\nu ), \end{aligned} \end{aligned}$$
(6)

where \(^H\) is the Hermitian transpose and

$$\begin{aligned} {\mathbf {H}}(\nu )= \left[ \begin{array}{cc} H_{11}(\nu ) \; H_{12}(\nu )\\ H_{21}(\nu ) \; H_{22}(\nu ) \end{array} \right] ={\bar{\mathbf {A}}}(\nu )^{-1} \end{aligned}$$
(7)

for

$$\begin{aligned} {\bar{\mathbf {A}}}(\nu )= \left[ \begin{array}{cc} 1-\sum _k a_{11}(k) e^{-{\mathbf {j}}2\pi k \nu } &{} -\sum _k a_{12}(k) e^{-{\mathbf {j}}2\pi k \nu }\\ -\sum _k a_{21}(k) e^{-{\mathbf {j}}2\pi k \nu } &{} 1-\sum _k a_{22}(k) e^{-{\mathbf {j}}2\pi k \nu } \end{array} \right] \end{aligned}$$
(8)

with \({\mathbf {j}}=\sqrt{-1}\) for the normalized frequency \(|\nu |\le 0.5\) so that \(a_{ij}(k)=0\) if and only if \(H_{ij}(\nu )=0\) for all \(\nu \).

Under this formulation, the coherency between the time series is given by

$$\begin{aligned} \begin{aligned} C_{ij}(\nu )&= \dfrac{S_{ij}(\nu )}{\sqrt{S_{ii}(\nu ) \; S_{jj}(\nu )}} \\ {}&= \dfrac{{\mathbf {h}}_i(\nu )\varSigma _{\mathbf {w}}{\mathbf {h}}^H_j(\nu )}{\sqrt{ ({\mathbf {h}}_i(\nu )\varSigma _{\mathbf {w}}{\mathbf {h}}_i^H(\nu ))({\mathbf {h}}_j(\nu )\varSigma _{\mathbf {w}}{\mathbf {h}^H}_j(\nu ))}} \end{aligned} \end{aligned}$$
(9)

where \({\mathbf {h}}_i(\nu )=[ H_{i1}(\nu ) \; H_{i2}(\nu )]\).

When (9) is fully expanded, the term that carries GC from \(x_j(n)\) to \(x_i(n)\) is represented by:

$$\begin{aligned} DC_{j\rightarrow i} (\nu )=\dfrac{\sigma _iH_{ij}(\nu )}{\sqrt{S_{ii}(\nu )}} \end{aligned}$$
(10)

as it is zero if and only if \(x_j(n)\) does not G-cause \(x_i(n)\) and where \(\sigma _i\) stands for \(w_i(n)\)’s standard deviation. This is essentially Saito & Harashima’s directed coherence.

Though GC/DC allowed pinpointing information flow direction, these descriptions remained essentially restricted to the pairwise case of dynamical observations, which is however limited when trying to describe relationships from a network standpoint for \(x_1(n), \dots ,\) \(x_N(n)\) simultaneously observed quantities.

There is no reason why one should be restricted to bivariate models. Indeed, one can generalize (4) for the multivariate time series vector as \({\mathbf {x}}(n)=[x_1(n), \dots ,\) \(x_N(n)]^T\) as:

$$\begin{aligned} {\mathbf {x}}(n)=\sum _k {\mathbf {A}}_k {\mathbf {x}}(n-k)+ {\mathbf {w}}(n) \end{aligned}$$
(11)

so the state of affairs was about to change with Kamiński and Blinowska (1991)’s proposal of DTF, where they used the inverse of (8)’s N-dimensional counterpart to compute the N-dimensional version of (7) whereby connectivity from \(x_j(n)\) to \(x_i(n)\) could be portrayed through:

$$\begin{aligned} DTF_{ij}(\nu )=\dfrac{H_{ij}(\nu )}{\sqrt{\sum _k |H_{ik}(\nu )|^2}}=\dfrac{H_{ij}(\nu )}{\sqrt{\mathbf {h_i}(\nu ){\mathbf {h}}_i^H(\nu )}} \end{aligned}$$
(12)

which differs from DC in its complete dispensing of \({\mathbf {w}}(n)\)’s covariance matrix, even when \(N=2\). Much as in the bivariate case, to compute \(H_{ij}(\nu )\) one had to invert \(\bar{{\mathbf {A}}}(\nu )\) whose entries are given by:

$$\begin{aligned} \bar{A}_{ij}(\nu )=\left\{ \begin{array}{l} 1-\sum _k a_{ij}(r)e^{-{\mathbf {j}}2\pi \nu k},\;\text {if}\;\;i=j \\ \;\; \; -\sum _k a_{ij}(r)e^{-{\mathbf {j}} 2\pi \nu k},\;\text {otherwise} \end{array} \right. \end{aligned}$$
(13)

for \({\mathbf {h}}_i(\nu )=[ H_{i1}(\nu ), \dots , H_{iN}(\nu )]\) in the now more general case.

Multivariate time domain GC representation had also progressed. Noteworthy was the work of Geweke (1984) who selected the i and j series of interest and built linear models conditioned on the other series to characterize causality. Another approach was due to Lütkepohl who put forward a time domain null hypothesis test based on rejecting \(a_{ij}(k)=0\) as these describe \(x_j(n)\) past influences on \(x_i(n)\)  (Lütkepohl 1993). At that time, Lütkepohl’s approach seemed better not only because it was computationally simpler but also because it possessed theoretically rigorous asymptotics.

2.1 Goals and paradoxes

Though we examined a slightly modified version of DTF:

$$\begin{aligned} \gamma _{ij}(\nu )=\dfrac{\sigma _j H_{ij}(\nu )}{\sqrt{\sum _k \sigma ^2_k |H_{ik}(\nu )|^2}}, \end{aligned}$$
(14)

which we termed Directed Coherence as it is closer to Saito & Harashima’s definition through the use of \(w_i(n)\)’s variances in  (Saito and Harashima 1981), we showed its occasional inconsistencies with Lütkepohl’s approach. The paradox was that results from simulated examples of known structure in their graph theoretical representation yielded more directed arrows than those in the original models when using DTF/DC, whereas Lütkepohl’s approach was always able to correctly spot the theoretically imposed connections  (Baccalá et al. 1998).

The answer clearly rested on the fact that, when \(N>2\), \(H_{ij}(\nu )\) is not just a function of \(A_{ij}(\nu )\) but depends on other entries of \(\bar{{\mathbf {A}}}(\nu )\) as well. What was needed was a reasonable descriptor based on \(\bar{{\mathbf {A}}}(\nu )\) to match Lütkepohl’s approach.

3 The benefits of looking widely

To have this intuition is one thing, to prove where it comes from quite another. Oddly, because of the similarity of the formulas in the \(N=3\) case for partial coherence  (Gersch 1972), we found our general answer and method of proof from a book on differential geometry  (Burke 1985) wherefrom it became clear that we could write another frequency domain quantity of interest in the \(N>3\) case as well. This quantity, partial coherency is directly based on the \(a_{ij}(k)\) and equals:

$$\begin{aligned} \kappa _{ij}(\nu )=\dfrac{\bar{{\mathbf {a}}}^H_i(\nu ){\varSigma }_{\mathbf {w}}^{-1}\bar{{\mathbf {a}}}_j(\nu )}{\sqrt{(\bar{{\mathbf {a}}}^H_i(\nu ){\varSigma }_{\mathbf {w}}^{-1}\bar{{\mathbf {a}}}_i(\nu ))(\bar{{\mathbf {a}}}^H_j(\nu ){\varSigma }_{\mathbf {w}}^{-1}\bar{{\mathbf {a}}}_j(\nu ))}}, \end{aligned}$$
(15)

as shown in  Baccalá (2001), where \(\bar{{\mathbf {a}}}_i(\nu )\) stands for the i-th column of \(\bar{{\mathbf {A}}}(\nu )\).

At this point, partial coherency/e deserves a brief recap. It amounts to the ordinary coherency/e between residues, i.e. that part of what is left from a time series when the effect of other left over series are subtracted:

$$\begin{aligned} \eta _i(n)=x_i(n)-{\mathfrak {P}}_{i}(n), \end{aligned}$$
(16)

where \({\mathfrak {P}}_{i}(n)\) is the best optimum linear predictor of \(x_i(n)\) using the other series, so that \(\kappa _{ij}(\nu )\) is just the coherency between \(\eta _i(n)\) and \(\eta _j(n)\) that result. It is common to see partial coherence computed as a ratio of suitably chosen minors from a normalized spectral density matrix that describes the pairwise coherencies when \(N>3\). Minor determinants, however, are neither needed nor computationally efficient if a multivariate autoregressive model is available for the model already automatically encodes (16) under its hood. Furthermore, this also avoids numerical error propagation.

Close examination of (15) reveals its formal similarity to the second equality in (9) leading naturally to partial directed coherence’s definition in Baccalá and Sameshima (2001b) as:

$$\begin{aligned} \pi _{ij}(\nu )=\dfrac{\bar{A}_{ij}(\nu )}{\sqrt{\sum _{k=1}^N|\bar{A}_{kj} (\nu )|^2}} =\dfrac{\bar{A}_{ij}(\nu )}{\sqrt{\bar{{\mathbf {a}}}_j^H(\nu )\bar{{\mathbf {a}}}_j(\nu )}} \end{aligned}$$
(17)

obtained by ditching terms involving \(\sigma _i\)’s as had been done for DTF’s introduction, where \(\bar{{\mathbf {a}}}_j(\nu )\) is the j-th column of \(\bar{{\mathbf {A}}}(\nu )\). Note that, when \(N=2\), DTF and PDC are numerically equal when \(i \ne j\).

4 Sunspots in your life

By examining the relationship between solar activity and melanomas  (Baccalá et al. 2007) we realized that DTF/PDC failed at exposing the correct causal interaction in contrast to Lütkepohl’s approach. At the time, we were applying the ‘ad hoc’ criterion that \(|\pi _{ij}(\nu )|^2>0.1\) was not to be disregarded. As it turned out, the problem was mainly elsewhere as we realized that DC (14) actually worked. The main trouble laid in DTF/PDC’s lack of scale invariance. This led to the introduction of generalized partial directed coherence (gPDC):

$$\begin{aligned} _g\pi _{ij}(\nu )=\dfrac{\bar{A}_{ij}(\nu )/\sigma _i}{\sqrt{\sum _{k=1}^N\dfrac{1}{\sigma ^2_k}|\bar{A}_{kj} (\nu )|^2}} \end{aligned}$$
(18)

A possible quick fix for the original DTF/PDC was to normalize the data by its variance prior to analysis, but this meant larger estimation variability  (Baccalá et al. 2007). It is interesting to note that, when Baccalá and Sameshima (2001b) was submitted to Biological Cybernetics, we decided to simplify PDC to (17); a rejected previous version already contained (18) as a possibility. There, it was called PDC with a ‘statistical metric’. This was a simplification of \(\bar{{\mathbf {a}}}^H_j(\nu ){\varSigma }_{\mathbf {w}}^{-1}\bar{{\mathbf {a}}}_j(\nu )\) which looks suspiciously like a metric for the \(\bar{{\mathbf {a}}}_j(\nu )\) vectors. PDC’s (17) metric was referred to as ‘Euclidean’. This nomenclature still survives in our software package.Footnote 1

The use of the full \({\varSigma }_{\mathbf {w}}^{-1}\) weighing term (information metric) led to information PDC (iPDC) discussed in Sect. 6.

5 Some serious stats

It is perhaps reassuring that, when we replaced the ‘ad hoc’ \(|\pi _{ij}(\nu )|^2>0.1\) criterion to reject the null hypothesis

$$\begin{aligned} H_0: |\pi _{ij}(\nu )|^2=0 \end{aligned}$$
(19)

by a rigorously obtained threshold, the apparent wrong Sunspot-Melanoma causality of the original PDC was mended  (Sameshima et al. 2014a).

Historically, we were working on \(H_0\) inference when  Schelter et al. (2006) provided a workable solution. Thankfully this made us further contribute PDC asymptotic confidence intervals in  Takahashi et al. (2007).

The generality of the approach in  Takahashi et al. (2007) allowed including gPDC and iPDC within a single mathematical framework  (Baccalá et al. 2013).

To achieve this we noticed that all PDC forms can be written as a ratio:

$$\begin{aligned} |\pi _{ij}(\nu )|^2=\pi (\varvec{\theta },\nu )=\frac{\pi _n(\varvec{\theta },\nu )}{\pi _d(\varvec{\theta },\nu )}, \end{aligned}$$
(20)

where \(\varvec{\theta }\) refers to a vector for all \(a_{ij}(k)\) and \(\sigma _{ij}\) parameters according to the PDC of choice. Thus

  1. 1.

    The ratio (20) is asymptotically normal:

    $$\begin{aligned} \sqrt{n_s} (\left| {\widehat{\pi }}_{ij}(\nu )\right| ^2-\left| \pi _{ij}(\nu )\right| ^2)\rightarrow {\mathcal {N}}(0,\gamma ^2(\nu )), \end{aligned}$$
    (21)

    where \(\gamma ^2(\nu )\) is a frequency dependent variance whose details depend on the PDC brand of interest  (Baccalá et al. 2013) and where \(n_s\) is the number of data points used.

  2. 2.

    Under hypothesis (19), (20) is distributed as

    $$\begin{aligned} n_s\; \pi _d(\varvec{\theta },\nu )\; (\left| {\widehat{\pi }}_{ij}(\nu )\right| ^2-\left| \pi _{ij}(\nu )\right| ^2) {\mathop {\rightarrow }\limits ^{d}} \sum _{k=1}^{q} l_k(\nu ) \chi ^2_1, \end{aligned}$$
    (22)

    whose \(l_k(\nu )\) weights depend only on the numerator \(\pi _n(\varvec{\theta },\nu )\) while denominator dependence comes only through the left-hand side of (22) alone.

These results allow computing objective frequency dependent decision threshold values and confidence intervals that formerly were only possible through computer intensive resampling approaches  (Baccalá et al. 2006).

In the Sunspot-Melanoma case, as few as \(n_s=37\) data points suffice for adequate inference. In practice, inference performance is not so simple to characterize as it varies with N and also is a function of the spectra of the time series involved.

Much credit for these developments is due to our former student Dr. Daniel Y. Takahashi  (Takahashi 2009). Application of the same approach also allowed us to deduce the allied results  (Baccalá et al. 2016) for DTF/DC and its information extension iDTF  (Takahashi et al. 2010) (see Sect. 6). They have the same general form as those for PDC.

The use of (22) and (21) respectively address two distinct problems: (a) Connectivity Detection and (b) Connectivity Characterization.

6 Information interpretation

Now back to the information metric. We were faced with the need to consider whether

$$\begin{aligned} _\iota \pi _{ij}(\nu ) =\dfrac{\bar{A}_{ij}(\nu )/\sigma _i}{\sqrt{\bar{{\mathbf {a}}}^H_j(\nu ){\varSigma }_{\mathbf {w}}^{-1}\bar{{\mathbf {a}}}_j(\nu )}} \end{aligned}$$
(23)

had any sensible meaning at all.

To see it we must recap what (11) means. It is a representation of \({\mathbf {x}}(n)\) in its relationships and characteristics and evolution as drawn from all conceivable possibilities provided by the purely random innovations process \({\mathbf {w}}(n)\) whose white amorphous nature is shaped by the filter represented by the \({\mathbf {A}}_k\) matrices in (11).

However, as we have seen, without reference to possibly how \({\mathbf {x}}(n)\) is constructed, we always can obtain each \(\eta _i(n)\) through (16) leading to an equivalent representation via the process \(\varvec{\eta }(n)=[\eta _1(n),\dots , \eta _N(n)]^T\) which has the same intrinsic information, only that this time it is unveiled differently through the partial coherence (15) that relates the (\(x_i(n),x_j(n))\) pair after deducting the effect of the other series. In signal space terms, the \({\mathbf {x}}(n)\) time series representation is reciprocal to \(\varvec{\eta }(n)\) (and vice-versa). They describe the same phenomenon.

With these ideas in mind, also in Biological Cybernetics  (Takahashi et al. 2010), we showed that (23) was a form of coherency, only this time linking \(\eta _j(n)\) to \(w_i(n)\). Consequently, under suitable hypotheses, this translates into the mutual information rate between \(\eta _j(n)\) to \(w_i(n)\) via:

$$\begin{aligned} {\text{ MIR }}(w_i,\eta _j) =-\frac{1}{2}\int ^{0.5}_{-0.5}\log (1-|\iota \pi _{ij}(\nu )|^2)\;d\nu \end{aligned}$$
(24)

The process of obtaining \(\varvec{\eta }(n)\) from \({\mathbf {x}}(n)\) is called partialization. It can be applied to \({\mathbf {w}}(n)\) as well, and leads to its reciprocal (partialized) \(\varvec{\zeta }(n)\) process so that

$$\begin{aligned} _\iota \gamma _{ij}(\nu )=\dfrac{\sigma _j H_{ij}(\nu )}{\sqrt{{\mathbf {h}}_i(\nu )\varSigma _{\mathbf {w}}{\mathbf {h}}_i^H(\nu )}}, \end{aligned}$$
(25)

that we named information DTF (iDTF) is the coherency between \(\zeta _j(n)\) and \(x_i(n)\) which translates naturally to a measure of their mutual information rate:

$$\begin{aligned} {\text{ MIR }}(x_i,\zeta _j) =-\frac{1}{2}\int ^{0.5}_{-0.5}\log (1-|_\iota \gamma _{ij}(\nu ))|^2)\;d\nu . \end{aligned}$$
(26)

Here the trick is that iDTF/iPDC describe the relationships between processes with different indices between the partialized and the original processes.The characterization of strict information flow directionality is possible because swapping i for j involves different base quantities, i.e. we go from relating \(\eta _j\) to \(w_i\) to relating \(\eta _i\) to \(w_j\) in the iPDC case (and x and \(\zeta \) for iDTF).

These developments were an integral part of Dr. Takahashi’s Doctoral Thesis  (Takahashi 2009). A simplified exposition of these ideas can be found in  (Takahashi et al. 2014).

7 Duality and the need for new concepts

At this point a couple of numerical illustrations may be worthwhile

Example 1

Consider data at time n to be generated by

$$\begin{aligned} \left\{ \begin{aligned} x_1(n) =&-0.95\;x_2(n-1)+w_1(n)\\ x_2(n) =&\ \ 0.95\;x_1(n-1)+x_3(n-1)+w_2(n)\\ x_3(n) =&-0.9025\;x_3(n-2)+w_3(n) \end{aligned} \right. \end{aligned}$$
(27)

where \(w_i(n)\) are zero mean independent Gaussian innovations of unit variance. iPDC and iDTF computations are shown respectively in Fig. 1a and b displayed in the ‘standard’ position from ‘Source’ (column) to ‘Target’ (row) structure. A graphical summary is contained in Fig. 2, which contrasts the theoretical structure of (27) with that derived from Fig. 1, whereby it is clear that iPDC mirrors the theoretical structure while iDTF has an additional arrow whose very existence was the drive in looking for PDC in the first place. This extra arrow’s role is to state the fact that there is a signal pathway for \(x_3(n)\) to reach \(x_1(n)\) even though they are not immediately attached. \(x_3(n)\)’s action onto \(x_1(n)\) takes place as its signal must first travel through \(x_2(n)\).

Fig. 1
figure 1

iPDC (a) and iDTF (b) for (27) in the ‘standard’ position where the array plots main diagonal contains the spectral density estimates of \(x_i(n)\). The graphs also display the \(\alpha =5\%\) significance threshold and the allied confidence intervals for significant values. \(n_s=2000\) data points were used in the computations

Example 2

Next consider the low pass source signal \(x_1(n)\) as directly attached towards \(x_3(n)\) through a bandpass filter represented by u(n). There is another pathway from \(x_1(n)\) towards \(x_3(n)\) that goes through the intermediate step of \(x_2(n)\) whose dynamics is specially crafted to mirror the delay pattern imposed by the u(n) filter. This scenario is simulated via:

$$\begin{aligned} \left\{ \begin{aligned} x_1(n) =&\ 0.75\;x_1(n-1)+w_1(n)\\ x_2(n) =&-0.64\;x_2(t-2)+x_1(t-1)+w_2(n)\\ x_3(n) =&-0.5\;x_3(n-1)+x_2(n-1)+a u(n-1) \\&+w_3(n)\\ u(n) =&-0.64\;u(n-2)+x_1(n-1) \end{aligned} \right. \end{aligned}$$
(28)

with \(w_i(n)\) as in Example  1. When \(a=-1\), even though \(x_1(n)\) is ‘physically’ attached to \(x_3(n)\), it has no ultimate ‘influence’ on \(x_3(n)\). Is this too artificial? This actually happens in the real world, e.g. in multipath cell phone transmission propagation that induce communication loss  (Stallings 2005).

As before, the structure of (28) (Fig. 3a) is confirmed in the diagrammatic representation of iPDC (Fig. 3b) as represented by the diagram in (Fig. 3c), whereas \(x_1(n)\)’s influence cancellation over \(x_3(n)\) is evident in Fig. 3d and its allied diagram (Fig. 3e).

These examples show that, when \(N>2\), PDC and DTF highlight distinct complementary aspects of dynamic connectivity. These differences lead us to introduce the concepts of Granger Connectivity and Granger Influentiability to refer respectively to what PDC and DTF portray (Baccalá and Sameshima 2014).

In conclusion PDC is centered on describing the directed attaching connections, whereas DTF centers on the overall effect. Both descriptions are informative and valuable in providing a full picture of ‘connectivity’. Alone neither is complete. We often refer to PDC and DTF as dual representations.

Fig. 2
figure 2

Comparison between the theoretical structure behind the theoretical signal generating model (27) (a) with the schematic representations of the computed iPDC (b) and iDTF (c) derived from their computed values. The extra iDTF arrow is readily apparent

8 Theme and variations

One could say that PDC’s main theme is represented by Equation (13). All else, the original PDC itself, gPDC and iPDC included, are mainly variations, even though iPDC’s interpretation is deeper. This claim as to variations is certainly true in at least two ways: (a) alternative methods to characterize what we called Granger Connectivity, which include  (Schelter et al. 2009; Plomp et al. 2014) by weighing \(A_{ij}(\nu )\) differently and (b) its use as a weighing factor in other estimators  (Korzeniewska et al. 2003).

Ideally each such alternative deserves consideration. However, to avoid too shallow a comparison and thereby risk unfairness, we think it is best to say we are happy that the introduction of PDC has led to many characterization alternatives whose common property is that their nullity is synonymous with lack of Granger Connectivity as defined above. Methods, however, differ in both computational and statistical performance.

The core reason for this is that \(A_{ij}(\nu )\) is the transfer function linking \(\eta _j(n)\), the j-th partialized process, to \(w_i(n)\) - the i-th innovations process accounting for the ground observed stochastic variability. This result is contained in  Takahashi et al. (2010)’s Appendix.

A similar role is played by \(H_{ij}(\nu )\) in producing \(A_{ij}(\nu )\)’s dual to characterize what we called Granger Influentiability whose many weighing variations lead to different interpretative details  (Korzeniewska et al. 2003).

Here one might also ask about PDC’s place among the host of other connectivity estimation methods. If we exclude correlation-based ones and consider only methods that expose connection directionality, it is related to Geweke’s approach  (Geweke 1984) which has been lately refined by Barrett and Seth (2010). They represent different ways of subtracting the effect of other observed time series upon the relationship between i and j. PDC’s procedure is however very natural and computationally efficient due to its link to partial coherence. Whereas both methods are applicable to rejecting Granger Connectivity, our numerical experiments under controlled conditions suggest good theoretical fit for PDC asymptotics  (Sameshima et al. 2014b) when detecting connectivity when contrasted to other approaches.

Saameeshiima! What is the (connection) force?’ was a question by Prof. György Böhm, a former boss. He was not alone in asking. We called it the Connection ‘Characterization’ problem. The definitive answer is not so easy, because there are many connection characterization methods, linear and nonlinear  (Sameshima and Baccalá 2014), that differ in how they account for the effect of other series, each in their own specific way.

PDC, through its frequency domain description of connection details, qualifies it as a contender to represent the ‘force’ of Granger Connectivity. Its frequency domain character highlights spectral bands where signal transmission is absent even when overall Granger Connectivity may be present due to other frequencies. A good example of this is discussed in Example 7.3 from Sameshima et al. (2014a) whose notch frequency character comes clearly out. This stresses PDC’s advantage over Lütkepohl’s approach which had been proposed as a time domain test for Granger Causality, thereby just addressing the ‘Detection’ problem without any frequency domain qualification.

In comparison to other approaches, what one can safely say is that different methods behave much like different units to gauge a given physical quantity. If you measure temperature and start with Celsius, you’d better stick to it and not use Fahrenheit, even though homaging your wife for her warmth may sound good in the name of domestic accord. The Cartesian precision behind PDC’s and DTF’s and their rigorous asymptotic variance results  (Baccalá et al. 2013, 2016) allow easy and reliable comparison over significant changes under different experimental conditions.

Fig. 3
figure 3

Example  2 theoretical diagram (a) with its allied iPDC results (b) and resulting link diagram (c). iDTF results are shown in (d) with its allied link diagram in (e). Gray shades represent the \(95\%\) confidence intervals associated with significant values for \(\alpha =5\%\), i.e. when estimated values are above the frequency dependent decision thresholds for \(n_s=400\) observed data points

9 What about nonlinearity?

Linear modelling through (11) lies at the core of PDC’s and DTF’s proposals. It is this very linearity that ensures their frequency domain interpretation. Many authors have considered methods to grapple with nonlinear systems. Here, the reader should be forewarned though: many papers that contain proposals for dealing with nonlinearity often suffer from the problem of providing simulated examples composed of nonlinear subsystems that interact linearly, whose connectivity is easily and correctly represented by PDC. Even bilinear interactions can also be detected by PDC though again with no assurance of preset \(\alpha \) confidence levels.

One type of interaction that linear methods are entirely unable to capture are those involving even interaction functions—a \(x^2_1(n-1)\) affecting \(x_2(n)\)’s equation is one such simplest example  (Massaroppe et al. 2011). But even in this case, through Reproducing Kernel transformations, one can easily deal with the connectivity detection problem. Moreover, thankfully, because of the analogous mathematics, one may even use the very same estimation framework to define appropriate Kernel GC, PDCs and DTFs; for them the same general asymptotic forms hold  (Massaroppe and Baccalá 2015, 2019). This is attractive because the required \(n_s\) number of data points for inference is much smaller than that of other competing nonlinear methods. In fact, Granger Connectivity Detection becomes feasible even without having to solve the challenging pre-image problem  (Massaroppe and Baccalá 2019) required for dynamic prediction.

10 What now? a sum up and a wishlist.

Any set of \(x_i(n)\) signals is amenable to PDC/DTF description. This makes it an universal decomposition of the set’s intrinsic relationships. Technically the spectral factorization behind (11) may be more or less difficult to carry out, but there are options  (Wilson 1972). Further systematic investigation is under way.

In fact, the lack of proficiency by practical researchers with multivariate time series modelling constitutes a challenge to the translation of these methods to the clinical practice, something lying at the top of our wishlist. We also would like to see it used beyond neuroscience which inspired it and this is slowly beginning  (Raphaldini et al. 2021).

We hope this retrospective helps overcome the educational road block to PDC/DTF employment by retelling their history and our motivation. Naturally further education of the ever evolving world of time series techniques is desirable. Things like employing state space estimation methods and time varying approaches are examples of interesting practical developments  (Astolfi et al. 2008; Sommerlade et al. 2014). Further analysis of all the latter approaches would take us too far afield here.

For now we are happy that, at only twenty years of age, PDC has attracted more attention than one could wish. Surely people still feel more comfortable with correlation based ideas, they have been around at least 100 years now, and are only becoming enthused with Granger causality  (Granger 1969) at roughly half that age though GC still induces heated debate given the conceptually loaded nature of the word ‘causality’, which (sells but) evokes a number of different and sometimes conflicting ideas  (Baccalá and Sameshima 2014; Pearl 2000; Spirtes et al. 2001; Valdés-Sosa et al. 2011). No wonder also here, there is still a long way to go.

Another set of challenges comes from the care needed to preserve the information of interest in the signals. Things like how power line interference should be removed  (Baccalá and Sameshima 2015) or the choice of sampling frequency can impact estimate reliability and inference via models like (11). The same is true of artifact removal algorithms. Currently, there is no satisfactory inventory of those challenges.

From our initial goal of looking into a data driven tool to help reverse engineer brain dynamical relationships we realized that reliable inferential results were central as all too often researchers judge methods based on confirmatory bias, i.e. their preconceptions as to how the brain works. This is what moved us to stress statistically robust gauging methods which are at their most interesting when they provide challenging evidence to preconceived ideas and that is what we hope these developments can furnish.

Ideally brain dynamical reverse engineering requires the availability of neural signals in their purest form. Scalp EEG, due to reflecting mixtures (at the sensor space) of deeper brain signal sources (the source space) can be misleading (a.k.a. due to the volume conduction problem) calling for methods to address it. Whereas PDC/DTF decompositions are still valid at the sensor space, one can expect difficulties recovering source space connectivity. Field inversion methods have been successfully applied in addressing this via PDC applied to source space reconstructed data (Amaral et al. 2017). A modified PDC version has also been proposed and evaluated  (Omidvarnia et al. 2014).

Another clear challenge stems from the need for dimension reduction. Adequate dynamical characterization requires as many representative signals as possible be included, yet there are practical limits to how many simultaneous time series models like (11) can handle for a given number \(n_s\) of time observations. To achieve this, sparse time series modelling  (Valdés-Sosa et al. 2005) and spatial wavelet preprocessing (in fMRI) have been used  (Vieira et al. 2014, 2015).

In conclusion, there is still a lot to be done. Long live PDC, its kin and offspring!