The Shortest Duration Constrained Hidden Markov Model: Data denoise and forecast optimization on the country-product matrix for the Fitness-Complexity Algorithm

Pengcheng Song; Xiangyu Zong; Ximing Chen; Qin Zhao; Lubingzhi Guo

doi:10.1371/journal.pone.0253845

Abstract

The Economic Fitness Index describes industrial completeness and comprehensively reflects product diversification with competitiveness and product complexity in production globalization. The Fitness-Complexity Algorithm offers a scientific approach to predicting GDP and obtains fruitful results. As a recursion algorithm, the non-linear iteration processes give novel insights into product complexity and country fitness without noise data. However, the Country-Product Matrix and Revealed Comparative Advantage data have abnormal noises which contradict the relative stability of product diversity and the transformation of global production. The data noise entering the iteration algorithm, combined with positively related Fitness and Complexity, will be amplified in each recursion step. We introduce the Shortest Duration Constrained Hidden Markov Model (SDC-HMM) to denoise the Country-Product Matrix for the first time. After the country-product matrix test, the country case test, the noise estimation test and the panel regression test of national economic fitness indicators to predict GDP growth, we show that the SDC-HMM could reduce abnormal noise by about 25% and identify change points. This article provides intra-sample predictions that theoretically confirm that the SDC-HMM can improve the effectiveness of economic fitness indicators in interpreting economic growth.

Citation: Song P, Zong X, Chen X, Zhao Q, Guo L (2021) The Shortest Duration Constrained Hidden Markov Model: Data denoise and forecast optimization on the country-product matrix for the Fitness-Complexity Algorithm. PLoS ONE 16(7): e0253845. https://doi.org/10.1371/journal.pone.0253845

Editor: Jie Zhang, Newcastle University, UNITED KINGDOM

Received: August 26, 2020; Accepted: June 14, 2021; Published: July 26, 2021

Copyright: © 2021 Song et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The export data is from COMTRADE database. https://comtrade.un.org/. This web provides the API to download the data. Besides, All relevant data are within the paper and its Supporting Information files.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Economic data are fundamental to economic research. Numeric value determines the accuracy of economic analysis in two key ways: first, the more uncomplicated the data composition required for economic research, the more essential the quality of economic data becomes; second, the more unsophisticated the value categories of economic data, the more crucial data denoise is for data quality. This work selects economic fitness and complexity analysis whose sole data input is the country-product matrix, and the matrix is binary data for which the value composed is extremely simple, only 0 and 1. The sole data required for economic fitness is measured by country-product matrix through the Revealed Comparative Advantage (RCA), whose original data are the countries’ export data of specific products.

Economic Fitness (EF) measures both the industrial diversification of a country from the perspective of quantity, and the competitiveness of its products from the perspective of quality, which is reflected in where the product is in the Global Value Chains (GVCs). Gereffi [1] initially carried out research on global value chains based on the East Asian garment industry, which now refers to the processes of globally distributing the entire production. Products in the relatively high-end segment will have stronger competitiveness, meaning the country can obtain higher added value [2]. The status of a country’s products in the production chains is relatively stable, and its evolution process also requires a certain period to perform. Excessive fluctuations of the RCA and country-product matrix are contrary to the objective laws of production globalization and cannot reflect changes in the essential competitiveness of products and countries.

The Economic Fitness method is developed by Tacchella et al. (2013) [3] for quantitative economic complexity analysis. It is used to measure the degree of product diversification with competitiveness and product complexity. Higher economic fitness indicates a more varied product structure, and shows that the country’s exports are more risk-resistant. The Fitness-Complexity Algorithm is a non-linear iteration method (see Eq 3) which can produce different results for different countries, despite them having the same initial value. The RCA value and the Country-Product Matrix after noise reduction through the SDC-HMM support more stable and more objective recursive equations (Eq 3). The shortest duration constraint requires the hidden state time series to maintain the former state for at least a default number. In this paper, the shortest duration constraint t = 2 will help the model more in line with the actual economic situation. On the premise of retaining the original data as much as possible, the impact of data noise can be better reduced. The importance of the SDC-HMM performs in two aspects. First, setting a minimum duration longer than one period is conducive to making full use of data time dimension information and effectively removing abnormal data value noise. Second, setting the value of d to be greater than or equal to 2 is to fully retain the message of the data time dimension and identify the change of data trend in time. The denoise process can enable the pursuit of lower noisy product complexity, and then reduce noise in the recursive cycle. Therefore, the consequential noise of economic fitness is lower and a more accurate prediction of GDP growth can be obtained.

The latest research also finds that the economic complexity of a country is the reason for its long-term economic growth [4]. Ricardo’s traditional comparative advantage theory [5] holds that a country should merely export a few products that adhere to its own comparative advantage and fulfil international specialization according to its comparative advantage. Nevertheless, the national export figures suggest that internationally accredited developed countries’ export structure has remained fully diversified. The internal economic fitness of a country constituted by the diversity and competitiveness of its products corresponds to the hidden state of the model, while changes in the external export data of countries only correspond to fluctuations in the apparent state. It is indeed appropriate to apply the Hidden Markov Model to process country-product matrix data [6].

However, when the export data fluctuate sharply over a short period of time due to factors such as tariffs, exchange rates and other countries’ trade policies, it does not necessarily result in a drastic change in the internal capacity of the country’s economic fitness. Furthermore, the economic fitness index at time t is used to predict GDP at (t+1) and other future times. The export RCA data and the unstable state of the country-product matrix will not be able to meet the objective laws of relative stability and smoothness in global production, which in turn will reduce the predictive effect of a country’s economic fitness on its economic growth (GDP). In view of this, this paper imposes a shortest duration constraint on the Hidden Markov Model to achieve the corresponding requirements.

The main innovations of this paper are as follows: First, this paper innovatively uses the shortest duration constraint to achieve noise reduction and removal of ultra-short-term abnormal economic fluctuations, while ensuring the identification and confirmation of medium- and long-term economic trends. In the end, the time dimension information of the time series data can be fully utilized to effectively remove abnormal data noise; and the data time dimension information can also be fully retained, and the data trend change point can be identified in time. Without the shortest duration constraint, macroeconomic data, especially the higher frequency monthly data, will show greater volatility. Most of the fluctuations are caused by the disturbance of certain unexpected events, which cannot characterize the intrinsic economic trends.

Secondly, this article innovatively chooses a Country-Product Matrix of binary data and the Economic Fitness indicators to maximize the extent to which the SDC-HMM for macroeconomic time series data with noise reduction optimization will ultimately improve the results of economic analysis. As a macroeconomic analysis model, the country Economic Fitness analysis requires only one data indicator, i.e. country-product matrix, and the Country-Product Matrix is binary data with a single data composition, which results in any noise and fluctuations in the data being directly transmitted to the macroeconomic analysis results. Therefore, the object of this study is chosen to maximize the effect of data noise reduction and optimization of the SDC-HMM.

The possible contributions of this paper are mainly reflected in the following four aspects: First, the shortest duration constraint is added to the analysis of the hidden Markov model, making full use of the time dimension information to address the problem of changing points in the time series data, and providing a practical way. Second, with the national economic fitness analysis, in which two characteristics highlight: data required is single, and the data’s value is extremely simple (binary data), we build a natural experiment to verify the data denoising effect of the SDC-HMM; Third, through the shortest duration constraint optimization, the prediction accuracy of economic fitness is improved. Fourth, it enriches the application of hidden Markov models in the field of economics and further expands the role of hidden Markov models in data cleaning.

The Hidden Markov Model

The Hidden Markov Model (HMM) includes two sets of time series random variables. One group is a hidden state random variable time series, in which the state of the sequence cannot be observed; another group is an explicit state random variable time series, which is observable and the sequence is generated via a hidden Markov chain transmission. According to the above model, the Hidden Markov Model contains two sets of state time series and three sets of probability time series. The detailed composition sequence is described as follows [7]:

hidden state time series set H
H = {h₁, h₂,…, h_N}, which involves N states. The hidden state in time t can be any one in the hidden state set, denoted as q_t, and satisfies the condition q_t ∈ {h₁, h₂,…, h_N}.
explicit state time series set X
X = {x₁, x₂,…, x_M}, which involves M states. The explicit state in time t can be any one in the explicit state set, denoted as p_t, and satisfies the condition p_t ∈ {x₁, x₂,…, x_M}.
hidden state transition probability distribution A
A = (a_ij), and a_ij = P(q_t+1 = h_j|q_t = h_i), where 1 ≤ i ≤ N and 1 ≤ j ≤ N. a_ij is the conditional probability of the state transition to the state h_j at time (t+1) when the hidden state at t is h_i (the transition probability of time t).
initial state probability distribution π
π = {π₁, π₂,…, π_N}, and π_i = P(q_i = h_i), 1 ≤ i ≤ N, the probability that the initial state of the hidden state time series becomes h_i is π_i.
hidden state emission probability distribution E
E = {e_i(x_l)}, and e_i(x_l) = P(p_t = x_l|q_t = h_i), where 1 ≤ i ≤ N, 1 ≤ l ≤ M. e_i(x_l) is the conditional probability of the state emission to the explicit state x_l when the hidden state at t is h_i (the emission probability of time t).

The Shortest Duration Constrained Hidden Markov Model

the shortest duration constraint
The shortest duration constraint requires the hidden state time series to maintain the former state for at least a default number dz, (1) t is the shortest duration. Suppose that the hidden state is q_t at time t, and the time sequence changes value at time t. The hidden state time series must meet the following equation (2)
the constrained Hidden Markov Model
1. the constrained hidden state time series set H’
  H’ = {h₁₁,h₁₂,…,h_1z,…,h_Nz}, which involves (z × N) states. The hidden state in time t can be any one in the hidden state set, denoted as q_t, and satisfies the condition q_t ∈{h₁₁,h₁₂,…,h_Nz}.
2. the constrained explicit state time series set X’
  X’ = {x₁,x₂,…,x_M}, which involves M states. The explicit state in time t can be any one in the explicit state set, denoted as p_t, and satisfies the condition p_t ∈ {x₁,x₂,…,x_M}.
  Compared with the HMM, the constrained HMM has the same explicit state time series set while the hidden state time series set is extended from the original N states to (z × N) states, where the set {h₁₁,h₁₂,…,h_1z} is consistent with h₁.
3. constrained hidden state transition probability distribution A’
  A’ = (a_ij,kl), and a_ij,kl = P(q_t+1 = h_kl | q_t = h_ij), where 1 ≤ i ≤ N, 1 ≤ k ≤ N, and 1 ≤ j ≤ z, 1 ≤ l ≤ z. a_ij is the conditional probability of the state transition to the state h_kl at time (t+1) when the hidden state at t is h_ij (the transition probability of time t).
4. constrained initial state probability distribution π’
  π’ = {π₁₁,π₁₂,…,π_1z,…,π_Nz}, and the same as HMM, π_ik is the probability when initial state is h_i·. π_i = P(q_i = h_i·), 1 ≤ i ≤ N.
5. constrained hidden state emission probability distribution E’
  E’ = {e_ij(x_l)}, and e_ij(x_l) = P(p_t = x_l|q_t = h_ij), where 1 ≤ i ≤ N, 1 ≤ l ≤ M, 1 ≤ j ≤ z. e_ij(x_l) is the conditional probability of the state emission to the explicit state x_l when the hidden state at t is h_ij (the emission probability of time t).

It is important to carry out change point detection and subsequent noise reduction optimization of economic time series data. Data change point noise reduction was initially proposed by Page [8] in quality control research on continuous sampling inspection, and subsequently developed into a range of fields such as economics, big data, biology and finance. Change point detection refers to the analysis of historical data sequences to detect the presence of meaningless data change points such as abnormal values. To achieve data noise reduction and forecast optimization, time series models contain the Auto-Regressive and Moving Average Model (ARMA) and the Hidden Markov Model [7]. As a kind of time domain analysis method [9], Luong [10] proposed applying a part of conditional restrictions to the hidden state sequence through the Hidden Markov Model (HMM) method. The HMM was established in 1957 as a time series analysis model. The specific application was first proposed by Rabiner & Juang in 1986 and used to resolve problems in the field of language recognition. In this application, the HMM is employed to address the noise influence in the acoustic model. Subsequently, the HMM was widely used in the fields of natural sciences and engineering for biological sequence alignment, image processing and facial recognition [11–13]. In recent years, it has been gradually applied to the humanities and social sciences for economic forecasting, analysing online public opinion, and financial transactions. In finance and economics, Rossi & Gallo [14] employed the HMM to anatomize the volatility of financial asset returns and Huang Xiaobin et al. [15] used it to model and analyze the unpredictable stock information state. Tacchella et al. [6] apply the HMM to optimize noise reduction of country-product matrix data, performing preliminary processing by setting a certain probability that each RCA quartile value will be generated by each stage of development, but this method can still not effectively eliminate data noise. The development stage of a year with extremely high values in a certain year within the low-value period confirmed by the HMM will display a considerable deviation from virtual national economic fitness.

Aiming at data sanitation, this paper attempts to realize noise reduction and forecast optimization of country-product matrix data by imposing a shortest duration constraint on the Hidden Markov Model. The specific constraint process ensures the duration of the original state by setting the HMM hidden state time series number to be greater than or equal to a certain minimum value, so as to ensure the consistency of the time series data nodes and the original current, while the trend shift needs a certain duration to be confirmed.

The Fitness-Complexity Algorithm

The Fitness-Complexity Algorithm is an iteration process, which includes two principal indicators of Economic Fitness and Product Complexity. The calculation is defined as follows. (3) The iteration process is divided into two major steps. First, the Fitness of a country is weighted by the Complexity of product and the Complexity is inversely proportionate to the number of countries which can export the same product. Second, the intermediate variables and are normalized into and by the summed mean denominator. The initial condition value is for any product and for every country. The M_cp is the binary Country-Product Matrix which is derived from the Revealed Competitive Advantage (see Eqs 4 and 5).

As shown in the recursive iteration, the Complexity of product in step n is inversely proportionate to the reciprocal of the Fitness of a country in step (n − 1). As Economic Fitness rises, product complexity also rises in the iteration process. In addition, Economic Fitness in step n is also directly proportionate to the Product Complexity of the previous step. In this way, the iteration serves as an anchor aiming at the explored economic potentiality and as an amplifier when the input of RCA or Country-Product Matrix involves noise data.

Revealed comparative advantage (RCA) is an indicator used to measure a country’s product competitiveness and national comparative advantage [3, 16, 17]. The Heckscher-Olin model not only has corresponding difficulties when dealing with three-element models, such as labor, capital and raw material inputs, but the conclusions drawn by distinguishing the endowment when the number of countries is greater than 2 are inconsistent with facts [18]. To overcome the above-mentioned shortcomings, revealed comparative advantage (RCA) measures the competitiveness of a product through the export data of a specific product in a country, which is expressed as follows. (4) Where RCA_c,p is the revealed advantage of country C in product P. E_cp is the export of country C in product P within a certain time period. Σ_p E_cp is the aggregate export of country C during the period. Σ_c E_cp is the aggregate export of product from all the countries around the world. Σ_c,p E_cp is the total world export. When the RCA value is greater than or equal to 1, then the product P of country C has a revealed comparative advantage, otherwise it does not. By binarizing the numerical value of revealed comparative advantage, we can obtain the country-product matrix, which is the M_cp matrix. The relevant definitions are as follows. (5) The M_cp matrix is a two-valued matrix, whose value is 1 when the RCA value is greater or equal to 1 and converts to 0 when the RCA value is below 1.

A country’s economic competitiveness and development level are by and large positively related to its economic diversification, that is, economic complexity. Diversification is the dominant factor in the globalized economic market [6]. In studying how to quantify the level of a country’s economic complexity, Hidalgo and Hausmann [2] first proposed the “Method of Reflections”. With the binarization of the country-product matrix of export data, they apply iterative methods in establishing a linear relationship between the competitiveness of a country and product complexity to quantify the level of economic complexity. In order to overcome the related conceptual and mathematical defects of the linear relationship, Tacchella et al. [1] proposed the “Fitness-Complexity Algorithm” to define a country by the binarized country-product matrix. The nonlinear relationship between economic fitness and product complexity further optimizes the quantitative analysis of economic complexity.

It is particularly crucial to effectively reduce noise for low-fitness countries. They have fewer bits that are not “0” in the country-product matrix, so as one or more noisy pieces of data appear, it will be particularly difficult to correctly judge their fitness [6]. Furthermore, there is still a considerable degree of noise in the data obtained through the binarization process, and especially when the RCA value fluctuates around the “1” threshold, the accuracy of the country-product matrix is difficult to guarantee.

The authenticity change of the internal trend also needs to be confirmed by the minimum duration constraint. This restraint will not recognize the data change point whose duration is shorter than the minimum pre-setting of the constraint, with the data smoothed and restored through HMM, and finally obtains high-quality country-product matrix time series data by decoding the hidden state time series to optimize the quantitative analysis of complexity. Such constraints can also be found in Zhuang Yu and He Zhenfeng [19], in which the constrained HMM is applied for change point detection, and economic cycle prediction is performed by anatomizing simulation data and GNP data.

Results

Model training and decoding algorithm

In this paper, the country-product matrix constructed by RCA binarization is utilized as the explicit state. The explicit state time series is divided into low and high states, which correspond to the binary values “0” and “1” respectively. The data for each product in each country is trained separately with a SDC-HMM.

In this paper, the shortest duration constraint t = 2 is set to bring the model more in line with the actual economic situation. On the premise of retaining the original data as much as possible, the impact of data noise is further reduced. The economic meaning of this time restriction setting is reflected in the confirmation of the economic cycle, that is, in the macroeconomic cycle. The negative growth state needs to last at least two time periods before economic recession can be identified [20].

The value of d chosen in this paper is assigned to be greater than or equal to 2, which has strong economic characteristics and implications. The choice of this value aims to improve the economic forecasting effect through two aspects. First, setting a minimum duration greater than one period is conducive to making full use of data time dimension information and effectively removing abnormal data value noise. As mentioned above, in the case of the export data used for the Economic Fitness analysis in this paper, for example, export data will fluctuate drastically in a short period due to factors such as tariffs, exchange rates, and trade policies of other countries, but such changes in data that fluctuate from period to period do not necessarily lead to drastic changes in the intrinsic capacity of the Economic Fitness. Economic analysis aims to characterize intrinsic economic development and predict future trends through data. The unstable state of export RCA data and Country-Product Matrix will not satisfy the objective law of relative stability and smoothing of global value chains, which in turn reduces the predictive effect of a country’s Economic Fitness on its economic growth. By choosing the value of d to be greater than or equal to 2, the noise reduction of such anomalous noise data will be achieved.

Second, setting the value of d to be greater than or equal to 2 is to fully retain the information of data time dimension and identify the change of data trend in time. If the minimum duration constraint is set too long, the data variation points inherent in the economic data cannot be identified in time. If a larger value is set, the data variables that characterize the macroeconomic trend changes will be identified with a longer time lag. This identification time lag will affect the timeliness and accuracy of macroeconomic forecasting to a certain extent. The export data selected for the economic adaptation analysis in this paper is a kind of data that changes relatively frequently, and the relative values of factor endowments and comparative advantages of different countries change more rapidly. Given this, we choose to set the value of d to be greater than or equal to 2, to maximize the retention of data time dimension information and timely identification of economic change trends. The final research results also show the correctness of this setting.

Furthermore, as the leading indicators used by countries to predict economic trends, the Purchasing Managers ‘Index(PMI), known as the”barometer”of industry and even the macro economy, can also prove that the economic significance of the shortest duration constraint. PMI conducts monthly surveys on purchasing managers of various departments through a questionnaire and collects, summarizes the comprehensive index. Taking China’s PMI as an example, the 50% value is the economic “prosperity and dryness line”. Provided that the PMI value of the month is lower than 50%, it means that economic expectations are sluggish, and if is higher than 50%, it indicates that the future economic development is in the expansion stage. The value and related economic significance are mostly comparable to the economic meaning of the country-product matrix in this article about the competitiveness of countries and products. For China’s PMI, when the index changes in the same direction for three consecutive months, it can reflect the trend switch of the country’s macro economy. More precisely, if the cycle is in a low-value period with only the single-period data rises above the theoretical line, it cannot be judged that the economy has fully recovered. To achieve a trend improvement in the economy, the same direction value that lasts for 3 months is needed to confirm.

The training algorithm is the Baum-Welch algorithm and the decoding algorithm is the Viterbi algorithm. The specific steps are given in Fig 1.

Download:

Fig 1. The model training and decoding process of the SDC-HMM.

To demonstrate the steps of training and decoding we use the Baum-Welch algorithm and Viterbi algorithm respectively. Firstly, the explicit state time series are trained into the hidden state transition probability distribution, the hidden state emission probability distribution, and the hidden state initial state probability distribution. Then we perform the Viterbi algorithm in decoding these probability distributions into the constrained hidden time series.

https://doi.org/10.1371/journal.pone.0253845.g001

Fig 2 is the schematic diagram of the HMM and constrained HMM. First, the transition state time series data are obtained from the RCA raw data. Subsequently, the time series data of the development stage are obtained based on the calculation of the emission probability distribution. In this process, the constrained HMM requires that each state data node be consistent with its previous (or next) state data node to meet the minimum duration constraint t = 2.

Download:

Fig 2. The schematic diagram of the HMM and SDC-HMM.

The transition state time series data are obtained from the RCA raw data. Subsequently, the time series data of the development stage are obtained based on the calculation of the emission probability distribution.

https://doi.org/10.1371/journal.pone.0253845.g002

Test 1: Country-product matrix test

As shown in Fig 3, data analysis is performed by comparing the RCA raw data (RCA) and RCA binary noise reduction data (RCA Binarization). The dotted line is the original RCA data, and the solid line is the RCA binary noise reduction data. As can be clearly seen, when the original RCA data fluctuate at the “1” threshold, there is a lot of noise in the RCA binarized data. In Fig 4, the dotted line represents the original RCA data, and the solid line the constrained HMM Binarization data. Noise reduction analysis is performed by comparing the RCA original data and the constrained HMM binarized noise reduction data.

Download:

Fig 3. RCA raw data (RCA) and RCA binarization data (RCA Binarization).

The data analysis is performed by comparing the RCA raw data (RCA) and RCA binary noise reduction data (RCA Binarization), in which the dotted line is the original RCA data and the solid line is the RCA binary noise reduction data.

https://doi.org/10.1371/journal.pone.0253845.g003

Download:

Fig 4. RCA raw data (RCA) and SDC-HMM data (constrained HMM Binarization).

The data analysis is performed by comparing the RCA raw data (RCA) and the constrained HMM Binarization data (constrained 1-11VIM Binarization), in which the dotted line is the original RCA data and the solid line the constrained HMM Binarization data.

https://doi.org/10.1371/journal.pone.0253845.g004

Comparative analysis of the two graphs indicates that the Hidden Markov Model with the shortest duration constraint can greatly reduce the fluctuation noise of the RCA raw data at the “1” threshold. Moreover, the constrained HMM can still effectively identify changes in the RCA’s inherent trends, because the changing points can be effectively identified.

Test 2: Country case test

In the noise reduction test of the constrained Hidden Markov Model in specific country cases, we find that the SDC-HMM not only presents the significant advantage of removing the noise of outlier data, but can also effectively identify the trends change point compared to the corresponding HMM.

As shown in Fig 4, we choose Anguilla, Algeria, Paraguay and Republic of Moldova for national case analysis and testing. By comparing the original RCA data, the country-product matrix binarized noise reduction data, the unconstrained Hidden Markov Model noise reduction data and the SDC-HMM reduction data, a total of four kinds of results are processed for noisy data. Finally, the test for data noise reduction and change point detection are performed, which improves the predictability of economic data and the effectiveness of subsequent empirical prediction.

A great deal of country case studies demonstrate that SDC-HMM can erase abnormally high value (low value) noise in the stationary phase, abnormally low value (high value) in the rising phase and abnormally high value (low value) noise during the descending period. The data can also be significantly optimized for noise reduction, compared to country-product matrix binary noise reduction and unconstrained HMM. SDC-HMM not only makes a difference when dealing with a single type of data noise, it also realizes multiple types of data noise reduction in the same time series data. As shown in the group diagram for the Republic of Moldova in Fig 5, the SDC-HMM can effectively remove the abnormal low value noise in the rising period and the abnormal low value noise in the descending period. The use of the SDC-HMM can achieve noise reduction of two different types of noise data through the constraint condition “t = 2” and improve data quality.

Download:

Fig 5.

Country case test (A), including (1) original RCA data (2) country-product matrix binarized noise reduction data, (3) unconstrained Hidden Markov Model noise reduction data and (4) the Shortest Duration Constrained Hidden Markov model reduction data These show the abnormal data noise sanitation of SDC-HIAI. The horizontal axis is the time axis (“0” represents the starting year 1995, -20” represents 2015): and the vertical axis is the corresponding value of the sub- graph_ The original RCA data graph is the original value, and the rest of the sub-graphs are binary values.

https://doi.org/10.1371/journal.pone.0253845.g005

Furthermore, in the country case test, we find that the SDC-HMM can retain the advantages of change point recognition. As shown in Fig 6, the Republic of Armenia and the United Kingdom of Great Britain and Northern Ireland are typical illustrations.

Download:

Fig 6.

Country case test (B), including (1) original RCA data, (2) country-product matrix binarized noise reduction data, (3) unconstrained Hidden Markov Model noise reduction data and (4) the Shortest Duration Constrained Markov Model reduction data. These show the long-term change point identification of SDC-HMM.

https://doi.org/10.1371/journal.pone.0253845.g006

Test 3: Noise estimation test

In order to further strengthen the quantitative test and analysis of the degree of noise reduction of SDC-HMM data, we conduct noise estimation tests on the country-product matrix and the SDC-HMM. The noise estimate approach was originally utilized by Tacchella et al. [21] to compare the noise reduction effect of the HH index method [2] and their proposed non-linear index method (Non-Linear Metrics). The detailed steps are as follows:

They calculate the corresponding Spearman’s Correlation Coefficient (ρ_s) from real data every year. (6) ρ_s is a non-parametric indicator used to measure the dependence between a couple of statistics. The noise estimation test results are shown in Fig 7.

Download:

Fig 7. Noise Estimation Test.

The blue line is the noise estimation of RCA binarization and the orange one is the noise estimation of SDC-HMM. The average data noise percentage of the RCA binarized data obtained by noise estimation is higher than the average value of data noise after SDC-HM optimization.

https://doi.org/10.1371/journal.pone.0253845.g007

In Fig 7, from 1995 to 2015, that is a period of 20 years, the average data noise percentage of the RCA binarized data obtained by noise estimation is about 42%, while the average value of data noise after SDC-HMM optimization is about 32%. In comparison, after SDC-HMM optimization, the data noise is reduced by about 25%. The SDC-HMM possesses significant advantages in noise reduction not only at the average level, but also for extreme values. After the shortest constraining of HMM, the maximum value of data noise is 33.09%, which is much lower than the minimum value of RCA binary data noise of 41.67%. The noise estimation test confirms that the SDC-HMM has a significant effect on data noise reduction.

Test 4: Fitness-GDP panel regression test

In order to study the improvement of the GDP prediction of Economic Fitness after SDC-HMM noise reduction, the panel data model is used for econometric analysis.

Based on the correlation study of GDP growth factors, the fixed effects model (FEM) constructed in this paper is as follows. (7) GDP_pc, lnFitness_it, lnPop_it and lnLF_it represent GDP per capita, the economic fitness index, national population, and labor force over 15 years old respectively.

In Table 1, Models 1 to 3 are the mixed regression model and Models 4 to 6 are the fixed effect model after SDC-HMM noise reduction while Table 2 has the same model after HMM noise reduction. From the results of the Hausman test, we can see that because the p value of Models 5 and 6 is statistically significant in the 0.01 confidential interval, the fixed effect model should be used instead of the random effect model. Model 4 rejects the null hypothesis at a significance level of 10%, which confirms the applicability of the fixed-effect model. Table 2 fits the same principle.

Download:

Table 1. Regression results of the mixed regression model and fixed effect model of economic fitness for GDP prediction after SDC-HMM noise reduction.

https://doi.org/10.1371/journal.pone.0253845.t001

Download:

Table 2. Regression results of the mixed regression model and fixed effect model of economic fitness for GDP prediction after HMM noise reduction.

https://doi.org/10.1371/journal.pone.0253845.t002

In order to test the robustness of the regression results, we apply the method of time-divided regression to the model, dividing the period into 1995–2007 and 2008–2017 and performing regression analysis on these two periods (Appendix A1-A4 in S1 Appendix). Comparing the time-divided regression model with the initial model, the results show that the original model results are robust.

For thorough analysis of the prediction optimization of SDC-HMM, Tables 1 and 2 are compared as follows. From the mixed regression model, the results of Models 1–3 all show that SDC-HMM has better regression interpretation strength than the HMM. The significance level of the positive interpretation of the economic fitness index for GDP in Model 3 has increased to 1% of SDC-HMM, compared with 10% of HMM. From the perspective of the fixed-effect model, the standard errors and p-values of the economic fitness coefficients in all models demonstrate that economic fitness has strong explanatory power for economic growth, and it is significant at the 1% level. The regression results of SDC-HMM in Model 4 show that an increase of 1% in the economic fitness index will cause GDP to increase by 2.33%, and it is significant at the 1% level. In Model 4 of HMM, the economic fitness index is increased by 1%, which can only explain the 1.84% growth in GDP. Similarly, the SDC-HMM economic fitness indicators in Models 5 and 6 separately show that the percentage of each economic fitness indicator supporting GDP growth increased from 1.23% to 1.69% and from 1.24% to 1.70%.

Fitness-GDP Forecast Optimization

When it comes to the forecast optimization of SDC-HMM, we propose the strictly out-of sample forecast to verify whether the HMM with shortest duration constraint performs better than the initial HMM without constraint. We apply the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Squared Error (MSE) between the predicted GDP and the real GDP data.

As shown in Table 3, the MAE, RMSE and MSE between the GDP predicted by the SDC-HMM and the real value of GDP in that year are reduced compared to HMM. MSE decreased by 0.89%, MAE decreased by 0.62%, and RMSE decreased by 0.44%. After the shortest duration constraint is imposed, the GDP predicted by the V-SPS (the Velocity Selective Predictability Scheme) method [6] has improved and becomes more accurate. It can be concluded that the SDC-HMM, compared with unconstrained HMM, works better in revealing the development potential of a country, thereby improving the accuracy of GDP forecasting.

Download:

Table 3. The results of the SDC-HMM on Fitness-GDP forecast optimization.

https://doi.org/10.1371/journal.pone.0253845.t003

Discussion

From an economic perspective, a complex and complete industrial system and competitive export products are an inexhaustible driving force for a country’s economic growth. Occupying a higher portion of the global value chains will lead to more product added value. The migration of a country’s product in the chains and the acquisition or loss of the product’s revealed comparative advantage are essentially the same process, and it is one that takes time to accomplish objectively. The same product of a country loses competitive advantage at one time node adjacent to another with high RCA value, and this data point violates economic development and the reality of global production. Therefore, such data should be regarded as noisy data which are meaningless for predictive analysis. The basis of economic analysis is high-quality economic data that can be utilized for prediction [22]. Furthermore, the simpler the numerical composition of economic data, the more data denoise can improve data quality. The simpler the data required for economic analysis, the more essential data quality improvement is for economic analysis. The only data required for economic fitness analysis are the “country-product matrix” data and the “country-product matrix” data which are binary data, whose numerical composition is extremely simple.

We select economic fitness analysis and “country-product matrix” data to test the SDC-HMM for data sanitation and optimize the GDP prediction accuracy of the V-SPS method which is constructed by economic fitness. Economic Fitness is used to measure the degree of product diversification with competitiveness and product complexity in global value chains. The Fitness-Complexity Algorithm is a non-linear iteration method which can produce different results for different countries, despite them having the same initial value. Each iteration step will add more information regarding the complexity of the product, and eventually in the iteration process separate the countries with the same initial value. Further analysis means that according to the non-linear iterative recursive structure (see Eq 3), the iterative recursion algorithm of the fitness index implies the following fact: when a country’s competitive products become more diversified, in the next stage the complexity of the products produced will be greater, and increased product complexity will further diversify the country’s competitive products. This is a benign recursion cycle which captures the economic essence. Considering the economic logic behind the mathematical recursion progress, the calculated Fitness will further amplify noise if the original data are noisy.

We construct a constrained Hidden Markov Model with shortest duration t = 2. With the country-product matrix test, the SDC-HMM can greatly diminish the RCA raw data fluctuation noise at the “1” threshold, while retaining effective identification and confirmation of data trend change points. As confirmed by national case tests, we find that the SDC-HMM excels at removing abnormal high value (low value) noise in the stationary period, abnormal low value (high value) noise in the rising period and abnormal high value (low value) noise in the descending period, and in identifying the long-term change point of the data in low-to-high and high-to-low long-term trends, achieving obvious noise reduction. Through the noise estimation test, the SDC-HMM reduces total data noise by about 25%, which is significantly better than unconstrained HMM. In the empirical panel regression test using national economic fitness indicators to predict GDP growth, SDC-HMM is shown to improve the effectiveness of economic fitness indicators in interpreting economic growth. The Mean Squared Error (MSE) between the predicted GDP and the real GDP data decreased by 0.89%. There is two main importance of the SDC-HMM. Firstly, setting a minimum duration greater than one period is conducive to making full use of data time dimension information and effectively removing abnormal data value noise. The export data used for the Economic Fitness analysis often fluctuate drastically in a short period due to factors such as tariffs, exchange rates, and trade policies of other countries, but such changes do show the commensurable obversion of Economic Fitness. To characterize inherent economic development and predict future trends through data, the unstable state of export RCA data and Country-Product Matrix does not fit the stable GVCs. As a result, the predictive effect of a country’s Economic Fitness on its economic growth diminishes too. Secondly, setting the value of d to be greater than or equal to 2 is to fully retain the information of data time dimension and identify the change of data trend in time. The data with a longer minimum duration constraint will give rise to the real variation points being missed. Furthermore, the time lag will reduce the accuracy of macroeconomic forecasting to a certain extent.

In this paper, the shortest duration constraint is defined as a constant 2 according to economic significance. When it comes to the other economic fields, this invariable duration could also be the limitation. Consequently, the more stable data requires a longer duration constraint. In addition, subsequent research could functionally set the shortest duration constraint according to the specific situation of data and the purpose of economic research. In order to improve the accuracy of economic research and prediction, the specific value or function form could be accomplished by distinguishing the degree of noise reduction through noise detection.

Methods

The original export data used for the RCA calculation and Gross Domestic Products both come from the UN Comtrade Database [23], which was created by the United Nations Statistics Department and consists of annual trade data provided by officials from more than 200 countries and regions each year. For preparatory cleaning of the raw country’s export data, this paper selects the BACI database [24] provided by the French Prospective Research and International Information Center and chooses the HS92 version of product-level international trade data. The Labor Force used for the regression test come from the World Bank databank [25]. The Populations are obtained from the Penn World Table 9.0 (PWT) produced by the University of Groningen and the University of Pennsylvania [26]. The model used for noise estimation specifically draws on the method in [21].

The computational complexity of SDC-HMM in this paper is not high. Compared with the HMM used in the EF, SDC-HMM increases the calculation amount very little. As most of the computation is applied in the parameters of HMM, the shortest duration constraint (SDC) only consumes a small amount of calculation. Compared with the calculation method of Tacchella et al. (2018), we provide a better noise reduction method without increasing the amount of calculation. It takes about 10–20 minutes to perform the SDC-HMM algorithm of the noise reduction task in this paper by using a PC.

Supporting information

S1 Data.

https://doi.org/10.1371/journal.pone.0253845.s001

(RAR)

S1 Appendix.

https://doi.org/10.1371/journal.pone.0253845.s002

(DOCX)

References

1. Gereffi, G. The Organization of Buyer-Driven Global Commodity Chains: How U.S. Retailers Shape Overseas Production Networks. Commodity Chains and Global Capitalism (eds. Gary Gereffi and Miguel Korzeniewicz), 95–122 (Westport, 1994).
2. Dedrick J, Kraemer K L, Linden G. Who profits from innovation in global value chains?: a study of the iPod and notebook PCs. Industrial and corporate change 19, 81–116 (2010).
- View Article
- Google Scholar
3. Tacchella A., Cristelli M., Caldarelli G., Gabrielli A. & Pietronero L. Economic complexity: conceptual grounding of a new metrics for global competitiveness. Journal of Economic Dynamics and Control, 1683–1691 (2013).
- View Article
- Google Scholar
4. Hidalgo C. A., & Hausmann R. The building blocks of economic complexity. Cid Working Papers 106,10570–10575 (2009). pmid:19549871
- View Article
- PubMed/NCBI
- Google Scholar
5. Ricardo, D. On the Principles of Political Economy and Taxation. (John Murray, 1891).
6. Tacchella A., Mazzilli D., Pietronero L. A dynamical systems approach to gross domestic product forecasting. Nature Physics 14, 861–865 (2018).
- View Article
- Google Scholar
7. Rabiner L. & Juang B. An introduction to hidden Markov models. ieee assp magazine 3, 4–16 (1986).
- View Article
- Google Scholar
8. Page E. S. Continuous inspection schemes, Biometrika 41(1/2), 100–115 (1954).
- View Article
- Google Scholar
9. Yang Tianyu & Huang Shufen. Estimating China’s Output Gap Based on Wavelet Denoising and Quarterly Data. Economic Research Journal 1, 115–126 (2010).
- View Article
- Google Scholar
10. Luong TM., Rozenholc Y. and Nuel G. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model, Computational Statistics & Data Analysis, 68, 129–140 (2013).
- View Article
- Google Scholar
11. Bilmes J. A. What HMMs can do. IEICE TRANSACTIONS on Information and Systems 89 (3), 1–24(2006).
- View Article
- Google Scholar
12. Eddy S. R. Profile hidden Markov models. Bioinformatics (Oxford, England) 14 (9), 755–763 (1998). pmid:9918945
- View Article
- PubMed/NCBI
- Google Scholar
13. Yamato J, Ohya J, Ishii K. Recognizing human action in time-sequential images using hidden markov model. CVPR 92, 379–385 (1992).
- View Article
- Google Scholar
14. Rossi A., & Gallo G. M. Volatility estimation via hidden Markov models. Journal of Empirical Finance 13(2), 203–230(2006).
- View Article
- Google Scholar
15. Huang Xiaobin, Wang Chunfeng, Zhenming Fang et al. Detecting Chinese Stock Information based on hidden Markov Model. Systems Engineering Theory Practice 32, 713–720 (2012).
- View Article
- Google Scholar
16. Balassa B. Trade Liberalisation and “Revealed” Comparative Advantage. The Manchester School 33 (2), 99–123 (1965).
- View Article
- Google Scholar
17. Smith, A. The wealth of nations. London. (UK: W.Strahan and T.Cadell, 1776)
18. Arrow K. J., Chenery H. B., Minhas B. S., & Solow R. M. Capital-Labor Substitution and Economic Efficiency. The Review of Economics and Statistics 43(3), 225–250(1961).
- View Article
- Google Scholar
19. Zhuang Yu & He Zhenfeng. Change Point Detection Based on Constrained Hidden Markov Model. Computer Systems Applications 26 (5), 133–137(2017).
- View Article
- Google Scholar
20. Stock J. H. & Watson M. W. Has the business cycle changed and why? NBER Macroeconomics Annual 17 (1), 159–218(2002).
- View Article
- Google Scholar
21. Battiston F., Cristelli M., Tacchella A. & Pietronero L. How metrics for economic complexity are affected by noise. Complexity Economics 3, 1–22 (2014).
- View Article
- Google Scholar
22. Zaccaria A., Cristelli M., Kupers R., Tacchella A., & Pietronero L. A case study for a new metrics for economic complexity: The Netherlands, Journal of Economic Interaction and Coordination 11, 151–169 (2016).
- View Article
- Google Scholar
23. UN. Commodity trade statistics database, https://comtrade.un.org (2020).
24. Gaulier, G. & Zignago, S. Baci: International trade database at the product-level. the 1994–2007 version. Working Papers 2010–23, CEPII, http://www.cepii.fr/CEPII/fr/publications/wp/abstract.asp?NoDoc=2726 (2010).
25. WB. World Bank Open Data, https://data.worldbank.org/ (2020).
26. The Penn World Table 9.0 (PWT), https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt9.0 (2015).

[ref1] 1. Gereffi, G. The Organization of Buyer-Driven Global Commodity Chains: How U.S. Retailers Shape Overseas Production Networks. Commodity Chains and Global Capitalism (eds. Gary Gereffi and Miguel Korzeniewicz), 95–122 (Westport, 1994).

[ref2] 2. Dedrick J, Kraemer K L, Linden G. Who profits from innovation in global value chains?: a study of the iPod and notebook PCs. Industrial and corporate change 19, 81–116 (2010).
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Tacchella A., Cristelli M., Caldarelli G., Gabrielli A. & Pietronero L. Economic complexity: conceptual grounding of a new metrics for global competitiveness. Journal of Economic Dynamics and Control, 1683–1691 (2013).
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Hidalgo C. A., & Hausmann R. The building blocks of economic complexity. Cid Working Papers 106,10570–10575 (2009). pmid:19549871
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref5] 5. Ricardo, D. On the Principles of Political Economy and Taxation. (John Murray, 1891).

[ref6] 6. Tacchella A., Mazzilli D., Pietronero L. A dynamical systems approach to gross domestic product forecasting. Nature Physics 14, 861–865 (2018).
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref7] 7. Rabiner L. & Juang B. An introduction to hidden Markov models. ieee assp magazine 3, 4–16 (1986).
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref8] 8. Page E. S. Continuous inspection schemes, Biometrika 41(1/2), 100–115 (1954).
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref9] 9. Yang Tianyu & Huang Shufen. Estimating China’s Output Gap Based on Wavelet Denoising and Quarterly Data. Economic Research Journal 1, 115–126 (2010).
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref10] 10. Luong TM., Rozenholc Y. and Nuel G. Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model, Computational Statistics & Data Analysis, 68, 129–140 (2013).
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref11] 11. Bilmes J. A. What HMMs can do. IEICE TRANSACTIONS on Information and Systems 89 (3), 1–24(2006).
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref12] 12. Eddy S. R. Profile hidden Markov models. Bioinformatics (Oxford, England) 14 (9), 755–763 (1998). pmid:9918945
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref13] 13. Yamato J, Ohya J, Ishii K. Recognizing human action in time-sequential images using hidden markov model. CVPR 92, 379–385 (1992).
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref14] 14. Rossi A., & Gallo G. M. Volatility estimation via hidden Markov models. Journal of Empirical Finance 13(2), 203–230(2006).
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref15] 15. Huang Xiaobin, Wang Chunfeng, Zhenming Fang et al. Detecting Chinese Stock Information based on hidden Markov Model. Systems Engineering Theory Practice 32, 713–720 (2012).
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref16] 16. Balassa B. Trade Liberalisation and “Revealed” Comparative Advantage. The Manchester School 33 (2), 99–123 (1965).
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref17] 17. Smith, A. The wealth of nations. London. (UK: W.Strahan and T.Cadell, 1776)

[ref18] 18. Arrow K. J., Chenery H. B., Minhas B. S., & Solow R. M. Capital-Labor Substitution and Economic Efficiency. The Review of Economics and Statistics 43(3), 225–250(1961).
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref19] 19. Zhuang Yu & He Zhenfeng. Change Point Detection Based on Constrained Hidden Markov Model. Computer Systems Applications 26 (5), 133–137(2017).
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref20] 20. Stock J. H. & Watson M. W. Has the business cycle changed and why? NBER Macroeconomics Annual 17 (1), 159–218(2002).
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref21] 21. Battiston F., Cristelli M., Tacchella A. & Pietronero L. How metrics for economic complexity are affected by noise. Complexity Economics 3, 1–22 (2014).
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref22] 22. Zaccaria A., Cristelli M., Kupers R., Tacchella A., & Pietronero L. A case study for a new metrics for economic complexity: The Netherlands, Journal of Economic Interaction and Coordination 11, 151–169 (2016).
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref23] 23. UN. Commodity trade statistics database, https://comtrade.un.org (2020).

[ref24] 24. Gaulier, G. & Zignago, S. Baci: International trade database at the product-level. the 1994–2007 version. Working Papers 2010–23, CEPII, http://www.cepii.fr/CEPII/fr/publications/wp/abstract.asp?NoDoc=2726 (2010).

[ref25] 25. WB. World Bank Open Data, https://data.worldbank.org/ (2020).

[ref26] 26. The Penn World Table 9.0 (PWT), https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt9.0 (2015).

Figures

Abstract

Introduction

The Hidden Markov Model

The Shortest Duration Constrained Hidden Markov Model

The Fitness-Complexity Algorithm

Results

Model training and decoding algorithm

Test 1: Country-product matrix test

Test 2: Country case test

Test 3: Noise estimation test

Test 4: Fitness-GDP panel regression test

Fitness-GDP Forecast Optimization

Discussion

Methods

Supporting information

S1 Data.

S1 Appendix.

References