Skip to main content
Log in

Insights from machine learning for evaluating production function estimators on manufacturing survey data

  • Published:
Journal of Productivity Analysis Aims and scope Submit manuscript

Abstract

National statistical organizations often rely on non-exhaustive surveys to estimate industry-level production functions in years in which a full census is not conducted. When analyzing data from non-census years, we propose selecting an estimator based on a weighting of its in-sample and predictive performance. We compare Cobb–Douglas functional assumption to existing nonparametric shape constrained estimators and a newly proposed estimator. For simulated data, we find that our proposed estimator has the lowest weighted errors. Using the 2010 Chilean Annual National Industrial Survey, a Cobb–Douglas specification describes at least 90% as much variance as the best alternative estimators in practically all cases considered providing two insights: the benefits of using application data for selecting an estimator, and the benefits of structure in noisy data. Finally for the five largest manufacturing industries, we find that a 30% sample, on average, achieves 60% of the R-squared value that would have been achieved with a full census; however, the variance across industries and samples is large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. We will primarily focus on establishment censuses and surveys performed by census bureaus where an establishment is defined as a single physical location where business is conducted or where services or industrial operations are performed. An example is the U.S. annual survey of manufacturers. This survey is conducted annually, except for years ending in 2 and 7 in which a census is performed, Foster et al. (2008). This survey includes approximately 50,000 establishments selected from the census universe of 346,000, or approximately a 15% sampling, Fort et al. (2013).

  2. The Chilean National Institute of Statistics uses survey and census. Whereas, in other parts of economics the terms sample and population are commonly used for the same concept.

  3. A body of economic literature of a less computational and more aggregate in nature is the growth accounting literature, Solow (1957) and Barro et al. (2003). These methods rely on price information, a cost minimization assumption and parametric forms to deterministically compute the coefficients of a first order approximation of a production function using observed input cost shares (see for example Syverson 2011). This literature foregoes any model adequacy check of the production model.

  4. Because of our focus on a conditional-mean estimator, we do not model inefficiency and therefore do not report inefficiency levels.

  5. For example Cobb–Douglas is widely used in Monte Carlo simulations, Andor and Hesse (2014); however, the product economics literature does not provide any support for this standard assumption. The Trans-log functional function is more flexible than the Cobb–Douglas functional form; however, the function parameters and domain have to be selected carefully to assure the Trans-log function satisfies concavity. For this reasons, we sacrifice flexibility and select the Cobb–Douglas functional form so that estimates consistent with concavity are more likely to result.

  6. A partitioned region is referred to as a basis region, Hannah and Dunson (2013).

  7. We use lowercase letters as indices and capital letters to indicate the number of components.

  8. The tolerance is set to 1% and CNLS is used as the large model in all of our examples. Initially, we do not use the Generalized Cross Validation (GCV) score approximation used by Hannah and Dunson (2013), because they assert that GCV’s predictive results are only comparable with full cross validation strategies for problems with \(n\ge 5000\), which are larger than the datasets we consider in this paper.

  9. The advantage of the machine learning based comparison methods over standard Monte Carlo simulation methods is that we do not need to specify a functional form to generate data. However, we are only able to make statements about the estimator for particular application dataset under consideration. Monte Carlo simulation methods might still be preferred to explore a larger set of potential data generation processes and their associated production functions.

  10. In-sample error is the deviations between the estimated function and the true function measured on the input vectors that appear in the observed sample.

  11. Predictive error is the deviations between the estimated function and the true function measured on input vectors that are unobserved in the sample.

  12. The distinction between comparing the true function and observed output is through Monte Carlo simulations we know the true function and can compare our estimators to the true function. However, when using application data, we can only compare to the observed output in the sample.

  13. Note that the estimator “hat” character is over \(E(Er{r}_{ISf}^{nL})\) rather than \(Er{r}_{ISf}^{nL}\), the in-sample error for the particular learning set fitted with the production function.

  14. The concept described here is referred to as sample-specificity or the estimator’s performance in a specific sample. If we use the same set of data (the training set) for estimating the model and for evaluating the model there are significant benefits to overfitting. Even if we use the same DGP to generate more observations for the input vectors used in fitting the model, we would expect a worse fit because this new output vector would have a different draw of random error terms. It is this difference between evaluating fit on the training set and evaluating fit on a new draw of the output vector for the same input vectors. This is why we use the algorithm proposed in Hastie et al. (2009).

  15. Computing the in-sample error provides a more detailed measure of the quality of the production function on a full set than the learning error\({\overline{MSE}}_{yIS}^{nL}={\sum }_{v=1}^{V}{\sum }_{i=1}^{{n}_{L}}{({\hat{f}}_{vi}-{Y}_{vi})}^{2}/{n}_{L}v\), because it averages performance across many possible \({\epsilon }_{i}\) residual values for the learning set input vector.

  16. Specifically, if we intend to use the learning set’s residual sum of squares, calculation of an estimator \({\sigma }^{2}\) would require knowledge of the functional estimator’s effective number of parameters. However, effective parameters can be difficult to calculate for both nonparametric and sieve estimators.

  17. As a reviewer pointed out, the parametric Cobb–Douglas in multiplicative form with an additive error term is not only challenging in cross-sectional data, but also in a panel setting due to the incidental parameters issue. Specifically analyzing the asymptotic performance, the variance of the noise term is a function of both \(T\) and \(N\), and unless both of these values approach infinite, the estimate of the noise variance will not converge to the true value.

  18. We experimented with a variety of starting values including linear and Cobb–Douglas estimates, zero values (the default values in Matlab), and the true values. While the starting values of zero perform quit poorly, all other starting value result is very similar estimates.

  19. Note that the definition of \({R}_{FS}^{2}\) implies that if the evaluated estimator fails to explain more variability than the simply taking the mean of the output variable over the full sample, we will instead use the mean as our estimator.

  20. We reemphasize that unlike cross validation procedures in which the goal is to estimate \(E(Er{r}_{y}^{n})\), our goal is to estimate \(E(Er{r}_{y}^{nL})\).

  21. Again, expectations and averages over the error and optimism metrics discussed are done over all possible learning sets of a given size.

  22. The maximum attainable, i.e., using the full set as the learning set, noise-to-total variance levels of our real datasets are very similar to those of our low noise settings. Compare \((1-{\overline{MSE}}_{FSy}^{nL})/var({Y}_{FS})\) in our low noise settings against the \({R}_{FS}^{2}\) results of the \(100\)% survey real datasets.

  23. Recall we use the Cobb–Douglas function with an additive error term is used to maintain consistency of the error structure across estimators.

  24. While nonparametric estimators avoid potential functional form misspecification, nonparametric estimates of functional derivatives tend to have higher variances than parametric estimators, Hansen (2018).

  25. Within the class of monotonically increasing and concave production function, the additional assumption of homotheticity can be imposed. A homothetic production function \(f({\boldsymbol{X}})\) can be written as \(g(h({\boldsymbol{X}}))\) where \(g\) is a monotonically increasing function and \(h\) is homogeneous of degree 1. Practically what nonhomothetic implies is that while input isoquants are convex for any output level, the shape of the input isoquant can vary across output levels.

  26. We calculate the MRTS as the marginal product of capital as a ratio to the marginal product of labor. Thus, we impose a positive sign convention.

  27. As mentioned, increased testing set variance as the learning set size increases does not seem to be large enough to affect the variance of our estimates across the different learning and testing set sizes considered.

  28. In Appendix 6.6, we further explore the sensitivity of the results shown in Fig. 3 to our assumption of a finite population of firms and discuss the consequences of considering an infinite amount of unobserved firms when assessing the predictive capabilities of our estimators, thus only evaluating estimator performance in terms of predictive error.

  29. These are hyperplanes which have zero coefficients on some input dimensions, implying it is possible to obtain output without the zero-coefficient inputs.

  30. Due to the increased computational burden of using larger datasets, we present results for a single replicate of the DGP for each sample size and only include learning set results. For this section we report RMSE results rather than MSE results because the latter are small and the differences are indistinguishable across settings.

References

  • Afriat SN (1967) The construction of utility functions from expenditure data. Int Econ Rev 8(1):67–77

    Article  Google Scholar 

  • Afriat SN (1972) Efficiency estimation of production functions. Int Econ Rev 13(3):568–598

    Article  Google Scholar 

  • Aigner D, Lovell CK, Schmidt P (1977) Formulation and estimation of stochastic frontier production function models. J Economet 6(1):21–37

    Article  Google Scholar 

  • Allen DM (1974) The relationship between variable selection and data agumentation and a method for prediction. Technometrics 16(1):125–127

    Article  Google Scholar 

  • Andor M, Hesse F (2014) The stoned age: the departure into a new era of efficiency analysis? A Monte Carlo comparison of stoned and the “oldies” (sfa and dea). J Product Anal 41(1):85–109

    Article  Google Scholar 

  • Andor M, Parmeter C (2017) Pseudolikelihood estimation of the stochastic frontier model. Appl Econ 49(55):5651–5661

    Article  Google Scholar 

  • Balázs G, György A, Szepesvári C (2015) Near-optimal max-affine estimators for convex regression. Proceedings of the eighteenth international conference on artificial intelligence and statistics, PMLR 38:56–64

  • Barro R, Sala-i-Martin X (2003) Economic growth. The MIT Press, Cambridge, MA

  • Breiman L (1992) The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc 87(419):738–754

    Article  Google Scholar 

  • Breiman L (1993) Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans Inf Theory 39(3):999–1013

    Article  Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    Google Scholar 

  • Burman P (1989) A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514

    Article  Google Scholar 

  • Chambers RG et al. (1988) Applied production analysis: a dual approach. Cambridge University Press. Cambridge, UK

  • Chen X (2007) Large sample sieve estimation of semi-nonparametric models. Handb Economet 6:5549–5632

    Article  Google Scholar 

  • De Loecker J, Goldberg PK, Khandelwal AK, Pavcnik N (2016) Prices, markups, and trade reform. Econometrica 84(2):445–510

    Article  Google Scholar 

  • Efron B (2004) The estimation of prediction error: covariance penalties and cross-validation. J Am Stat Assoc 99(467):619–632

    Article  Google Scholar 

  • Fare R, Rolf F, Grosskopf S, Lovell CK et al. (1994) Production frontiers. Cambridge University Press. Cambridge, UK

  • Fort TC, Haltiwanger J, Jarmin RS, Miranda J (2013) How firms respond to business cycles: the role of firm age and firm size. IMF Econ Rev 61(3):520–559

    Article  Google Scholar 

  • Foster L, Haltiwanger J, Syverson C (2008) Reallocation, firm turnover, and efficiency: selection on productivity or profitability? Am Econ Rev 98(1):394–425

    Article  Google Scholar 

  • Frisch R (1964) Theory of production. Springer Science & Business Media. Berlin, Germany

  • Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70(350):320–328

    Article  Google Scholar 

  • Hannah L, Dunson D (2012) Ensemble methods for convex regression with applications to geometric programming based circuit design. ICML'12 Proceedings of the 29th International Conference on International Conference on Machine Learning, 147–154

  • Hannah L, Dunson D (2013) Multivariate convex regression with adaptive partitioning. J Mach Learn Res 14(1):3261–3294

    Google Scholar 

  • Hansen BE (2018) Econometrics. University of wisconsin, department of economics. Madison, WI

  • Hastie T, Tibshiraniand R, Friedman J (2009) The elements of statistical learning. Springer series in statistics. Springer, New York Inc, New York

  • Huang R, Szepesvári C (2014) A finite-sample generalization bound for semiparametric regression: partially linear models. In Proceedings of the artificial intelligence and statistics, pp 402–410

  • Hwangbo H, Johnson AL Ding Y (2015) Power curve estimation: functional estimation imposing the regular ultra passum law. Working Paper, SSRN https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2621033

  • Kuosmanen T (2008) Representation theorem for convex nonparametric least squares. Economet J 11(2):308–325

    Article  Google Scholar 

  • Kuosmanen T, Saastamoinen A, Sipiläinen T (2013) What is the best practice for benchmark regulation of electricity distribution? comparison of dea, sfa and stoned methods. Energy Policy 61:740–750

    Article  Google Scholar 

  • Kuosmanen T, Kortelainen M (2012) Stochastic non-smooth envelopment of data: semi-parametric frontier estimation subject to shape constraints. J Product Anal 38(1):11–28

    Article  Google Scholar 

  • Lee C-Y, Johnson AL, Moreno-Centeno E, Kuosmanen T (2013) A more efficient algorithm for convex nonparametric least squares. Eur J Oper Res 227(2):391–400

    Article  Google Scholar 

  • Mazumder R, Choudhury A, Iyengar G, Sen B (2019) A computational framework for multivariate convex regression and its variants. J Am Stat Assoc 114:318–331

  • Olesen OB, Ruggiero J (2014) Maintaining the regular ultra passum law in data envelopment analysis. Eur J Oper Res 235(3):798–809

    Article  Google Scholar 

  • Olesen OB, Ruggiero J (2018) An improved afriat-diewert-parkan nonparametric production function estimator. Eur J Oper Res 264(3):1172–1188

    Article  Google Scholar 

  • Shephard R (1970). Theory of cost and production functions princeton university press. Princeton, New Jersey

  • Simar L, Zelenyuk V (2011) Stochastic fdh/dea estimators for frontier analysis. J Product Anal 36(1):1–20

    Article  Google Scholar 

  • Simar L, Van Keilegom I, Zelenyuk V (2017) Nonparametric least squares methods for stochastic frontier models. J Product Anal 47(3):189–204

    Article  Google Scholar 

  • Solow RM (1957) Technical change and the aggregate production function. Rev Econ Stat 39(3):312–320

    Article  Google Scholar 

  • Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Methodol 36(2):111–147

    Google Scholar 

  • Syverson C (2011) What determines productivity? J Econ Lit 49(2):326–365

    Article  Google Scholar 

  • White H, Wooldridge J (1991) Some results on sieve estimation with dependent observations. In: Nonparametric and semiparametric methods in econometrics and statistics: proceedings of the fifth international symposium in economic theory and econometrics, Chapter 18, Cambridge University Press, pp 459–493

  • Yagi D, Chen Y, Johnson AL, Kuosmanen T (2018) Shape constrained kernel-weighted least squares: Application to production function estimation for chilean manufacturing industries. J Bus Econ Stat, published online: https://doi.org/10.1080/07350015.2018.1431128

  • Yagi D, Chen Y, Johnson AL, Morita H (2019) An axiomatic nonparametric production function estimator: Modeling production in Japan’s cardboard industry. arXiv, Preprint at: https://arxiv.org/abs/1906.08359

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew L. Johnson.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Insights from machine learning for evaluating production function estimators on manufacturing survey data

Appendix: Insights from machine learning for evaluating production function estimators on manufacturing survey data

This appendix includes:

  • Results for the case with 3 and 4 regressors; More extensive simulated dataset results (Appendix 6.1),

  • CAPNLS: Formulation and scalability to larger datasets (Appendix 6.2).

  • Parametric Bootstrap algorithm to calculate expected optimism (Appendix 6.3).

  • Comparison between different error structures for the parametric Cobb–Douglas function (Appendix 6.4).

  • Cobb–Douglas results with multiplicative residual assumption for Chilean manufacturing data (Appendix 6.5).

  • Application Results for Infinite Populations (Appendix 6.6).

  • Comprehensive Application Results: MPSS, MP, and MRTS (Appendix 6.7).

1.1 More extensive simulated dataset results

In this section, we provide additional Monte Carlo simulation results for the bivariate case and present results for the trivariate and tetravariate cases. The performance of all estimators deteriorates significantly due to the curse of dimensionality. The results reported are for the performance metric full set error, \(\widehat{E(Er{r}_{FS}^{nL})}\), which is a weighting of the expected in-sample error and the expected predictive error. While the parametric estimator should have the best performance in terms of expected predictive error, it is not the case for expected in-sample error. The parametric estimator is generally not able to fit the observed data with noise as well as the nonparametric estimators because of parametric estimator’s lack of flexibility. Or stated differently, the nonparametric estimators benefit from being able to overfit the observed sample. We see the effects of lack of flexibility for the parametric estimator relative to the nonparametric estimator is pronounced in higher dimensional low noise cases.

1.1.1 Bi-variate input Cobb–Douglas DGP

In this section we reproduce Table 1 of the text, but also include computational times. We run our experiments on a computer with Intel Core2 Quad CPU 3.00 GHz and 8GB RAM using MATLAB R2016b and the solvers quadprog and lsqnonlin.

Table 10 Number of hyperplanes and runtimes for bivariate input Cobb–Douglas DGP

1.1.2 Tri-variate input Cobb–Douglas DGP

We consider the data generation process (DGP) \({Y}_{i}={X}_{i1}^{0.4}{X}_{i2}^{0.3}{X}_{i3}^{0.2}+{\epsilon }_{i}\), where \({\epsilon }_{i} \sim N(0,{\sigma }^{2})\), \(\sigma =\{0.01,0.05,0.1,0.2,0.3,0.4\}\) for our six noise settings having the same small and large noise split as the previous example, and \({X}_{ij} \sim Unif(0.1,1)\) for \(j=1,2,3,\ i=1,\ldots ,{n}_{L}\). Again, CNLS’s expected full set errors exceed the displayed range for the learning-to-full set scenarios regardless of noise level, due to the poor predictive error values, which are partly linked to the higher proportion of ill-defined hyperplanesFootnote 29 CNLS fits. Compared to bi-variate case, the parametric estimator deteriorates relative to the nonparametric estimators, i.e., the higher errors for the parametric estimator exceed the small scale of most panels in Fig. 4.

Fig. 4
figure 4

Trivariate input Cobb–Douglas DGP results for small noise settings

Figure 5, however, shows that the errors given by the parametric estimator are lower than the errors for CNLS in learning-to-full settings. Further, CAP’s expected performance deteriorates relative to the bi-variate example. As in the bi-variate example, as the learning set grows, the expected full set errors gap between CAPNLS and the correctly specified parametric estimator either favors CAPNLS at every learning set size or becomes more favorable for CAPNLS as the learning set size increases. Finally, Fig. 5 shows that CAPNLS can accurately recover a production function even when noise composes nearly 85% of the variance, as shown when \(\sigma =0.4\).

Fig. 5
figure 5

Trivariate input Cobb–Douglas DGP results for large noise settings

In Table 11, we observe that all methods fit slightly more hyperplanes than for the bi-variate example. The increase in the number of hyperplanes with increased dimensionality is moderate for both CAPNLS and CAP at all settings. For CNLS, while the number of hyperplanes does not significantly increase for \(n=100\), it significantly increases for the two larger datasets. The runtimes for all methods are also higher than in the previous example, i.e., CAPNLS’s times nearly double, although staying below one minute for all scenarios.

Table 11 Number of hyperplanes and runtimes for trivariate input Cobb–Douglas DGP

1.1.3 Tetravariate input Cobb–Douglas DGP

We consider the DGP \({Y}_{i}={X}_{i1}^{0.3}{X}_{i2}^{0.25}{X}_{i3}^{0.25}{X}_{i4}^{0.1}+{\epsilon }_{i}\), where \({\epsilon }_{i} \sim N(0,{\sigma }^{2})\) and \({X}_{ij} \sim Unif(0.1,1)\) for \(j=1,2,3,4;\ i=1,\ldots ,{n}_{L}\) and \(\sigma =\{0.01,0.05,0.1,0.2,0.3,0.4\}\) for our six noise settings. The performance of all estimators further deteriorates due the curse of dimensionality. Figures 6 and 7 show that for this higher dimensional example, the parameters in the parametric estimator are increasingly difficult to learn, and thus the parametric estimator can only predict the true function up to a certain accuracy, namely \({\bar{MSE}}_{FSf}^{nL}=0.015\), and then tends to plateau at this error level even as the learning set size increases. Moreover, the benefits of CAPNLS over the other nonparametric methods are similar to the tri-variate example for the small noise settings, but significantly larger for the large noise settings. Finally, the gap between CAPNLS and all the other functional estimators, parametric or nonparametric, favors CAPNLS for all noise settings and learning set sizes.

Fig. 6
figure 6

Tetra-variate input Cobb–Douglas DGP results for small noise settings

Fig. 7
figure 7

Tetra-variate input Cobb–Douglas DGP results for large noise settings

Table 12 shows that the number of hyperplanes needed to fit the four-variate input production function does not significantly increase from the trivariate-input case for any of the methods. CAPNLS has 40–60% longer runtimes compared to the trivariate-input case. The runtime increase with dimensionality; however, this is not a severe concern, because the input information to fit a production function (or output in the case of a cost function) rarely exceeds four variables. The maximum recorded runtime for CAPNLS is still below two minutes, i.e., it is not large in absolute terms.

Table 12 Number of hyperplanes and runtimes for tetra-variate input Cobb–Douglas DGP

1.1.4 More extensive simulated dataset results

Tables 1330 provide more detail results. This information was show in Figs. 1 and 2 in the main text and Figs. 47 in the Appendix.

Table 13 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.2\)
Table 14 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.3\)
Table 15 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.4\)
Table 16 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.2\)
Table 17 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.3\)
Table 18 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.4\)
Table 19 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.2\)
Table 20 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.3\)
Table 21 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.4\)
Table 22 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.01\)
Table 23 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.05\)
Table 24 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=2\), \(\sigma =0.1\)
Table 25 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.01\)
Table 26 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.05\)
Table 27 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=3\), \(\sigma =0.1\)
Table 28 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.01\)
Table 29 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.05\)
Table 30 \(MS{E}_{f}\), \(MS{E}_{ISf}\), \(MS{E}_{FSf}\), \(MS{E}_{y}\), \(MS{E}_{ISy}\), \(MS{E}_{FSy}\), time, and \(K\) results for \(d=4\), \(\sigma =0.1\)

1.2 CAPNLS: formulation and scalability to larger datasets

1.2.1 CAPNLS as a series of quadratic programs

Taking advantage of the linearly-constrained quadratic programming structure of CAPNLS is essential to achieve computational feasibility. Therefore, we write Problem (5) in the standard form

$$\begin{array}{l}\mathop{\rm{min}}\limits_{{\boldsymbol{\beta}}}\,\frac{1}{2}{{\boldsymbol{\beta }}}^{T}H{\boldsymbol{\beta }}+{{\boldsymbol{\beta }}}^{T}g\\ s.t.\ A{\boldsymbol{\beta }}\le 0,\ {\boldsymbol{\beta }}\ge {\boldsymbol{l}}\end{array}$$
(6.2.1)

Starting with the objective function from (4), we let \(\tilde{{\boldsymbol{X}}}=({\boldsymbol{1}},{\boldsymbol{X}})\) and write

$$\begin{array}{ll}\mathop{\rm{min}}\limits_{{({\beta }_{0k},\,{{\boldsymbol{\beta }}}_{-0k})}_{k=1}^{K}}\sum _{i=1}^{N}{({\beta }_{0[i]}+{{\boldsymbol{\beta }}}_{-0[i]}^{T}{{\boldsymbol{X}}}_{i}-{Y}_{i})}^{2}\\=\mathop{\rm{min}}\limits_{{({\boldsymbol{\beta }})}_{k=1}^{K}}\sum _{i=1}^{N}{({{\boldsymbol{\beta }}}_{[i]}^{T}{\tilde{{\boldsymbol{X}}}}_{i}-{Y}_{i})}^{2}\\=\cdots =\mathop{\rm{min} }\limits_{{({\boldsymbol{\beta }})}_{k=1}^{K}}\frac{1}{2}\sum _{i=1}^{N}{({{\boldsymbol{\beta }}}_{[i]}^{T}{\tilde{{\boldsymbol{X}}}}_{i})}^{2}-\sum _{i=1}^{N}({{\boldsymbol{\beta }}}_{[i]}^{T}{\tilde{{\boldsymbol{X}}}}_{i}{Y}_{i})\end{array}$$
(6.2.2)

where we have dropped constant \({\sum }_{i=1}^{N}{Y}_{i}^{2}\) and multiplied times one half. To write the last expression in Eq. (6.2.2) in standard form, we first write \({\sum }_{i=1}^{N}{({{\boldsymbol{\beta }}}_{[i]}^{T}{\tilde{{\boldsymbol{X}}}}_{i})}^{2}\) using matrix operations. We define observation-to-hyperplane \(n(d+1)\times K(d+1)\)-dimensional mapping matrix \(P\), with elements \(P((i-1)(d+1)+j,([i]-1)(d+1)+i)\) = \({\tilde{X}}_{ij},\ i=1,\cdots \ ,n,\ j=1,\cdots \ ,d+1\) and all other elements equal to zero. Similarly, we define \(n\times n(d+1)\)-dimensional observation-specific vector product matrix \(S\), with elements \(S(i,(i-1)(d+1)+l)=1\) for \(l=1,\cdots \ ,3,i=1,\cdots \ ,n,j=1,\cdots \ ,d+1\). Then, we concatenate vectors \({({{\boldsymbol{\beta }}}_{k})}_{(k=1)}^{K}\) in \(K(d+1)\times 1\)-dimensional vector \({\boldsymbol{\beta }}\). It follows that

$$\sum _{i=1}^{N}{({{\boldsymbol{\beta }}}_{[i]}^{T}{\tilde{{\boldsymbol{X}}}}_{i})}^{2}={\boldsymbol{\beta}}^{T}{P}^{T}({S}^{T}S)P{\boldsymbol{\beta }}\,\text{and}\,{\sum _{i=1}^{N}}({\boldsymbol{\beta}}_{[i]}^{T}{\tilde{\boldsymbol{X}}}_{i}{Y}_{i})={\boldsymbol{\beta}}^{T}{P}^{T}{S}^{T}{\boldsymbol{Y}},$$
(6.3.3)

from which we have that \(H={P}^{T}({S}^{T}S)P\) and \(g=-{P}^{T}{S}^{T}{\boldsymbol{Y}}\).

To write in the Afriat Inequality constraints as \(nK\times K(d+1)\) - dimensional matrix \(A\), we let elements \(A\left(\right.K(i-1)+k,\ j+(d+1)([i]-1)\left)\right.\)= \({\tilde{X}}_{ij},\ i=1,\cdots \ ,n,\ j=1,\cdots \ ,d+1,\ k=1,\cdots \ ,K,\) and let all other elements equal zero. Finally, we define \(K(d+1)\) – dimensional vector \({\boldsymbol{l}}\) to have elements \({\boldsymbol{l}}\left(\right.(k-1)(d+1)+1\left)\right.=0,\ k=1,\cdots \ ,K\), and all other elements be equal to negative infinity.

1.2.2 Scalability of CAPNLS to larger datasets

To demonstrate the performance of CAPNLS in large data sets, we revisit the DGP used in the Tri-variate case, specifically, \({Y}_{i}={X}_{i1}^{0.4}{X}_{i2}^{0.3}{X}_{i3}^{0.2}+{\epsilon }_{i}\), where \({\epsilon }_{i} \sim N(0,{\sigma }^{2})\), \(\sigma =0.1\) and \({X}_{ij} \sim Unif(0.1,1)\) for \(j=1,2,3,\ i=1,\cdots \ ,n\), and \(n=500,1000,2000,3000\), and \(5000\). Table 31 reports the estimator’s performance.Footnote 30

Table 31 Number of hyperplanes and runtimes for trivariate input Cobb–Douglas DGP on larger datasets

We first conduct standard CAPNLS analysis and report learning errors, number of fitted hyperplanes and runtime results. Runtimes for datasets up to \(2000\) observations are well below the one hour threshold, but there are significant scalability challenges for datasets larger than \(2000\) observations. Thus, we apply the Fast CAP stopping criterion in Hannah and Dunson (2013), which measures the GCV score improvement by the addition of one more hyperplane and stops the algorithm if no improvement has been achieved in two consecutive additions. Unlike Fast CAP, however, we apply it directly to the learning error against observations. We denote the results for those runs with the CAPNLSF superscript and observe that differences are minimal compared to following our standard partitioning strategy. This alternative stopping rule results in a highly scalable algorithm which can fit datasets up to \(5000\) observations in \(\sim\!\! 40\) minutes.

1.3 Parametric Bootstrap algorithm to calculate expected optimism

We apply the following algorithm from Efron (2004) to compute in-sample optimism. First, we assume a Gaussian density \(p({\boldsymbol{Y}})=N(\hat{{\boldsymbol{Y}}},{\hat{\sigma }}^{2}{\boldsymbol{I}})\), where \({\boldsymbol{Y}}\) is the vector of estimated output values of the estimator for which we are assessing the in-sample optimism. We use \({\boldsymbol{I}}\) to indicate an identity matrix. We obtain \({\sigma }^{2}\) from the residuals of a “big” model presumed to have negligible bias. Given CNLS’s high flexibility and complex description (many hyperplanes), we choose it as our “big” model. Although obtaining an unbiased estimate for \({\sigma }^{2}\) from CNLS’s residuals is complicated, i.e., there are no formal results regarding the effective number of parameters CNLS uses, using \(MS{E}_{yLearn}^{CNLS}\) as \({\sigma }^{2}\) results in a downward biased estimator of \({\sigma }^{2}\). This downward bias in fact results in improved efficiency for the parametric bootstrap algorithm and is an example of a “little” bootstrap, see Breiman (1992). Thus, we let \(\hat{\sigma }=MS{E}_{yLearn}^{CNLS}\). Efron (2004) then suggests to run a large number \(B\) of simulated observations \({{\boldsymbol{Y}}}^{* }\) from \(p({\boldsymbol{Y}})\), fit them to obtain estimates \({\hat{{\boldsymbol{Y}}}}^{* }\), and estimate \(co{v}_{i}=cov({\hat{Y}}_{i},{Y}_{i})\) computing

$${\hat{cov}}_{i}=\sum _{b=1}^{B}{\hat{Y}}_{i}^{* b}({Y}_{i}^{* b}-{Y}_{i}^{* })/(B-1);{Y}_{i}^{* }=\sum _{b=1}^{B}{\hat{Y}}_{i}^{* b}/B.$$
(6.3.1)

We select \(B=500\) for all our experiments based on observed convergence of the \({\sum }_{i=1}^{N}{\hat{cov}}_{i}\) quantity. Further, we note that if the researcher is not comfortable with the assumption made about the size of \(MS{E}_{yLearn}^{CNLS}\) relative to \({\sigma }^{2}\), sensitivity analysis (by adding a multiplier \(c\,>\,1\), such that \(p({\boldsymbol{Y}})=N(\hat{{\boldsymbol{Y}}},c{\hat{\sigma }}^{2}{\boldsymbol{I}})\)) can be performed. Finally, we also note that non-Gaussian distributions can be used to draw the bootstrapped \({{\boldsymbol{Y}}}^{* }\) vectors. This is especially useful when considering inefficiency, because it can imply a skewed distribution.

1.4 Comparison between different error structures for the parametric Cobb–Douglas function

This Appendix compares the performance of the standard log-linear Cobb–Douglas with an additive error estimator and the multiplicative Cobb–Douglas with an additive error estimator. Figures 813 consider the bivariate, trivariate, and tetravariate case with both a low and high noise setting. We find both parametric estimators preform very similarly with the correctly specified additive error estimator performing slightly better.

Fig. 8
figure 8

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is low and the model includes two inputs

Fig. 9
figure 9

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is high and the model includes two inputs

Fig. 10
figure 10

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is low and the model includes three inputs

Fig. 11
figure 11

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is high and the model includes three inputs

Fig. 12
figure 12

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is low and the model includes four inputs

Fig. 13
figure 13

A comparison of Cobb–Douglas multiplicative function with an additive error term (CDA) vs. Cobb–Douglas log-linear function with an additive error term (CDM) when the noise level is high and the model includes four inputs

1.5 Further Cobb–Douglas results with multiplicative residual assumption for Chilean manufacturing data

This section presents the parameter estimates for the additive Cobb–Douglas specification. Estimates are presented for the five largest industries in the Chilean manufacturing data. Table 32 shows that returns to scale (RTS) is decreasing (DRS) for Other Metal Products (2899), Wood (2010) and Structural Use Metal (2811). However, RTS are increasing (IRS) for Plastics (2520) and Bakeries (1541). All RTS estimates are with an additive factor of \(0.2\) from constant returns to scale.

Table 32 Cobb–Douglas parameter estimates for five industries

1.5.1 Cobb–Douglas results with multiplicative residual

In this subsection, we reconstruct Table 3 from the main text, but consider the alternative parametric estimator, a log-linear Cobb–Douglas function with an multiplicative error term (CDM).

Table 33 Ratio of CDM to Best Model performance

1.6 Application results for infinite populations

In this Appendix we explore the sensitivity of our application insights in the case when predictive ability at any point of the production function is equally important. This represents a key departure from the assumptions made in the main body of the paper, as it translates into not weighting in-sample and predictive errors. Rather, since we are interested in the descriptive ability of the fitted production function on an infinite number of unobserved input vectors, we only consider the predictive error. To illustrate the consequences of this alternative assumptions in detail, we present our results in Fig. 14 which is analogous to Fig. 3. In Fig. 14 we show replicate-specific as well as averaged \({R}^{2}\) values. In this case, rather than using \({R}_{FS}^{2}\) as our predictive power indicator, we use \(R_{Pred}^2 = \max (1 - (E(\widehat {Er{r_y}})/Var({Y_{FS}}),0))\).

Fig. 14
figure 14

Best Method’s \({R}_{Pred}^{2}\) as function of relative subset size for selected industries. CAPNLS was chosen as Best Method for industry codes 2899, 2010, 2811 and 2520, while CDA was chosen for industry code 1541

In Fig. 14, we observe that for the majority of studied industries, weighting the in-sample error with the predictive error, the direct consequence of our finite full population assumption, does not affect the diagnostic of the mean predictive power of our production function models. Using our notation, this means that for most industries the expected \({R}_{Pred}^{2}\) for each given subsample size did not differ greatly from \({R}_{FS}^{2}\). However, the \({R}_{Pred}^{2}\) figures have significantly higher variance than their \({R}_{FS}^{2}\) counterparts for each subsample size. For all industries except for industry code 2811, we obtain at least one replicate with negligible predictive power. For some industries, such as Industry Codes 2520 and 1541, this causes the predictive power bound to be very wide (although the upper bound and mean values increase monotonically in the subsample size).

1.7 Comprehensive application results: MPSS, MP, and MRTS

Tables 3439 provide more detail results regarding MPSS, MP for both capital and labor, and MRTS for different percentiles of the four inputs (capital, labor, energy, and services). This information was summarized in Tables 59 in the main text.

Table 34 Most product scaler size (\(y\))
Table 35 Marginal product of capital
Table 36 Marginal product of capital
Table 37 Marginal product of labor
Table 38 Marginal product of labor
Table 39 Marginal rate of technical substitution (\(K/L\))

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Preciado Arreola, J.L., Yagi, D. & Johnson, A.L. Insights from machine learning for evaluating production function estimators on manufacturing survey data. J Prod Anal 53, 181–225 (2020). https://doi.org/10.1007/s11123-019-00570-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11123-019-00570-9

Keywords

JEL Codes

Navigation