Data science, big data and statistics

Galeano, Pedro; Peña, Daniel

doi:10.1007/s11749-019-00651-9

Data science, big data and statistics

Invited Paper
Published: 08 April 2019

Volume 28, pages 289–329, (2019)
Cite this article

TEST Aims and scope Submit manuscript

3530 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

This article analyzes how Big Data is changing the way we learn from observations. We describe the changes in statistical methods in seven areas that have been shaped by the Big Data-rich environment: the emergence of new sources of information; visualization in high dimensions; multiple testing problems; analysis of heterogeneity; automatic model selection; estimation methods for sparse models; and merging network information with statistical models. Next, we compare the statistical approach with those in computer science and machine learning and argue that the convergence of different methodologies for data analysis will be the core of the new field of data science. Then, we present two examples of Big Data analysis in which several new tools discussed previously are applied, as using network information or combining different sources of data. Finally, the article concludes with some final remarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data: the next challenge for statistics

Article 01 August 2015

Francesca Ieva, Piercesare Secchi & Simone Vantini

Big Data Analytics: Views from Statistical and Computational Perspectives

Comments on: Data science, big data and statistics

Article 08 April 2019

Peter Bühlmann

References

Aghabozorgi S, Shirkhorshidi AS, Wah TY (2015) Time-series clustering—a decade review. Inform Syst 53:16–38
Google Scholar
Akaike H (1973) Information theory and an extension of the maximum likelihood method. In: Petrov N, Caski F (eds) Proceeding of the 2nd symposium on information theory. Academiai Kiado, Budapest, pp 267–281
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
MathSciNet MATH Google Scholar
Alonso A, Peña D (2018) Clustering time series by linear dependency. Stat Comput. https://doi.org/10.1007/s11222-018-9830-6
Google Scholar
Ando T, Bai J (2017) Clustering huge number of financial time series: a panel data approach with high-dimensional predictors and factor structures. J Am Stat Assoc 112(519):1182–1198
MathSciNet Google Scholar
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
MathSciNet MATH Google Scholar
Arribas-Gil A, Romo J (2014) Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15(4):603–619
Google Scholar
Asimov D (1985) The grand tour: a tool for viewing multidimensional data. SIAM J Sci Stat Comp 6:128–143
MathSciNet MATH Google Scholar
Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70(1):191–221
MathSciNet MATH Google Scholar
Bailey TC, Sapatinas T, Powell KJ, Krzanowski WJ (1998) Signal detection in underwater sound using wavelets. J Am Stat Assoc 93:73–83
MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
MathSciNet MATH Google Scholar
Barabási AL (2016) Network Science. Cambridge University Press, Cambridge
MATH Google Scholar
Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085
MathSciNet MATH Google Scholar
Basu S, Michailidis G (2015) Regularized estimation in sparse high-dimensional time series models. Ann Stat 43:1535–1567
MathSciNet MATH Google Scholar
Benito M, García-Portugués E, Marron JS, Peña D (2017) Distance-weighted discrimination of face images for gender classification. Stat 6(1):231–240
MathSciNet Google Scholar
Benjamini Y (2010) Discovering the false discovery rate. J R Stat Soc B 72(4):405–416
MathSciNet MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300
MathSciNet MATH Google Scholar
Bergmeir C, Benítez JM (2012) On the use of cross-validation for time series predictor evaluation. Inf Sci 191:192–213
Google Scholar
Bertini E, Tatu A, Keim D (2011) Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE Trans Vis Comput Graph 17:2203–2212
Google Scholar
Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc B 48(3):259–302
MathSciNet MATH Google Scholar
Bickel PJ, Levina E (2008) Regularized estimation of large covariance matrices. Ann Stat 36(1):199–227
MathSciNet MATH Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp. https://doi.org/10.1088/1742-5468/2008/10/P10008
Google Scholar
Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
MathSciNet MATH Google Scholar
Box GEP, Tiao GC (1968) A bayesian approach to some outlier problems. Biometrika 55(1):119–129
MathSciNet MATH Google Scholar
Breiman L (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231
MathSciNet MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Chapman and Hall/CRC, New York
MATH Google Scholar
Brockwell SE, Gordon IR (2001) A comparison of statistical methods for meta-analysis. Stat Med 20:825–840
Google Scholar
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin, Heidelberg
MATH Google Scholar
Bühlmann P, van de Geer S (2018) Statistics for big data: a perspective. Stat Prob Lett 136:37–41
MathSciNet MATH Google Scholar
Bühlmann P, Drineas P, Kane M, van der Laan M (2016) Handbook of big data. Chapman and Hall/CRC, Boca Raton
Google Scholar
Cai TT (2017) Global testing and large-scale multiple testing for high-dimensional covariance structures. Annu Rev Stat Appl 4:423–446
Google Scholar
Cai TT, Liu W (2011) Adaptive thresholding for sparse covariance matrix estimation. J Am Stat Assoc 106:672–684
MathSciNet MATH Google Scholar
Cai TT, Liu W (2016) Large-scale multiple testing of correlations. J Am Stat Assoc 111:229–240
MathSciNet Google Scholar
Cai TT, Zhuo HH (2012) Optimal rates of convergence for sparse covariance matrix estimation. Ann Stat 40(5):2389–2420
MathSciNet MATH Google Scholar
Cai TT, Liu W, Luo X (2011) A constrained \(\ell _{1}\) minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607
MathSciNet MATH Google Scholar
Caiado J, Maharaj EA, D’urso P (2015) Time series clustering. In: Handbook of cluster analysis, CRC Press, pp 241–264
Cairo A (2016) The truthful art: data, charts, and maps for communication. New Riders
Candès E, Tao T (2006) Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans Inf Theory 52:5406–5425
MathSciNet MATH Google Scholar
Candès E, Romberg JK, Tao T (2006) Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math 52:1207–1223
MathSciNet MATH Google Scholar
Candès E, Li X, Ma Y, Wright J (2011) Robust principal component analysis? J ACM 58(3):11
MathSciNet MATH Google Scholar
Candès EJ, Fan Y, Janson L, Lv J (2016) Panning for gold: model-free knockoffs for high-dimensional controlled variable selection. Technical report, May 2016, Department of Statistics, Stanford University
Cao R (2017) Ingenuas reflexiones de un estadístico en la era del big data. Bol de Estad e Investig Oper 33(3):295–321
Google Scholar
Carmichael I, Marron JS (2018) Data science vs. statistics: two cultures? Jpn J Stat Data Sci 1(1):117–138
Google Scholar
Cerioli A, Farcomeni A, Riani M (2013) Robust distances for outlier-free goodness-of-fit testing. Comput Stat Data Anal 65:29–45
MathSciNet MATH Google Scholar
Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform Sci 275:314–347
Google Scholar
Chen H, De P, Hu YJ, Hwang BH (2014) Wisdom of crowds: the value of stock opinions transmitted through social media. Rev Financ Stud 27(5):1367–1403
Google Scholar
Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771
MathSciNet MATH Google Scholar
Chernozhukov V, Galichon A, Hallin M, Henry M (2017) Monge–Kantorovich depth, quantiles, ranks and signs. Ann Stat 45(1):223–256
MathSciNet MATH Google Scholar
Cook RD (2018) An introduction to envelopes: dimension reduction for efficient estimation in multivariate statistics. Wiley, New York
MATH Google Scholar
Cook D, Buja A, Cabrera J, Hurley C (1995) Grand tour and projection pursuit. J Comput Graph Stat 4:155–172
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cover TM, Hart PE (1967) Nearest neighbour pattern classification. IEEE Trans Inform Theory 13:21–27
MATH Google Scholar
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25(2):553–576
MathSciNet MATH Google Scholar
Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23
MathSciNet MATH Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130
MATH Google Scholar
Donoho D (2006a) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306
MathSciNet MATH Google Scholar
Donoho D (2006b) For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59:797–829
MathSciNet MATH Google Scholar
Donoho D (2017) 50 years of data science. J Comput Graph Stat 26(4):745–766
MathSciNet Google Scholar
Dryden IL, Hodge DJ (2018) Journeys in big data statistics. Stat Prob Lett 136:121–125
MathSciNet MATH Google Scholar
Efron B, Hastie T (2016) Computer age statistical inference. Cambridge University Press, Cambridge
MATH Google Scholar
Evergreen SDH (2016) Effective data visualization: the right chart for the right data. SAGE Publications
Faith J, Mintram R, Angelova M (2006) Targeted projection pursuit for visualizing gene expression data classifications. Bioinformatics 22:2667–2673
Google Scholar
Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1(2):293–314
Google Scholar
Forni M, Hallin M, Lippi M, Reichlin L (2005) The generalized dynamic factor model: one-sided estimation and forecasting. J Am Stat Assoc 100:830–840
MathSciNet MATH Google Scholar
Fraiman R, Justel A, Svarc M (2008) Selection of variables for cluster analysis and classification rules. J Am Stat Assoc 103:1294–1303
MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441
MATH Google Scholar
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York
MATH Google Scholar
Galeano P, Peña D (2019) Outlier detection in high-dimensional time series (Unpublished manuscript)
Galeano P, Peña D, Tsay RS (2006) Outlier detection in multivariate time series by projection pursuit. J Am Stat Assoc 101:654–669
MathSciNet MATH Google Scholar
Galimberti G, Manisi A, Soffritti G (2017) Modelling the role of variables in model-based cluster analysis. Stat Comput 28(1):1–25
MathSciNet MATH Google Scholar
Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J of Inf Manage 35(2):137–144
Google Scholar
García-Ferrer A, Highfield RA, Palm F, Zellner A (1987) Macroeconomic forecasting using pooled international data. J Bus Econ Stat 5:53–67
Google Scholar
Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70:320–328
MATH Google Scholar
Genton MG (2001) Classes of kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–312
MathSciNet MATH Google Scholar
Genton MG, Johnson C, Potter K, Stenchikov G, Sun Y (2014) Surface boxplots. Stat 3(1):1–11
Google Scholar
Genton MG, Castruccio S, Crippa P, Dutta S, Huser R, Sun Y, Vettori S (2015) Visuanimation in statistics. Stat 4(1):81–96
MathSciNet Google Scholar
Giannone D, Reichlin L, Small D (2008) Nowcasting: the real-time informational content of macroeconomic data. J Monet Econ 55:665–676
Google Scholar
Gómez V, Maravall A (1996) Programas tramo and seats. Documento de Trabajo, Banco de España SGAPE-97001
Guhaniyogi R, Dunson DB (2015) Bayesian compressed regression. J Am Stat Assoc 110:1500–1514
MathSciNet MATH Google Scholar
Hall P, Marron JS, Neeman A (2005) Geometric representation of high dimension, low sample size data. J R Stat Soc B 67(3):427–444
MathSciNet MATH Google Scholar
Härdle WK, Lu HHS, Shen X (2018) Handbook of big data analytics. Springer
Hastie T, Pregibon D (1992) Generalized linear models. In: Chambers JM, Hastie TJ (eds) Statistical models in S, Chap 6. Wadsworth & Brooks/Cole
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
MATH Google Scholar
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, Boca Raton
MATH Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
MATH Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257
Google Scholar
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35(1):73–101
MathSciNet MATH Google Scholar
Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13:411–430
Google Scholar
Irizarry RA (2001) Local harmonic estimation in musical sound signals. J Am Stat Assoc 96:357–367
MathSciNet MATH Google Scholar
Jain AK (1989) Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs, NJ
MATH Google Scholar
James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of 4th Berkeley symposium on mathematical statistics and probability, vol I, University of California Press, pp 361–379
Johnstone IM, Titterington DM (2009) Statistical challenges of high-dimensional data. Philos Trans R Soc A 367:4237–4253
MathSciNet MATH Google Scholar
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481
MathSciNet MATH Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kokoszka P, Reimherr M (2017) Introduction to functional data analysis. Chapman and Hall/CRC, Boca Raton
Kolaczyk ED (2009) Statistical analysis of network data. Springer, New York
MATH Google Scholar
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1
Google Scholar
Lam XY, Marron JS, Sun D, Toh KC (2018) Fast algorithms for large-scale generalized distance weighted discrimination. J Comput Graph Stat 27(2):368–379
MathSciNet Google Scholar
Lauritzen SL (1996) Graphical Models. Oxford University Press Inc., New York
MATH Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Google Scholar
Liu W (2013) Gaussian graphical model estimation with false discovery rate control. Ann Stat 41(6):2948–2978
MathSciNet MATH Google Scholar
López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104:718–734
MathSciNet MATH Google Scholar
Lu X, Marron JS, Haaland P (2014) Object-oriented data analysis of cell images. J Am Stat Assoc 109:548–559
MathSciNet Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley symposium on mathematical statistics and probability vol 1, pp 281–297
Majumdar A (2009) Image compression by sparse PCA coding in curvelet domain. Signal Image Video Process 3:27–34
MATH Google Scholar
Maronna RA, Martin RD, Yohai V, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R), 2nd edn. Wiley, Hoboken, NJ
MATH Google Scholar
Meinshausen N, Bühlmann P (2006) High dimensional graphs and variable selection with the lasso. Ann Stat 34(3):1436–1462
MathSciNet MATH Google Scholar
Mosteller F, Wallace DL (1963) Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J Am Stat Assoc 58:275–309
MATH Google Scholar
Munzner T (2014) Visualization analysis and design. Chapman and Hall/CRC, Boca Raton
Google Scholar
Norets A (2010) Approximation of conditional densities by smooth mixtures of regressions. Ann Stat 38(3):1733–1766
MathSciNet MATH Google Scholar
de Oliveira MF, Levkowitz H (2003) From visual data exploration to visual data mining: a survey. IEEE Trans Vis Comput Graph 9:378–394
Google Scholar
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
MATH Google Scholar
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2:1–135
Google Scholar
Paradis L, Han Q (2007) A survey of fault management in wireless sensor networks. J Netw Syst Manag 15:171–190
Google Scholar
Peña D (2014) Big data and statistics: trend or change. Bol de Estad e Investig Oper 30:313–324
MathSciNet Google Scholar
Peña D, Box GEP (1987) Identifying a simplifying structure in time series. J Am Stat Assoc 82:836–843
MathSciNet MATH Google Scholar
Peña D, Poncela P (2004) Forecasting with nonstationary dynamic factor models. J Econom 119(2):291–321
MathSciNet MATH Google Scholar
Peña D, Prieto FJ (2001a) Cluster identification using projections. J Am Stat Assoc 96:1433–1445
MathSciNet MATH Google Scholar
Peña D, Prieto FJ (2001b) Robust covariance matrix estimation and multivariate outlier detection. Technometrics 43:286–310
MathSciNet Google Scholar
Peña D, Sánchez I (2005) Multifold predictive validation in armax time series models. J Am Stat Assoc 100:135–146
MathSciNet MATH Google Scholar
Peña D, Tiao GC, Tsay RS (2001) A course in time series analysis. Wiley, Hoboken, NJ
MATH Google Scholar
Peña D, Viladomat J, Zamar R (2012) Nearest-neighbors medians clustering. Stat Anal Data Min 5(4):349–362
MathSciNet Google Scholar
Peña D, Smucler E, Yohai VJ (2019a) Forecasting multiple time series with one-sided dynamic principal components. J Am Stat Assoc. https://doi.org/10.1080/01621459.2018.1520117
Google Scholar
Peña D, Tsay RS, Zamar R (2019b) Empirical dynamic quantiles for visualization of high-dimensional time series. Technometrics. https://doi.org/10.1080/00401706.2019.1575285
Google Scholar
Pigoli D, Hadjipantelis PZ, Coleman JS, Aston JAD (2018) The statistical analysis of acoustic phonetic data: exploring differences between spoken romance languages (with discussion). J R Stat Soc C 67:1–27
Google Scholar
Quijano-Sánchez L, Liberatore F (2017) The big chase: a decision support system for client acquisition applied to financial networks. Decis Support Syst 98:49–58
Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Google Scholar
Radke RJ, Andra S, Al-Kofahi O, Roysam B (2005) Image change detection algorithms: a systematic survey. IEEE Trans Image Process 14:294–307
MathSciNet Google Scholar
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178
MathSciNet MATH Google Scholar
Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New York
MATH Google Scholar
Ren Z, Sun T, Zhang CH, Zhou HH (2015) Asymptotic normality and optimalities in estimation of large gaussian graphical model. Ann Stat 43(3):991–1026
MathSciNet MATH Google Scholar
Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc B 71(2):447–466
MathSciNet MATH Google Scholar
Riani M, Atkinson AC, Cerioli A (2012) Problems and challenges in the analysis of complex data: static and dynamic approaches. In: di Ciaccio A, Coli M, Angulo JM (eds) Advanced statistical methods for the analysis of large data-sets. Springer, Berlin, Heidelberg, pp 145–157
Google Scholar
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408
Google Scholar
Rousseeuw P, van den Bossche W (2018) Detecting deviating data cells. Technometrics 60(2):135–145
MathSciNet Google Scholar
Ryan TP, Woodall WH (2005) The most-cited statistical papers. J Appl Stat 32(5):461–474
MathSciNet MATH Google Scholar
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3:210–229
MathSciNet Google Scholar
Schölkopf B, Smola A, Müller KR (1997) Kernel principal component analysis. In: Gerstner W, Germond A, Hasler M, Nicoud JD (eds) Artificial Neural Networks ICANN’97, vol 1327. Lecture Notes in Computer Science, pp 583–588
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
MathSciNet MATH Google Scholar
Sesia M, Sabatti C, Candès EJ (2018) Gene hunting with knockoffs for hidden Markov models. Biometrika. https://doi.org/10.1093/biomet/asy033
MATH Google Scholar
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494
MathSciNet MATH Google Scholar
Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Anal 99(6):1015–1034
MathSciNet MATH Google Scholar
Shi JQ, Choi R (2011) Gaussian process regression analysis for functional data. CRC Press, Boca Raton
MATH Google Scholar
Small C (1990) A survey of multidimensional medians. Int Stat Rev 58:263–277
Google Scholar
Stock JH, Watson MW (2002) Forecasting using principal components from a large number of predictors. J Am Stat Assoc 97:1167–1179
MathSciNet MATH Google Scholar
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36(2):111–147
MathSciNet MATH Google Scholar
Stone M (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc B 39(1):44–47
MathSciNet MATH Google Scholar
Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334
MathSciNet Google Scholar
Tausczik YR, Pennebaker JW (2010) The psychological meaning of words: Liwc and computerized text analysis methods. J Lang Soc Psychol 29:24–54
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 12:267–288
MathSciNet MATH Google Scholar
Tong H (2012) Threshold models in non-linear time series analysis. Springer, New York
Google Scholar
Tong H, Lim KS (1980) Threshold autoregression, limit cycles and cyclical data (with discussion). J R Stat Soc B 42(3):245–292
MATH Google Scholar
Torrecilla JL, Romo J (2018) Data learning from big data. Stat Prob Lett 136:15–19
MathSciNet MATH Google Scholar
Tsay RS, Chen R (2018) Nonlinear time series analysis. Wiley, Hoboken, NJ
MATH Google Scholar
Tukey JW (1970) Exploratory data analysis. Addison-Wesley Pub, Co, Reading, MA
MATH Google Scholar
Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L (2003) Outlier detection and false discovery rates for whole-genome DNA matching. J Am Stat Assoc 98:236–246
MathSciNet MATH Google Scholar
Vidal R (2011) Subspace clustering. IEEE Signal Proc Mag 28:52–68
Google Scholar
Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64:440–448
MathSciNet MATH Google Scholar
Wei F, Tian W (2018) Heterogeneous connection effects. Stat Prob Lett 133:9–14
MathSciNet MATH Google Scholar
Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726
MathSciNet MATH Google Scholar
Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3):515–534
Google Scholar
Xia Y, Cai T, Cai TT (2016) Testing differential networks with applications to detecting gene-by-gene interactions. Biometrika 102:247–266
MATH Google Scholar
Yang Y (2005) Can the strengths of aic and bic be shared? A conflict between model identification and regression estimation. Biometrika 92:937–950
MathSciNet MATH Google Scholar
Zhang P (1993) Model selection via multifold cross validation. Ann Stat 21(1):299–313
MathSciNet MATH Google Scholar
Zhao SD, Cai TT, Li H (2014) Direct estimation of differential networks. Biometrika 101:253–268
MathSciNet MATH Google Scholar
Zhou Z, Wu WB (2009) Local linear quantile estimation for nonstationary time series. Ann Stat 37:2696–2729
MathSciNet MATH Google Scholar
Zhu X, Pan R, Li G, Liu Y, Wang H (2017) Network vector autoregression. Ann Stat 45(3):1096–1123
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The invitation to write this article came from the editor Jesús López-Fidalgo and we are very grateful to him for his encouragement. The applications presented in this paper were carried out with Federico Liberatore, Lara Quijano-Sánchez and Carlo Sguera, post-docs at the UC3M-BS Institute of Financial Big Data. Iván Blanco and Jose Luis Torrecilla, also post-docs in the Institute, have also contributed with useful discussions. The ideas in this article have been clarified with the comments of Andrés Alonso, Anibal Figueiras, Rosa Lillo, Juan Romo and Rubén Zamar. To all them, our gratitude.

Author information

Authors and Affiliations

Departamento de Estadística and Institute of Financial Big Data, Universidad Carlos III de Madrid, 28903, Getafe, Madrid, Spain
Pedro Galeano & Daniel Peña

Authors

Pedro Galeano
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Peña
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Galeano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has been supported by Grant ECO2015-66593-P of MINECO/FEDER/UE.

This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00639-5, https://doi.org/10.1007/s11749-019-00640-y, https://doi.org/10.1007/s11749-019-00641-x, https://doi.org/10.1007/s11749-019-00642-w, https://doi.org/10.1007/s11749-019-00643-9, https://doi.org/10.1007/s11749-019-00644-8, and https://doi.org/10.1007/s11749-019-00646-6, https://doi.org/10.1007/s11749-019-00647-5, https://doi.org/10.1007/s11749-019-00648-4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Galeano, P., Peña, D. Data science, big data and statistics. TEST 28, 289–329 (2019). https://doi.org/10.1007/s11749-019-00651-9

Download citation

Published: 08 April 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11749-019-00651-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data science, big data and statistics

Abstract

Access this article

Similar content being viewed by others

Big data: the next challenge for statistics

Big Data Analytics: Views from Statistical and Computational Perspectives

Comments on: Data science, big data and statistics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Data science, big data and statistics

Abstract

Access this article

Similar content being viewed by others

Big data: the next challenge for statistics

Big Data Analytics: Views from Statistical and Computational Perspectives

Comments on: Data science, big data and statistics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation