-
Teaching Visual Accessibility in Introductory Data Science Classes with Multi-Modal Data Representations arXiv.stat.OT Pub Date : 2022-08-04 JooYoung Seo, Mine Dogucu
Although there are various ways to represent data patterns and models, visualization has been primarily taught in many data science courses for its efficiency. Such vision-dependent output may cause critical barriers against those who are blind and visually impaired and people with learning disabilities. We argue that instructors need to teach multiple data representation methods so that all students
-
Customs Import Declaration Datasets arXiv.stat.OT Pub Date : 2022-08-04 Chaeyoon Jeong, Sundong Kim, Jaewoo Park, Yeonsoo Choi
Given the huge volume of cross-border flows, effective and efficient control of trades becomes more crucial in protecting people and society from illicit trades while facilitating legitimate trades. However, limited accessibility of the transaction-level trade datasets hinders the progress of open research, and lots of customs administrations have not benefited from the recent progress in data-based
-
InsightiGen: a versatile tool to generate insight for an academic systematic literature review arXiv.stat.OT Pub Date : 2022-08-02 Ardeshir Shojaeinasab, Masoud Jalayer, Homayoun Najjaran
A comprehensive literature review has always been an essential first step of every meaningful research. In recent years, however, the availability of a vast amount of information in both open-access and subscription-based literature in every field has made it difficult, if not impossible, to be certain about the comprehensiveness of one's survey. This subsequently can lead to reviewers' questioning
-
Model-Free, Monotone Invariant and Computationally Efficient Feature Screening with Data-adaptive Threshold arXiv.stat.OT Pub Date : 2022-07-27 Linsui Deng, Yilin Zhang
Feature screening for ultrahigh-dimension, in general, proceeds with two essential steps. The first step is measuring and ranking the marginal dependence between response and covariates, and the second is determining the threshold. We develop a new screening procedure, called SIT-BY procedure, that possesses appealing statistical properties in both steps. By employing sliced independence estimates
-
Alternative approaches for analysing repeated measures data that are missing not at random arXiv.stat.OT Pub Date : 2022-07-23 Oliver Dukes, David Richardson, Eric Tchetgen Tchetgen
We consider studies where multiple measures on an outcome variable are collected over time, but some subjects drop out before the end of follow up. Analyses of such data often proceed under either a 'last observation carried forward' or 'missing at random' assumption. We consider two alternative strategies for identification; the first is closely related to the difference-in-differences methodology
-
Sharp hypotheses and organic fiducial inference arXiv.stat.OT Pub Date : 2022-07-18 Russell J. Bowater
A fundamental class of inferential problems are those characterised by there having been a substantial degree of pre-data (or prior) belief that the value of a model parameter $\theta_j$ was equal or lay close to a specified value $\theta^{*}_j$, which may, for example, be the value that indicates the absence of a treatment effect or the lack of correlation between two variables. This paper puts forward
-
The Inference Framework arXiv.stat.OT Pub Date : 2022-07-10 Nicholas Carrara
The following three sections and appendices are taken from my thesis "The Foundations of Inference and its Application to Fundamental Physics" from 2021, in which I construct a theory of entropic inference from first principles. The majority of these chapters are not original, but are a collection of various sources through the history of the subject. The first section deals with deductive reasoning
-
On General Weighted Extropy of Ranked Set Sampling arXiv.stat.OT Pub Date : 2022-07-05 Nitin Gupta, Santosh Kumar Chaudhary
In the past six years, a considerable attention has been given to the extropy measure proposed by Lad et al. (2015). Weighted Extropy of Ranked Set Sampling was studied and compared with simple random sampling by Qiu et al. (2022). The general weighted extropy and some results related to it are introduced in this paper. We provide general weighted extropy of ranked set sampling. We also studied characterization
-
Cost-Efficient Fixed-Width Confidence Intervals for the Difference of Two Bernoulli Proportions arXiv.stat.OT Pub Date : 2022-07-04 Ignacio Erazo, David Goldsman, Yajun Mei
We study properties of confidence intervals (CIs) for the difference of two Bernoulli distributions' success parameters, $p_x - p_y$, in the case where the goal is to obtain a CI of a given half-width while minimizing sampling costs when the observation costs may be different between the two distributions. Assuming that we are provided with preliminary estimates of the success parameters, we propose
-
Modified entropies as the origin of generalized uncertainty principles arXiv.stat.OT Pub Date : 2022-06-28 Nana Cabo Bizet, Octavio Obregón, Wilfredo Yupanqui
In a profound way, entropy represents a concept and a measure of uncertainty. It has been shown that the uncertainty in Quantum Mechanics namely the Heisenberg principle, is a consequence of the entropic uncertainty principle. Additionally in several theoretical frameworks, the Heisenberg uncertainty principle needs to be extended to a Generalized Uncertainty Principle (GUP) to describe the quantum
-
Run and Frequency quotas for q-binary trials arXiv.stat.OT Pub Date : 2022-06-27 Jungtaek Oh
We study the distributions of waiting times in variations of the $q$-sooner and later waiting time problem. One variation imposes length and frequency quotas on the runs of successes and failures. Another case considers binary trials for which the probability of ones is geometrically varying. We also study the distributions of Longest run under the same variations. The main theorems are sooner and
-
Analysis of Hydrogen Production Costs across the United States and over the next 30 years arXiv.stat.OT Pub Date : 2022-06-21 Yuanchen Wang, Mahmoud M. Ramadan, Pragya Tooteja
Hydrogen can play an important role for decarbonization. While hydrogen is usually produced through SMR, it can also be produced through water electrolysis which is cleaner. The relative cost and carbon intensity of hydrogen production through SMR and electrolysis vary throughout the United States because of differences in the grid. While many hydrogen cost models exist, no regional hydrogen study
-
Approximate Bayesian Inference for the Interaction Types 1, 2, 3 and 4 with Application in Disease Mapping arXiv.stat.OT Pub Date : 2022-06-18 Esmail Abdul Fattah, Haavard Rue
We address in this paper a new approach for fitting spatiotemporal models with application in disease mapping using the interaction types 1,2,3, and 4. When we account for the spatiotemporal interactions in disease-mapping models, inference becomes more useful in revealing unknown patterns in the data. However, when the number of locations and/or the number of time points is large, the inference gets
-
Process, Population, and Sample: the Researcher's Interest arXiv.stat.OT Pub Date : 2022-06-16 Charles W. Champ, Andrew V. Sills
A case is made that researchers are interested in studying processes. Often the inferences they are interested in making are about the process and its associated population. On other occasions, a researcher may be interested in making an inference about the collection of individuals the process has generated. We will call the statistical methods employed by the researcher to make such inferences about
-
On the probability of invalidating a causal inference due to limited external validity arXiv.stat.OT Pub Date : 2022-06-17 Tenglong Li
External validity is often questionable in empirical research, especially in randomized experiments due to the trade-off between internal validity and external validity. To quantify the robustness of external validity, one must first conceptualize the gap between a sample that is fully representative of the target population (i.e., the ideal sample) and the observed sample. Drawing on Frank & Min (2007)
-
Current state and prospects of R-packages for the design of experiments arXiv.stat.OT Pub Date : 2022-06-15 Emi Tanaka, Dewi Amaliah
Re-running an experiment is generally costly and in some cases impossible due to limited resources, so the design of an experiment plays a critical role in increasing the quality of experimental data. In this paper we describe the current state of the R-packages for the design of experiments through an exploratory data analysis of package downloads, package metadata, and the comparison of characteristics
-
Fast Computation of Highly G-optimal Exact Designs via Particle Swarm Optimization arXiv.stat.OT Pub Date : 2022-06-13 Stephen J. Walsh, John J. Borkowski
Computing proposed exact $G$-optimal designs for response surface models is a difficult computation that has received incremental improvements via algorithm development in the last two-decades. These optimal designs have not been considered widely in applications in part due to the difficulty and cost involved with computing them. Three primary algorithms for constructing exact $G$-optimal designs
-
Making Sense of Dependence: Efficient Black-box Explanations Using Dependence Measure arXiv.stat.OT Pub Date : 2022-06-13 Paul Novello, Thomas Fel, David Vigouroux
This paper presents a new efficient black-box attribution method based on Hilbert-Schmidt Independence Criterion (HSIC), a dependence measure based on Reproducing Kernel Hilbert Spaces (RKHS). HSIC measures the dependence between regions of an input image and the output of a model based on kernel embeddings of distributions. It thus provides explanations enriched by RKHS representation capabilities
-
Interactive Exploration of Large Dendrograms with Prototypes arXiv.stat.OT Pub Date : 2022-06-03 Andee Kaplan, Jacob Bien
Hierarchical clustering is one of the standard methods taught for identifying and exploring the underlying structures that may be present within a data set. Students are shown examples in which the dendrogram, a visual representation of the hierarchical clustering, reveals a clear clustering structure. However, in practice, data analysts today frequently encounter data sets whose large scale undermines
-
A validation of the short-form classroom community scale for undergraduate mathematics and statistics students arXiv.stat.OT Pub Date : 2022-06-01 Maria Tackett, Shira Viel, Kim Manturuk
This study examines Cho and Demmans Epp's short-form adaptation of Rovai's well-known Classroom Community Scale (CCS-SF) as a measure of classroom community among introductory undergraduate math and statistics students. A series of statistical analyses were conducted to investigate the validity of the CCS-SF for this new population. Data were collected from 351 students enrolled in 21 online classes
-
A discrete analogue of Terrell's characterization of rectangular distributions arXiv.stat.OT Pub Date : 2022-05-28 Nickos Papadatos
George R. Terrell (1983, {Ann. Probab., vol. 11(3), pp. 823--826) showed that the Pearson coefficient of correlation of an ordered pair from a random sample of size two is at most one-half, and the equality is attained only for rectangular (uniform over some interval) distributions. In the present note it is proved that the same is true for the discrete case, in the sense that the correlation coefficient
-
The paradoxical nature of easily improvable evidence arXiv.stat.OT Pub Date : 2022-05-27 Maria Chikina, Wesley Pegden
Established frameworks to understand problems with reproducibility in science begin with the relationship between our understanding of the prior probability of a claim and the statistical certainty that should be demanded of it, and explore the ways in which independent investigations, biases in study design and publication bias interact with these considerations. We propose a complementary perspective;
-
Visualising Multilevel Regression and Poststratification: Alternatives to the Current Practice arXiv.stat.OT Pub Date : 2022-05-25 Dewi Amaliah
Surveys provide important evidence for policymaking, decision-making, and understanding of society. However, conducting the large surveys required to provide subpopulation level estimates is expensive and time-consuming. Multilevel Regression and Poststratification (MRP) is a promising method to provide reliable estimates for subpopulations from surveys without the amount of data needed for reliable
-
Three principles for modernizing an undergraduate regression analysis course arXiv.stat.OT Pub Date : 2022-05-23 Maria Tackett
As data has become more prevalent in academia, industry, and daily life, it is imperative that undergraduate students are equipped with the skills needed to analyze data in the modern environment. In recent years there has been a lot of work innovating introductory statistics courses and the developing introductory data science courses; however, there has been less work beyond the first course. This
-
BayesMix: Bayesian Mixture Models in C++ arXiv.stat.OT Pub Date : 2022-05-17 Mario Beraha, Bruno Guindani, Matteo Gianella, Alessandra Guglielmi
We describe BayesMix, a C++ library for MCMC posterior simulation for general Bayesian mixture models. The goal of BayesMix is to provide a self-contained ecosystem to perform inference for mixture models to computer scientists, statisticians and practitioners. The key idea of this library is extensibility, as we wish the users to easily adapt our software to their specific Bayesian mixture models
-
A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database arXiv.stat.OT Pub Date : 2022-05-13 Dewi Amaliah, Dianne Cook, Emi Tanaka, Kate Hyde, Nicholas Tierney
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper
-
On the use of a local R-hat to improve MCMC convergence diagnostic arXiv.stat.OT Pub Date : 2022-05-13 Théo Moins, Julyan Arbel, Anne Dutfoy, Stéphane Girard
Diagnosing convergence of Markov chain Monte Carlo is crucial and remains an essentially unsolved problem. Among the most popular methods, the potential scale reduction factor, commonly named $\hat{R}$, is an indicator that monitors the convergence of output chains to a target distribution, based on a comparison of the between- and within-variances. Several improvements have been suggested since its
-
Examining the role of context in statistical literacy outcomes using an isomorphic assessment instrument arXiv.stat.OT Pub Date : 2022-05-11 Sayali Phadke, Matthew D Beckman, Kari Lock Morgan
The central role of statistical literacy has been discussed extensively, emphasizing its importance as a learning outcome and in promoting a citizenry capable of interacting with the world in an informed and critical manner. Our work contributes to the growing literature on assessing and improving people's statistical literacy vis-a-vis contexts important in their professional and personal lives. We
-
Pearson's r Adjusted arXiv.stat.OT Pub Date : 2022-05-09 Xinbo Ai
Correlation is the workhorse in data analysis among virtually all disciplines of science and technology, and Pearson's r has been de facto standard for correlation analysis for over a century. It is taken for granted that Pearson's r can only capture linear dependence, as stated in statistics textbooks. We find that Pearson's r has potential to measure arbitrary monotonic relationships, it just undervalues
-
Far from Asymptopia arXiv.stat.OT Pub Date : 2022-05-06 Michael C. Abbott, Benjamin B. Machta
Inference from limited data requires a notion of measure on parameter space, which is most explicit in the Bayesian framework as a prior distribution. Jeffreys prior is the best-known uninformative choice, the invariant volume element from information geometry, but we demonstrate here that this leads to enormous bias in typical high-dimensional models. This is because models found in science typically
-
Foundations for NLP-assisted formative assessment feedback for short-answer tasks in large-enrollment classes arXiv.stat.OT Pub Date : 2022-05-05 Susan Lloyd, Matthew Beckman, Dennis Pearl, Rebecca Passonneau, Zhaohui Li, Zekun Wang
Research suggests "write-to-learn" tasks improve learning outcomes, yet constructed-response methods of formative assessment become unwieldy with large class sizes. This study evaluates natural language processing algorithms to assist this aim. Six short-answer tasks completed by 1,935 students were scored by several human raters, using a detailed rubric, and an algorithm. Results indicate substantial
-
Complementary Goodness of Fit Procedure for Crash Frequency Models arXiv.stat.OT Pub Date : 2022-05-03 Mohammadreza Hashemi, Adrian Ricardo Archilla
This paper presents a new procedure for evaluating the goodness of fit of Generalized Linear Models (GLM) estimated with Roadway Departure (RwD) crash frequency data for the State of Hawaii on two-lane two-way (TLTW) state roads. The procedure is analyzed using ten years of RwD crash data (including all severity levels) and roadway characteristics (e.g., traffic, geometry, and inventory databases)
-
Controlling for Latent Confounding with Triple Proxies arXiv.stat.OT Pub Date : 2022-04-28 Ben Deaner
We apply results in Hu and Schennach (2008) to achieve nonparametric identification of causal effects using noisy proxies for unobserved confounders. We call this the `triple proxy' approach because it requires three proxies that are jointly independent conditional on unobservables. We consider three different choices for the third proxy: it may be an outcome, a vector of treatments, or a collection
-
Bayesian estimation of in-game home team win probability for college basketball arXiv.stat.OT Pub Date : 2022-04-25 Jason Maddox, Ryan Sides, Jane Harvill
Two new Bayesian methods for estimating and predicting in-game home team win probabilities are proposed. The first method has a prior that adjusts as a function of lead differential and time elapsed. The second is an adjusted version of the first, where the adjustment is a linear combination of the Bayesian estimator with a time-weighted pre-game win probability. The proposed methods are compared to
-
Research on spatial information transmission efficiency and capability of safe evacuation signs arXiv.stat.OT Pub Date : 2022-04-22 Ruiwen Fan, Zhangyin Dai, Shixiang Tian, Ting Xia a, Hui Zhou, Congbao Huang
As an indispensable spatial direction information indicator for emergency evacuation, the spatial relationship between safety evacuation signs and evacuees will affect the response time of evacuees and the evacuation efficiency. This paper takes 2 kinds of common safety evacuation signs, hangtag-type and embedded, as the research object and designs space direction information transmission efficiency
-
Physical, subjective and analogical probability arXiv.stat.OT Pub Date : 2022-04-20 Russell J. Bowater
The aim of this paper is to show that the concept of probability is best understood by dividing this concept into two different types of probability, namely physical probability and analogical probability. Loosely speaking, a physical probability is a probability that applies to the outcomes of an experiment that have been judged as being equally likely on the basis of physical symmetry. Physical probabilities
-
Wrapped Distributions on homogeneous Riemannian manifolds arXiv.stat.OT Pub Date : 2022-04-20 Fernando Galaz-Garcia, Marios Papamichalis, Kathryn Turnbull, Simon Lunagomez, Edoardo Airoldi
We provide a general framework for constructing probability distributions on Riemannian manifolds, taking advantage of area-preserving maps and isometries. Control over distributions' properties, such as parameters, symmetry and modality yield a family of flexible distributions that are straightforward to sample from, suitable for use within Monte Carlo algorithms and latent variable models, such as
-
A Simulation-Optimization Framework To Improve The Organ Transplantation Offering System arXiv.stat.OT Pub Date : 2022-04-22 Ignacio Erazo, David Goldsman, Pinar Keskinocak
We propose a simulation-optimization-based methodology to improve the way that organ transplant offers are made to potential recipients. Our policy can be applied to all types of organs, is implemented starting at the local level, is flexible with respect to simultaneous offers of an organ to multiple patients, and takes into account the quality of the organs under consideration. We describe in detail
-
Minimizing Fleet Size and Improving Bike Allocation of Bike Sharing under Future Uncertainty arXiv.stat.OT Pub Date : 2022-04-19 Mingzhuang Hua, Xuewu Chen, Jingxu Chen, Yu Jiang
As a rapidly expanding service, bike sharing is facing severe problems of bike over-supply and demand fluctuation in many Chinese cities. This study develops a large-scale method to determine the minimum fleet size under uncertainty, based on the bike sharing data of millions of trips in Nanjing. It is found that the algorithm of minimizing fleet size under the incomplete-information scenario is effective
-
Comment on "The statistics wars and intellectual conflicts of interest" by D. Mayo arXiv.stat.OT Pub Date : 2022-04-17 Philip B. Stark
While P-values are widely abused, they are a useful tool for many purposes; banning them is analogous to banning scalpels because most people do not know how to perform surgery. Many reported P-values are not genuine P-values, for a variety of reasons. Perhaps the most widespread and pernicious problem is the Type III error of testing a statistical hypothesis that has little or no connection to the
-
High-Frequency-Based Volatility Model with Network Structure arXiv.stat.OT Pub Date : 2022-04-14 Huiling Yuan, Guodong Li, Junhui Wang
This paper introduces one new multivariate volatility model that can accommodate an appropriately defined network structure based on low-frequency and high-frequency data. The model reduces the number of unknown parameters and the computational complexity substantially. The model parameterization and iterative multistep-ahead forecasts are discussed and the targeting reparameterization is also presented
-
Computational Statistics and Data Science in the Twenty-first Century arXiv.stat.OT Pub Date : 2022-04-12 Andrew J. Holbrook, Akihiko Nishimura, Xiang Ji, Marc A. Suchard
Data science has arrived, and computational statistics is its engine. As the scale and complexity of scientific and industrial data grow, the discipline of computational statistics assumes an increasingly central role among the statistical sciences. An explosion in the range of real-world applications means the development of more and more specialized computational methods, but five Core Challenges
-
Six Statistical Senses arXiv.stat.OT Pub Date : 2022-04-11 Radu V. Craiu, Ruobin Gong, Xiao-Li Meng
This article proposes a set of categories, each one representing a particular distillation of important statistical ideas. Each category is labeled a "sense" because we think of them as essential in helping every statistical mind connect in constructive and insightful ways with statistical theory, methodologies and computation. The illustration of each sense with statistical principles and methods
-
The loss value of multilinear regression arXiv.stat.OT Pub Date : 2022-04-06 Helmut Kahl
A formula for the euclidean distance between a point and a linear subspace is presented. As a consequence a formula for determinants of positive semidefinite, hermitian matrices is derived, and a formula for the loss value of multilinear regression.
-
Teaching for large-scale Reproducibility Verification arXiv.stat.OT Pub Date : 2022-03-31 Lars Vilhuber, Hyuk Harry Son, Meredith Welch, David N. Wasser, Michael Darisse
We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum
-
Length L-function for Network-Constrained Point Data arXiv.stat.OT Pub Date : 2022-03-30 Zidong Fang, Ci Song, Hua Shu, Jie Chen, Tianyu Liu, Xi Wang, Xiao Chen, Tao Pei
Network constrained points are referred to as points restricted to road networks, such as taxi pick up and drop off locations. A significant pattern of network constrained points is referred to as an aggregation; e.g., the aggregation of pick up points may indicate a high taxi demand in a particular area. Although the network K function using the shortest path network distance has been proposed to
-
Non-iterative Gaussianization arXiv.stat.OT Pub Date : 2022-03-28 Rongxiang Rui, Maozai Tian
In this work, we propose a non-iterative Gaussian transformation strategy based on copula function, which doesn't require some commonly seen restrictive assumptions in the previous studies such as the elliptically symmetric distribution assumption and the linear independent component analysis assumption. Theoretical properties guarantee the proposed strategy can exactly transfer any random variable
-
Predicting Cricket Outcomes using Bayesian Priors arXiv.stat.OT Pub Date : 2022-03-21 Mohammed Quazi, Joshua Clifford, Pavan Datta
This research has developed a statistical modeling procedure to predict outcomes of future cricket tournaments. Proposed model provides an insight into the application of stratified survey sampling to the team selection pattern by incorporating individual players' performance history coupled with Bayesian priors not only against a particular opposition but also against any cricket playing nation -
-
Assessing and mitigating systematic errors in forest attribute maps utilizing harvester and airborne laser scanning data arXiv.stat.OT Pub Date : 2022-03-09 Janne Räty, Marius Hauglin, Rasmus Astrup, Johannes Breidenbach
Cut-to-length harvesters collect useful information for modeling relationships between forest attributes and airborne laser scanning (ALS) data. However, harvesters operate in mature forests, which may introduce selection biases and systematic errors in harvester data-based forest attribute maps. We fitted regression models (harvester models) for volume (V), height (HL), stem frequency (N), above-ground
-
Effect of congestion avoidance due to congestion information provision on optimizing agent dynamics on an endogenous star network topology arXiv.stat.OT Pub Date : 2022-03-02 Satori Tsuzuki, Daichi Yanagisawa, Katsuhiro Nishinari
The importance of fundamental research on network topologies is widely acknowledged. This study aims to elucidate the effect of congestion avoidance of agents given congestion information on optimizing traffic in a network topology. We investigated stochastic traffic networks in a star topology with a central node connected to isolated secondary nodes with different preferences. Each agent at the central
-
Revisiting the secondary climate attributes for transportation infrastructure management: A Redux and Update for 2020 arXiv.stat.OT Pub Date : 2022-02-25 Tao Liao, Paul Kepley, Indraneel Kumar, Samuel Labi
Environmental conditions in various regions can have a severely negative impact on the longevity and durability of the civil engineering infrastructures. In 2018, a published paper used 1971 to 2010 NOAA data from the contiguous United States to examine the temporal changes in secondary climate attributes (freeze-thaw cycles and freeze index) using the climate normals from two time windows, 1971-2000
-
An interaction-based contagion model over temporal networks demonstrates that reducing temporal network density reduces total infection rate arXiv.stat.OT Pub Date : 2022-02-23 Alex Abbey, Yanir Marmor, Yuval Shahar, Osnat Mokryn
Contacts' temporal ordering and dynamics, such as their order and timing, are crucial for understanding the transmission of infectious diseases. Using path-preserving temporal networks, we evaluate the effect of spatial pods (social distancing pods) and temporal pods (meetings' rate reduction) on the spread of the disease. We use our interaction-driven contagion model, instantiated for COVID-19, over
-
Tools and Recommendations for Reproducible Teaching arXiv.stat.OT Pub Date : 2022-02-19 Mine Dogucu, Mine Cetinkaya-Rundel
It is recommended that teacher-scholars of data science adopt reproducible workflows in their research as scholars and teach reproducible workflows to their students. In this paper, we propose a third dimension to reproducibility practices and recommend that regardless of whether they teach reproducibility in their courses or not, data science instructors adopt reproducible workflows for their own
-
Edge coherence in multiplex networks arXiv.stat.OT Pub Date : 2022-02-18 Swati Chandna, Svante Janson, Sofia C. Olhede
This paper introduces a nonparametric framework for the setting where multiple networks are observed on the same set of nodes, also known as multiplex networks. Our objective is to provide a simple parameterization which explicitly captures linear dependence between the different layers of networks. For non-Euclidean observations, such as shapes and graphs, the notion of "linear" must be defined appropriately
-
Spam four ways: Making sense of text data arXiv.stat.OT Pub Date : 2022-02-11 Nicholas J. Horton, Jie Chao, William Finzer, Phebe Palmer
The world is full of text data, yet text analytics has not traditionally played a large part in statistics education. We consider four different ways to provide students with opportunities to explore whether email messages are unwanted correspondence (spam). Text from subject lines are used to identify features that can be used in classification. The approaches include use of a Model Eliciting Activity