A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database,arXiv - STAT - Other Statistics

当前位置： X-MOL 学术 › arXiv.stat.OT › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database
arXiv - STAT - Other Statistics Pub Date : 2022-05-13 , DOI: arxiv-2205.06417
Dewi Amaliah, Dianne Cook, Emi Tanaka, Kate Hyde, Nicholas Tierney

Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper describes the trials and tribulations of refreshing a textbook data set on wages, extracted from the National Longitudinal Survey of Youth (NLSY79) in the early 1990s. The data is useful for teaching modeling and exploratory analysis of longitudinal data. Subsets of NLSY79, including the wages data, can be found in supplementary files from numerous textbooks and research articles. The NLSY79 database has been continuously updated through to 2018, so new records are available. Here we describe our journey to refresh the wages data, and document the process so that the data can be regularly updated into the future. Our journey was difficult because the steps and decisions taken to get from the raw data to the wages textbook subset have not been clearly articulated. We have been diligent to provide a reproducible workflow for others to follow, which also hopefully inspires more attempts at refreshing data for teaching. Three new data sets and the code to produce them are provided in the open source R package called `yowie`.

中文翻译：

从荒野到教科书数据的旅程，以可重复地刷新来自全国青年纵向调查数据库的工资数据

教科书数据对于教授统计学和数据科学方法至关重要，因为它们很干净，可以让教师专注于方法论。理想情况下，教科书数据集会定期刷新，尤其是当它们是从正在进行的数据收集中获取的子集时。使用当代数据进行教学也很重要，以灌输这种方法与今天相关的感觉。本文描述了刷新教科书工资数据集的试验和磨难，该数据集取自 1990 年代初期的全国青年纵向调查 (NLSY79)。该数据可用于教学建模和纵向数据的探索性分析。NLSY79 的子集，包括工资数据，可以在大量教科书和研究文章的补充文件中找到。NLSY79 数据库一直持续更新到 2018 年，所以有新的记录可用。在这里，我们描述了我们刷新工资数据的过程，并记录了该过程，以便将来可以定期更新数据。我们的旅程很艰难，因为从原始数据到工资教科书子集所采取的步骤和决策尚未明确阐明。我们一直在努力为其他人提供可重复的工作流程，这也希望能激发更多尝试更新教学数据的尝试。在名为“yowie”的开源 R 包中提供了三个新数据集和生成它们的代码。我们的旅程很艰难，因为从原始数据到工资教科书子集所采取的步骤和决策尚未明确阐明。我们一直在努力为其他人提供可重复的工作流程，这也希望能激发更多尝试更新教学数据的尝试。在名为“yowie”的开源 R 包中提供了三个新数据集和生成它们的代码。我们的旅程很艰难，因为从原始数据到工资教科书子集所采取的步骤和决策尚未明确阐明。我们一直在努力为其他人提供可重复的工作流程，这也希望能激发更多尝试更新教学数据的尝试。在名为“yowie”的开源 R 包中提供了三个新数据集和生成它们的代码。

更新日期：2022-05-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>