Comparison of supervised learning statistical methods for classifying commercial beers and identifying patterns,Journal of Chemometrics

当前位置： X-MOL 学术 › J. Chemometr. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparison of supervised learning statistical methods for classifying commercial beers and identifying patterns
Journal of Chemometrics ( IF 1.9 ) Pub Date : 2020-04-01 , DOI: 10.1002/cem.3216
Dániel Koren ₁ , Laura Lőrincz ₂ , Sándor Kovács ₃ , Gabriella Kun‐Farkas ₁ , Beáta Vecseriné Hegyes ₁ , László Sipos ₄

Affiliation

In this study, 13 properties (alcohol‐, real extract‐, flavonoid‐, anthocyanin, glucose, fructose, maltose, sucrose content, EBC [European Brewery Convention] and L*a*b* color, bitterness) of 21 beers (alcohol‐free pale lagers, alcohol‐free beer‐based mixed drinks, beer‐based mixed drinks, international lagers, wheat beers, stouts, fruit beers) were determined. In the first step, multiple factor analysis (MFA) was performed for the whole data and five clusters (target classes) were determined; then, a bootstrapping was applied to establish a balanced data so as every cluster should contain 100 samples and the total sample size is 500. In the second step, 12 supervised learning algorithms (random trees [RND], Quinlan's C4.5 decision tree algorithm [C4.5], Iterative Dichotomiser 3 algorithm [ID3], cost‐sensitive decision tree algorithm [CSMC4], cost‐sensitive classification tree [CSCRT], k‐nearest neighbors algorithm [KNN], radial basis function [RBF], multilayer perceptron neural network [MLP], prototype nearest neighbor [PNN], linear discriminant analysis [LDA], naïve Bayes with continuous variables [NBC], partial least squares discriminant analysis [PLS‐DA]) were applied to classify each brand into the target classes. Furthermore, several error rates were calculated: re‐substitution error rate (RER), cross‐validated error rate (CV), bootsrap error (BOOT), leave‐one‐out (LOO), and train‐test error rate (TRAIN). The MFA could discriminate five groups, which can be characterized by some analytical parameters, and the other multivariate methods performed similarly. The methods can be discriminated best based on the BOOT, CV, and LOO. The best estimation methods are the C4.5, CSMC4, and CSCRT; these performed best along the flavonoid content and EBC color. It identified that the methods most sensitive to the properties are the NBC. The classification ability fluctuated greatly in the case of three properties (glucose, maltose, sucrose). A remarkable fluctuation has been experienced in the case of L*a*b* color parameters, flavonoid content, EBC color, and bitterness by NBC method.

中文翻译：

用于对商业啤酒进行分类和识别模式的监督学习统计方法的比较

在这项研究中，21 种啤酒（酒精）的 13 种特性（酒精-、真正的提取物-、类黄酮-、花青素、葡萄糖、果糖、麦芽糖、蔗糖含量、EBC [欧洲啤酒厂公约]和 L*a*b*无酒精淡啤酒、无酒精啤酒混合饮料、啤酒混合饮料、国际啤酒、小麦啤酒、黑啤、水果啤酒）。第一步，对整个数据进行多因素分析（MFA），确定五个聚类（目标类）；然后，应用bootstrapping建立平衡数据，每个簇应该包含100个样本，总样本量为500个。第二步，12个监督学习算法（随机树[RND]，Quinlan的C4.5决策树算法） [C4.5]、迭代二分法3算法[ID3]、成本敏感决策树算法[CSMC4]、成本敏感分类树 [CSCRT]、k-最近邻算法 [KNN]、径向基函数 [RBF]、多层感知器神经网络 [MLP]、原型最近邻 [PNN]、线性判别分析 [LDA]、朴素贝叶斯连续变量 [NBC]、偏最小二乘判别分析 [PLS-DA]) 用于将每个品牌分类为目标类别。此外，还计算了几种错误率：重新替换错误率 (RER)、交叉验证错误率 (CV)、引导错误 (BOOT)、留一法 (LOO) 和训练测试错误率 (TRAIN) . MFA 可以区分五个组，这些组可以通过一些分析参数来表征，而其他多变量方法的表现类似。可以根据 BOOT、CV 和 LOO 最好地区分这些方法。最好的估计方法是 C4.5、CSMC4、和 CSCRT；这些在类黄酮含量和 EBC 颜色方面表现最好。它确定对属性最敏感的方法是 NBC。在三种性质（葡萄糖、麦芽糖、蔗糖）的情况下，分类能力波动较大。L*a*b* 颜色参数、黄酮含量、EBC 颜色和 NBC 方法的苦味经历了显着的波动。

更新日期：2020-04-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11