当前期刊: Journal of Chemometrics Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Fast method for GA‐PLS with simultaneous feature selection and identification of optimal preprocessing technique for datasets with many observations
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-24
    Petter Stefansson; Kristian H. Liland; Thomas Thiis; Ingunn Burud

    A fast and memory‐efficient new method for performing genetic algorithm partial least squares (GA‐PLS) on spectroscopic data preprocessed in multiple different ways is presented. The method, which is primarily intended for datasets containing many observations, involves preprocessing a spectral dataset with several different techniques and concatenating the different versions of the data horizontally into a design matrix X which is both tall and wide. The large matrix is then condensed into a substantially smaller covariance matrix XTX whose resulting size is unrelated to the number of observations in the dataset, i.e. the height of X. It is demonstrated that the smaller covariance matrix can be used to efficiently calibrate partial least squares (PLS) models containing feature selections from any of the involved preprocessing techniques. The method is incorporated into GA‐PLS and used to evolve variable selections for a set of different preprocessing techniques concurrently within a single algorithm. This allows a single instance of GA‐PLS to determine which preprocessing technique, within the set of considered methods, is best suited for the spectroscopic dataset. Additionally, the method allows feature selections to be evolved containing variables from a mixture of different preprocessing techniques. The benefits of the introduced GA‐PLS technique can be summarized as threefold: (1) for datasets with many observations, the proposed method is substantially faster compared to conventional GA‐PLS implementations based on NIPALS, SIMPLS, etc. (2) using a single GA‐PLS automatically reveals which of the considered preprocessing techniques results in the lowest model error. (3) it allows the exploration of highly complex solutions composed of features preprocessed using various techniques.

    更新日期:2020-01-24
  • Issue Information
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-24

    No abstract is available for this article.

    更新日期:2020-01-24
  • The O‐PLS methodology for orthogonal signal correction—is it correcting or confusing?
    J. Chemometr. (IF 1.847) Pub Date : 2017-04-11
    Ulf G. Indahl

    The separation of predictive and nonpredictive (or orthogonal) information in linear regression problems is considered to be an important issue in chemometrics. Approaches including net analyte preprocessing methods and various orthogonal signal correction (OSC) methods have been studied in a considerable number of publications. In the present paper, we focus on the simplest single response versions of some of the early OSC approaches including Fearns OSC, the orthogonal projections to latent structures, the target projection (TP), and the projections to latent structures (PLS) postprocessing by similarity transformation. These methods are claimed to yield improved model building and interpretation alternatives compared with ordinary PLS, by filtering “off” the response‐orthogonal parts of the samples in a dataset. We point out at some fundamental misconceptions that were made in the justification of the PLS‐related OSC algorithms and explain the key properties of the resulting modelling.

    更新日期:2020-01-24
  • Multivariate patent analysis—Using chemometrics to analyze collections of chemical and pharmaceutical patents
    J. Chemometr. (IF 1.847) Pub Date : 2018-05-10
    Rickard Sjögren; Kjell Stridh; Tomas Skotare; Johan Trygg

    Patents are an important source of technological knowledge, but the amount of existing patents is vast and quickly growing. This makes development of tools and methodologies for quickly revealing patterns in patent collections important. In this paper, we describe how structured chemometric principles of multivariate data analysis can be applied in the context of text analysis in a novel combination with common machine learning preprocessing methodologies. We demonstrate our methodology in 2 case studies. Using principal component analysis (PCA) on a collection of 12338 patent abstracts from 25 companies in big pharma revealed sub‐fields which the companies are active in. Using PCA on a smaller collection of patents retrieved by searching for a specific term proved useful to quickly understand how patent classifications relate to the search term. By using orthogonal projections to latent structures (O‐PLS) on patent classification schemes, we were able to separate patents on a more detailed level than using PCA. Lastly, we performed multi‐block modeling using OnPLS on bag‐of‐words representations of abstracts, claims, and detailed descriptions, respectively, showing that semantic variation relating to patent classification is consistent across multiple text blocks, represented as globally joint variation. We conclude that using machine learning to transform unstructured data into structured data provide a good preprocessing tool for subsequent chemometric multivariate data analysis and provides an easily interpretable and novel workflow to understand large collections of patents. We demonstrate this on collections of chemical and pharmaceutical patents.

    更新日期:2020-01-24
  • Visualization of descriptive multiblock analysis
    J. Chemometr. (IF 1.847) Pub Date : 2018-07-31
    Tomas Skotare; Rickard Sjögren; Izabella Surowiec; David Nilsson; Johan Trygg

    Understanding and making the most of complex data collected from multiple sources is a challenging task. Data integration is the procedure of describing the main features in multiple data blocks, and several methods for multiblock analysis have been previously developed, including OnPLS and JIVE. One of the main challenges is how to visualize and interpret the results of multiblock analyses because of the increased model complexity and sheer size of data. In this paper, we present novel visualization tools that simplify interpretation and overview of multiblock analysis. We introduce a correlation matrix plot that provides an overview of the relationships between blocks found by multiblock models. We also present a multiblock scatter plot, a metadata correlation plot, and a variation distribution plot, that simplify the interpretation of multiblock models. We demonstrate our visualizations on an industrial case study in vibration spectroscopy (NIR, UV, and Raman datasets) as well as a multiomics integration study (transcript, metabolite, and protein datasets). We conclude that our visualizations provide useful tools to harness the complexity of multiblock analysis and enable better understanding of the investigated system.

    更新日期:2020-01-24
  • Exploring the latent variable space of PLS2 by post‐transformation of the score matrix (ptLV)
    J. Chemometr. (IF 1.847) Pub Date : 2018-09-19
    Matteo Stocchero

    Projection to Latent Structures (PLS) regression is largely applied in chemometrics. The most used algorithm for performing PLS is probably PLS2. PLS2 solves the problem of redundancy and collinearity in complex data sets and produces a small set of latent variables that can be used to investigate complex phenomena. However, the presence of specific cluster structures or trends in the data can drive PLS2 towards wrong directions and a redundant number of latent variables is generated. To overcome this unexpected behaviour, OSC‐based methods were developed. The main idea was to use the concept of orthogonality to identified two different type of sources of structured variation which are modeled into two different subspaces: the non‐predictive subspace described by latent variables orthogonal to the Y‐response and the predictive subspace related to the Y‐response. OSC‐based methods work on the variable space producing suitable weight vectors to project the data. In this study, a new post‐transformation method, called post‐transformation of the Latent Variable space (ptLV), is introduced. The method generates a latent space isomorphic to that discovered by PLS2 where the non‐predictive data variation is separated from the predictive one. It works on the score space and can be applied also to kernel‐PLS2 (KPLS2). The relationships with post‐transformation of PLS2 (ptPLS2) are investigated and a real and two simulated data sets are used to illustrate how ptLV works in practice.

    更新日期:2020-01-24
  • Cellwise outlier detection and biomarker identification in metabolomics based on pairwise log ratios
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-02
    Jan Walach; Peter Filzmoser; Štěpán Kouřil; David Friedecký; Tomáš Adam

    Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell‐rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell‐rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.

    更新日期:2020-01-24
  • 1H NMR spectroscopy coupled with multivariate analysis was applied to investigate Italian cherry tomatoes metabolic profile
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-06
    Olimpia Masetti; Luigi Nisini; Alessandra Ciampa; Maria Teresa Dell'Abate

    Nuclear magnetic resonance (NMR) spectroscopy, in combination with different chemometric methods, was widely used for metabolomic profiling in the geographical determination of food origin. In the present study, spectra data of cherry tomatoes, collected from Pachino (Sicily) and Sabaudia (Latium), were analyzed by principal component analysis (PCA), k nearest neighbors (kNN), and partial least‐squares discriminant analysis (PLS‐DA) in order to discriminate the samples according to their geographical provenance.

    更新日期:2020-01-24
  • Study of chemical compound spatial distribution in biodegradable active films using NIR hyperspectral imaging and multivariate curve resolution
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-26
    Larissa R. Terra; Jussara V. Roque; Cícero C. Pola; Igor M. Gonçalves; Nilda de Fátima F. Soares; Reinaldo F. Teófilo

    A study of spatial distribution of the four different plasticizers and sorbic acid incorporated in cellulose acetate biodegradable films using near‐infrared hyperspectral imaging (NIR‐HSI) and multivariate curve resolution—alternating least squares (MCR‐ALS) is presented. A NIR‐HSI was acquired for each film. MCR‐ALS was applied to generate pure component distribution maps. A repeatability study was performed. The proposed method was able to recover the pure spectra of each film component accurately. The relative concentration vectors obtained by the MCR‐ALS were rebuilt in matrices, and it was possible to analyze the homogeneity of the film constituents based on macropixel analysis and homogeneity index. The NIR‐HSI imaging showed excellent repeatability. For the first time, a study detailing the distribution of chemical compounds incorporated into entire biodegradable films was possible by using NIR hyperspectral imaging combined with the MCR‐ALS method.

    更新日期:2020-01-24
  • Separating common (global and local) and distinct variation in multiple mixed types data sets
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-10
    Yipeng Song; Johan A. Westerhuis; Age K. Smilde

    Multiple sets of measurements on the same objects obtained from different platforms may reflect partially complementary information of the studied system. The integrative analysis of such data sets not only provides us with the opportunity of a deeper understanding of the studied system but also introduces some new statistical challenges. First, the separation of information that is common across all or some of the data sets and the information that is specific to each data set is problematic. Furthermore, these data sets are often a mix of quantitative and discrete (binary or categorical) data types, while commonly used data fusion methods require all data sets to be quantitative. In this paper, we propose an exponential family simultaneous component analysis (ESCA) model to tackle the potential mixed data types problem of multiple data sets. In addition, a structured sparse pattern of the loading matrix is induced through a nearly unbiased group concave penalty to disentangle the global, local common, and distinct information of the multiple data sets. A Majorization‐Minimization–based algorithm is derived to fit the proposed model. Analytic solutions are derived for updating all the parameters of the model in each iteration, and the algorithm will decrease the objective function in each iteration monotonically. For model selection, a missing value–based cross validation procedure is implemented. The advantages of the proposed method in comparison with other approaches are assessed using comprehensive simulations as well as the analysis of real data from a chronic lymphocytic leukaemia (CLL) study.

    更新日期:2020-01-24
  • Diagnostics via partial residual plots in inverse Gaussian regression
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-25
    Muhammad Imran; Atif Akbar

    Regression diagnostics is the basic requirement to apply regression analysis to reach reliable conclusions. Generalized linear models also required diagnostics for its implementation. The construction of partial residuals using response residuals for the inverse Gaussian regression model is carried out to explore the structure and usefulness for visualizing diagnostics and curvature as a function of selected predictors. The current study established the performance of partial residual plots over conventional diagnostic methods. The comparison has been made using aerial biomass data and with the help of simulation study. It has been observed that partial residual plots provide much better diagnosis than do conventional methods. Moreover, multiple diagnostics in a single display provide better perceptive towards lack of fit, specification, and data anomalies.

    更新日期:2020-01-24
  • Robustness control in bilinear modeling based on maximum correntropy
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-22
    Valeria Fonseca Diaz; Bart De Ketelaere; Ben Aernouts; Wouter Saeys

    We present the development of a bilinear regression model for multivariate calibration on the basis of maximum correntropy criteria (MCC) whose robustness can be easily controlled. MCC regression methods can be more effective when the assumption of normality does not hold or when data are contaminated with outliers. These methods are competitive when the degree of robustness against outliers should be controlled. By controlling the robustness, information from candidate outliers can be partially retained rather than completely included or discarded during calibration. Within the context of bilinear regression models, an MCC approach using statistically inspired modification of the partial least squares (SIMPLS) is proposed, which is named maximum correntropy‐weighted partial least squares (MCW‐PLS). Thanks to the controllable robustness of MCC models, observations are upweighted or downweighted during the calibration process, rendering robust models with soft discrimination of samples. Such a weighting represents an important advantage, especially for cases when samples are not drawn from a normal distribution. Applications to three real case studies are presented. These applications uncovered three main features of MCW‐PLS: robustness control between SIMPLS and robust SIMPLS (RSIMPLS), improvements in prediction performance of bilinear calibration models, and the possibility to detect the most informative samples in a calibration set.

    更新日期:2020-01-23
  • Comparison of sensory evaluation techniques for Hungarian wines
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-19
    Zsuzsanna Guld; Diána Nyitrainé Sárdy; Attila Gere; Anita Rácz

    The aim of this study was to compare different Hungarian Kadarka, Kékfrankos, and Cabernet franc wines produced and aged by the same methods and to compare two types of sensory analysis methods as well: the 100‐point OIV system and quantitative descriptive analysis (QDA). Both tests were conducted by 12 assessors of the University of Pécs, Institute for Regional Development, Faculty of Horticulture and Oenology. This study provides conclusions about the use of sensory analysis methods, highlighting the advantages and disadvantages of QDA and the OIV system. Principal component analysis, analysis of variance (ANOVA), multiple factor analysis, and partial least squares dicriminant analysis were used for the evaluation of the data. Our results showed that the sensory panel was able to discriminate the samples by both sensory methods; however, the information provided by them was significantly different. ANOVA clearly showed that the two methods have different sensitivity when comparing wines (commercial and produced wine samples) and QDA proved to be the more sensitive, as well as more detailed, method. Partial least squares discriminant analysis augmented the findings in the classification part of the different type of wine samples. In general, OIV is able to show the general quality of the wines, while QDA coupled with proper chemometric methods is able to describe why the given samples received good or bad OIV scores.

    更新日期:2020-01-21
  • Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-16
    Matthieu Lesnoff; Maxime Metz; Jean‐Michel Roger

    In multivariate calibrations, locally weighted partial least squared regression (LWPLSR) is an efficient prediction method when heterogeneity of data generates nonlinear relations (curvatures and clustering) between the response and the explicative variables. This is frequent in agronomic data sets that gather materials of different natures or origins. LWPLSR is a particular case of weighted PLSR (WPLSR; ie, a statistical weight different from the standard 1/n is given to each of the n calibration observations for calculating the PLS scores/loadings and the predictions). In LWPLSR, the weights depend from the dissimilarity (which has to be defined and calculated) to the new observation to predict. This article compares two strategies of LWPLSR: (a) “LW”: the usual strategy where, for each new observation to predict, a WPLSR is applied to the n calibration observations (ie, entire calibration set) vs (b) “KNN‐LW”: a number of k nearest neighbors to the observation to predict are preliminary selected in the training set and WPLSR is applied only to this selected KNN set. On three illustrating agronomic data sets (quantitative and discrimination predictions), both strategies overpassed the standard PLSR. LW and KNN‐LW had close prediction performances, but KNN‐LW was much faster in computation time. KNN‐LW strategy is therefore recommended for large data sets. The article also presents a new algorithm for WPLSR, on the basis of the “improved kernel #1” algorithm, which is competitor and in general faster to the already published weighted PLS nonlinear iterative partial least squares (NIPALS).

    更新日期:2020-01-17
  • Comparison of supervised learning statistical methods for classifying commercial beers and identifying patterns
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-14
    Dániel Koren; Laura Lőrincz; Sándor Kovács; Gabriella Kun‐Farkas; Beáta Vecseriné Hegyes; László Sipos

    In this study, 13 properties (alcohol‐, real extract‐, flavonoid‐, anthocyanin, glucose, fructose, maltose, sucrose content, EBC [European Brewery Convention] and L*a*b* color, bitterness) of 21 beers (alcohol‐free pale lagers, alcohol‐free beer‐based mixed drinks, beer‐based mixed drinks, international lagers, wheat beers, stouts, fruit beers) were determined. In the first step, multiple factor analysis (MFA) was performed for the whole data and five clusters (target classes) were determined; then, a bootstrapping was applied to establish a balanced data so as every cluster should contain 100 samples and the total sample size is 500. In the second step, 12 supervised learning algorithms (random trees [RND], Quinlan's C4.5 decision tree algorithm [C4.5], Iterative Dichotomiser 3 algorithm [ID3], cost‐sensitive decision tree algorithm [CSMC4], cost‐sensitive classification tree [CSCRT], k‐nearest neighbors algorithm [KNN], radial basis function [RBF], multilayer perceptron neural network [MLP], prototype nearest neighbor [PNN], linear discriminant analysis [LDA], naïve Bayes with continuous variables [NBC], partial least squares discriminant analysis [PLS‐DA]) were applied to classify each brand into the target classes. Furthermore, several error rates were calculated: re‐substitution error rate (RER), cross‐validated error rate (CV), bootsrap error (BOOT), leave‐one‐out (LOO), and train‐test error rate (TRAIN). The MFA could discriminate five groups, which can be characterized by some analytical parameters, and the other multivariate methods performed similarly. The methods can be discriminated best based on the BOOT, CV, and LOO. The best estimation methods are the C4.5, CSMC4, and CSCRT; these performed best along the flavonoid content and EBC color. It identified that the methods most sensitive to the properties are the NBC. The classification ability fluctuated greatly in the case of three properties (glucose, maltose, sucrose). A remarkable fluctuation has been experienced in the case of L*a*b* color parameters, flavonoid content, EBC color, and bitterness by NBC method.

    更新日期:2020-01-15
  • Variable importance: Comparison of selectivity ratio and significance multivariate correlation for interpretation of latent‐variable regression models
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-13
    Olav M. Kvalheim

    This work examines the performance of significance multivariate correlation (sMC) and selectivity ratio (SR) for ranking variables according to their importance in latent‐variable regressions (LVRs) models. Both indices are based on target projection (TP) of a validated LVR model obtained by partial least squares (PLS). The matrix of explanatory x‐variables is projected on the normalized regression vector to obtain a score vector that is proportional to the vector of predicted values for the response variable y. sMC for each x‐variable is calculated by dividing the squared variance explained by the decomposition obtained from these two vectors on the squared residuals. This is similar to how SR is calculated except that for SR, the regression vector is replaced by the loading matrix obtained by projecting the data matrix of x‐variables onto the score matrix obtained by TP. The two indices for variable importance are compared for three different applications with data representing instrumental profiles from liquid chromatography, infrared spectroscopy, and proton nuclear magnetic spectroscopy. Results show that SR outperforms sMC for interpretation and biomarker selection. The main drawback of sMC appears to be the mixing of predictive and orthogonal variation resulting from the direct use of the normalized regression vector in the calculation. SR uses a loading vector that is proportional to the covariances between x‐variables and the predicted response variable.

    更新日期:2020-01-14
  • Process capability indices when two sources of variability present, a tolerance interval approach
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-09
    Éva Pusztai; Sándor Kemény

    The sound tolerance interval–based method and two Pp–based approximations are compared on the proportion of nonconforming parts. As output distribution of the process, one possible model is examined here: two sources of variations are in a one‐way structure. It was found that the uncertainty of variance components estimates plays the major role in the goodness of the three calculation methods.

    更新日期:2020-01-10
  • Large‐scale dynamic process monitoring based on performance‐driven distributed canonical variate analysis
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-08
    Jun Liu; Chunyue Song; Jun Zhao; Peng Ji

    As a typical process monitoring method for the large‐scale industrial process, the distributed principal components analysis (DPCA) needs to be improved because of its rough selection for the variables in each subblock. Moreover, for DPCA, the process dynamic property is ignored and invalid fault diagnosis may occur. Therefore, a performance‐driven distributed canonical variate analysis (DCVA) is proposed. Firstly, with historical fault information, the genetic algorithm is utilized to select appropriate variables for each subblock; secondly, canonical variate analysis is introduced to capture the dynamic information for performance improvement; finally, a novel fault diagnosis method is developed for the DCVA model. Case studies on a numerical example and the Tennessee Eastman benchmark process demonstrate the effectiveness of the proposed model.

    更新日期:2020-01-09
  • Estimation of gasoline properties by 1H NMR spectroscopy with repeated double cross‐validated partial least squares models
    J. Chemometr. (IF 1.847) Pub Date : 2020-01-05
    Ana L. Leal; Ricardo M. Albuquerque; Artur M.S. Silva; Jorge C. Ribeiro; Fernando G. Martins

    Commercial gasoline must satisfy several product specifications before trading. In the present work, repeated double cross validation using partial least squares regression was applied to create reliable prediction models for 13 physicochemical parameters (eg, density, vapour pressure, evaporate at 70°C, evaporate at 100°C, evaporate at 150°C, final boiling point, research octane number, motor octane number, aromatic content, olefinic content, benzene content, oxygen content, and methyl tert‐butyl ether content) of gasoline produced in Matosinhos' refinery. The input variables for the regression are the 1H NMR spectral intensities of a total of 448 samples, which were recorded using a picoSpin NMR spectrometer operating at 80 MHz. The output variables are the corresponding property values, which were also measured according to ISO standard methods. A spectral feature elimination before multivariate analysis was done to remove noise and speed up the chemometric analysis. The optimum complexity of each model was achieved by repeated double cross‐validation strategy, consisting of 100 repetitions of two nested cross‐validation loops. Quantitative partial least squares yielded accurate predictions of 11 of 13 properties within the reproducibility of ISO standards. The methodology presented in this work has been proven effective in property estimation and enables a significant reduction in the total time of gasoline quality control.

    更新日期:2020-01-06
  • High‐dimensional spectral data classification with nonparametric feature screening
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-26
    Chuan‐Quan Li; Qing‐Song Xu

    Two nonparametric feature screening methods, namely, the Kolmogorov filter and model free, marginally measure the relationship between categorical response and predictor variables without the parametrical assumption. And they can select important variables in the high‐dimensional classification data. Random forest, as a classical nonparametric method, can solve various classification problems. In this paper, we combine the two nonparametric feature screening methods with random forest to handle with spectral data classification. And then other conventional classification methods are compared with ours on three spectral datasets. The comparison results illustrated that our methods have more desirable ability about classification performance and variable selection than other methods.

    更新日期:2019-12-27
  • Digital bandstop filtering in the quantitative analysis of glucose from near‐infrared and midinfrared spectra
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-25
    Osamah Alrezj; Mohammed Benaissa; Saleh A. Alshebeili

    This work proposes the use of bandstop filtering (BSF) as a pretreatment method in the quantitative analysis of glucose from both near‐infrared (NIR) and midinfrared (MIR) spectra. The proposed method is investigated and evaluated against the traditional bandpass filtering (BPF) and implemented with the linear calibration models principal component regression (PCR) and partial least squares regression (PLSR) to predict the glucose from an aqueous mixture consisting of glucose and human serum albumin dissolved in a phosphate buffer solution. The results obtained show that BSF pretreatment achieves better prediction performance than BPF in both the NIR and MIR spectral regions. For detailed analysis, the BPF and BSF were implemented under both the Butterworth and Chebyshev filter configurations in both bands; in the NIR region, the Butterworth BSF combined with the PLSR model provides the best glucose prediction by reducing the root mean square error of prediction (RMSEP) from 100 mg/dL without filtering to 34 mg/dL with a coefficient of determination R2 of .982. In the MIR region, the Chebyshev BSF combined with either PLSR or PCR improves the glucose prediction by reducing the RMSEP by 54% compared with 45% when using BPF and with R2 of.995.

    更新日期:2019-12-27
  • Machine learning methods to predict solubilities of rock samples
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-25
    Pál Péter Hanzelik; Szilveszter Gergely; Csaba Gáspár; László Győry

    Interests in the use of chemometric and data science methods for laboratory techniques have grown rapidly over the last 10 years, for the reason that they are cheaper and faster than traditional analytical methods of material testing. This study uses 888 rock samples collected from the exploration and production (E&P) sector of the oil industry. Based on the Fourier‐transform infrared (FT‐IR) spectra of these rock samples their solubility predictions have been developed and investigated with nine methods including both linear and non‐linear ones. Two of these methods such as Partial Least Squares Regression (PLSR) and Support Vector Regression (SVR) are available in a commercial software package and the other seven methods, Extreme Gradient Boosting (XGBoost), Ridge Regression (RR), k‐nearest neighbours (k‐NN), Decision Tree (DT), Multilayer Perceptron (MLP), Support Vector Regression (SVR), Artificial Neural Network (ANN) with TensorFlow (TF), were coded by the authors based either on commercial applications or open source libraries. The investigation starts with spectral data pre‐processing carried out by standard normal variate (SNV), baseline correction and feature selection methods creating the feature set for all machine learning (ML) applications. The accuracy of predictions has been evaluated with mean squared error as a performance metric for each investigated method. The comparisons of predicted values to real data of test samples have shown that mineral solubility in acids can be well predicted in the range of the uncertainties of real laboratory measurements, therefore it can be used to improve the response time of these investigations and reduce the risk in industrial applications. In those cases, where the unknown samples have got some out of the range features, the limitations in the accuracy of predictions have become clear. We have also identified the limitations in the methodology and planned steps to further improve the prediction capabilities. The identified constraint of samples' multitude further emphasizes the need for database building efforts, so that the real potential in big data and machine learning can be realized.

    更新日期:2019-12-27
  • A perspective on modeling evolution
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-25
    Anna de Juan; Sílvia Mas; Marcel Maeder; Romà Tauler

    Data modeling is a wide concept that exists since long and encompasses all possible ways to interpret the information associated with a process, analytical measurement or set of related parameters that presents a systematic variation. Data modeling can follow the path of knowledge and be based on first principles or can focus on measurements and empirical models. These different approaches are known as hard‐ and soft‐modeling, respectively.

    更新日期:2019-12-27
  • Issue Information
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-26

    No abstract is available for this article.

    更新日期:2019-12-27
  • A two‐layer ensemble learning framework for data‐driven soft sensor of the diesel attributes in an industrial hydrocracking process
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-12
    Yalin Wang; Dongzhe Wu; Xiaofeng Yuan

    In the hydrocracking process, it is of great significance to timely measure the product attributes for real‐time process control and optimization. However, they are often very difficult to measure online due to technical and economical limitations. To this end, soft sensor is introduced to predict product attributes through easy‐to‐measure process variables, with the advantages of low cost, fast response, and ease of maintenance. In this paper, a two‐layer ensemble learning framework is developed for soft sensing of three diesel attributes in an industrial hydrocracking process. In this modeling framework, the process variables are first divided into subspace blocks according to process topological structure to capture the local behaviors of different production cells. Then, to overcome the weak generalization ability of a single calibration model with specific hypothesis, different regression learners are constructed on each variable subblock to increase the model diversity. At last, individual models are fused to improve the prediction performance and generalization ability of soft sensor models. The effectiveness and flexibility of the proposed ensemble learning method is validated on a real industrial hydrocracking process.

    更新日期:2019-12-27
  • PARAMO: Enhanced Data Pre‐processing in Batch Multivariate Statistical Process Control
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-26
    Marta Fuentes‐García; José María González‐Martínez; Gabriel Maciá‐Fernández; José Camacho

    Since the pioneering works by Nomikos and MacGregor, the Batch Multivariate Statistical Process Control (BMSPC) methodology has been extensively revised, and a sheer number of alternative monitoring approaches have been suggested. The different approaches vary in the batch data alignment, the pre‐processing approach, the data arrangement, and/or the type of model used, from two‐way to three‐way and from linear to nonlinear. One of the most accepted pre‐processing schemes, referred to as the trajectory centering and scaling (TCS), is based on the normalization to zero mean and unit variance around the average trajectory. However, the main drawback of TCS is the inherent increase of the level of uncertainty in the estimation of model parameters. In this work, we illustrate how to improve parameter estimation while maintaining the good properties of this pre‐processing approach. This enhancement is achieved with the new pre‐processing approach we call PARAMO, which uses more observations than TCS to estimate the pre‐processing parameters. We show that this improvement favorably impacts the performance of the monitoring system. The results of this research work affect a large amount of the monitoring approaches proposed to date, and we advocate that the pre‐processing procedure proposed here should be generally applied in BMSPC.

    更新日期:2019-12-27
  • QSAR and docking studies on Triazole Benzene Sulfonamides with human Carbonic anhydrase IX inhibitory activity
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-11
    P. Gopinath; M.K. Kathiravan

    Cancer is the second leading cause of death worldwide, and breast cancer accounts for 2.09 million cases in the year 2018. Hypoxia‐related human carbonic anhydrase IX enzyme was found to play a key role in metastasis also. In this view, quantitative structure activity relationship (QSAR) studies were carried out by QSARINS on triazole benzene sulfonamide derivatives for carbonic anhydrase IX inhibitory activity targeting breast cancer. A new scope to explore 3D‐MoRSE descriptors in carbonic anhydrase inhibition has been initiated by this study. The best model 3 generated includes five variables MoRSEV22, MoRSEC17, MoRSEV1, MoRSEC4, and MoRSEE2 with statistical values R2 = 0.7852, CCCtr = 0.8797, Q2LOO = 0.7237, Q2LMO = 0.7071, CCCcv = 0.8472, R2ext = 0.7894, and CCCext = 0.8784. The developed QSAR model suggests that the atomic volume, atomic charges, and Sanderson's electronegativity play key roles and were extremely helpful in designing and optimizing the lead. Molecular docking studies were performed using Autodock v 4.2.6 and the residues of active site region involving both hydrophilic and hydrophobic parts interacted with best predicted active compounds 1d, 3e, 6f and 9f. The study leads to the development of new inhibitors targeting breast cancer.

    更新日期:2019-12-27
  • Comparison of two augmented classical least squares algorithms and PLS for determining nifuroxazide and its genotoxic impurities using UV spectroscopy
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-25
    Maha A. Hegazy; Nada S. Abdelwahab; Nouruddin W. Ali; Marco M. Z. Sharkawi; Mohamed M. Abdelkawy; Mohammed T. El‐Saadi

    Concentration residual augmented classical least squares (CRACLS) and spectral residual augmented classical least squares (SRACLS) are reflecting the resolving power of spectrophotometric augmented mathematical techniques. The applied multivariate calibration methods were able to determine simultaneously the quinary mixture of nifuroxazide (NIF) and its four carcinogenic impurities, and a comparative study between the proposed models and the conventional partial least squares (PLS) algorithm was also done. Regression models were built covering the range of 10.00 to 50.00 μg mL−1 for NIF, 0.05 to 0.45 μg mL−1 (for each of impurities A and B), and 0.10 to 0.90 μg mL−1 (for impurities C and D). NIF and its four genotoxic impurities were successfully determined in the prepared mixtures and dosage forms. The efficiency of the applied algorithm for resolution and quantitation of the overlapped UV signals were compared, and the advantages over PLS mode were emphasized. The obtained results were statistically compared with each other along with those of the reported method.

    更新日期:2019-12-27
  • Fault monitoring based on locally weighted probabilistic kernel partial least square for nonlinear time‐varying processes
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-02
    Ying Xie

    In this paper, novel data‐driven fault detection and diagnosis approaches are proposed on the basis of a new locally weighted probabilistic kernel partial least squares (LWPKPLS). (a) LWPKPLS can construct an accurate model for a time‐varying process by updating itself using the newly coming samples, thus LWPKPLS can be used to monitor time‐varying processes. (b) By the integration of local weighted regression and kernel tricks, the LWPKPLS can be applied to construct models for processes with much stronger nonlinear data characteristics. (c) Meanwhile, as a probabilistic regression model, LWPKPLS can process data with random noises and missing values. (d) A set of process monitoring approaches including fault detection and fault diagnosis are developed on the basis of LWPKPLS. At last, the experiment results from a numerical example and an ion‐exchange membrane electrolysis process (IEMEP) demonstrate that the proposed process monitoring approaches have satisfactory monitoring performance.

    更新日期:2019-12-27
  • Using chemometrics tools to gain detailed molecular information on chemical processes
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-23
    Graeme Puxty; Marcel Maeder

    Paul Gemperline has been active in many areas of data analysis in chemistry. In this contribution, we build on his developments in advanced process analysis. The most prominent aspect is data fusion, the combined analysis of measurements taken on different instruments, using different measurement techniques, and different chemical conditions. We have successfully applied these principles to the investigation of the chemistry of CO2 in aqueous amine solutions.

    更新日期:2019-12-23
  • A procedure for calibration transfer of DOSY NMR measurements: An example of molecular weight of heparin preparations
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-15
    Yulia B. Monakhova; Bernd W.K. Diehl

    Calibration transfer is commonly used for spectra obtained on different spectrometers or other conditions. This paper reports a multivariate transfer approach for 2D diffusion‐ordered spectroscopy (DOSY) measurements among high‐resolution nuclear magnetic resonance (NMR) spectrometers on the basis of partial least squares (PLS) regression. As the test system previously published quantitative model to predict molecular weight of heparin, low molecular weight heparin (LMWH) was used.

    更新日期:2019-12-17
  • Modified PCA and PLS: Towards a better classification in Raman spectroscopy–based biological applications
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-10
    Shuxia Guo, Petra Rösch, Jürgen Popp, Thomas Bocklitz

    Raman spectra of biological samples often exhibit variations originating from changes of spectrometers, measurement conditions, and cultivation conditions. Such unwanted variations make a classification extremely challenging, especially if they are more significant compared with the differences between groups to be separated. A classifier is prone to such unwanted variations (ie, intragroup variations) and can fail to learn the patterns that can help separate different groups (ie, intergroup differences). This often leads to a poor generalization performance and a degraded transferability of the trained model. A natural solution is to separate the intragroup variations from the intergroup differences and build the classifier based on merely the latter information, for example, by a well‐designed feature extraction. This forms the idea of this contribution. Herein, we modified two commonly applied feature extraction approaches, principal component analysis (PCA) and partial least squares (PLS), in order to extract merely the features representing the intergroup differences. Both of the methods were verified with two Raman spectral datasets measured from bacterial cultures and colon tissues of mice, respectively. In comparison to ordinary PCA and PLS, the modified PCA was able to improve the prediction on the testing data that bears significant difference to the training data, while the modified PLS could help avoid overfitting and lead to a more stable classification.

    更新日期:2019-12-11
  • Novel LIBS method for micro‐spatial chemical analysis of inorganic gunshot residues
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-09
    Korina Menking‐Hoggatt, Luis Arroyo, James Curran, Tatiana Trejos

    This study developed a reliable laser‐induced breakdown spectroscopy (LIBS) screening approach capable of detecting GSR in just a few minutes with minimal damage to the sample, high specificity, and sensitivity. Moreover, a novel micro‐sampling method was developed to gather three‐dimensional data of the simultaneous occurrence of IGSR markers from a discrete space. The method is capable of micro‐spatial chemical analysis from just two laser shots fired at an area of 100‐μm diameter. The performance of the micro‐spot method is compared with our previously published bulk‐line method. Superior accuracy, spatial information of IGSR distribution in the sample, and a less invasive sampling are some of the advantages of the newly proposed method. A benefit afforded by this approach is the use of the universal hand's collection method currently used by practitioners, while leaving over 99% of the stub left unaltered for further analysis.

    更新日期:2019-12-11
  • Detection of magnetic audio tape degradation with neural networks and Lasso
    J. Chemometr. (IF 1.847) Pub Date : 2019-12-02
    Nilmini H. Ratnasena, Dayla C. Rich, Alyssa M. Abraham, Larissa L. Cunha, Stephen L. Morgan

    Audio magnetic tapes manufactured using polyester urethane are known to become nonplayable over time due to the degradation of the magnetic layer. Attempting to play degraded tapes to digitize them can cause extensive damage to the tape as well as to the play back device. For this reason, most of the magnetic tapes in cultural heritage institutions are in critical state. The purpose of our study is to preserve historical recordings in magnetic tapes by developing a nondestructive technique to determine degradation status. Our approach is to combine attenuated total reflectance Fourier transform infrared spectroscopy (ATR FT‐IR) with chemometric techniques, especially neural networks and least absolute shrinkage and selection operator (Lasso). The model built using neural networking was able to successfully classify playable and nonplayable with 97% to 98% accuracy when similar tape brands/models were in the training and the test set. With different brands/models in the test set, neural network model performed poorly. However, Lasso showed 95.5% accuracy for similar brand/models and 80.5% accuracy for different tape brands/models. This suggests that Lasso is the better technique to determine if a tape is degraded or not.

    更新日期:2019-12-03
  • Issue Information
    J. Chemometr. (IF 1.847) Pub Date : 2019-11-14

    No abstract is available for this article.

    更新日期:2019-11-15
  • Orders of magnitude speed increase in partial least squares feature selection with new simple indexing technique for very tall data sets
    J. Chemometr. (IF 1.847) Pub Date : 2019-10-27
    Petter Stefansson, Ulf G. Indahl, Kristian H. Liland, Ingunn Burud

    Feature selection is a challenging combinatorial optimization problem that tends to require a large number of candidate feature subsets to be evaluated before a satisfying solution is obtained. Because of the computational cost associated with estimating the regression coefficients for each subset, feature selection can be an immensely time‐consuming process and is often left inadequately explored. Here, we propose a simple modification to the conventional sequence of calculations involved when fitting a number of feature subsets to the same response data with partial least squares (PLS) model fitting. The modification consists in establishing the covariance matrix for the full set of features by an initial calculation and then deriving the covariance of all subsequent feature subsets solely by indexing into the original covariance matrix. By choosing this approach, which is primarily suitable for tall design matrices with significantly more rows than columns, we avoid redundant (identical) recalculations in the evaluation of different feature subsets. By benchmarking the time required to solve regression problems of various sizes, we demonstrate that the introduced technique outperforms traditional approaches by several orders of magnitude when used in conjunction with PLS modeling. In the supplementary material, we provide code for implementing the concept with kernel PLS regression.

    更新日期:2019-11-15
  • Chemometric approach to estimate kinetic properties of paclitaxel prodrugs and their substructures for solubility prediction through molecular modelling and simulation studies
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-16
    Nupur S. Munjal, Rohit Shukla, Tiratha Raj Singh

    Paclitaxel drug is administered in the treatment of ovarian and breast cancer and also in Kaposi sarcoma. In spite of being nanomolar active, use of this drug is confined because of its low aqueous solubility, hence many prodrugs for increasing paclitaxel's solubility were formed, but the formation process was not rational. In the current study, quantitative structure property relationship (QSPR) models were formed for the solubility prediction of paclitaxel prodrugs. Structures of all molecules were optimized at the parameterization method 6 (PM6) and Austin Model 1 (AM1) levels, after which Dragon‐based 5250 descriptors and quasi‐mixture descriptors were calculated. Independent descriptors were selected in multiple steps, and QSPR models having 12 and 10 descriptors with R2 and Q2 values of 0.78 and 0.60 and 0.80 and 0.69 for AM1‐ and PM6‐optimized geometry datasets, respectively, were formed. Also, for substituent group dataset, QSPR models with 8 and 9 descriptors having R2 and Q2 values of 0.82 and 0.76 and 0.93 and 0.83 were determined for AM1‐ and PM6‐optimized geometry datasets, respectively. Quasi‐mixture descriptors, which were calculated for substituent group datasets, gave the QSPR model with R2 and Q2 values of 0.70 and 0.58 and 0.69 and 0.52 respectively for AM1‐ and PM6‐optimized geometries. After the models' development, the substituent group dataset was employed for the formation of docking and molecular dynamics simulation–based models for the metabolic study with CYP1A2 enzyme. It is anticipated that the proposed QSPR models will serve as a base for the designing of new paclitaxel prodrugs with improved aqueous solubility.

    更新日期:2019-11-15
  • Properly handling negative values in the calculation of binding constants by physicochemical modeling of spectroscopic titration data
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-14
    Nathanael P. Kazmierczak, Douglas A. Vander Griend

    To implement equilibrium hard‐modeling of spectroscopic titration data, the analyst must make a variety of crucial data processing choices that address negative absorbance and molar absorptivity values. The efficacy of three such methodological options is evaluated via high‐throughput Monte Carlo simulations, root‐mean‐square error surface mapping, and two mathematical theorems. Accuracy of the calculated binding constant values constitutes the key figure of merit used to compare different data analysis approaches. First, using singular value decomposition to filter the raw absorbance data prior to modeling often reduces the number of negative values involved but has little effect on the calculated binding constant despite its ability to address spectrometer noise. Second, both truncation of negative molar absorptivity values and the fast nonnegative least squares algorithms are superior to unconstrained regression because they avoid local minima; however, they introduce bias into the calculated binding constants in the presence of negative baseline offsets. Finally, we establish two theorems showing that negative values are best addressed when all the chemical solutions leading to the raw absorbance data are the result of mixing exactly two distinct stock solutions. This allows the raw absorbance data to be shifted up, eliminating negative baseline offsets, without affecting the concentration matrix, residual matrix, or calculated binding constants. Otherwise, the data cannot be safely upshifted. A comprehensive protocol for analyzing experimental absorbance datasets with is included.

    更新日期:2019-11-15
  • A practical convolutional neural network model for discriminating Raman spectra of human and animal blood
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-22
    Jialin Dong, Mingjian Hong, Yi Xu, Xiangquan Zheng

    A practical convolutional neural network (CNN) model is proposed to discriminate the Raman spectra of human and animal blood. The proposed network, which discards the pooling layers to avoid loss of data, consists of preprocessing and fully connected classifier layers. Two preprocessing layers, namely, denoising and baseline correction layer, are designed to allow only one kernel for each layer to explicitly suppress the noise and subtract varying background of the spectra. The network combines the preprocessing and discrimination to form a whole processing unit and learns parameters adaptively by training from 217 of 326 Raman spectra of human, dog, and rabbit blood samples. The trained network is evaluated by remaining 109 samples and shows better classification accuracy, as compared with the PLSDA and SVM.

    更新日期:2019-11-15
  • Model‐based description of indicator displacement assay sensor arrays for quantitation of mixtures
    J. Chemometr. (IF 1.847) Pub Date : 2019-09-09
    Akram Rostami, Somaiyeh Khodadadi Karimvand, Hamid Abdollahi

    This paper propounds an approach for the simultaneous analysis of colorless analytes using optical sensor arrays. On the basis of the equilibrium law, the principles of indicator displacement assay (IDA)–based sensor arrays that can be utilized for the efficient quantification of analytes in the mixture solutions through multivariate hard modeling approach are discussed in detail. According to these principles, two different sensor array systems are designed for the simultaneous analysis of the analytes with nonselective signal in the mixtures. Each sensor element of the designed sensor arrays consists of a single indicator and receptor but has different equilibrium environmental conditions.

    更新日期:2019-11-15
  • Chemometric study of kosmotropic and chaotropic ion properties related to Hofmeister effects
    J. Chemometr. (IF 1.847) Pub Date : 2019-09-09
    Anna Jakubowska, Tomasz Kozik

    The ion specificity is ubiquitous in biological and colloidal phenomena. In this study, the Hofmeister effects have been observed in binding of univalent cations with the surfactant dimers. Chemometric analysis has also revealed that the ion specificity observed is mainly related to the effect of ions on the water structure, ie, to their kosmotropic and chaotropic properties. These ion properties, generally characteristic of aqueous solutions, are in accordance with the results of the principal component analysis (PCA) and cluster analysis (CA) used for the processing of mass spectrometric data obtained in vacuum, ie, in the medium drastically different then water. Such an application of the chemometric methods has not been presented so far.

    更新日期:2019-11-15
  • Comparative Chemometric Analysis for Classification of Acids and Bases via a Colorimetric Sensor Array.
    J. Chemometr. (IF 1.847) Pub Date : 2018-05-26
    Michael J Kangas,Raychelle M Burks,Jordyn Atwater,Rachel M Lukowicz,Billy Garver,Andrea E Holmes

    With the increasing availability of digital imaging devices, colorimetric sensor arrays are rapidly becoming a simple, yet effective tool for the identification and quantification of various analytes. Colorimetric arrays utilize colorimetric data from many colorimetric sensors, with the multidimensional nature of the resulting data necessitating the use of chemometric analysis. Herein, an 8 sensor colorimetric array was used to analyze select acid and basic samples (0.5 - 10 M) to determine which chemometric methods are best suited for classification quantification of analytes within clusters. PCA, HCA, and LDA were used to visualize the data set. All three methods showed well-separated clusters for each of the acid or base analytes and moderate separation between analyte concentrations, indicating that the sensor array can be used to identify and quantify samples. Furthermore, PCA could be used to determine which sensors showed the most effective analyte identification. LDA, KNN, and HQI were used for identification of analyte and concentration. HQI and KNN could be used to correctly identify the analytes in all cases, while LDA correctly identified 95 of 96 analytes correctly. Additional studies demonstrated that controlling for solvent and image effects was unnecessary for all chemometric methods utilized in this study.

    更新日期:2019-11-01
  • Artificial neural networks as supervised techniques for FT-IR microspectroscopic imaging.
    J. Chemometr. (IF 1.847) Pub Date : 2007-03-28
    Peter Lasch,Max Diem,Wolfgang Hänsch,Dieter Naumann

    In this report the applicability of an improved method of image segmentation of infrared microspectroscopic data from histological specimens is demonstrated. Fourier transform infrared (FT-IR) microspectroscopy was used to record hyperspectral data sets from human colorectal adenocarcinomas and to build up a database of spatially resolved tissue spectra. This database of colon microspectra comprised 4120 high-quality FT-IR point spectra from 28 patient samples and 12 different histological structures. The spectral information contained in the database was employed to teach and validate multilayer perceptron artificial neural network (MLP-ANN) models. These classification models were then employed for database analysis and utilised to produce false colour images from complete tissue maps of FT-IR microspectra. An important aspect of this study was also to demonstrate how the diagnostic sensitivity and specificity can be specifically optimised. An example is given which shows that changes of the number of teaching patterns per class can be used to modify these two interrelated test parameters. The definition of ANN topology turned out to be crucial to achieve a high degree of correspondence between the gold standard of histopathology and IR spectroscopy. Particularly, a hierarchical scheme of ANN classification proved to be superior for the reliable classification of tissue spectra. It was found that unsupervised methods of clustering, specifically agglomerative hierarchical clustering (AHC), were helpful in the initial phases of model generation. Optimal classification results could be achieved if the class definitions for the ANNs were carried out by considering the classification information provided by cluster analysis.

    更新日期:2019-11-01
  • Adaptive penalties for generalized Tikhonov regularization in statistical regression models with application to spectroscopy data.
    J. Chemometr. (IF 1.847) Pub Date : 2017-04-01
    Timothy W Randolph,Jimin Ding,Madan G Kundu,Jaroslaw Harezlak

    Tikhonov regularization was proposed for multivariate calibration by Andries and Kalivas [1]. We use this framework for modeling the statistical association between spectroscopy data and a scalar outcome. In both the calibration and regression settings this regularization process has advantages over methods of spectral pre-processing and dimension-reduction approaches such as feature extraction or principal component regression. We propose an extension of this penalized regression framework by adaptively refining the penalty term to optimally focus the regularization process. We illustrate the approach using simulated spectra and compare it with other penalized regression models and with a two-step method that first pre-processes the spectra then fits a dimension-reduced model using the processed data. The methods are also applied to magnetic resonance spectroscopy data to identify brain metabolites that are associated with cognitive function.

    更新日期:2019-11-01
  • Quionolone carboxylic acid derivatives as HIV-1 integrase inhibitors: Docking-based HQSAR and topomer CoMFA analyses.
    J. Chemometr. (IF 1.847) Pub Date : 2018-04-03
    Jianbo Tong,Pei Zhan,Xiang Simon Wang,Yingji Wu

    Quionolone carboxylic acid derivatives as inhibitors of HIV-1 integrase were investigated as a potential class of drugs for the treatment of acquired immunodeficiency syndrome (AIDS). Hologram quantitative structure-activity relationships (HQSAR) and translocation comparative molecular field vector analysis (topomer CoMFA) were applied to a series of 48 quionolone carboxylic acid derivatives. The most effective HQSAR model was obtained using atoms and bonds as fragment distinctions: cross-validation q2 = 0.796, standard error of prediction SDCV = 0.36, the non-cross-validated r2 = 0.967, non-cross validated standard error SD = 0.17, the correlation coefficient of external validation Qext2 = 0.955, and the best hologram length HL = 180. topomer CoMFA models were built based on different fragment cutting models, with the most effective model of q2 = 0.775, SDCV = 0.37, r2 = 0.967, SD = 0.15, Qext2 = 0.915, and F = 163.255. These results show that the models generated form HQSAR and topomer CoMFA were able to effectively predict the inhibitory potency of this class of compounds. The molecular docking method was also used to study the interactions of these drugs by docking the ligands into the HIV-1 integrase active site, which revealed the likely bioactive conformations. This study showed that there are extensive interactions between the quionolone carboxylic acid derivatives and THR80, VAL82, GLY27, ASP29, and ARG8 residues in the active site of HIV-1 integrase. These results provide useful insights for the design of potent new inhibitors of HIV-1 integrase.

    更新日期:2019-11-01
  • Discovery of False Identification Using Similarity Difference in GC-MS based Metabolomics.
    J. Chemometr. (IF 1.847) Pub Date : 2015-05-06
    Seongho Kim,Xiang Zhang

    Compound identification is a critical process in metabolomics. The widely used approach for compound identification in gas chromatography-mass spectrometry (GC-MS) based metabolomics is the spectrum matching, in which the mass spectral similarity between an experimental mass spectrum and each mass spectrum in a reference library is calculated. While various similarity measures have been developed to improve the overall accuracy of compound identification, little attention has been paid to reducing the false discovery rate. We, therefore, develop an approach for controlling false identification rate using the distribution of the difference between the first and the second highest spectral similarity scores. We further propose a model-based approach to achieving a desired true positive rate. The developed method is applied to the NIST mass spectral library and its performance is compared with the conventional approach that uses only the maximum spectral similarity score. The results show that the developed method achieves a significantly higher F1 score and positive predictive value than those of the conventional approach.

    更新日期:2019-11-01
  • Multivariate curve resolution methods and the design of experiments
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-22
    Mathias Sawall, Christoph Kubis, Henning Schröder, Denise Meinhardt, Detlef Selent, Robert Franke, Alexander Brächer, Armin Börner, Klaus Neymeyr

    A major problem of multivariate curve resolution methods is the underlying non‐uniqueness of the pure component decompositions. This raises the question how a chemical experiment should be designed so that the solution ambiguity is as small as possible. Changes of the reaction conditions belong to the possible variations whereas for a fixed chemical reaction system, the pure component spectra appear to be unchangeable.

    更新日期:2019-10-25
  • Incorporating brand variability into classification of edible oils by Raman spectroscopy
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-01
    Francis Kwofie, Barry K. Lavine, Joshua Ottaway, Karl Booksh

    Two‐hundred and fifteen Raman spectra of 15 edible oils or blends of edible oils from 53 samples spanning multiple brands purchased over 3 years were investigated using a genetic algorithm for spectral pattern recognition. Using a hierarchical approach to classification, the 15 edible oils could be divided into two groups based on their degree of unsaturation. While edible oils from any particular batch within a class are well clustered and can be differentiated from other varieties of edible oils that are also from a single source, incorporating uncontrolled variability from sources (by purchasing edible oils under different brand names) and seasons (by purchasing edible oils over a 3‐year period) presented a far more challenging classification problem for edible oils within the same group. The between‐source and yearly variability within one class of edible oils is often comparable to differences between the average spectra of the different varieties of edible oils, thereby preventing either a reliable classification of the edible oils or the detection of adulterants in an edible oil if a single model, spanning all sources and years of oils, is to be constructed. The novelty of this study arises from the incorporation of edible oils gathered systematically over the span of 3 years, introducing a heretofore unseen variance to the chemical compositions of the edible oils that are being classified. This is the first time that many different edible oils and commercially available brands thereof have been classified simultaneously.

    更新日期:2019-10-25
  • VSN: Variable sorting for normalization
    J. Chemometr. (IF 1.847) Pub Date : 2019-07-29
    Gilles Rabatel, Federico Marini, Beata Walczak, Jean‐Michel Roger

    Spectrometric and analytical techniques in general collect multivariate signals from chemical or biological materials by means of a specific measurement instrumentation, usually in order to characterize or classify them through the estimation of one of several compounds of interest. However, measurement conditions might induce various additive (baseline) or multiplicative effects on the collected signals, which may jeopardize the accuracy and generalizability of estimation models. A common way of dealing with such issues is signal normalization and in particular, when the baseline is constant, the standard normal variate (SNV) transform. Despite its efficiency, SNV has important drawbacks, in terms of physical interpretation and robustness of estimation models, because all the variables are equally considered, independently on what their actual relationship with the response(s) of interest is. In the present study, a novel algorithm is proposed, named variable sorting for normalization (VSN). This algorithm automatically produces, for a given set of multivariate signals, a weighting function favoring signal variables that are only impacted by additive and multiplicative effects, and not by the response(s) of interest. When introduced in SNV preprocessing, this weighting function significantly improves signal shape and model interpretation. Moreover, VSN can be successfully used not only for constant but also with more complex baselines, such as polynomial ones. Together with the description of the theory behind VSN, its application on various synthetic multivariate data, as well as on real SWIR spectral data, is presented and discussed.

    更新日期:2019-10-25
  • Trilinear self‐modeling curve resolution using Borgen‐Rajkó plot
    J. Chemometr. (IF 1.847) Pub Date : 2019-07-22
    Nematollah Omidikia, Hamid Abdollahi, Mohsen Kompany‐Zareh, Róbert Rajkó

    Modern analytical instruments provide measurement data arrays with full of hidden and redundant information. Multivariate curve resolution (MCR) techniques decompose data set to physic‐chemically meaningful abstract profiles. On the other hand, for such data matrices, Borgen‐Rajkó self‐modeling curve resolution (SMCR) techniques reveal all possible solutions analytically under the minimal assumption. Although Lawton‐Sylvestre (LS) and Borgen methods have been proposed for the non‐negative curve resolution of two‐component and three‐component systems, there is still a great deal of interest to include further restrictions on the Borgen‐Rajkó SMCR. As modern hyphenated analytical instruments produce multiway (eg, three‐way) arrays, multiway analysis (eg, trilinear decomposition) was received much more popularity by chemists.

    更新日期:2019-10-25
  • Simulation of 1/fα noise for analytical measurements
    J. Chemometr. (IF 1.847) Pub Date : 2019-06-20
    Stephen Driscoll, Michael Dowd, Peter D. Wentzell

    A simple procedure is described that can be used to generate 1/fα noise, also known as power law noise, in simulated analytical measurement vectors. Certain types of power law noise, such as pink noise (α=1), dominate many types of analytical signals, so its simulation is important in optimizing data processing strategies. In this work, simulated 1/fα error sequences are created directly from white noise via the theoretical measurement error covariance matrix (ECM) by rotation and scaling. The 1/fα ECM is obtained from the coefficients of a finite impulse response filter and is easily adapted to generate multiplicative 1/fα noise that is probably more common for analytical systems exhibiting proportional noise characteristics. Simulating 1/fα noise directly from the ECM offers two main advantages. First, 1/fα noise can be easily simulated in the presence of other common analytical measurement errors by additive combination of the ECMs. Second, the theoretical ECM can be used to model real experimental measurement noise. It is shown that the power spectral density function of measurement error sequences generated by the proposed method closely approximates the theoretical behaviour of 1/fα noise. To demonstrate the utility of this method in evaluating data processing methods, simulated data exhibiting 1/f (pink) noise is analyzed by maximum likelihood principal component analysis (MLPCA) that takes measurement error structure into account, and baseline noise is simulated using brown noise to test baseline fitting by asymmetric least squares (AsLS).

    更新日期:2019-10-25
  • Four‐dimensional quantitative structure‐activity analysis of 1,4‐naphthoquinone derivatives tested against HL‐60 human promyelocytic leukemia cells
    J. Chemometr. (IF 1.847) Pub Date : 2019-04-23
    Maria Cristina A. Costa, Pedro O. Mariz Carvalho, Márcia M. C. Ferreira

    Four‐dimensional quantitative structure‐activity relationship (4D‐QSAR) models were developed to predict biological activity of 1,4‐naphthoquinones derivatives tested against human HL‐60 leukemic cells, in order to better investigate the action mode of these compounds. Quinones can generate reactive oxygen species (ROS) through the activation by the cytochrome P450 and P450 reductase enzymes acting as anticancer agents. Molecular dynamics (MD) simulations were performed in the 3D optimized geometries, and the field descriptors were calculated. Partial least squares (PLS) regression method was applied to build the QSAR model, which presented the following statistics, with two factors and explaining 51.11% of total variance: R2 = 0.83; SEC = 0.28; Q2 = 0.77; SEV = 0.31. For external validation, the results were R2pred = 0.76 and SEP = 0.30. Among the nine Coulomb (C) and Lennard‐Jones (LJ) descriptors selected by the model, one of them, C13838, is located close to quinone oxygens involved in the production of radical anions (O2−·) and to hydroxyl in position 5 that may stabilize catalytically important water molecules. The negative LJ descriptors around R1 and R2 substituents might indicate that apolar substituents in these regions are unfavorable to the activity. Coulomb descriptors located at the vicinities of the substituent R2 gave information about the bioactive conformation.

    更新日期:2019-10-25
  • Spatial‐spectral analysis method using texture features combined with PCA for information extraction in hyperspectral images
    J. Chemometr. (IF 1.847) Pub Date : 2019-04-23
    Jun‐Li Xu, Aoife A. Gowen

    This work proposes a new method to treat spatial and spectral information interactively. The method extracts spatial features, ie, variogram, gray‐level co‐occurrence matrix (GLCM), histograms of oriented gradients (HOG), and local binary pattern (LBP) features, from each wavelength image of hypercube and principal component analysis (PCA) is applied on this spatial feature matrix to identify wavelength‐dependent variation in spatial patterns. Resultant image is obtained by projecting the score values to the original data. Three datasets, including a synthetic hyperspectral image (Dataset 1), a set of real hyperspectral images of salmon fillets (Dataset 2), and remote‐sensing images (Dataset 3), were utilized to evaluate the performance of the proposed method. Results from Dataset 1 showed that the spatial‐spectral methods had the potential of reducing baseline offset noise. Dataset 2 revealed that spatial‐spectral methods can alleviate noisy pixels with strong signal and reduce shadow effects. In addition, substantial improvements were obtained in case of classification between white stripe and red muscle pixels by using the HOG‐based approach with correct classification rate (CCR) of 0.97 compared with the models directly built from raw and standard normal variate (SNV) preprocessed spectra (CCR = 0.94). Samson image of Dataset 3 suggested the flexibility and effectiveness of the proposed method by improving CCR of 0.96 using conventional PCA on SNV pretreated spectra to 0.98 using GLCM‐based approach on SNV preprocessed spectra. Overall, experimental results demonstrated that the spatial‐spectral methods can improve the results found by using the spectral information alone because of the spatial information provided.

    更新日期:2019-10-25
  • Second‐order multivariate calibration with the extended bilinear model: Effect of initialization, constraints, and composition of the calibration set on the extent of rotational ambiguity
    J. Chemometr. (IF 1.847) Pub Date : 2019-04-15
    Alejandro C. Olivieri

    Extended bilinear modeling is popular in second‐order multivariate calibration, particularly when the matrix data for each sample are of chromatographic origin. Since elution time profiles vary across samples, in both shape and peak position, it is not possible to process these data in a three‐way trilinear format. In these cases, the most successful model for quantitating analytes in the presence of interferents is multivariate curve resolution‐alternating least squares (MCR‐ALS) in its extended version, ie, processing an augmented data matrix built with the matrices for a test sample and the calibration samples, appended in the direction of the elution time mode. MCR‐ALS starts with certain initial profiles and applies a set of natural constraints during the ALS phase, whose purpose is to reduce the range of feasible solutions or to lead to unique bilinear solutions if possible. In this report, a simulated second‐order three‐component system (two calibrated analytes and one uncalibrated interferent in test samples) is studied regarding the presence of rotational ambiguity in the bilinear solutions, using (a) a grid search methodology to compute the feasible solutions and (b) MCR‐ALS on a large set of test samples to estimate the average prediction errors. Various initialization schemes and constraints are probed, and the results are compared in terms of the extent of rotational ambiguity and global uncertainty in predicted concentrations.

    更新日期:2019-10-25
  • Online discrimination of chemical substances using standoff laser‐induced fluorescence signals
    J. Chemometr. (IF 1.847) Pub Date : 2019-04-09
    Marian Kraus, Florian Gebert, Arne Walter, Carsten Pargmann, Frank Duschek

    Chemical contamination of objects and surfaces, caused by accident or on purpose, is a common security issue. Immediate countermeasures depend on the class of risk and consequently on the characteristics of the substances. Laser‐based standoff detection techniques can help to provide information about the thread without direct contact of humans to the hazardous materials. This article explains a data acquisition and classification procedure for laser‐induced fluorescence spectra of several chemical agents. The substances are excited from a distance of 3.5 m by laser pulses of two UV wavelengths (266 and 355 nm) with less than 0.1 mJ per laser pulse and a repetition rate of 100 Hz. Each pair of simultaneously emitted laser pulses is separated using an optical delay line. Every measurement consists of a dataset of 100 spectra per wavelength containing the signal intensities in the spectral range from 250 to 680 nm, recorded by a 32‐channel photo multiplying tube array. Based on this dataset, three classification algorithms are trained which can distinguish the samples by their single spectra with an accuracy of over 98%. These predictive models, generated with decision trees, support vector machines, and neural networks, can identify all agents (eg, benzaldehyde, isoproturon, and piperine) within the current set of substances.

    更新日期:2019-10-25
  • SO‐CovSel: A novel method for variable selection in a multiblock framework
    J. Chemometr. (IF 1.847) Pub Date : 2019-03-25
    Alessandra Biancolillo, Federico Marini, Jean‐Michel Roger

    With the development of technology and the relatively higher availability of new instrumentations, having multiblock data sets (eg, a set of samples analyzed by different analytical techniques) is becoming more and more common and, as a consequence, how to handle this kind of outcomes is a widely discussed topic. In such a context, where the number of involved variables is relatively high, selecting the most significant features is obviously relevant. For this reason, the possibility of joining a multiblock regression method, the sequential and orthogonalized partial least‐squares (SO‐PLS), with a variable selection approach called covariance selection (CovSel), has been investigated. The resulting method, sequential and orthogonalized covariance selection (SO‐CovSel) is similar to SO‐PLS, but the feature reduction provided by PLS is performed by CovSel. Finally, predictions are made by applying multiple linear regression on the subset of selected variables. The novel approach has been tested on different multiblock data sets both in regression and in classification (by combination with LDA), and it has been compared with another state‐of‐the‐art multiblock method. SO‐CovSel has demonstrated to be suitable for its purpose: It has provided good predictions (both in regression and in classification) and, from the interpretation point of view, it has led to a meaningful selection of the original variables.

    更新日期:2019-10-25
  • Localized and adaptive soft sensor based on an extreme learning machine with automated self‐correction strategies
    J. Chemometr. (IF 1.847) Pub Date : 2018-10-24
    Dominic V. Poerio, Steven D. Brown

    A novel, nonlinear soft sensor based on a localized, adaptive single‐layer feedforward neural network with random hidden layer weights, also called an extreme learning machine, combined with the recursive partial least squares algorithm to update the linear output layer weights, is explored. The soft sensor is highly adaptive with minimal operator input, and automated mechanisms are included to self‐correct numerous aspects of the underlying model. For instance, mechanisms are put in place to automatically select an optimized local model region describing the current process dynamics from the historical data when the current prediction error reaches an adaptively computed threshold. Additionally, the new soft sensor simultaneously employs an ensemble of models with diverse recursive partial least squares forgetting factors with automated and adaptive reweighting of the models in the ensemble, thus enabling real‐time model memory adjustment. The validity of the method is shown by comparison with numerous other soft sensor methods for the prediction of the activity of a polymerization catalyst.

    更新日期:2019-10-25
  • Issue Information
    J. Chemometr. (IF 1.847) Pub Date : 2019-10-20

    No abstract is available for this article.

    更新日期:2019-10-25
  • Iterative deflation algorithm, eigenvalue equations, and PLS2
    J. Chemometr. (IF 1.847) Pub Date : 2019-08-27
    Matteo Stocchero

    PLS2 is probably the most used algorithm to perform projection to latent structures regression in the case of multivariate response. However, several criticisms pointed to the theoretical limits of its original formulation, highlighting the need of a more robust foundation within the theory of regression analysis. The iterative deflation algorithm is here introduced as a starting point to obtain a family of regression methods, which includes PLS2, principal component regression (PCR), and elastic component regression (ECR), where different eigenvalue equations are used to calculate the weight vectors. Within this framework, an original portrait of PLS2 is drawn. The main mathematical properties useful to understand what PLS2 is and how PLS2 behaves are derived. A new regression method called iterative deflation algorithm‐based regression (IDAR) is introduced to describe the limit behaviour of PLS2, PCR, and ECR. The post‐transformation method is presented as a general property of the iterative deflation algorithm. Two data sets, one simulated and the other experimental, are investigated to illustrate the main properties of PLS2.

    更新日期:2019-10-25
  • Support vector machine regression on selected wavelength regions for quantitative analysis of caffeine in tea leaves by near infrared spectroscopy
    J. Chemometr. (IF 1.847) Pub Date : 2019-07-29
    Somdeb Chanda, Ajanto Kumar Hazarika, Navnil Choudhury, Sk Anarul Islam, Rishabh Manna, Santanu Sabhapondit, Bipan Tudu, Rajib Bandyopadhyay

    Caffeine is an important component that determines the quality of tea, and its rapid estimation is very much needed for the industry. In this pursuit, a near‐infrared (NIR) spectroscopy‐based technique for the estimation of caffeine is developed and presented in this paper. On the basis of responses of the different bonds present in caffeine, four specific wavelength windows—(a) 1075 to 1239.5 nm (C―H stretch second overtone); (b) 1339.25 to 1440.75 nm (C―H stretch and C―H deformation); (c) 1640.25 to 1700 nm (C―H stretch first overtone, ═CH & amp; ―CH3 asymmetric); and (d) 900 to 1700 nm (whole range of the spectrometer)—were analyzed in details for model development and to obtain the effective wavelength (EW). Five different preprocessing techniques followed by two regression techniques—(a) the partial least‐squares (PLS) and (b) the support vector regression (SVR) were implemented on raw data for analysis. Comparing all the models, the wavelength band of 1075 to 1239.5 nm and 1339.25 to 1440.75 nm were found to produce satisfactory results. The best discrimination result was obtained using the combination of standard normal variate (SNV) preprocessing with SVR at the 1075 to 1239.5 nm wavelength region. The SVR regression with 105 samples in the training set and 15 samples in the testing set resulted in the performance parameters as RMSECV = 0.134, RMSEP = 0.069, rcv2 = 0.869, rp2 = 0.65, and RPD = 5.626 at 1075 to 1239.5 nm, whereas the PLS model produced the best RMSECV = 0.287, RMSEP = 0.077, rcv2 = 0.637, rp2 = 0.675, and RPD = 5.218 at 1339.25 to 1440.75‐nm wavelength band.

    更新日期:2019-10-25
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
加州大学洛杉矶分校
上海纽约大学William Glover
南开大学化学院周其林
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug