Introduction

Glycosaminoglycans (GAGs) are linear, polydisperse carbohydrates that are ubiquitous among living cells, and are responsible for a multitude of biological interactions including cell signaling, energy generation, protein binding conformation changes, and molecular recognition [1,2,3,4]. Structurally, GAGs are composed of a repeating linear disaccharide backbone of a uronic sugar and amino sugar residue. Structural differentiation occurs based on three primary forms of modifications: O-sulfation, N-deacetylation/sulfation, and uronic sugar epimerization. Recent studies suggest that patterns of sulfation have a profound effect on protein binding [5, 6]. Moreover, analysis of GAGs released from proteoglycans bikunin and decorin suggests that naturally occurring glycans may contain a conserved sulfation motif that is independent of chain length [7, 8]. Modifications can be localized to specific glycan residues and further designated to specific positions with the respective use of glycosidic and cross-ring fragmentation that arise from ion activation using collisional or electron-based activation or photodissociation [9,10,11,12,13,14,15,16,17,18,19].

Sulfation and other structural modifications critically influence the biological activity of GAGs, and considerable effort has been made to develop new approaches for sequencing this class of molecules. The assignment of structures for GAGs requires a de novo approach in which the mass spectra must contain sufficient information to localize all the relevant features [20]. Such a method would determine GAG structure using a combination of assumed information (repeating disaccharide copolymer backbone, linkage pattern between sugars) and diagnostic and information-rich MS and MS2 fragments that can localize the position of SO3 and N-acetyl modifications. There are a number of methods that can be used to produce information-rich tandem mass spectra, using collisions, photodissociation, or electron-based activation methods. The density of fragment ions in these tandem mass spectra makes it difficult to interpret and assign structures in a high-throughput manner without assistance from specialized software. Manual interpretation of tandem MS of GAGs is possible but requires considerable time and expertise. While the interpretation of known structures can be significantly improved with mass calculation tools such as GlycoWorkbench [21], the feasibility of interpreting unknown GAG structures diminishes greatly with an increasing degree of polymerization (dp), as the number of permutations scales as n-choose-k, where n is number of possible modification sites (up to 4 per disaccharide) and k is the number of modifications.

Previous work from our laboratory proposes software for a database-independent method of automated GAG tandem MS interpretation using a genetic algorithm [20]. The approach is an effort to replicate the evolutionary cycle and determination of the most optimum structure using a survival-of-the-fittest mechanism. Theoretical structures are represented as genes and tested against the MS2—the closeness of the match between theoretical fragments and the experimental data dictates the fitness of the gene. The best genes (two or more) that best survive (i.e., structures that would have theoretical fragments that most closely match the data) breed, passing certain structural characteristics onto a new generation. Each generation has its own set of winners, those most accurately matching the data, and continues this process until the best candidates no longer show signs of change.

Our earlier publication provided proof of principle for this approach using full-length bikunin GAGs (up to dp 43 in length) but only examined glycosidic fragments [7], as enzymatic analysis has determined that there was only one type of modified disaccharide present in these samples. Most glycosaminoglycans are far more structurally diverse. Modification heterogeneity and the potential for multiple sulfo-modifications to occur on the same sugar residue dictate that we not only incorporate cross-ring fragmentation into our scoring model but also find and utilize structurally diagnostic fragments or fragment sets that can yield unambiguous assignments of modification position. The approach for automated assignment here begins at the MS1 stage, using a set of known masses for composition assignment: amino sugar, uronic sugar, and delta-uronic sugar (maximum of 1) masses serve as fundamental building blocks. A network of potential masses is generated by propagation of the repeating uronic and amino sugar copolymer with variable amounts of SO3 and acetyl modifications being added. Assignment of composition is based on accurate mass, similar to an approach presented by Hogan et al. [22]. Our structural characterization technique is based on an optimization framework that generates fragments from randomly generated candidate structures in silico, and then optimizes structural possibilities with a genetic algorithm. This relies on a robust scoring algorithm to support the survival of the fittest mechanism, in which candidate structures are fragmented in silico, and the products are compared to the experimental mass spectrum. Here, we explain in detail the development of the scoring method that enables this approach. This technique is considerably different from the previous work of Hu et al. [23], who have proposed a divide-and-conquer strategy that determines GAG structures by partitioning the GAG sequence into smaller parts. The approach combines and builds upon these smaller pieces of information to determine overall sequence. Our genetic algorithm approach examines a theoretical GAG structure in its entirety instead of building a structure from the ground up. The labile SO3 modifications of GAGs are prone to being lost during ion activation, thus depleting some of the fragment precursor, and can confound structure-building algorithms that attempt to assemble sequence-based pieces from the whole. Also, as shown previously [20], this technique can also highlight the possibility of mixtures and/or additional GAG components in the GAGs of proteoglycans based on conflicting fragment data. Hence, our focus is more concerned with (1) does the tandem mass spectrum unambiguously explain characteristics of a particular structure on the whole (instead of its parts) and (2) is there a possible structure of appropriate composition that can more accurately match the spectrum?

Experimental

Mass Spectrometry

Electron detachment dissociation (EDD) experiments were performed using a 9.4-T Bruker Apex Ultra QeFTMS (Billerica, MA), with an indirectly heated hollow cathode (HeatWave, Watsonville, CA) for generating EDD electrons. The solutions were ionized using flow nanoelectrospray (pulled fused silica tip FS360-75-15-N-20). The solutions were made at a concentration of 0.2 mg/mL in 50:50 methanol:H2O. Hexasaccharide solutions were injected at a rate of 25 μL/h. The solutions were run in negative ion mode. Precursor ions were selected in the external quadrupole accumulated for 4 s in an RF-only hexapole before injection into the Fourier transform ion cyclotron resonance mass spectrometer (FT-ICR MS). Precursor isolation was refined by using in-cell isolation with coherent harmonic excitation frequency (CHEF). The isolation power of the CHEF event was 20%. For irradiation of electrons, the cathode heater was set to 1.6 A and the bias was set to − 19 V. The extraction lens was set to − 19.2 ± 0.2 V. Ions were irradiated for 1 s. Sixty-four acquisitions were signal averaged per mass spectrum. Internal calibration produced a mass accuracy of 5 ppm.

Software

MS1 analysis of parent ion mass is performed using a composition assignment software module written in the MATLAB coding environment. Monoisotopic peaks and charge states are acquired from Bruker DataAnalysis (using the FTMS peak-picking method) and deconvoluted to a neutral mass. A composition is derived from one or more neutral masses by searching an automatically generated data matrix of possible chain lengths, degrees of sulfation, deacetylation, and sodium/hydrogen exchange. The user input also includes the possibility of reducing end modifications, and non-reducing ends that can terminate in unsaturated uronic acids, as is common in enzymatically produced GAG oligomers. Theoretical neutral masses from the automated data matrix are compared against user-specified masses with a user-defined mass tolerance. The sequences that match are then used for performing the MS2 analysis.

For MS2 analysis, the software uses a genetic algorithm search heuristic alongside binary vector representation of glycan structures where on-bits denote an occupied site of SO3 and N-acetylation modifications. The first step generates two glycan structures at random that fit the expected composition (initialization step) and then proceeds to breed these structures into a new generation of candidates (crossover step). The primary three steps (crossover, mutation, and fitness) are iterated until the maximum fitness value does not change after numerous cycles. The number of iterations required before termination of the algorithm can be defined by the user but is defaulted at a value of 3. The structures containing the highest scores are then examined using additional data interpretation tools that assign fragment peak masses alongside their charge, intensity, and mass error (in ppm).

Software is compatible for standard desktop computers, with a minimum requirement of MATLAB R2014 coding environment or newer, 2.4 GHz processor and 4 GB RAM. Customization specific to GAG family [chondroitin sulfate and dermatan sulfate (CS/DS), heparin and heparan sulfate (Hp/HS), and Arixtra’s synthetic backbone and connectivity] is available using specific functions for determination of fragment masses. Additional parameters are adjustable from the command line if desired (maximum charge state of fragments, ppm mass error tolerance, maximum possible Na-H exchange, neutral loss considerations for H2O and CO2). MATLAB source code and raw mass spectral data used for testing and validation in this manuscript are available upon request.

Results and Discussion

Software Architecture

Previous work from our laboratory presented an automated and high-throughput method for characterizing GAG structure of unknown samples from tandem MS using a database-independent method that generates theoretical fragments in silico and then proceeds to optimize potential structures using a genetic algorithm [20]. The workflow is shown in Figure 1. The user provides tandem MS data in the form of a two-column comma-separated value file with masses (including isotopes) sorted ascending in column 1 and intensities (either relative or arbitrary units) in column 2 as well as additional inputs of the parent ion mass and charge of the precursor ion. An independent module calculates the composition, including the degree of polymerization, the number of SO3 groups, and the number of acetyl groups, within a user-defined ppm tolerance window. Once completed, the software automatically employs a genetic algorithm optimization model to determine the most likely structure(s) based on the data provided.

Figure 1
figure 1

The standard workflow for our current GAG identification software. The user provides the information in the green box. Blue boxes are connected module and fully automated and require no user supervision

The optimization procedure generates a list of candidate structures of appropriate composition, representing them as a binary number, as shown in Figure 2. The automate procedure then determines the quality of the match of the candidates’ predicted fragmentation products with the experimental data. An essential requirement for this process is a robust scoring algorithm that can discriminate isomeric structures by their fit to the experimental data. Each iteration of the genetic algorithm attempts to improve upon the list of candidate structures while eliminating low scoring structures from consideration. This paper focuses on the development of the scoring algorithm that can be applied to any GAG family (heparin, heparan sulfate, chondroitin sulfate, dermatan sulfate) and to any chain length. The focus is on the methodology used to compare computer-generated candidate structures against experimental data and to show the statistical merit of the scoring paradigm.

Figure 2
figure 2

Bit-wise matrix representation of disaccharide CS/DS and Hp/HS GAG families used in software for rapid analysis via a genetic algorithm. Multiple disaccharide units can be combined to fit the appropriate chain length and composition. In the case of CS/DS, bit 5 represents a rare and unlikely circumstance where a free amine could exist. The default case is that bit 5 is removed, and the occupancy of bit 4 dictates either N-SO3 (when bit 4 = 1) or N-acetylation (when bit 4 = 0)

Scoring Algorithm

A fundamental feature that allows this software to operate independent of a database is that it considers all possible isomers, constrained by GAG family and by composition (dp, the degree of sulfation, and the number of acetylated amine groups). In order to allow all isomers to be considered, without having to score every possibility, the genetic algorithm optimizes the match of structures to the experimental data. Using this optimization tool allows scoring on a small subset (1% or less) of all possible permutations of a given composition while finding the correct structure. Within this optimization step is a series of comparisons between candidate structures and the experimental data—each iteration of the genetic algorithm attempts to find a more closely matched candidate structure based on its interpretation of experimental data; the fitness score of a candidate structure is the value used as a measure of closeness. Hence, the paradigm for how the software scores candidate structures must be discussed.

A 2-step system is utilized in which an initial score is given to structures based on glycosidic fragments alone and then a refined score is assigned based on the number of cross-ring fragments that can be used to unambiguously define a specific modification location. The first step determines the number and type of modifications that are located on a particular residue. For CS/DS GAG families, we consider the possibility of 2-O-sulfation on the hexuronic residue and 4-O- and 6-O-sulfation on the N-acetyl galactosamine. Optionally, more uncommon CS modifications such as N-sulfation or a free amine group can be considered, in which case a new bit is introduced to the binary representation of the GAG candidate oligomers. As this modification is exceedingly rare for naturally occurring CS and would decrease overall software performance calculations, it is by default not used unless specified by the user. For Hp/HS GAG families, we consider the possibility of 2-O-sulfation on hexuronic residues and 3-O-, 6-O-, and N-sulfation on glucosamine residues. Additionally, de-acetylated glucosamines are also an optional feature.

Glycosidic fragments provide the composition of each residue in the GAG oligomer, greatly reducing the search space for complete structural assignment. Each sugar residue is assigned a score of 0 to 4 (called GlycScore or SG), ranked based on the number of glycosidic fragments that surround the sugar residue as well as what information these fragments provide. Figure 3 provides the criterion for assigning a score to a residue: a GlycScore of 4 indicates that glycosidic cleavages are observed to be pointing directly to the specific residue and come from both the reducing and non-reducing ends. On the other hand, a low-value GlycScore indicates only a few or no fragments that are supportive of a particular residue. Assuming high-quality tandem MS data with information-rich spectra, one can expect that the majority of glycosidic fragments are present. When using a genetic algorithm optimization method, candidate structures that lack a large fraction (> 50%) of glycosidic fragment matches are evaluated as unlikely matches and other structures with similar characteristics are unlikely to be examined.

Figure 3
figure 3

A simple paradigm for assigning GlycScore (SG) to a residue. Each residue’s SG is concerned with the set of non-reducing and reducing end fragments that point to it. For residue n, Bn, Cn, Ymax-n + 1, and Zmax-n + 1 (shown with black arrows) are the fragments considered to be part of SGn. Fragments in blue contribute to the SG of the residue adjacent from the non-reducing end. Likewise, red fragments contribute to the SG of the residue adjacent from the reducing end

The ranking of structures and their respective step 1 score (S1) is reliant not solely on the GlycScore (SG) but also on the intensities of fragment matches as shown in Eq. (1), where In is the intensity of the 4 fragments (Bn, Cn, Ymax-n + 1, Zmax-n + 1) that point directly to the residue, n, in question

$$ S{1}_n={SG}_n\times \sum \limits_{m=1}^4{I}_m $$
(1)

The GlycScore by itself will always yield an integer value and is prone to problematic artifacts if used as the sole measurement for ranking structures as multiple structures can have the exactly the same GlycScore if the number of glycosidic fragment matches is equal. The commonly occurring neutral loss of SO3 in the tandem mass spectrum exacerbates the problem, as the resulting fragments match a loss-free glycosidic fragment when the genetic algorithm presents a candidate structure that is under-sulfated in a specific region compared to the true structure. The solution is to consider the intensity of matched glycosidic fragments (Im) as a scaling factor to SG. Previous works on GAG tandem MS for multiple ion activation techniques including CID, EDD, and negative electron transfer dissociation (NETD) show that the intensity of SO3 loss peaks is typically lower in abundance than the non-loss peak [11, 12, 16,17,18]. Figure 4 shows an example of the assignment of SG and S1 for a hexasaccharide GAG with 6 sulfo-modifications.

Figure 4
figure 4

(a) Annotated structure of synthetic hexasaccharide sample with 6 SO3 groups; fragments were produced by electron detachment dissociation (EDD). (b) Annotated EDD spectrum of structure from (a). (c) Calculated SG, Im, and S1 values for each residue n. Note that in the table, the sum of all intensities for relevant fragment peaks is shown

One challenge in assigning glycosidic fragments occurs when there are isobaric possibilities for product ions from symmetrically modified oligomers. This happens for c and z ions from a precursor with a delta-uronic acid at the non-reducing end and no derivatization of the reducing end. Isobaric fragment ions are difficult to assign either by manual interpretation or by automated glycan analysis—our software treats isobaric possibilities as both types of fragment ions. This problem is mitigated for GAG samples where the reducing end has been modified in some capacity to create an unambiguous mass difference. For example, in the case of full-length glycans isolated from proteoglycans, a linker region is present on the reducing end, and this breaks the symmetry of the structure and eliminates the possibility of isobaric c/z ions [7, 8]. Some depolymerization procedures produce a non-reducing end that is not a delta-uronic acid, and these also avoid isobaric c/z fragments.

The second step of our algorithm uses a heuristic based on a priori knowledge of GAG chemical structure (Figure 4) to determine the position of SO3 modifications on sugar residues for the top scoring structures from step 1 (by default, the top 3 structures). At this point in the assignment process, the structure search space is greatly constrained from the glycosidic fragment matches. At this point, it is possible to examine every glycan residue individually as there is a narrow selection of permutations based on the known biosynthetic pathways for modification. For hexuronic acids, the presence of a sulfo-modification defaults to being at the 2-O position, as no other positions are known to be modified in these residues. For amino sugars, the possible sites of modification are N, 3-O, and 6-O for heparin or heparan sulfate GAGs, or 4-O or 6-O for chondroitin or dermatan sulfate. The code looks for diagnostic cross-ring fragments or fragment combinations that would be able to unambiguously assign sulfation positions. If no cross-ring fragments are available on a residue, the code will leave the positions on that residue as ambiguous. As with all de novo sequencing methods, the mass spectra must be information-rich that provide enough fragmentation data to assign all the features in order to make a definitive assignment. On the other hand, a situation can arise where multiple structures are produced from conflicting cross-ring assignments. In these cases, we rank structures based on the intensities of the diagnostic cross-ring fragments.

Intensity ratios between sets of fragments have been used to differentiate the presence of glucuronic versus iduronic acids in synthetic HS tetrasaccharides [24]. Furthermore, multivariate statistics have been used to identify diagnostic fragments for certain ion activation methods [11, 14, 25,26,27]. At the present level of knowledge, there is not sufficient knowledge to use these ratios as a definitive guide to determining all C-5 stereochemistry. The software is fully capable of applying an additional layer of analysis to determine the epimeric center once there is a fuller understanding of product ion intensities as a function of uronic acid stereochemistry. Much like the method proposed in step 2, a residue-by-residue analysis of the uronic sugars could be applied as a 3rd step without a significant increase in analysis time, and this capability is envisioned in a future release.

Validation of Algorithm

In order to create a fully automated GAG interpretation software program, the scoring algorithm must be (1) unsupervised and user-independent and (2) assign the correct glycan structure. Much of the glycan analysis currently published in the literature uses software packages such as GlycoWorkbench [21] or similar GAG fragment calculation tools in combination with user intuition or experience to interpret the mass spectrum. Careful examination of peaks in the mass spectrum with manual supervision allows an expert to determine the likelihood of false positive while reaffirming structural features. This in-depth yet subjective form of glycan interpretation that relies on user expertise is difficult to automate and impractical for high-throughput analysis.

Assigning structure from tandem MS fragment peaks in an automated fashion requires some degree of assumption. For our software, we assume glycosidic and cross-ring fragments increase the validity of certain structural features and that enough of them will give the highest score to the most valid structure. However, the quality of the scoring system is judged not on its theoretical foundation but purely by its ability to assign the correct sequence and differentiate it from incorrect ones. We examine the quality of our scoring system with a well-accepted statistical methodology, namely by determining the likelihood that a score is given due to random chance, that is the expectation value, a statistic that has been widely applied in bioinformatics [28,29,30]. If x is the score of a particular spectrum, S, a survival function, s(x), for a discrete score probability distribution, p(x), can be defined [31]

$$ {s}_{j(x)}=\Pr \left(X>x\right)={\sum}_{i=j(x)}^{\infty }{p}_i $$
(2)

where Pr(X > x) is defined as the probability that the spectrum’s score will be higher than score x due to random matching within a defined database, D. For GAGs, the defined database is all possible permutations of the composition of the sample of spectrum, S. The expectation value e(x) can be interpreted as the number of GAG structures that would be expected to have scores of at least x.

$$ {e}_{j(x)}=n\times {s}_{j(x)} $$
(3)

where n is the number of sequences scored. The expectation value can be interpreted as follows: if a score x of expectation value e(x) = y, one would have a score of at least that value for the y number of times for every replicate experiment. A lower expectation value is therefore more ideal. For example, an e(x) value of 0.001 suggests that an experiment must be replicated 1000 times before a score of x could be obtained by random chance.

This technique has been used previously for analysis of the scoring system of peptide MS [32] while using a database search engine [33]; we apply the same fundamental principles and calculations for our scoring system. To examine our GAG scoring system with this method, p(x) is determined by constructing a frequency histogram of all GAG structure scores. We take the tandem MS of a pure, single-component GAG for which we know the structure (Figure 6a) and score structures of appropriate composition against the experimental data. The structures being scored against are stochastically selected and not optimized with the genetic algorithm heuristic to prevent introduction of selection bias. Among these structures, we know that only one is considered “valid” while all others are termed “stochastic.” The probability p(x) can be determined by normalizing the discrete frequency f(x) of a structure by the number of sequences scored (N)

$$ {p}_i={f}_i/N $$
(4)

Figure 6 shows the frequency histogram of 20,000 structures scored using a Monte Carlo sampling method for a synthetically produced heparan sulfate hexasaccharide containing 8 SO3 groups. The green asterisk in the histogram represents the score associated with the valid GAG structure. Visual inspection of the histogram shows that the right-most point of the histogram contains the structure of the highest score, x*, and is the valid structure for this dataset. The confidence of this score increases with respect to the distance between x* and all other scores (the scores of stochastic structures). Hence, a gap between x* and the bulk majority of other scores is highly desired and, moreover, the difference between x* and the next highest scoring structure is important for evaluating the algorithm’s ability to discriminate similar structures. It should also be noted that the frequency of the highest score x* is the lowest in the set, implying that x* is likely a unique value observed only when matched with the valid structure.

Given how our scoring system is partially dependent on intensity while also working under some assumptions regarding the expected fragmentation results from ion activation of GAGs, the numerical difference between two scores is not an easily interpretable measurement of the degree of difference between two structures of those respective scores. Moreover, the individual score of any structure (valid or stochastic) has little interpretability and does not serve as a good measure of fitness. A much more sophisticated estimate for confidence can be determined from the survival function (Figure 5). Score x* for our hexasaccharide has a value of s(x*) = 5.59E−16; application of Eq. (3) yields an e(x*) = 1.96E 12. This expectation value indicates that an experiment would have to be repeated approximately 1012 times before a score of x* would be matched to a structure due to random chance—a figure of merit that reflects positively on our GAG scoring algorithm.

Figure 5
figure 5

Once the number of SO3 on a residue is confined to a specific residue, the software searches for cross-ring fragments to determine position. In situations where a mixture might be present, structures are ranked based on the cumulative intensity of diagnostic fragments. Blue arrows indicate positions that can be assigned with a specific cross-ring fragment or a combination of 2 cross-ring fragments with the assumption that they come from the same end (reducing or non-reducing). A red arrow indicates assignments that can be made if the two indicated cross-ring fragments are from different ends (i.e., an A fragment and an X fragment)

Additional histograms and survival function diagrams for other GAG compositions of both experimentally and synthetically generated datasets for CS/DS and Hp/HS GAG families were also calculated and can be found in the supplemental information. The expectation values for the scores of valid structures (x*) using our scoring algorithm across various chain lengths, GAG families, and different degrees of modification suggest that our method can be applied to a wide variety of GAG tandem MS.

High-Throughput Applications

Previously published work from our laboratory proposed a high-throughput model for determining GAG structure using a genetic algorithm optimization technique [20]. While Monte Carlo simulations are performed at random and with no additional supervision, integration of our scoring method within our high-throughput platform inherently creates a bias during the optimization cycles. A genetic algorithm is modeled around a survival-of-the-fittest mechanism that mimics the characteristics of evolution. Thus, it retains information through each iteration of the algorithm—the life cycles of potential structural candidates are heavily influenced by the crossover and mutation steps that occur and create a skewed, top-heavy distribution when plotted on a histogram.

Figure 7 shows a histogram of the mean score per iteration for 8325 optimization iterations in a genetic algorithm. Compared to the histogram in Figure 6, the genetic algorithm spends most of its time refining structures that are within the upper 40th percentile of all possible structures. The magenta-shaded bars in Figure 7 reflect scores of structures being observed in the final 20% of the genetic algorithm, showing that the final iterations of the genetic algorithm tend to heavily bias high-scoring structural features, averaging in the 70th percentile or greater. Likewise, the cyan histogram shows the scoring range for the final 10% of the genetic algorithm. Iterations in which lower-scoring structures are observed are a by-product of heavily mutated candidates with numerous incorrect features with frequency of observation diminishing as the algorithm progresses.

Figure 6
figure 6

(a) The hexasaccharide structure used for calculating e(x) and s(x). Tandem MS was performed using EDD. (b) A histogram of 20,000 structures scored using the Monte Carlo method. Asterisk represents the score that is associated with the structure shown in (a). Note that the scoring algorithm assigns the valid structure the highest score. (c) The survival function of the scoring algorithm plotted versus the score. An e(x) value of 1.96E−12 is calculated for the valid structure

Figure 7
figure 7

(a) The distribution of mean scores per iteration for 8235 iterations within a genetic algorithm for synthesized heparan sulfate hexasaccharide containing 8 SO3 modifications. Blue bars indicate that the majority of the genetic algorithm is optimizing structures above the 40th percentile of all possible structures. The magenta bars show the final 20% of a genetic algorithm life cycle and is even more top score-focused. Likewise, cyan bars show the final 10%. (b) The percent with which compositions converge upon the correct answer using our scoring system and optimized with the genetic algorithm search heuristic. Decasaccharide and dodecasaccharide data is synthetically produced in silico and yields perfect accuracy due to a lack of experimental artifacts, contaminants, and noise

These measurements are shown for 110 independent optimization cycles—that is, the genetic algorithm was applied 110 times to the same dataset for the hexasaccharide structure shown in Figure 6. Among these 110 cycles, 107 iterations converged upon the correct structure, yielding a 97.3% convergence rate for this hexasaccharide (DP6). For partially stochastic methods such as the genetic algorithm (the initiation step is stochastic), it is typically best to run multiple iterations to assure convergence upon a correct answer. Time benchmarks for performance of this algorithm have been reported [20], and it is computation and temporally inexpensive to iterate multiple time to secure convergence upon the correct answer. Figure 7b shows a breakdown of percentage convergence upon the correct structure for multiple hexasaccharides as well as synthetically generated data for decasaccharide (DP10) and dodecasaccharide (DP12) structures with intensity values randomly normalized between 0.1 and 0.9. It should be noted that synthetically generated data has no fail rate compared to experimental data, which can be attributed to the lack of any experimental noise peaks or other potential contaminants or artifacts that could influence the structural matching elements implemented in software. This speaks to the necessity of high-quality, informationally dense mass spectra for accurate structural assignment. As noted previously [20], the genetic algorithm is, by default, assigned to run in multiple cycles (default of 5, adjustable by the user) with new, randomly generate starting candidates at the beginning of each cycle.

Structural ambiguity is likely to arise when structurally informative peaks within the mass spectrum is absent. When applied to a genetic algorithm or other optimization techniques that involve scoring, the absence of structurally significant fragments leads to a lower maximum possible score for the experimental data. As a result, the highest-ranking structure is no longer one, unambiguous item but multiple possible elements. Figure 8 shows the increase in ambiguity when the score decreases as a result of missing pieces of structurally informative information. Reexamination of the histogram derived from Monte Carlo simulations shows that as we move lower in score, more possible structural options arise. Within our software under such circumstances, single structural convergence is no longer an option and, instead, multiple structural outputs will be the final result. Tandem MS using electron activation methods such as EDD [15,16,17,18, 24,25,26, 34,35,36,37] and NETD [13, 19] produces an abundant series of cross-ring fragments that allow for complete or near-complete assignment of structure. Works with collision-based threshold methods [12, 26, 27, 38] are also a valid option but do not always yield all necessary cross-ring fragments for unambiguous identification, especially in longer-chain proteoglycan GAGs [7, 8]. Thus, for highly pure components where complete structural characterization is desired, the data must contain the appropriate number of structurally diagnostic peaks.

Figure 8
figure 8

An increase in the number of possible structures for a mass spectrum occurs when the highest possible score is lower due to a lack of structurally diagnostic peaks. As the score decreases, the results reported by the software are no longer a single structure but multiple, equally valid structures

Conclusions

The Glycosaminoglycan-Unambiguous Identification Technology software package consists of multiple modules structured in a manner that (1) identifies GAG composition based on an accurate mass measurement within the MS1, (2) uses an in silico fragmentation calculation tool specific to a desired GAG family (HS/Hp or CS/DS) to consider potential glycosidic and cross-ring fragments, (3) applies the multilayered scoring system specific to GAGs as discussed in this manuscript, and (4) determines the optimal structure using a genetic algorithm optimization technique using multiple iterations to minimize the likelihood of converging upon local maxima. Any module or combination of modules is available to use upon request from the author. Conversion of this software platform from paid programming environments (MATLAB) to free alternatives is currently being processed.

This approach is successful for both moderately sulfated GAGs, such as those examined here, or more highly sulfated GAGs such as Arixtra. The success of this automated approach requires MS/MS datasets that contain the necessary amount and type of fragments needed for structural characterization. Much like any form of automated spectral interpretation, the approach is highly reliant on the quality of the data: glycosidic fragments and structurally meaningful cross-ring fragments are necessary for complete structural characterization. The acquisition of such data can be a limiting factor in high-throughput structure analysis. Fortunately, developments in activation methods for GAGs have led to a number of approaches that provide information-rich datasets (reference EDD and NETD). A lack of any necessary pieces of information both increases the possibility of structural ambiguity and simultaneously increases processing times. With this software, glycosidic fragments are particularly important as they greatly reduce search space and facilitate the assignment of cross-ring fragment peaks.

Data preprocessing methods have not been discussed in this manuscript as there are various specialized considerations that must be made for GAGs. Deconvolution of GAGs using commercialized options such as Bruker DataAnalysis or Thermo Xcalibur fails to capture all relevant monoisotopic peaks. This is because isotope distributions of GAG fragments deviate a great deal from the typical proteomic averagine model and, more importantly, do not necessarily fall under a consistent chemical formula but instead change with respect to the number of sulfo-modifications. Isotope deconvolution thus needs its own set of specialized rules, a point we will discuss with more detail in a separate manuscript.