Introduction

Monogenic inherited diseases usually involve multiple disciplines and complex clinical symptoms. They are difficult to be precisely diagnosed by conventional clinical tests due to the underlying molecular mechanisms, and most of them are usually fatal, disabling, or teratogenic1. Traditional testing techniques may have a greater risk of bringing in false negative diagnosis and misdiagnosis, as a result, the clinicians may miss the critical points to provide treatment for the patients. In comparison, genetic testing can achieve better performance including early detection, early intervention, and early treatment for single-gene genetic diseases. Large-scale discovery of novel genes and validation of monogenic diseases can be quickly implemented and widely applied clinically. People with a family history of genetic disorders can be screened by pre-marital, pre-pregnancy, and prenatal genetic screening2,3 and avoid birth defects. Therefore, genetic testing is important for clinical diagnosis and prevention of birth defects.

Next-generation sequencing technology has been widely used in detecting genetic disease. The major sequencing technologies are targeted region sequencing, whole exome sequencing, whole genome sequencing, and mitochondrial DNA sequencing. However, whole genome and exome sequencing are not only costly and time consuming, but also challenging to screen for specific disease-causing variantsacross a large span of genomic region4. The combination of regional capture and high-throughput sequencing technology can effectively capture disease-associated regions and quickly locate disease-causing variants. With the characteristics of high throughput, low cost5, high speed, and high accuracy, high-throughput sequencing technology is widely used in clinical practice6,7 for genetic disease detection and carrier screening8. However, most of the currently available products for genetic testing detect limited types of diseases and have a compromised detection rate9. Moreover, besides monogenic variants, recent studies have found that chromosome microdeletions or microduplications are important causes of developmental delay and intellectual disability10. Therefore, we urgently need a highly efficient and sensitive screening method that can detect all types of variants to meet the need of one-step detection of a variety of monogenic genetic diseases and common chromosomal abnormalities.

Therefore, this study used BGISEQ-500 as a sequencing platform to develop a chip that focuses on coding regions with known associations with genetic diseases. Variants that affect gene function are detected more cost-effectively than whole genome sequencing or whole exome sequencing. Currently, 4013 known single genetic diseases can be detected (Table 1). In addition, we can detect 148 common chromosomal abnormalities by targeting specific regions (Table 2). Compared with traditional gene detection methods, the combined strategy integrates known single-gene diseases with common chromosomal abnormalities, and therefore achieves “one-step” solution to detecting genetic variants. The improved detection rate of diseases, along with the benefit of high throughput, high accuracy, fast speed, and low cost proves that this combined strategy is a powerful tool for clinical diagnosis and prenatal prevention of birth defects.

Table 1 List of 4013 diseases that can be detected by the designed chip.
Table 2 List of 148 chromosomal abnormalities that can be detected by the designed chip.

Materials and methods

Sample information

A total of 100 samples were gathered for this study. Because we designed the chip to capture almost all disease-causing genes (Table 1), the samples are collected based on the patients who would like to participant in this study in the hospital and are essentially unbiased. In order to assess the stability of the chip, we selected two samples, S77 and S78, for inter-batch and intra-batch stability evaluation. In addition, samples S79, S80, S81, and S82 were selected to evaluate the coverage and depth of the target area under the BGISEQ-500 platform. 86 patients were selected from the clinical cases. Among them, 52 cases were diagnosed, whereas 34 cases were not. In addition, 12 samples that have been tested for CNVseq were selected, and the results were in accordance with the known positive samples in the disease area shown in Table 2. The ability of the chip to detect chromosomal abnormalities was evaluated. All adult participants and parents of minors registered in the study have obtained written informed consent. The project and research programs involving human tissues were approved by the BGI Ethics Committee (BGI-IRB 16098).

Chip design

In this study, a chip was designed to detect not only SNP, INDEL and large intragenic deletion, but also 148 chromosomal abnormalities from DECIPHER and OMIM databases by adding capture fragments in specific regions. The design steps of the capture region are as follows: (I) Design of capture region for single-gene diseases: concerning that genes usually correspond to multiple transcripts, first, we select the most common or the longest transcript for each gene as the transcript representing the gene. Then we select all the coding sequence (CDS) regions with each CDS region extending 10 base pairs (bps) on both sides to detect splicing variation7. The untranslated region (UTR) is large, and most of the region cannot be annotated. The chip does not capture the UTR. The transcripts selected according to the above principles may not include all the functional regions of the other transcripts of the gene, and therefore may not harbor all of the functional regions; (II) Design of chromosomal abnormality detection capture region: firstly, the variant region of each chromosomal abnormality and all genes in the region are determined according to the database, and then the most common or the longest transcript is selected for each gene. All functional regions of the major pathogenic genes of the mutated region are all intended to be contained within the chip. The capture regions of other non-major pathogenic genes are based on the following principles: (i) For variant regions containing less than or equal to 15 genes, we randomly select 100 bps on CDS of each gene for capturing; (ii) For variant regions containing more than 15 genes and less than 40 genes, we randomly select 100 bps on CDS of each gene from two out of every three genes for capturing; (iii) For variant regions containing more than or equal to 40 genes, we randomly select 100 bp on CDS of every other gene for capturing. The coverage region of the probes is about 10 million bps, and an estimated 0.33% of the human reference genome is captured per sequence run.

Experiments and sequencing

In this experiment, genomic DNA was first extracted from whole blood, and qualified DNA was subjected to library preparation6. The library was prepared by disrupting 1 μg of genomic DNA into a small fragment of 200–300 bps of DNA. The fragment selection product was quantified using Qubit. The initial amount of DNA was adjusted to 50 ng according to the measured concentration, then TE buffer was added to make up the total volume to 40 μL. The end repair is then performed and the base “A” is ligated at the 3’ end so that the DNA fragment can be ligated to Barcode. The library constructed by Pre-PCR was used to enrich the target region with the probe designed in this study. Pooling and mixing were performed according to 1 µg sample amount per chip. Then hybridization and elution were performed according to the manual (Roche NimbleGen, USA) followed by the PCR amplification. After the purification with AMPure XP Beads (Beckman Coulter, USA), 330 ng DNA library was subjected to cyclization, and then DNA nanospheres were synthesized. After purification, Qubit (Thermo Fisher Scientific, USA) was used to quantify the purified PCR product. A final yield of single-stranded loops ranging from 33 to 132 ng was considered qualified. Sequencing was performed using the BGISEQ-500 platform, and data analysis and interpretation of the results were performed based on the sequencing data11,12. The library preparation was separately performed for each replicate of each sample. Concerning the inter-batch and intra-batch stability of each sample, the inter-batch assessment used three different chips for capturing three technical replicates respectively; while the intra-batch assessment used the same chip for capturing three technical replicates with specific barcodes simultaneously. The data that support the findings of this study have been deposited in the CNSA (https://db.cngb.org/cnsa/) of CNGBdb with accession code CNP0000378.

Bioinformatics analysis and variant identification

The process of bioinformatics analysis includes data filtering, alignment, variant detection, and result annotation. The raw data were first evaluated for quality to remove low-quality and adapter contaminated reads. The valid data was then mapped to the human reference genome (HG19) using Burrows Wheeler Aligner (BWA)13. The PCR-induced duplication was eliminated using Picard software. SNVs and Indels were tested using the Haplotypecaller module in Genomic Analysis Toolkit (GATK)14,15. Intra-gene deletions and duplications were identified by comparing the average depth between samples in the same batch. The variants were then annotated, using databases including dblocal (a database of variant frequencies for 100 normal human samples)6, dbSNP (http://www.ncbi.nlm.nih.gov/SNP/), HapMap (http: //hapmap.ncbi.nlm.nih.gov/), dbNSFP (http://varianttools.sourceforge.net/Annotation/DbNSFP), and 1000 Genomes (http://www.1000genomes.org/). The criteria for detecting variations in this study: 1. Select known high-frequency pathogenic variants; 2. Filter variations by population frequency, usually <0.01; 3. Refer to databases such as HGMD and clinvar to screen for loci with reported pathogenicity; 4. Clarify the pathogenic mechanism of the gene using the clingen database and identify deleterious mutations; 5. Select sites with high pathogenicity scores based on the results of SIFT, polyphen, varseak, and other prediction software; 6. Combine the patient’s phenotypic and the inheritance pattern of the disease, discovering loci that could ultimately explain the clinical symptoms of the patient. In addition, we used CNVkit to detect chromosomal abnormalities16. In this study, CNVkit was mainly used to detect the deletion and duplication of large fragments. We kept using the default parameters, and the cases and controls were used in the pipeline. Finally, suspicious variants were screened, interpreted, and validated to generate the final data. The bioinformatic analysis method of CNVseq refers to the previous report17. The clinical evaluation of the CNVseq results was based on guidelines prepared by the American College of Medical Genetics (ACMG)18. Variants were named in reference to the International Cytogenetic Nomenclature International System (ISCN) standard. Description of the stability evaluation method within and between batches: we sequenced the samples in 3 technical replicates (batches) and the same batch separately, and ensured that the sample concentration was consistent before chip capture, after hybridization and elution, and in the final sequencing experiment, thereby reducing fluctuations in sequencing depth accordingly. We used the same method to apply data filtering, comparison, deduplication mark, local weight comparison, Indel region weight comparison, base quality value recalibration, and variant detection for each sample based on the GATK best practices pipeline (https://www.broadinstitute.org/gatk/guide/best-practices.php). For variant detection, we used the parameter -out_mode EMIT_ALL_SITES to output all the site information in our capture area and removed the sites that were not covered by a single sample three times on the machine, which is why our total number of sites will be slightly smaller than the size of our capture area. Then we counted whether the genotypes of each site in each sample were exactly the same, or the same twice, or completely different.

Homology model construction and protein stability prediction

To investigate the effects of disease-associated variants on protein structure and function, we performed protein modeling analysis. This study used the rapid modeling module of yasara (version 17.1.28) software to automatically implement multi-template search and comparison to complete hybrid modeling. The protein model was repaired using the foldx plug-in, and the mutated protein model was constructed to obtain a high-confidence structural model. The foldx plug-in was used to calculate the difference between the energy of the wild-type protein and mutant protein (ΔΔG = Δgmut-Δgwt), and a value of ΔΔG greater than 1.6 kcal/mol was considered to have a significant effect on protein stability19. All structural analysis and image rendering were performed using PyMOL (version 2.2.0).

Results

Clinical sample depth determination

In this study, four samples S79, S80, S81 and S82 were selected. The original depths of the samples were 281.46×, 413.23×, 569.57×, and 714.32×, respectively. The average sequencing depths after removing the duplication were 157×, 231×, 277×, and 380×, respectively. All CDS regions were quantified by parameters with sequencing depth and coverage. The proportion and coverage of CDS regions of four samples with sequencing depths between 0 and 30x were 2.29% CDS regions of S79 with 14.33% coverage, 1% CDS regions of S80 with 15.33% coverage, 0.76% CDS regions of S81 with 17.45% coverage, and 0.48% CDS regions of S82 with 12.59% coverage, respectively (Fig. 1). When the sequencing depth reached 30× or more, the coverage was greatly improved. Between 30× and 100× sequencing depths, the coverage of CDS regions of S79, S80, S81, and S82 were 93.84%, 92.29%, 92.75% and 91.81%, respectively. When the depth was greater than 100×, the coverage of all samples can reach more than 99%. A total of 45,527 CDS regions were analyzed, of which 43,192 areas were able to obtain 100% coverage; a total of 2335 areas did not achieve 100% coverage, but it can be seen that, as the depth increases, the coverage increases. In the 45,527 captured regions, 154 CDSs had zero coverage regardless of the read depth. Among these 154 CDSs, CDS1, the first coding DNA sequence, accounts for 87%. We know that the GC content from 5′ untranslated regions to 3′ untranslated regions along human genes gradually decrease20. The CDS1 area is next to the 5′ UTR area, possibly because the higher GC content of 5′ UTR affected the capture of the CDS1 area. Based on the above results, it is recommended that on the BGISEQ500 sequencing platform, the average depth of sequencing of the samples using the customized chip of this study should preferably reach 100 X or more after the removal of the duplication.

Fig. 1: Relationship between Sequencing depth and coverage in CDS region.
figure 1

The columns indicate the proportional distributions of CDS regions with different sequencing depths for sample S79 (157× average), sample S80 (231× average), sample S81 (277× average), and sample S82(380x average), respectively (refer to the left coordinate). The solid dots (circles) represent the average coverage in CDS regions with different sequencing depths (refer to the right coordinate).

Inter-batch and intra-batch stability assessment

In this project, sample S77 and sample S78 were sequenced in three batches to evaluate the stability among batches; each sample was sequenced three times to evaluate the stability within the batch. We used the parameter—out_mode EMIT_ALL_SITES to output all the locus detection information in the capture region. Genotypic consistency of loci in different batches of the same sample and the same batch of repeated samples was analyzed. For batch-to-batch stability, the total number of loci was 9,903,792 for sample S77, the intersection of three different batches was 9,881,645, the stability was 99.78% (Fig. 2a); total number of loci was 9,874,160 for sample S78, and 9,852,762 for the intersection of three separate batches with 99.78% for stability (Fig. 2b). In this experiment, we defined stability as the ratio of sites identified in all three technical replicates. For intra-batch stability, the total number of loci in sample S77 was 9,904,450, and the number of intersection loci of three samples in the same batch was 9,882,238, with 99.78% stability (Fig. 2c); for sample S78, the total number of loci was 9,877,841, and the number of intersection loci of three samples in the same batch was 9857 175, the stability was 99.79% (Fig. 2d). From the above data, it is confirmed that the stability of the customized chip is quite good among batches and within batches on the BGISEQ500 sequencing platform. To evaluate the accuracy of this technique, we compared the SNPs of YH cell line samples tested using targeted NGS with the genotyping results obtained using Illumina’s Human Zhonghua-8 bead Chips (SNP Array). We selected the common locus between the SNP array and the chip designed in this experiment for accuracy analysis. A total of 3664 SNPs were detected in YH cell line, and 99.54% (3647/3664) of the genotypes at the selected loci were consistent with the results of SNP Array, demonstrating the high accuracy of this method.

Fig. 2: Evaluation of the stability of our method.
figure 2

Venn diagram of S77 (a) and S78 (b) sequenced three times in the same batch. Venn diagram of S77 (c) and S78 (d) sequenced three times in three batches.

Variant information in clinical samples

Using targeted next-generation sequencing (NGS), we obtained high-quality sequences of 86 samples. Variant-related information was obtained after the completion of the reference sequence alignment and variant detection. In this study, 67 disease-related variants were identified in 52 patients, including 49 missense variants, 8 frameshift variants, 5 splicing variants, 3 intra-gene deletion and duplication, and 2 whole gene deletions. Of the 67 variants, 36 have been reported and 31 have been reported for the first time. Table S1 summarizes the disease-related variant information for 52 samples.

Table 3 Details of CNV detection results at different sequencing depths.

Chromosome abnormality detection

This study used CNVkit software to detect chromosomal abnormalities. The software detects CNV based on the read depth method. Therefore, in addition to the original depth, 10,000,000 reads and 20,000,000 reads are randomly extracted, simulating different sequencing depths for CNV copy number and breakpoint position detection. When the data showed that the original depth was 613×, there was one area that remained undetected. This area was chr7: 69, 783, 279–69, 952, 448, with the segment length of 169.17 Kb, and the area is not detected at three different depths, namely 613 × (original depth), 140 × (20,000,000 reads) and 70 × (10,000,000 reads). Therefore, it is speculated that the detection accuracy of the customized chip is insufficient to detect a deletion or a duplication of about 200 kb. In addition, the recommended detection accuracy of CNVkit software is 1 M, and it was found that all the deletions and repetitions above 1 M were detected. CNVkit software detects chromosome deletions and duplications based on the depth of reads. The results also confirmed that as the depth decreases, the number of missed detection areas increases, so it is recommended to ensure a certain amount of depth to help reduce the rate of missed detection. Table 3 shows details of the CNV results information for samples.

Table 4 Structural analysis of four mutant proteins.

Protein structure prediction and stability results

We performed protein modeling analysis on all genes defined as uncertain significance, of which only six genes were modeled completely and included mutant amino acids in their sequence (Table 4). The six genes were: ANLN, CNGB1, UMOD, DSTYK, UNC45B, and COL4A3. In the structure of ANLN, Asp1021 is located at the carboxy terminus of the Anillin protein and belongs to the PH (Pleckstrin homology) domain, which is necessary for all targeted events21. The PH domain is a 120 amino acid protein module that is thought to interact with lipids to mediate protein recruitment to the plasma membrane, and studies have shown that the PH domain is electrostatically polarized22. To examine how the p.D1021V variant would affect protein structure, we compared the structure of the wild-type and the mutant, and found that the conformation was basically unchanged. In addition, Gibson’s free energy calculated by foldx also indicates that the variant does not affect the stability of the protein.

The CNGB1 variant p.M974R, UMOD variant p.V550I, and COL4A3 variant p.A1555V were calculated by foldx, with the change in ΔG Gibbs free energy of 4.07063 kcal/mol, 4.01864 kcal/mol, and 2.46126 kcal/mol, respectively. This indicates that these variants affect the stability of the protein.

Discussion

The study of monogenic hereditary diseases belongs to the field of typical precision medicine. The complex clinical symptoms of monogenic diseases lead to a difficult diagnosis, and most of the pathogenic mechanisms are not clear. Due to the lack of effective treatments, the disease is often fatal, disabling or teratogenic. Diseases such as intellectual disability and growth retardation are often caused by chromosomal abnormalities in addition to the single-gene variants, which are also responsible for monogenic genetic diseases. Therefore, we urgently need an effective detection method that can detect both monogenic genetic variants and chromosome aberrations to facilitate clinical diagnosis and prevention of birth defects. This study designed a chip that can detect up to 4013 single-gene diseases. Compared with previous panel designs6,7, we have included more genes related to mendelian diseases when designing the chip to improve our diagnosis rate. In addition, this study also identified 148 common chromosomal disorders by targeting the key genes as well as the random, non-critical genes in chromosomal abnormal regions. In this study, we use MGIEasy Exome Capture V5 Probe to bridge the cost gap between the panel and WES. When their average depth is 200 X, the cost of the panel is approximately 1700 RMB, while the cost of WES is approximately 2300 RMB. The primary reason for the disparity in trial costs between the two is the expense of sequencing. Due to the modest amount of data generated by the panel, the time and personnel costs associated with bioinformatics processing and interpretation will further contribute to the cost differential between the two tests, which we did not specify in this study. Because the amount of data created by the panel is reduced over time, the cost of data storage is reduced. When the sample size hits a particular threshold, it can become rather costly. This project uses the strategy of BGISEQ500 sequencing platform and chip combination. Due to its low cost, the evaluation results indicate that this combination has potential for clinical testing and carrier screening applications.

Sequencing analysis is effective for the diagnosis of rare genetic diseases, but the relationship between effectiveness and cost-effectiveness for the use of comprehensive analyses such as whole genome sequencing and whole exome sequencing remains controversial. Target capture analysis enriches genes or regions of interest and is an analytical method that balances cost and effectiveness. The chip designed in this study encompasses the majority of currently known disease-causing genes that can cause genetic diseases, and can be considered a clinical-grade whole exome. The panel can more effectively target disease-related regions of the human genome and, more importantly, achieve higher sequencing coverage when targeting a group of genes associated with a particular disease phenotype. In this study, for the analysis of CDS coverage, sample coverage reached 99.66% when sequencing depth exceeded 100*, and coverage increased as sequencing depth increased.

Nevertheless, a high-resolution assessment of various WES datasets reveals unequal coverage along the length of exons23. Studies reveal that regions with inadequate WES coverage account for around 10% of all CDS regions24. We also analyzed the coverage of genes recommended by the American College of Medical Genetics and Genomics (ACMG) for pathogenic variant detection and clinical reporting25. Among the 59 genes analyzed, APOB CDS1, DSC2 CDS1, PRKAG2 CDS5, RET CDS1, and TGFBR1 CDS1 were identified. Regardless of how much the sequencing depth is increased, there is no coverage(Table S1). Six genes, including KCNH2, KCNQ1, SDHD, TNNI3, VHL, and WT1, have been identified inside low-coverage regions in one or more samples, according to additional research26. These results imply that low-coverage regions inside functionally significant genes may influence variant detection and subsequent clinical diagnosis.

Moreover, with the same amount of detection data, the chip can obtain higher depth sequencing data than WES, which is advantageous for detecting structural variation at the exon level, and we know that certain diseases, particularly neurological diseases like DMD, can cause by structural variation at the exon level. The clinical application of WGS is still limited at this time for two reasons: first, the interpretation of non-coding regions is extremely limited and relies on scientific research, and second, the cost is prohibitive for the subject. Taking into account the potency ratio, this chip containing nearly all genes with distinct molecular mechanisms continue to be an excellent option. Diseases such as McCune-Albright syndrome are caused by variants in early embryonic somatic cells. Conventional WES analysis, particularly in the clinical setting, may not detect somatic variants. However, this chip has some remaining limitations. In fact, in the era of clinical genomics, where reverse phenotyping has become commonplace27, WES can provide early diagnosis and drive treatment options. WES was selected to expedite potential diagnoses and reduce costs associated with multiple tests. Overall, the panel lacks the advantages of a larger number of candidate genes and the ability to reevaluate data on a regular basis, which are offered by WES.

For 86 clinical cases, we first found candidate pathogenic genes in the list of 4,013 diseases based on clinical diagnosis and used the targeted NGS to find pathogenic variants in the candidate genes. If the variant is indeterminate based on the results of the information analysis and database annotations, we will plot the reads and align the reference sequences of the variant sites with a single base resolution. If the variant is still unrecognized, Sanger sequencing or real-time PCR will be performed. However, the pathogenic variants in some cases are still not in the candidate gene. We will find candidate variants in other genes in the target region and to infer the disease in reverse.

In this study, we performed homology modeling on some proteins, hoping to be able to explain the changes in protein structure from variants. Sample S32, 7 years old, shows clinical manifestations of hematuria and C3 glomerulopathy. Missense variation c.3062A>T (p.D1021V) was detected in the ANLN (NM_018685.4) gene coding region of the sample as a heterozygote. ANLN gene variant can cause focal segmental glomerulosclerosis type 8 (OMIM#: 616032), which is autosomal dominant, and the main clinical manifestation of glomerular segmental sclerosis, proteinuria, decreased glomerular filtration rate and progressive decline in renal function. Both SIFT and PolyPhe-2 predictions are deleterious variants. The frequency information of c.3062A>T was not found in the dbSNP database, Hapmap database, thousand-person database, or the local database, and there is no documented pathogenicity. In the structure of ANLN, the variant p.D1021V is located in the PH (Pleckstrin homology) domain. Anillin is an actin-binding protein involved in cytokinesis. It interacts with GTP-bound Rho proteins and results in the inhibition of their GTPase activity. The PH domain has multiple functions, but generally involves targeting the protein to an appropriate cellular location or interacting with a binding partner. The PH domain is in electrostatic polarity, because aspartic acid is charged and polar and is often involved in the formation of protein active sites or binding sites, while proline is a non-polar amino acid. Comparing the wild-type and mutant conformations, no changes were found, but there were some differences in the hydrophobic surface. We speculated that the variant affected the electrostatic polarity of the PH domain, resulting in a change in protein function. Therefore, it is speculated that the ANLN gene c.3062A>T is a disease-causing variant in the subject.

CNV is widely distributed in human genome and is one of the important pathogenic factors of human diseases. Pathogenic CNV can cause intellectual disability, growth retardation, autism, various birth defects, leukemias, and tumors. Determining the copy number and breakpoint position of the variant region are two crucial aspects of CNV detection. With the advancement of technology, more and more technical means have emerged for CNV detection, but different technology platforms and their corresponding computing strategies have great differences in the accuracy of detected CNV copy number and breakpoint position. The CNVseq method uses genome-wide data, and this study utilizes genomic target region data. Although two methods for detecting CNV are based on the circular binary segmentation algorithm, there are still differences in data correction and comparison. Based on the above reasons, the position of the breakpoints obtained by the two methods is not very consistent, actually the breakpoint positions identified by the two different methods in our study all vary at the kilo bps resolution level. This study uses CNVkit software, which detects CNV based on the read depth method. Therefore, in addition to using the original data, we also simulated different sequencing depths for CNV copy number and breakpoint position detection. As the depth decreases, the number of missed detection areas increases, and a certain number of read lengths help to reduce the rate of missed detection. At breakpoint locations, different depths have no significant effect on the detection of breakpoint locations. Based on a similar capture sequencing technology, the difference between exome sequencing and target capture sequencing during experiments and bio-information analysis is still usually significant. Factors such as the GC content of the probes, the initial DNA concentration, and even the temperature of the chip hybridization in the experiment may affect the number of reads captured by each probe and make a difference in capture efficiency, depth, and coverage. Indeed WESs can accurately detect CNVs above 1 M, but our research based on a specific panel to detect these common chromosomal CNVs is extremely cost-effective.

Conclusion

In summary, we provide a diagnostic detection tool that combines capture arrays and NGS to capture the coding region of 3043 genes associated with 4013 diseases and detects 148 chromosomal abnormalities by targeting specific regions. The results of the evaluation suggest that our method has high accuracy and stability. Compared with traditional genetic testing methods, it integrates known data about single-gene diseases and frequent chromosomal abnormalities to achieve a “one-step” solution to genetic variants. In our study, perhaps due to high GC content, missing enrichment probes, and other reasons, there are still 154 CDSs regions that cannot be covered at all. The incomplete coverage of regions may be improved by using a high concentration of capture probes that cover difficult-to-enrich regions28,29. This technology can be potentially utilized in diagnostic testing to provide an effective basis for clinical diagnosis and genetic counseling and improve the detection rate of diseases.