Samtools get consensus sequences

9/23/2023

Especially for the high-depth data generated by sequencing low-input DNA, the duplication level can be much higher. Particularly, the library amplification using PCR technology can lead to particular sequences becoming overrepresented, and consequently cause some false positive mutations in the result of NGS data analysis.Īs a result of library amplification, NGS data can have many duplicates. However, the processes of making NGS library and sequencing are not error-free. To detect such low-frequency variants, we usually increase the sequencing depth (can be higher than 10,000x). Since the tumor-derived DNA is usually a small part of the total blood cell-free DNA, the mutant allele frequency (MAF) of a variant detected from ctDNA sequencing data can be very low (as low as 0.1%). Recently, circulating tumor DNA (ctDNA) sequencing has been recognized as a promising biomarker for cancer treatment and monitoring. From such deep sequencing data, somatic mutations can be detected to guide personalized targeted therapy or immunotherapy. High-depth next-generation sequencing (NGS) has been widely used for precision cancer diagnosis and treatment. To our best knowledge, gencore is the only duplicate removing tool that generates both informative HTML and JSON reports. Comparing to some new tools like UMI-Reducer and UMI-tools, gencore runs much faster, uses less memory, generates better consensus reads and provides simpler interfaces. ConclusionsĬomparing to the conventional tools like Picard and SAMtools, gencore greatly reduces the output data’s mapping mismatches, which are mostly caused by errors. The JSON format report contains all the statistical results, and is interpretable for downstream programs. The HTML format report contains many interactive figures plotting statistical coverage and duplication information. Gencore reports statistical results in both HTML and JSON formats. When unique molecular identifier (UMI) technology is applied, gencore can use them to identify the reads derived from same original DNA fragment. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data. While the consensus read is generated, the random errors introduced by library construction and sequencing can be removed. This tool clusters the mapped sequencing reads and merges reads in each cluster to generate one single consensus read. This paper presents an efficient tool gencore for duplicate removing and sequence error suppressing of NGS data. These unmet requirements drove us to develop an ultra-fast, simple, little-weighted but powerful tool for duplicate removing and sequence error suppressing, with features of handling UMIs and reporting informative results. Furthermore, existing tools rarely report rich statistical results, which are very important for quality control and downstream analysis.

Some modern tools can work with UMIs, but are usually slow and use too much memory. Most existing duplicate removing tools cannot handle the UMI-integrated data. Recently, a new technology called unique molecular identifier (UMI) has been developed to better identify sequencing reads derived from different DNA fragments. However, as NGS technology gains more recognition in clinical application, researchers start to pay more attention to its sequencing errors, and prefer to remove these errors while performing deduplication operations. If you prefer a FASTA format instead of FASTQ, you can use tools like seqtk or fastq_to_fasta to convert the FASTQ file to FASTA format if needed.Removing duplicates might be considered as a well-resolved problem in next-generation sequencing (NGS) data processing domain. Please make sure to replace reference.fasta with the filename of your reference genome and sorted_aligned_reads.bam with the appropriate name of your sorted and indexed BAM file.Īfter running this script, you should obtain the consensus sequence in the consensus.fastq file. vcf2fq: Converts the consensus genotype in VCF format to FASTQ format, representing the consensus sequence.Ĭonsensus.fastq: The output file containing the consensus sequence in FASTQ format. Sorted_aligned_reads.bam: The sorted and indexed BAM file.īcftools call: Calls the consensus genotype for each position based on the pileup. f reference.fasta: Specifies the reference genome in FASTA format. Samtools mpileup: Generates a pileup of aligned reads at each position in the reference genome. Samtools mpileup -uf reference.fasta sorted_aligned_reads.bam | bcftools call -c | vcf2fq > consensus.fastq

0 Comments

Samtools get consensus sequences

Leave a Reply.

Author

Archives

Categories