プレプリント / バージョン1

TaKoHigh enables accurate variant calling and phasing of PCR-based long-read sequencing data

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.1517

キーワード:

long-read sequencing、 PCR amplicons、 variant calling、 haplotype phasing、 chimeric reads、 ADAMTS13、 thrombotic thrombocytopenic purpura

抄録

Long-read sequencing (LRS) is a powerful approach for analyzing causative variants in hereditary diseases. When the target gene is known, long-range PCR amplicons provide high coverage and efficiency. However, artifacts such as chimeric reads, allelic imbalance, and uneven coverage across overlapping amplicons can compromise the accuracy of variant calling and phasing. Yet no existing software is tailored to the properties of PCR-based LRS. We developed TaKoHigh, the first variant calling and phasing tool optimized for LRS of PCR amplicons. TaKoHigh analyzes each amplicon individually, applies thresholds based on allelic balance, and connects haplotypes through overlaps between adjacent amplicons. In ADAMTS13-associated cases, TaKoHigh achieved 98% variant calling accuracy, outperforming Clair3 (64%) and Longshot (50%). It also successfully resolved compound heterozygosity in cases where conventional tools failed. TaKoHigh enables robust interpretation of PCR-based LRS data without requiring specialized experimental protocols, making it broadly applicable in both clinical and research settings.

利益相反に関する開示

The authors declare no competing interests.

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

Chan, K. W. et al. Targeted Gene Sanger Sequencing Should Remain the First-Tier Genetic Test for Children Suspected to Have the Five Common X-Linked Inborn Errors of Immunity. Front. Immunol. 13, 883446 (2022).

Marx, V. Method of the year: long-read sequencing. Nat Methods 20, 6-11 (2023).

van Dijk, E. L. et al. Genomics in the long-read sequencing era. Trends Genet. 39, 649-671 (2023).

Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat Comput Sci 2, 797-803 (2022).

Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 10, 4660 (2019).

Laver, T. W. et al. Pitfalls of haplotype phasing from amplicon-based long-read sequencing. Sci. Rep. 6, 21746 (2016).

Levy, G. G. et al. Mutations in a member of the ADAMTS gene family cause thrombotic thrombocytopenic purpura. Nature 413, 488-494 (2001).

Kokame, K. et al. Mutations and common polymorphisms in ADAMTS13 gene responsible for von Willebrand factor-cleaving protease activity. Proc. Natl. Acad. Sci. U. S. A. 99, 11902-11907 (2002).

Matsumoto, M. et al. Molecular characterization of ADAMTS13 gene mutations in Japanese patients with Upshaw-Schulman syndrome. Blood 103, 1305-1310 (2004).

Moake, J. L. Thrombotic thrombocytopenic purpura: survival by "giving a dam". Trans. Am. Clin. Climatol. Assoc. 115, 201-219 (2004).

Sadler, J. E. Von Willebrand factor, ADAMTS13, and thrombotic thrombocytopenic purpura. Blood 112, 11-18 (2008).

Lotta, L. A., Garagiola, I., Palla, R., Cairo, A. & Peyvandi, F. ADAMTS13 mutations and polymorphisms in congenital thrombotic thrombocytopenic purpura. Hum. Mutat. 31, 11-19 (2010).

Fujimura, Y. et al. Natural history of Upshaw-Schulman syndrome based on ADAMTS13 gene analysis in Japan. J. Thromb. Haemost. 9 Suppl 1, 283-301 (2011).

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094-3100 (2018).

Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572-4574 (2021).

Martin, M. et al. WhatsHap: fast and accurate read-based phasing. bioRxiv, 085050 (2016).

Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178-192 (2013).

De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39 (2023).

公開済


投稿日時: 2025-09-06 07:18:53 UTC

公開日時: 2025-09-10 08:22:28 UTC
研究分野
生物学・生命科学・基礎医学