Preprint / Version 1

The Gene’s GC Content Is the Greatest Source of Inter-Species Differences in Protein Amino Acid Composition

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.1061

Keywords:

Amino acid composition, GC content, TA skew, Species Difference, Diversity

Abstract

Organisms synthesize proteins based on sequences of 20 amino acids specified by their genes, and protein function is determined by these amino acid sequences and compositions. Previous studies in Bacteria have shown that an organism’s genomic GC content is a key determinant of the amino acid composition of its proteins. However, a more generalized behavior that includes organisms from other domains of life has remained unclear.

In this study, I performed principal component analysis (PCA) on the amino acid compositions of approximately 1.5 million proteins from 81 species spanning all three domains of life and examined how their principal component scores varied among species. The results revealed that, while the first principal component exhibited considerable variation among species, the variation in all other principal components was significantly limited.

To investigate this further, I developed a function to back-calculate the GC content of a gene from its amino acid composition under the assumption of equal usage of synonymous codons. I then compared the estimated GC content derived from this reverse transformation with the first principal component from the PCA, observing a correlation coefficient of 0.98, which indicates an almost perfect match. Because the first principal component of amino acid composition was essentially the only component that showed substantial interspecies variation, and its values strongly correlated with the back-calculated GC content, I conclude that the greatest source of diversity in protein amino acid composition lies in the gene’s GC content, which is substantially governed by the organism’s genomic GC content.

Conflicts of Interest Disclosure

No competing interests are declared.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Du, M.-Z., Zhang, C., Wang, H., Liu, S., Wei, W., & Guo, F.-B. (2018). The GC Content as a Main Factor Shaping the Amino Acid Usage During Bacterial Evolution Process. Frontiers Media SA. https://doi.org/10.3389/fmicb.2018.02948

EMBL-EBI. (2024). Reference Proteomes (Release 2024_02) [Database]. Retrieved January 7, 2025, from https://www.ebi.ac.uk/reference_proteomes/

Esumi, G. (2023). Statistical Extremes of Amino Acid Residue Composition of the Proteome Proteins Can Explain the Origin of the Universality of the Genetic Code. Jxiv. https://doi.org/10.51094/jxiv.575

Esumi, G. (2023). The Synonymous Codon Usage of a Protein Gene Is Primarily Determined by the Guanine + Cytosine Content of the Individual Gene Rather Than the Species to Which It Belongs To Synthesize Proteins With a Balanced Amino Acid Composition. Jxiv. https://doi.org/10.51094/jxiv.561

Downloads

Posted


Submitted: 2025-01-27 00:32:42 UTC

Published: 2025-01-28 10:19:23 UTC
Section
Biology, Life Sciences & Basic Medicine