Amino‑acid composition is a selection target for coding‑sequence retention: evidence from out‑of‑frame translation comparisons in Escherichia coli
DOI:
https://doi.org/10.51094/jxiv.1565Keywords:
Amino‑acid composition, Amino‑acid composition space, De novo gene, Reference proteome, Mutual‑constraint hypothesisAbstract
Protein evolution is a major engine of biological diversification, but the selective pressures that determine which sequences are retained as genes remain debated. In previous work, we showed that, across the three domains of life, per‑protein amino‑acid residue fractions in reference proteomes—defined here as the set of proteins encoded by the annotated coding sequences (CDS) of a reference genome—consistently form species‑specific, bell‑shaped distributions well approximated by binomial expectations for each of the 20 amino acids. If a genome can, in principle, encode a wide range of amino‑acid compositions yet the realized set of coding sequences occupies only a narrow region of that space, this would imply that amino‑acid composition itself is a target of selection during gene retention. Here we test this idea in Escherichia coli. Using this definition, we compared the distributions of per‑CDS residue fractions for the reference proteome’s native (+1) translations with those for genome‑encoded out‑of‑frame translations obtained by re‑parsing the unaltered CDS in non‑native frames (+2, +3 on the plus strand; −1, −2, −3 on the reverse complement). We then located the reference proteome within the composition space spanned by these alternatives. The reference proteome was concentrated within a markedly narrower, shifted region of composition space than that spanned by the out‑of‑frame translations, with especially strong separations for cysteine, aspartate, glutamate, and arginine. Thus, despite the genome’s capacity to encode diverse compositions, the reference proteome lies within a restricted subset—consistent with amino‑acid residue composition being an important target of selection in coding‑sequence retention.
Conflicts of Interest Disclosure
The author declare no conflicts of interest associated with this manuscript.Downloads *Displays the aggregated results up to the previous day.
References
Van Oss, S. B., & Carvunis, A.-R. (2019). De novo gene birth. PLOS Genetics, 15(5), e1008160. https://doi.org/10.1371/journal.pgen.1008160
Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A., Yildirim, M. A., Simonis, N., Charloteaux, B., Hidalgo, C. A., Barbette, J., Santhanam, B., Brar, G. A., Weissman, J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., & Vidal, M. (2012). Proto-genes and de novo gene birth. Nature, 487(7407), 370–374. https://doi.org/10.1038/nature11184
Zhao, L., Svetec, N., & Begun, D. J. (2024). De Novo Genes. Annual Review of Genetics, 58(1), 211–232. https://doi.org/10.1146/annurev-genet-111523-102413
Schmitz, J. F., & Bornberg-Bauer, E. (2017). Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA. F1000Research, 6, 57. https://doi.org/10.12688/f1000research.10079.1
Iyengar, B. R., & Bornberg-Bauer, E. (2023). Neutral Models of De Novo Gene Emergence Suggest that Gene Evolution has a Preferred Trajectory. Molecular Biology and Evolution, 40(4). https://doi.org/10.1093/molbev/msad079
Esumi, G. (2023). The distributions of amino acid compositions of proteins in an organism’s proteome uniformly approximate binomial distributions [Preprint]. Jxiv. https://doi.org/10.51094/jxiv.408
Esumi, G. (2025). Chicken Eggs Are a Practical and Common Exome-Matched Diet for Multicellular Eukaryotic Organisms [Preprint]. Jxiv. https://doi.org/10.51094/jxiv.1056
National Center for Biotechnology Information (NCBI). (2025). NCBI Datasets Taxonomy: Escherichia coli K‑12 (TaxID 83333). National Library of Medicine (US). Retrieved September 29, 2025, from https://www.ncbi.nlm.nih.gov/datasets/taxonomy/83333/
Mir, K., Neuhaus, K., Scherer, S., Bossert, M., & Schober, S. (2012). Predicting Statistical Properties of Open Reading Frames in Bacterial Genomes. PLoS ONE, 7(9), e45103. https://doi.org/10.1371/journal.pone.0045103
Elzanowski, A., & Ostell, J. (2024, September 23). The Genetic Codes. National Center for Biotechnology Information (NCBI). https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Akashi, H., & Gojobori, T. (2002). Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proceedings of the National Academy of Sciences, 99(6), 3695–3700. https://doi.org/10.1073/pnas.062526999
Wilson, B. A., Foy, S. G., Neme, R., & Masel, J. (2017). Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nature Ecology & Evolution, 1(6), 0146. https://doi.org/10.1038/s41559-017-0146
Heames, B., Buchel, F., Aubel, M., Tretyachenko, V., Loginov, D., Novák, P., Lange, A., Bornberg-Bauer, E., & Hlouchová, K. (2023). Experimental characterization of de novo proteins and their unevolved random-sequence counterparts. Nature Ecology & Evolution, 7(4), 570–580. https://doi.org/10.1038/s41559-023-02010-2
Kariin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11(7), 283–290. https://doi.org/10.1016/S0168-9525(00)89076-9
Campbell, A., Mrázek, J., & Karlin, S. (1999). Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences, 96(16), 9184–9189. https://doi.org/10.1073/pnas.96.16.9184
Downloads
Posted
Submitted: 2025-09-30 05:27:25 UTC
Published: 2025-10-06 02:30:36 UTC
License
Copyright (c) 2025
Genshiro Esumi

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.