Amino‑acid composition is a selection target for coding‑sequence retention: evidence from out‑of‑frame translation comparisons in Escherichia coli

Genshiro Esumi

doi:10.51094/jxiv.1565

##article.authors##

Genshiro Esumi Department of Pediatric Surgery, Hospital of the University of Occupational and Environmental Health https://orcid.org/0000-0003-0618-9943 https://researchmap.jp/esumig

DOI:

https://doi.org/10.51094/jxiv.1565

Keywords:

Amino‑acid composition, Amino‑acid composition space, De novo gene, Reference proteome, Mutual‑constraint hypothesis

Abstract

Protein evolution is a major engine of biological diversification, but the selective pressures that determine which sequences are retained as genes remain debated. In previous work, we showed that, across the three domains of life, per‑protein amino‑acid residue fractions in reference proteomes—defined here as the set of proteins encoded by the annotated coding sequences (CDS) of a reference genome—consistently form species‑specific, bell‑shaped distributions well approximated by binomial expectations for each of the 20 amino acids. If a genome can, in principle, encode a wide range of amino‑acid compositions yet the realized set of coding sequences occupies only a narrow region of that space, this would imply that amino‑acid composition itself is a target of selection during gene retention. Here we test this idea in Escherichia coli. Using this definition, we compared the distributions of per‑CDS residue fractions for the reference proteome’s native (+1) translations with those for genome‑encoded out‑of‑frame translations obtained by re‑parsing the unaltered CDS in non‑native frames (+2, +3 on the plus strand; −1, −2, −3 on the reverse complement). We then located the reference proteome within the composition space spanned by these alternatives. The reference proteome was concentrated within a markedly narrower, shifted region of composition space than that spanned by the out‑of‑frame translations, with especially strong separations for cysteine, aspartate, glutamate, and arginine. Thus, despite the genome’s capacity to encode diverse compositions, the reference proteome lies within a restricted subset—consistent with amino‑acid residue composition being an important target of selection in coding‑sequence retention.

Conflicts of Interest Disclosure

The author declare no conflicts of interest associated with this manuscript.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Van Oss, S. B., & Carvunis, A.-R. (2019). De novo gene birth. PLOS Genetics, 15(5), e1008160. https://doi.org/10.1371/journal.pgen.1008160

Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A., Yildirim, M. A., Simonis, N., Charloteaux, B., Hidalgo, C. A., Barbette, J., Santhanam, B., Brar, G. A., Weissman, J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., & Vidal, M. (2012). Proto-genes and de novo gene birth. Nature, 487(7407), 370–374. https://doi.org/10.1038/nature11184

Zhao, L., Svetec, N., & Begun, D. J. (2024). De Novo Genes. Annual Review of Genetics, 58(1), 211–232. https://doi.org/10.1146/annurev-genet-111523-102413

Schmitz, J. F., & Bornberg-Bauer, E. (2017). Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA. F1000Research, 6, 57. https://doi.org/10.12688/f1000research.10079.1

Iyengar, B. R., & Bornberg-Bauer, E. (2023). Neutral Models of De Novo Gene Emergence Suggest that Gene Evolution has a Preferred Trajectory. Molecular Biology and Evolution, 40(4). https://doi.org/10.1093/molbev/msad079

Esumi, G. (2023). The distributions of amino acid compositions of proteins in an organism’s proteome uniformly approximate binomial distributions [Preprint]. Jxiv. https://doi.org/10.51094/jxiv.408

Esumi, G. (2025). Chicken Eggs Are a Practical and Common Exome-Matched Diet for Multicellular Eukaryotic Organisms [Preprint]. Jxiv. https://doi.org/10.51094/jxiv.1056

National Center for Biotechnology Information (NCBI). (2025). NCBI Datasets Taxonomy: Escherichia coli K‑12 (TaxID 83333). National Library of Medicine (US). Retrieved September 29, 2025, from https://www.ncbi.nlm.nih.gov/datasets/taxonomy/83333/

Mir, K., Neuhaus, K., Scherer, S., Bossert, M., & Schober, S. (2012). Predicting Statistical Properties of Open Reading Frames in Bacterial Genomes. PLoS ONE, 7(9), e45103. https://doi.org/10.1371/journal.pone.0045103

Elzanowski, A., & Ostell, J. (2024, September 23). The Genetic Codes. National Center for Biotechnology Information (NCBI). https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Akashi, H., & Gojobori, T. (2002). Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proceedings of the National Academy of Sciences, 99(6), 3695–3700. https://doi.org/10.1073/pnas.062526999

Wilson, B. A., Foy, S. G., Neme, R., & Masel, J. (2017). Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nature Ecology & Evolution, 1(6), 0146. https://doi.org/10.1038/s41559-017-0146

Heames, B., Buchel, F., Aubel, M., Tretyachenko, V., Loginov, D., Novák, P., Lange, A., Bornberg-Bauer, E., & Hlouchová, K. (2023). Experimental characterization of de novo proteins and their unevolved random-sequence counterparts. Nature Ecology & Evolution, 7(4), 570–580. https://doi.org/10.1038/s41559-023-02010-2

Kariin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11(7), 283–290. https://doi.org/10.1016/S0168-9525(00)89076-9

Campbell, A., Mrázek, J., & Karlin, S. (1999). Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences, 96(16), 9184–9189. https://doi.org/10.1073/pnas.96.16.9184

Amino‑acid composition is a selection target for coding‑sequence retention: evidence from out‑of‑frame translation comparisons in Escherichia coli

##article.authors##

DOI:

Keywords:

Abstract

Conflicts of Interest Disclosure

Downloads *Displays the aggregated results up to the previous day.

References

Downloads

Posted

License

Language