Press-room / news / Science news /

Heterogeneity of the GFP fitness landscape and data-driven protein design

Understanding the relationship between genotype and phenotype, the fitness landscape, elucidates the fundamental laws of heredity (Canale et al. 2018) and may ultimately create novel methods of protein design (Alley et al. 2019). The fitness landscape is often conceptualised as a multidimensional surface (Kondrashov and Kondrashov, 2015) with one dimension representing fitness, or another phenotype, and the other dimensions each representing a genotype’s locus.

Gonzalez Somermeyer L, Fleiss A, Mishin AS, Bozhanova NG, Igolkina AA, Meiler J, Alaball Pujol ME, Putintseva EV, Sarkisyan KS, Kondrashov FA

While several experimentally characterized fitness landscapes for specific proteins have been reported (Hartman and Tullman-Ercek 2019; Sarkisyan et al. 2016), such surveys of large proteins are still hindered by the enormity of the genotype space. Moreover, their characterization is hampered by epistatic interactions between amino acids — dependence of mutations effects on each other, which is pretty common (Russ et al. 2020). Many of them are too complex to predict with available data (Pokusaeva et al. 2019). The best to date way of exploring fitness landscapes was directed evolution — a laboratory procedure that mimics natural selection. This involves randomly mutating the genetic sequence of a naturally occurring protein to create multiple variants with slightly different amino acids (Chen et al. 2018). Alternatively, a rational design approach is used, in which new proteins are built using principles learned from the study of known protein structures (Anishchenko et al. 2021).
In the present study, an international scientific team in the collaboration with scientists from the Group of synthetic biology and the Group of molecular tags for optical nanoscopy used both approaches to engineer new variants of naturally occurring green fluorescent proteins by generating tens of thousands GFP mutant variants and assessing their ability to fluoresce (Figure 1). Moreover, machine learning algorithms were used for predicting the performance of other GFP variants and expanding the fitness landscape of green fluorescent proteins.

Figure 1. The fitness landscape of green fluorescent proteins. Two naturally occurring green fluorescent proteins — GFPs — dots outlined in black; functional mutant proteins able to fluoresce — green dots; non-functional — grey dots. Application of a machine learning algorithm expanded the fitness landscape (right; blue contour lines) by including mutations that are not generated by evolution. This led to the creation of functional, synthetic variants (green dot, bottom right) that reside on different fitness peaks to variants that are naturally occurring.  

The authors were able to design a functional GFP variant carrying 48 mutations compared to naturally occurring proteins. To see whether the developed algorithm could be as effective for other proteins, the authors experimented with three GFP proteins that originated from evolutionarily distant species — cgreGFP, amacGFP, and ppluGFP2. They found that machine learning was better at generating functional variants of cgreGFP than amacGFP and ppluGFP2. Analysis of the fitness landscape revealed that the homologues differed in the number of mutations they could tolerate: on average three to four mutations for cgreGFP and avGFP, but seven to eight mutations in the case of amacGFP and ppluGFP2. The proteins also differed in their general sturdiness: ppluGFP2 was stable when exposed to high temperatures, whereas the structure of cgreGFP was more sensitive to changes in temperature. The increased mutational sensitivity of avGFP and cgreGFP apparently was due to negative epistasis when the negative effect of combined mutations is greater than the individual ones.
Overall, the published results indicate that to generate functional protein variants and to predict a protein’s function the algorithm only requires data on the effects of single-site mutations and low-order epistasis. This is good news for the protein engineering field as it suggests that prior knowledge of high-order interactions between large sets of mutations is not needed for protein design.
The resulrs are published in the eLife journal.


  1. Alley, Ethan C., Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. Church. 2019. “Unified Rational Protein Engineering with Sequence-Based Deep Representation Learning.” Nature Methods 16 (12): 1315–22.
  2. Anishchenko, Ivan, Samuel J. Pellock, Tamuka M. Chidyausiku, Theresa A. Ramelot, Sergey Ovchinnikov, Jingzhou Hao, Khushboo Bafna, et al. 2021. “De Novo Protein Design by Deep Network Hallucination.” Nature 600 (7889): 547–52.
  3. Canale, Aneth S., Pamela A. Cote-Hammarlof, Julia M. Flynn, and Daniel Na Bolon. 2018. “Evolutionary Mechanisms Studied through Protein Fitness Landscapes.” Current Opinion in Structural Biology 48 (February): 141–48.
  4. Chen, Kai, Xiongyi Huang, S. B. Jennifer Kan, Ruijie K. Zhang, and Frances H. Arnold. 2018. “Enzymatic Construction of Highly Strained Carbocycles.” Science 360 (6384): 71–75.
  5. Hartman, Emily C., and Danielle Tullman-Ercek. 2019. “Learning from Protein Fitness Landscapes: A Review of Mutability, Epistasis, and Evolution.” Current Opinion in Systems Biology 14 (April): 25–31.
  6. Kondrashov, Dmitry A., and Fyodor A. Kondrashov. 2015. “Topological Features of Rugged Fitness Landscapes in Sequence Space.” Trends in Genetics: TIG 31 (1): 24–33.
  7. Pokusaeva, Victoria O., Dinara R. Usmanova, Ekaterina V. Putintseva, Lorena Espinar, Karen S. Sarkisyan, Alexander S. Mishin, Natalya S. Bogatyreva, et al. 2019. “An Experimental Assay of the Interactions of Amino Acids from Orthologous Sequences Shaping a Complex Fitness Landscape.” PLoS Genetics 15 (4): e1008079.
  8. Russ, William P., Matteo Figliuzzi, Christian Stocker, Pierre Barrat-Charlaix, Michael Socolich, Peter Kast, Donald Hilvert, et al. 2020. “An Evolution-Based Model for Designing Chorismate Mutase Enzymes.” Science 369 (6502): 440–45.

june 16