Machine Learning-Assisted Directed Protein Evolution with Combinatorial Libraries


To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee. By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.

Submission Details

ID: UGoTemyw

Submitter: Zach Wu

Submission Date: March 15, 2019, midnight

Version: 2

Publication Details
Zachary Wu, S. B. Jennifer Kan, Russell D. Lewis, Bruce J. Wittmann, Frances H. Arnold (2019) Machine Learning-Assisted Directed Protein Evolution with Combinatorial Libraries; unpublished work from Frances H. Arnold group
Additional Information

The original article can be found via: This is an updated version of study mnqBQFjF3. Warning: The sequence data is presented as obtained from Laragen, processed with Seqman Pro, and used for machine learning in our publication. However, in developing next-gen sequencing techniques tested on the same plates for later work, we have noticed that some of the sequences may have mixed populations.

Study data

No weblogo for data of varying length.
Colors: D E R H K S T N Q A V I L M F Y W C G P

Data Distribution

Studies with similar sequences (approximate matches)

Correlation with other assays (exact sequence matches)