The Unreasonable Redundancy of Nature's Protein Folds
Summary
The post argues that natural protein folds are redundantly reused even when sequence data is vast. It outlines a data-engineering pipeline that fragments, clusters, and reweights MGnify-derived structures to study fold diversity, revealing that most data concentrates in a small set of structural neighborhoods. The result implies that simply increasing natural sequence data may not yield many novel folds for enzyme design, with implications for how AI models design and sample protein structures.