Generative models are increasingly used for protein design, but the lack of standardized evaluation frameworks limits comparison across model classes and hinders translation to experimental success. We developed a unified framework for sequence generation and benchmarking across multiple model types, testing it on Tobacco etch virus (TEV) protease. Our experimental work revealed substantial performance variations, with machine learning-designed libraries achieving higher hit rates than conventional methods. Structure-based models demonstrated superior outcomes overall, and commonly used selection metrics do not strongly correlate with experimental activity, underscoring the importance of experimental validation in protein model development.