As a part of my work I do security research for various apps and websites. I wanted to see if LLMs could reproduce a common class of exploits I've found in multiple apps. So I built a deliberately vulnerable book review app and spent $1,500 finding out…
TL/DR
Some models were more capable than others.
One of the lessons learned ( the most relevant for me personally )