
which is very incredible because what they did is here they just took multiple common capture the flag exercise uh environments like hack in the box and others and they would try to run um the model to see if they can solve this capture the flex like network exploitation web exploitation cryptography all this common capture the flag you're used to what's fascinating to see is the first models 40 when they've started to do this they just performed a 20% in average. Now we see models performing at 80 90%. If they're allowed to use research, if they're allowed to use as well web browsing and so on. And this is crazy to see that this type of progress. But these models
are you know they are uh generalizing across this specific capture the flag exercises.