Sciloop Frontier builds science evaluation datasets for high achieving AI labs. We focus on problems that are outcome-verifiable, expert-crafted, and cross-checked. So you get signal you can trust.
Why this is needed
Public benchmarks are saturating. Models that look strong on leaderboards often fail on novel, slightly-different problems. They're matching the training distribution, not reasoning from first principles. Labs need evaluations that don't game out.
Who builds the data
Every problem is written by Olympiad medalists, post-docs at top universities, and experts from MIT, Harvard, IITs, UC Berkeley, and similar institutions across math, physics, chemistry, and biology. We don't ship until domain experts agree it's correct and novel.
Our bar for quality
We don't release a problem until it's been verified correct, novel, and by our own runs, calibrated to our difficulty standard.
On the set we maintain today (Gemini 3 Pro, Claude Opus 4.6, GPT-5.2 Thinking):
- pass@3: 0%
- pass@8: <5%
The data has to stay ahead of the frontier to be useful!
AI labs: talk to us and see sample eval sets at team@sciloop.dev.
Domain experts (post-doc at a top university, faculty, or Olympiad background): We are always looking for the top notch talent. apply here.
About
Founded by Bilal and Osman. They are IPhO medalists and MIT students. But more importantly, they are frustrated mentors hearing bold claims of AI labs as LLMs confidently get key concepts wrong in mildly challenging yet creative problems.