AI Coding Benchmark
DeepboxBench
How well can AI models write Deepbox code? 10 tasks from hard to very hard — tensors, neural networks, ML pipelines, and more.
10
Tasks
5
Hard
5
Very Hard
1
Models
Leaderboard
| # | Model | Overall | Hard | Very Hard | Last Run |
|---|---|---|---|---|---|
Gemini 3 Flash google/gemini-2.5-flash-lite-preview-09-2025 | 6.0% | 6.0% | 6.0% | Mar 5, 2026 |
Task Breakdown
| Task | Difficulty | Gemini 3 Flash |
|---|---|---|
Custom Autograd Operation Autograd | Hard | 10.0% |
DataFrame ETL Pipeline DataFrame | Hard | 5.0% |
Binary Classifier Eval ML | Hard | 5.0% |
Residual Network Module NN | Hard | 5.0% |
GridSearchCV Pipeline ML | Hard | 5.0% |
Transformer Encoder NN | Very Hard | 5.0% |
GAN Training Loop NN | Very Hard | 5.0% |
End-to-End ML System ML | Very Hard | 5.0% |
LSTM Forecaster NN | Very Hard | 5.0% |
Stacking Ensemble ML | Very Hard | 10.0% |
About DeepboxBench
DeepboxBench measures how well AI models can write code using Deepbox, a zero-dependency TypeScript framework for AI, numerical computing, and machine learning.
Each model receives the same Deepbox API reference and must solve 10 tasks spanning tensors, autograd, neural networks, ML pipelines, DataFrames, and more. Tasks are graded on correct imports, API usage, completeness, type safety, and adherence to best practices.