DeepboxDeepboxBench
Deepbox
AI Coding Benchmark

DeepboxBench

How well can AI models write Deepbox code? 10 tasks from hard to very hard — tensors, neural networks, ML pipelines, and more.

10
Tasks
5
Hard
5
Very Hard
1
Models

Leaderboard

#ModelOverallHardVery HardLast Run
Gemini 3 Flash
google/gemini-2.5-flash-lite-preview-09-2025
6.0%6.0%6.0%
Mar 5, 2026

Task Breakdown

TaskDifficultyGemini 3 Flash
Custom Autograd Operation
Autograd
Hard10.0%
DataFrame ETL Pipeline
DataFrame
Hard5.0%
Binary Classifier Eval
ML
Hard5.0%
Residual Network Module
NN
Hard5.0%
GridSearchCV Pipeline
ML
Hard5.0%
Transformer Encoder
NN
Very Hard5.0%
GAN Training Loop
NN
Very Hard5.0%
End-to-End ML System
ML
Very Hard5.0%
LSTM Forecaster
NN
Very Hard5.0%
Stacking Ensemble
ML
Very Hard10.0%

About DeepboxBench

DeepboxBench measures how well AI models can write code using Deepbox, a zero-dependency TypeScript framework for AI, numerical computing, and machine learning.

Each model receives the same Deepbox API reference and must solve 10 tasks spanning tensors, autograd, neural networks, ML pipelines, DataFrames, and more. Tasks are graded on correct imports, API usage, completeness, type safety, and adherence to best practices.