Now in private beta · 30-day trial, no card

Know which developers and teams truly deliver.

DevEval uses AI to analyse code, tasks, project context, and review discussions — turning real engineering work into explainable productivity, quality, and risk signals.

Commits, hours, story points, and raw Git statistics show activity. DevEval shows delivery value.

From small teams to enterprise engineering organisations.

Start free trial 30 days · no credit card · self-hosted from day 1
AK
Alice Kowalska
Senior · Backend · 2 yrs
Active
Quality Velocity Stability Collab Cost
Quality
88
Velocity
71
Stability
82
Collab
79
Cost
74
ELO · Code Quality
Last 30 days · Glicko-2
Q2 · 2026
1 Alice K.
1842 ±64 +24
2 Tomek W.
1721 ±71 +11
3 Bob R.
1690 ±88 −6
4 Marta J.
1654 ±92 +3
5 Dawid P.
1598 ±105 −12
6 Hanna L.
1572 ±110 +8
↑ Last update 14m ago · 23 PRs analyzed
Reads from Diff in / scores out — code never leaves your organization
GitHub GitLab Bitbucket Jira Linear Tempo Confluence GitHub GitLab Bitbucket Jira Linear Tempo Confluence
01 Why DevEval

Because activity isn't value. And classic metrics only see activity.

Story Points, PR counts, commits, lines of code, hours logged — they all measure activity, not value. Two devs each ship 12 PRs. One refactored auth. The other changed copy on 12 buttons. Classic metrics call them equal. We don't.

The Same-Sprint Illusion
Same sprint, two devs. Alice owns the gnarly database migration nobody else dared to touch, plus three medium fixes — 25 CU delivered. Bob ships fifteen button-copy changes — 7 CU delivered. Both logged 8 days. Both closed roughly the same Story Points. By every traditional metric, they performed equally.
Hours & SP
tie
Looks fair on every dashboard
CU delivered
3.6×
Real difficulty gap — what DevEval sees
Traditional
DevEval
Lines of code / commits
How much real difficulty did this dev actually solve
Story Points (team-relative)
Constant difficulty unit, comparable across teams
A single composite score per dev, no breakdown
"Better than 87% of org in code quality — top in Velocity, mid in Stability"
Single-number performance review
5 dimensions: Quality / Velocity / Stability / Collab / Cost
Table of metrics, no explanation
"Alice > Bob — 5 reviews vs 1, lower quality variance"
02 The problem

Every metric you use today was broken by AI in 2025.

AI coding assistants deflate every time-based productivity signal. Lines, commits, hours, story points — all noise now. Without a quality-aware, AI-resistant model, every performance review becomes a negotiation about the metric, not the work.

What organisations measure
Why it fails in 2026
Concrete failure mode
Number of commits
Rewards activity, not impact.
50 commits to fix a typo. 1 commit for a vertical slice. Same line.
Story Points
Subjective, inconsistent across teams and vendors.
Team A's "5 points" is Team B's "13". And nobody on either team can tell you why.
Lines of code
Punishes refactoring. AI writes lines for free.
-2,400 LOC of dead code removal scores the same as +2,400 LOC of copy-paste.
Time spent / hours billed
No link to difficulty. AI deflates the time signal further.
Senior closes a gnarly migration in a day. Junior spends the week shipping repetitive forms. Same week on the timesheet, same line on the invoice — and the codebase moved very differently.
Velocity (points/sprint)
Gameable. No quality dimension. Rewards shipping bad code fast.
Doubled velocity this quarter — by skipping reviews. The bugs land next quarter; the bonus was paid in this one.
Subjective manager evaluations
Bias-prone. Non-comparable across teams, BUs, vendors.
Five managers, five scales, five favourite people. Performance review = negotiation.
DevEval instead
1 axis of volume + 5 axes of expertise bonus, computed from the diff itself — not from what someone typed into a planning tool. AI-resistant by design.
How → CU
03 How it works

Context first. Then signal, not noise.

DevEval first learns what your projects actually are, then evaluates every merged PR against that context, then ranks developers head-to-head with reasoning. A built-in chat lets you skip clicking — it just reads the analysis already done for you.

01 on connect · per repo
Project context

It learns your codebase first.

Before any scoring, DevEval profiles each project — tech stack, criticality, era, complexity. A 3-line tweak in a simple CRUD app is not the same as one in a 24/7 banking core. Every later score is calibrated against this context.

02 every merged PR
Per-PR analysis

Each merged PR is read like a senior reviewer would.

Every merged PR runs through seven layers: difficulty scoring (CU on 6 axes), effort estimate, classification, code-quality review, stability check, risk scan, and review-value attribution. One verdict per PR. Bug attribution traces regressions back to the introducing PR.

03 Glicko-2 · 5 dimensions
Head-to-head

Pairs are compared, with reasoning.

"Alice > Bob in code quality, 87% confidence." Three rating systems for three questions. ROI per dev / project / client falls out of the same data.

04 optional shortcut
Chat

Click around, or just ask.

Every screen is fully clickable — drill down to any PR, review, or ranking yourself. Or ask the chat: it reads the same data we already produced and assembles it for you, so you don't have to traverse five views to answer one question.

04Complexity Units

One scale. From a typo to a multi-sprint outlier.

CU is a measure of contribution per PR — comparable across people, teams, vendors, tools. Volume (scope) plus expertise bonuses on 5 axes. Same task = same CU, forever. AI-resistant by construction.

CU Scale · 0 → 30
Steps: 0, 0.25, 0.5, 1, 1.5, 2, … 30
0
5
10
15
20
25
30
1 CRUD endpoint
5 screens (pattern)
Multi-layer feature
Vertical slice
Senior foundations
Infra greenfield
Mid-dev sprint · 10 MD
Multi-sprint epic
Outlier (rare)
0
5
10
15
20
25
30
  • 21 CRUD endpoint
  • 45 screens (pattern)
  • 7.5Multi-layer feature
  • 11.5Vertical slice
  • 14Senior foundations
  • 17.5Infra greenfield
  • 20Mid-dev sprint · 10 MD
  • 24.5Multi-sprint epic
  • 30Outlier (rare)
CALIBRATION ANCHOR
20 CU ≈ 10 MD
solid sprint of mid-dev work, no AI
PATTERN-FOLLOWING
~1.5 CU / MD
sublinear discount for repetition
EXPERTISE-HEAVY
up to 3–4 CU / MD
small diff, large bonus axes
MAX SCALE
30
multi-sprint outlier · rare

Built from 1 volume axis + 5 expertise bonuses.

Each PR scored 0–10 on every axis. The formula reconciles them into the final CU.
Scope · 0–10
Amount of work delivered, with sublinear discount for repetition — but not to zero.
Mid-band (5–6)
6–10 pattern-following units with different decisions
Top (9–10)
20+ pattern-following units OR pure-volume mass change across 2+ sprints
Anti-pattern
File count ≠ concern count. 30 files threaded with one boolean = scope 1–2, not 8.
05 Quality Gate

Shipping fast with bad code is not a win.

Two delivery patterns. Same team. Same sprint. Velocity metrics call Dev A the top performer — DevEval doesn't. Quality, stability, and collaboration weigh more than raw volume by design.

Volume-first
Q1 2026 · 1 sprint
Developer A
Ships fast. Cuts corners.
Output (CU) 50
Code quality 30 /100
Stability 30 /100
Reviews given 0 CU
Bugs introduced 6
DevEval score
5
pts
Velocity rewarded by every traditional metric.
VS
7.4× score gap
Quality-first
Q1 2026 · 1 sprint
Developer B
Ships less code. Ships better code. Reviews peers.
Output (CU) 20
Code quality 90 /100
Stability 85 /100
Reviews given 10 CU · q. 80
Bugs introduced 1
DevEval score
37
pts
Less output, 7× the score. Quality, stability, and collaboration count more than raw volume.
Velocity doesn't override quality
Code quality and stability multiply into the final score. Cutting corners costs more than it earns.
Reviews are first-class output
Reviewing a peer's PR with a quality-80 review counts as contribution. Helping the team isn't invisible work.
Bugs trace back to the author
Regressions are attributed to the introducing PR. Fast-and-broken doesn't hide behind sprint boundaries.
06Use cases

Five places where the same data pays off.

One platform. One scoring model. Five concrete decisions it informs — from vendor renewals to hiring. The data is the same; the lens changes.

Vendor performance

Compare software houses on the same scale.

Challenge
You work with multiple software houses delivering against the same backlog. Comparing them today means comparing apples to oranges — different stacks, different reporting, different "definitions of done".
Outcome
Vendor selection and renewal decisions backed by quarter-over-quarter, evidence-grade data.
What DevEval enables
01
Standardised scorecards
Same 5 dimensions, same CU scale, normalised for project difficulty.
02
Cross-vendor benchmarking
Fair, statistically valid comparisons — confidence interval, not gut feeling.
03
Underperformance alerts
Auto-flagged when a vendor's quality or stability degrades quarter over quarter.
04
Contract & SLA evidence
Defensible numbers in QBR meetings. Not narratives.
07Three ratings + one stat

Three rating systems. Not a bug — a feature.

A chess grandmaster has three numbers: ELO 2400, top 5%, and games played — each answers a different question. So does DevEval. Plus a raw stat — Productivity Index — combining difficulty (CU) with working days. Not a ranking, just a measurement.

Absolute Points
Per hour 4.31 · 4-period avg 343 · vs org +0%
758
Rising
↑ +539(+246.5%)
POINTS HISTORY
9305002500JanFebMarApr
BREAKDOWN BY DIMENSION
Code Quality
15220%
pts
Velocity
37950%
pts
Stability
9112%
pts
Collaboration
7610%
pts
Cost Efficiency
608%
pts
Worked example
The whole org adopts AI tools. Everyone gets 2× faster.
Each rating system reacts differently — that's why you need all of them.
Alice's Glicko-2
1842 1842
±0
relative — nothing changes
Alice's percentile
95th 95th
±0
relative — nothing changes
Alice's points
1500 3200
+113%
absolute — the org moved
08 Deployment options

SaaS by default. Self-hosted when you need it.

Most teams run DevEval as a managed service and are productive in hours. Enterprises with stricter requirements deploy it inside their own perimeter — same product, same scores, your infrastructure and your AI key.

SaaS · managed by us
DEFAULT
Default · fastest path to value
We run DevEval for you in EU regions. Sign up, connect a repo, see results in hours. Best for teams who want zero ops overhead and standard governance.
EU-hosted Managed updates Pooled AI cost
Self-hosted · your infra
ENTERPRISE
Enterprise option · your perimeter
Deploys into your environment — On-Premise (VM / Kubernetes), Azure AKS, or AWS EKS. You bring your own AI provider key. Source code and analysis stay inside your boundary.
On-Prem AKS EKS BYO AI key
Hybrid · scoped per client
CUSTOM
When the standard options don't fit
Air-gapped pilots, regulated industries, multi-region deployments, custom integrations. Scoped in a discovery call — quoted per engagement.
Custom scope Air-gapped Regulated
Self-hosted data flow
Diff in. Scores out. Source code never leaves your boundary.
For Enterprise deployments
01
Your Git
GitHub · GitLab · Bitbucket
02
PR diff + metadata
pulled via API — no agents
03
DevEval
inside your perimeter
04
AI provider
your key · direct / Bedrock / Vertex / Foundry
05
Scores & dashboards
in your environment
YOUR PERIMETER The 3 boxes in the middle stay inside your environment. The AI key is yours.
OAuth 2.0 · SAML/Entra ID · RBAC · full audit trail
YOUR INFRA
On-Prem · AKS · EKS
or air-gapped on request
YOUR AI KEY
BYO provider key
direct · Bedrock · Vertex · Foundry
YOUR DATA
No code exfil
only PR metadata + scores
YOUR SLA
Dedicated terms
scoped per engagement
09 Pricing

Per active developer. Self-hosted on Enterprise.

30-day trial without a card. After that your account turns read-only — nothing disappears. Enterprise is scoped and quoted per client.

Trial
0 free
30 days · no card
5 devs
Up to 30 PRs / dev
Full feature access · read-only after
Starter
16 EUR / dev / mo
Monthly billing
Up to 10 devs
Up to 30 PRs / dev / mo
Email support · 12-mo retention
Most popular
Pro
29 EUR / dev / mo
Monthly · Annual −10%
Up to 100 devs
Up to 60 PRs / dev / mo
Priority support · 24-mo retention
Enterprise
Custom
Annual · scoped per client
Unlimited devs · multi-BU
Unlimited PRs
SaaS or self-hosted · BYO AI key
SSO/SAML · dedicated SLA
Enterprise · how it works
Pricing for enterprise engagements is built per client in a short verification call — scope, integrations, deployment model, AI key arrangement, and support level all factor in. If you decide to go ahead, we can deploy DevEval into your environment, tune it to your stack, and route AI calls through your own provider key — so the sensitive bits stay under your control.
Request a quote
30 days · no card · cancel anytime

Stop measuring lines of code.
Start measuring value.

Connect a repo in 15 minutes. Historic PRs backfilled the same day. Full ratings and ROI ready in hours.

Start free trial
Private beta · early access

Self-service launches soon

We're prioritising enterprise rollouts right now. Leave your email and we'll let you know the moment self-service is open.

We'll only use this to reach out about DevEval. No newsletter, no sharing with third parties.