Now in private beta · 30-day trial, no card

Know which developers and teams truly deliver.

DevEval uses AI to analyse code, tasks, project context, and review discussions — turning real engineering work into explainable productivity, quality, and risk signals.

Commits, hours, story points, and raw Git statistics show activity. DevEval shows delivery value.

From small teams to enterprise engineering organisations.

Start free trial → 30-day cloud trial · no credit card

Alice Kowalska

Senior · Backend · 2 yrs

Active

   Score 79    Org avg 60  

Quality

Velocity

Stability

Collab

Cost

ELO · Code Quality

Last 30 days · Glicko-2

Q2 · 2026

1 Alice K.

1842 ±64 ↑ 24

2 Tomek W.

1721 ±71 ↑ 11

3 Bob R.

1690 ±88 ↓ 6

4 Marta J.

1654 ±92 ↑ 3

5 Dawid P.

1598 ±105 ↓ 12

6 Hanna L.

1572 ±110 ↑ 8

7 Piotr S.

1549 ±96 ↑ 5

↑ Last update 14m ago · 23 PRs analyzed

Reads from Read-only integration — your repository remains the source of truth

GitHub • GitLab • Bitbucket • Jira • Linear • Tempo • Confluence • GitHub • GitLab • Bitbucket • Jira • Linear • Tempo • Confluence •

01 Why DevEval

Because activity isn't value. And classic metrics only see activity.

Story Points, PR counts, commits, lines of code, hours logged — they all measure activity, not value. Two devs each ship 12 PRs. One refactored auth. The other changed copy on 12 buttons. Classic metrics call them equal. We don't.

The Same-Sprint Illusion

Same sprint, two devs. Alice owns the gnarly database migration nobody else dared to touch, plus three medium fixes — 25 CU delivered. Bob ships fifteen button-copy changes — 7 CU delivered. Both logged 8 days. Both closed roughly the same Story Points. By every traditional metric, they performed equally.

Hours & SP

tie

Looks fair on every dashboard

CU delivered

3.6×

Real difficulty gap — what DevEval sees

Traditional

DevEval

Lines of code / commits

How much real difficulty did this dev actually solve

Story Points (team-relative)

Constant difficulty unit, comparable across teams

A single composite score per dev, no breakdown

"Better than 87% of org in code quality — top in Velocity, mid in Stability"

Single-number performance review

5 dimensions: Quality / Velocity / Stability / Collab / Cost

Table of metrics, no explanation

"Alice > Bob — 5 reviews vs 1, lower quality variance"

Deep dive Value & ROI — the five-layer model →

02 The problem

Every metric you use today was broken by AI in 2025.

AI coding assistants deflate every time-based productivity signal. Lines, commits, hours, story points — all noise now. Without a quality-aware, AI-resistant model, every performance review becomes a negotiation about the metric, not the work.

What organisations measure

Why it fails in 2026

Concrete failure mode

◢

Number of commits

Rewards activity, not impact.

50 commits to fix a typo. 1 commit for a vertical slice. Same line.

◣

Story Points

Subjective, inconsistent across teams and vendors.

Team A's "5 points" is Team B's "13". And nobody on either team can tell you why.

◤

Lines of code

Punishes refactoring. AI writes lines for free.

-2,400 LOC of dead code removal scores the same as +2,400 LOC of copy-paste.

◥

Time spent / hours billed

No link to difficulty. AI deflates the time signal further.

Senior closes a gnarly migration in a day. Junior spends the week shipping repetitive forms. Same week on the timesheet, same line on the invoice — and the codebase moved very differently.

◐

Velocity (points/sprint)

Gameable. No quality dimension. Rewards shipping bad code fast.

Doubled velocity this quarter — by skipping reviews. The bugs land next quarter; the bonus was paid in this one.

◑

Subjective manager evaluations

Bias-prone. Non-comparable across teams, BUs, vendors.

Five managers, five scales, five favourite people. Performance review = negotiation.

DevEval instead

1 axis of volume + 5 axes of expertise bonus, computed from the diff itself — not from what someone typed into a planning tool. AI-resistant by design.

How → CU

03 How it works

Context first. Then signal, not noise.

DevEval first learns what your projects actually are, then evaluates every merged PR against that context, then ranks developers head-to-head with reasoning. A built-in chat lets you skip clicking — it just reads the analysis already done for you.

01 on connect · per repo

Project context

It learns your codebase first.

Before any scoring, DevEval profiles each project — tech stack, criticality, era, complexity. A 3-line tweak in a simple CRUD app is not the same as one in a 24/7 banking core. Every later score is calibrated against this context.

02 every merged PR

Per-PR analysis

Each merged PR is read like a senior reviewer would.

Every merged PR runs through seven layers: difficulty scoring (CU on 6 axes), effort estimate, classification, code-quality review, stability check, risk scan, and review-value attribution. One verdict per PR. Bug attribution traces regressions back to the introducing PR.

03 Glicko-2 · 5 dimensions

Head-to-head

Pairs are compared, with reasoning.

"Alice > Bob in code quality, 87% confidence." Three rating systems for three questions. ROI per dev / project / client falls out of the same data.

04 optional shortcut

Chat

Click around, or just ask.

Every screen is fully clickable — drill down to any PR, review, or ranking yourself. Or ask the chat: it reads the same data we already produced and assembles it for you, so you don't have to traverse five views to answer one question.

04Complexity Units

One scale. From a typo to a multi-sprint outlier.

CU is a measure of contribution per PR — comparable across people, teams, vendors, tools. Volume (scope) plus expertise bonuses on 5 axes. Same task = same CU, forever. AI-resistant by construction.

CU Scale · 0 → 30

Steps: 0, 0.25, 0.5, 1, 1.5, 2, … 30

1 CRUD endpoint

5 screens (pattern)

Multi-layer feature

Vertical slice

Senior foundations

Infra greenfield

Mid-dev sprint · 10 MD

Multi-sprint epic

Outlier (rare)

21 CRUD endpoint
45 screens (pattern)
7.5Multi-layer feature
11.5Vertical slice
14Senior foundations
17.5Infra greenfield
20Mid-dev sprint · 10 MD
24.5Multi-sprint epic
30Outlier (rare)

CALIBRATION ANCHOR

20 CU ≈ 10 MD

solid sprint of mid-dev work, no AI

PATTERN-FOLLOWING

~1.5 CU / MD

sublinear discount for repetition

EXPERTISE-HEAVY

up to 3–4 CU / MD

small diff, large bonus axes

MAX SCALE

multi-sprint outlier · rare

Built from 1 volume axis + 5 expertise bonuses.

Each PR scored 0–10 on every axis. The formula reconciles them into the final CU.

Scope · 0–10

Amount of work delivered, with sublinear discount for repetition — but not to zero.

Mid-band (5–6)

6–10 pattern-following units with different decisions

Top (9–10)

20+ pattern-following units OR pure-volume mass change across 2+ sprints

Anti-pattern

File count ≠ concern count. 30 files threaded with one boolean = scope 1–2, not 8.

Deep diveComplexity Units — the full scale→

05 Quality Gate

Shipping fast with bad code is not a win.

Two delivery patterns. Same team. Same sprint. Velocity metrics call Dev A the top performer — DevEval doesn't. Quality, stability, and collaboration weigh more than raw volume by design.

Volume-first

Q1 2026 · 1 sprint

Developer A

Ships fast. Cuts corners.

Output (CU) 50

Code quality 30 /100

Stability 30 /100

Reviews given 0 CU

Bugs introduced 6

DevEval score

pts

Velocity rewarded by every traditional metric.

7.4× score gap

Quality-first

Q1 2026 · 1 sprint

Developer B

Ships less code. Ships better code. Reviews peers.

Output (CU) 20

Code quality 90 /100

Stability 85 /100

Reviews given 10 CU · q. 80

Bugs introduced 1

DevEval score

pts

Less output, 7× the score. Quality, stability, and collaboration count more than raw volume.

Velocity doesn't override quality

Code quality and stability multiply into the final score. Cutting corners costs more than it earns.

Reviews are first-class output

Reviewing a peer's PR with a quality-80 review counts as contribution. Helping the team isn't invisible work.

Bugs trace back to the author

Regressions are attributed to the introducing PR. Fast-and-broken doesn't hide behind sprint boundaries.

Deep dive Scoring model — Score vs Absolute Points →

06Use cases

Five places where the same data pays off.

One platform. One scoring model. Five concrete decisions it informs — from vendor renewals to hiring. The data is the same; the lens changes.

Vendor performance

Compare software houses on the same scale.

Challenge

You work with multiple software houses delivering against the same backlog. Comparing them today means comparing apples to oranges — different stacks, different reporting, different "definitions of done".

Outcome

Vendor selection and renewal decisions backed by quarter-over-quarter, evidence-grade data.

What DevEval enables

Standardised scorecards

Same 5 dimensions, same CU scale, normalised for project difficulty.

Cross-vendor benchmarking

Fair, statistically valid comparisons — confidence interval, not gut feeling.

Underperformance alerts

Auto-flagged when a vendor's quality or stability degrades quarter over quarter.

Contract & SLA evidence

Defensible numbers in QBR meetings. Not narratives.

07Three ratings + one stat

Three rating systems. Not a bug — a feature.

A chess grandmaster has three numbers: ELO 2400, top 5%, and games played — each answers a different question. So does DevEval. Plus a raw stat — Productivity Index — combining difficulty (CU) with working days. Not a ranking, just a measurement.

◇ Absolute Points

Per hour 4.31 · 4-period avg 343 · vs org +0%

758

↗ Rising

↑ +539(+246.5%)

POINTS HISTORY

BREAKDOWN BY DIMENSION

Worked example

The whole org adopts AI tools. Everyone gets 2× faster.

Each rating system reacts differently — that's why you need all of them.

Alice's Glicko-2

1842 → 1842

±0

relative — nothing changes

Alice's percentile

95th → 95th

±0

relative — nothing changes

Alice's points

1500 → 3200

+113%

absolute — the org moved

Deep diveScoring model — Score vs Absolute Points→Deep diveHead-to-head ranking — how matches work→

08 Deployment options

SaaS by default. Self-hosted when you need it.

Most teams run DevEval as a managed service and are productive in hours. Enterprises with stricter requirements deploy it inside their own perimeter — same product, same scores, your infrastructure and your AI key.

SaaS · managed by us

DEFAULT

Default · fastest path to value

We run DevEval for you in EU regions. Sign up, connect a repo, see results in hours. Best for teams who want zero ops overhead and standard governance.

EU-hosted Managed updates Pooled AI cost

Self-hosted · your infra

ENTERPRISE

Enterprise option · your perimeter

Deploys into your environment — On-Premise (VM / Kubernetes), Azure AKS, or AWS EKS. You bring your own AI provider key. Source code and analysis stay inside your boundary.

On-Prem AKS EKS BYO AI key

Hybrid · scoped per client

CUSTOM

When the standard options don't fit

Air-gapped pilots, regulated industries, multi-region deployments, custom integrations. Scoped in a discovery call — quoted per engagement.

Custom scope Air-gapped Regulated

Self-hosted data flow

Repository processing runs in your environment; AI traffic follows the agreed deployment architecture.

For Enterprise deployments

Your Git

GitHub · GitLab · Bitbucket

→

PR diff + metadata

pulled via API — no agents

→

DevEval

inside your perimeter

→

AI provider

your key · direct / Bedrock / Vertex / Foundry

→

Scores & dashboards

in your environment

YOUR PERIMETER The 3 boxes in the middle stay inside your environment. The AI key is yours.

OAuth 2.0 · RBAC · financial masking · full audit trail

YOUR INFRA

On-Prem · AKS · EKS

or air-gapped on request

YOUR AI KEY

BYO provider key

direct · Bedrock · Vertex · Foundry

YOUR DATA

No code exfil

only PR metadata + scores

YOUR SLA

Dedicated terms

scoped per engagement

Deep dive Security & deployment — what crosses the boundary →

09 Pricing

Per active developer. Self-hosted on Enterprise.

30-day trial without a card. After that your account turns read-only — nothing disappears. Enterprise is scoped and quoted per client.

Trial

0 free

30 days · no card

5 devs

Up to 30 PRs / dev

Cloud product access · read-only after

Start trial

Starter

16 EUR / dev / mo

Monthly billing

Up to 10 devs

Up to 30 PRs / dev / mo

Email support · 12-mo retention

Choose Starter

Stop measuring lines of code.
Start measuring value.

Connect a repo in 15 minutes. Historic PRs backfilled the same day. Full ratings and ROI ready in hours.

Start free trial →