FACET

Code quality is a profile, not a score.

A single grade for “good code” is a category error. The things we call quality pull against each other, and the right balance depends on what the code is for. Facet measures all of them, works out the profile your code was actually built for, and reports the gaps against that, not against some universal ideal that does not exist.

There is no universal best

Push runtime performance to its limit and readability usually suffers. Maximise portability and you tend to leave performance on the table. Optimise for prototype speed and you defer tests, observability, and hardening on purpose. None of that is failure. They are tradeoffs, and a good engineer makes them deliberately.

So “is this good code?” is the wrong question. The one you can actually answer is “is this code well-matched to what it is for?” A throwaway analysis script that skips tests is passing, not failing. A security boundary that skips input validation is failing, however fast it runs. Quality only means something once you know the intent.

Why 14 dimensions

Collapse quality into one number and you hide the part you actually need. Facet measures fourteen dimensions separately, so the tradeoffs stay visible instead of averaging into mush. Together they describe the shape of a piece of code, its profile.

D1Runtime performance. Minimising latency and maximising throughput on the execution path.
D2Memory efficiency. Minimising peak and steady-state memory footprint.
D3Readability & comprehensibility. Minimising the cognitive load on a competent reader understanding intent and mechanism.
D4Maintainability & extensibility. Minimising the cost of future change.
D5Robustness & defensive correctness. Behaving acceptably under invalid input, partial failure, and hostile conditions.
D6Security. Resisting adversarial input and protecting secrets, with fail-closed defaults.
D7Portability & dependency minimalism. Running across environments with minimal external requirements.
D8Development speed & prototype economy. Minimising time from intent to a working artefact, accepting bounded, deliberate debt.
D9Testability. Minimising the cost of verifying behaviour.
D10Observability & debuggability. Minimising time-to-diagnosis in operation.
D11Auditability & compliance. Supporting external verification that the code does what is claimed, and only that.
D12Concurrency safety & scalability. Correctness under parallel execution and growth under load.
D13API ergonomics & interface stability. For library and SDK code, minimising consumer error and breakage.
D14Resource cost & energy efficiency. Minimising compute spend in deployment (cloud cost, battery, thermal).

From that fourteen-dimension fingerprint, Facet infers the profile your code most resembles (a hot path, a public library, a regulated core, a security boundary, and so on) and shows you where it diverges from the profile it appears built for.

Three kinds of finding

Violations (harm pole).Negative-polarity indicators that fired on your code — a harmful pattern is actively exhibited (e.g. eval on input,tls_verification_disabled, unbounded concurrent spawn). These are concrete, line-citable defects under any profile and are reported as a separate count, never subtracted from the capability score. Absence of a positive is not the same thing as presence of a negative, so the two are never netted.

Internal incoherence. Code that contradicts itself, where a higher-ladder practice sits over a missing lower-ladder foundation. Under compensatory scoring (the 13 formative dims) the satisfied higher rung is credited; the diagnostic flags it as a possible cargo-cult pattern worth a human glance, not a penalty. Under cumulative scoring (D8 prototype economy only) the cap holds.

Fit gaps. Most other findings are conditional on intent: to reach what this profile asks for, address X. These are suggestions about fit, not universal verdicts. Code that scores low on a dimension your profile does not prioritise is not wrong. It is optimised somewhere else on purpose.

The dimensions are formative, not reflective

The most important methodological finding of 2026-06. In classical measurement theory, areflective construct is a latent trait that causes its indicators (depression causes each symptom); a formative construct is constituted by its indicators (socio-economic status is composed of income + education + occupation, which need not correlate). They use different validity tests and different scoring models.

A within-dimension unidimensionality test on a real-code corpus (inter-feature correlation and first-eigenvalue share) showed 13 of 14 dimensions are formative composites— features genuinely independent properties that constitute the construct rather than symptoms of one latent trait. Only D8 (prototype economy) is unidimensional. The dimensions themselves are distinct: no inter-dimension pair has |r| ≥ 0.70 on real code, the ~6-factor scree shows no dominant general-quality halo (first factor only ~28% of variance), and the indicator overlaps a casual reader might predict (readability vs maintainability, robustness vs security) do not translate into co-moving scores.

Consequence for scoring. A cumulative Guttman ladder over independent features mis-scores (it caps at the first unsatisfied rung even when higher rungs are independently satisfied). A graded-response IRT model assumes the very unidimensionality the data deny. Compensatory counting is the theoretically correct model for a formative composite, not merely the cheaper one. The 13 formative dimensions use compensatory; D8, legitimately unidimensional, keeps cumulative.

How a dimension scores: two sub-scores, never netted

Each dimension reports two numbers kept separate:

  • Capability count. Positive indicators marked present — the good things the code does (e.g. parameterised_queries, resource_cleanup_all_paths, bounded_parameter_count).
  • Violation count. Negative-polarity indicators marked present — harmful patterns it exhibits.

A feature-rich-but-vulnerable file and a minimal-but-clean file are genuinely different and should not collapse to the same number. The five-rung level is preserved as a familiar overall readout (count of satisfied rungs under compensatory; contiguous cap under cumulative); the two sub-scores are what make the level interpretable.

When every feature of a dimension is not applicable (e.g. security on a throwaway with no attack surface), the dimension reports no surface — N/Aand is excluded from profile-fit. A security-irrelevant script is not graded as “insecure” for lacking authorisation on code that has none.

Reliability: the four-bar gate

A dimension is labelled reliable only when its judge clears four bars on a held-out minimal-pair bundle:

  • Surface-weighted agreement ≥ 0.60 — the judge tracks a defensible answer key rather than guessing.
  • Minority-class recall ≥ 0.50 — the rare degradation is caught, not majority-guessed away.
  • Schema coverage ≥ 0.80 — the judge returns a parseable, usable reading on the large majority of samples.
  • Test-retest reproducibility ≥ 0.90 — the same code profiled twice yields the same feature vector.

All 14 dimensions clear all four bars on the 2026-06-18 expanded-instrument requalification, each on an open-weight, on-prem-deployable judge. Reliability holds on real code, not just crafted items: a real-code test-retest probe (k=3 over a 50-file corpus) found mean feature-stability 0.938, in the crafted-bundle baseline range.

A minority-recall threshold is what most LLM-judge work omits and the one that matters most: a judge that says “present” to everything scores high agreement on healthy code yet never catches a degradation. Our gate rejected gemma on auditability (high agreement but minority-recall 0.40) in favour of qwen (minority-recall 0.80). Reliability over raw agreement.

Construct validity: what we have, what we don’t

Reliability says the instrument is consistent. Construct validity asks whether the dimensions actually measure what they claim. We report this separately so the labels stay honest. The evidence today (Lane A, exploratory):

  • Content validity. A per-dimension content census against ISO/IEC 25010 + OWASP-CWE + 12-factor + FinOps + linter rules produced ~127 candidate indicators; the highest-confidence subset (53) was materialised and each was individually qualified with a minimal-pair bundle for judge-discrimination before scoring real data. 53 of 54 cleared the gate; the one that failed was removed, not kept for completeness.
  • Dimensionality. No inter-dimension pair |r| ≥ 0.70, ~6 weakly-correlated factors (no halo, first factor 28% variance) — 14 distinct indicators that organise into about six families, not 14 fully-independent constructs and not one general-quality factor.
  • Known-groups discrimination. Authored high-quality vs degraded exemplars separate monotonically under the production judges on the dimensions whose absolute level was least certain (D1 +2, D5 +4, D6 +2, D9 +5, 4 of 4).
  • Convergent anchoring. D6 security features overlap Bandit / Semgrep rules and are cross-checked against static-analyser output; new D6 violations carry CWE/OWASP identifiers.
  • Scoring-model fit. Compensatory + cumulative is correct by direct evidence about the constructs, not by convention.

Open work (Lane B, confirmatory).Most Lane A studies used a single open judge and a modest real-code corpus (N ≈ 47–50). Confirmatory work needs a second independent judge, a larger held-out corpus, parallel analysis for factor retention, formal inter-rater κ, and formal DIF across languages/styles. None of these undercut the claims; they bound how strongly the claims may be stated today.

What this instrument does not do

Defensibility is as much about disclosed limits as positive results.

  • It does not measure functional correctness. The judge extracts feature presence, never whether the code computes the right answer. Broken code with all the right machinery can score well on the structural dimensions. Correctness needs an oracle or test-execution gate — a different instrument. We state this plainly rather than pretend the rubric captures it.
  • One judge per dimension. A known, accepted simplification. We mitigate run-noise on the noisier judges with an ensemble (majority vote across N extractions), but we do not run multiple independent judges per dimension; cross-judge convergent validity is on the Lane B list.
  • Floor and ceiling effects are interpreted, not papered over. D10 (observability) and D14 (resource cost) score low on most real code; we verified that this is a true base rate (most code lacks observability instrumentation and paid-resource surfaces), not an instrument defect — the crafted bundles discriminate perfectly. D3 readability runs near-ceiling on real code for the same kind of reason (most code is readable on the standard features); the original D3 features sit at borderline crafted minority-recall, which is queued for ensemble extraction.
  • An indicator that could not be measured reliably was removed. One change-amplification feature (cross-site dispersion) failed qualification — the judge marked the violation absent on the variant and present on the clean gold. After one reword it was dropped. We do not score what we cannot measure.

How the measurement works (the engine)

  • The model finds, the code decides. A model reads your code and extracts atomic, line-cited features (present / absent / not applicable, with a one-clause basis). It never assigns a score. Deterministic code, version-controlled and inspectable, computes the score from the feature vector, so two runs of the same code agree.
  • Claims are not evidence. Every prompt enforces that names, comments and docstrings are claims, not evidence: a function named sanitize_input that does not sanitise is absent. This is the explicit defence against the dominant LLM-grading failure mode (being fooled by reassuring surface language).
  • One judge per dimension. No single model is best at everything; each dimension is routed to the model that proved most reliable at measuring it.
  • Chunked extraction where the prompt got too long. Five dimensions (D1, D7, D11, D12, D14) split their feature list into two extraction calls and merge the result, because the indicator expansion enlarged their prompts to the point where per-feature attention diluted. The chunking restored minority-recall on each of those dims. The fix was measured and validated before shipping, not assumed.
  • Provider-pinned + seeded judge calls. Different OpenRouter backends of “the same” model can give different greedy outputs; we pin one provider and pass a fixed seed so the extraction is as stable as the provider allows, then majority-vote across N extractions to mop up residual flicker.
  • Reliable or provisional, never silent. Every dimension carries an explicit label. If a future revision dips a judge below the bar, the dim becomes provisional rather than disappearing or pretending.
  • Your code is never stored. We keep secret-scrubbed measurements and a content hash, scoped to your account. The source itself is never written down.

Profile your codeView your reports