Independent HCI Research · 2025–2026 · Live study

everydayservices.studyOpen data collection

Users finish the task. They still cannot explain what happened.

A between-subjects experiment across banking, healthcare, and government services — isolating interface language to measure the gap between task completion and user comprehension. Validating cognitive distance as a construct distinct from cognitive load.

Role

Researcher & EngineerEnd-to-end platform owner

Format

Between-subjects A/BPlain vs institutional copy

Locales

EN · ES · PT-BRShell copy identical A/B

Length

~15 minutes3 simulated tasks per participant

  • 3

    Service domains

    Banking, medical, license renewal

  • 5

    Primary hypotheses

    H1 to H5, pre-registered directional

  • 4

    Cognitive distance dimensions

    Understanding · prediction · confidence · transparency

  • 2

    Isolated arms

    Same clicks, fields, outcomes. Words only.

TL;DR

The gap performance metrics miss

People send money, book doctors, renew licenses — and leave the screen unsure of what just happened. The standard explanation is cognitive overload. But the user who finishes without errors and still cannot say whether the money has left the account was not overloaded. They were not informed.

This is a live international experiment that makes that gap visible. I designed the construct (cognitive distance), operationalised it into four dimensions, pre-registered five hypotheses, and shipped the platform collecting data today. The conceptual paper is under peer review.

The result is a research instrument for questions performance metrics have ignored for thirty years — and a demonstration of taking a construct from theory to instrument to data.

01 · The reframe

Load measures effort. Distance measures understanding.

Cognitive load theory (Sweller, 1988) gave HCI a vocabulary for capacity: working memory is finite, and tasks that exceed it fail. Decades of design advice followed — simplify, reduce, declutter.

Yet the most consequential usability failures persist in systems that have already lowered load. A user completes a transfer in three taps, sees the confirmation, and still cannot say whether the money has left. The interface was not too heavy. It was too opaque — the user executed system logic they could not parse.

That is the difference this study operationalises. Cognitive load measures how much effort a task demands. Cognitive distance measures whether the user can explain, predict, and transfer the system's logic into their own reasoning. Load is about capacity. Distance is about alignment.

Two phenomena, two different problems

Cognitive load measures effort. Cognitive distance measures alignment.

The traditional frame

Cognitive Load

the working memory cost of acting

  • AsksHow much effort does this task demand?
  • MeasuresMental resources consumed per task
  • Unit of analysisThe single interaction
  • InadequacyA simple task can still be misunderstood

The reframing

Cognitive Distance

the gap between system logic and user model

  • AsksCan the user explain, predict, and transfer?
  • MeasuresAlignment between system and mental model
  • Unit of analysisThe relationship, over time
  • CapturesSuccess without comprehension. The blind spot.

02 · Why it matters

Completion rate hides comprehension failure

Usability research relies on time-on-task, completion rate, and error frequency. ISO 9241-11 codified them; Nielsen operationalised them; most product teams use them weekly. They are robust — and none of them answers: did the user understand what just happened?

A high completion rate can sit on widespread misunderstanding. A low error rate can co-exist with users who cannot say whether their task succeeded. The proxies register success while interpretation collapses.

The cost falls hardest on users with the least interpretive surplus: people with cognitive disabilities, neurodivergent users, older adults, anyone working under time pressure or in a second language. For them, opacity translates directly into exclusion, dependence, and repeated error.

Production welcome screen in English with split layout: narrative left, session card with numbered tasks right

The study's production welcome screen, in English.

eConsent visual language: elevated session card, three-line headline, sticky agree button. Reads as international research, not consumer marketing.

03 · The framework

Four dimensions that make distance measurable

Moving cognitive distance from slogan to construct requires treating it as a multidimensional variable. Four observable dimensions — drawn from cognitive ergonomics and HCI — triangulate the construct. Each is independently measurable. Together they form a Cognitive Distance Index (CDI) portable across studies, product lines, and design reviews.

Operational framework

Four dimensions, one construct

Each dimension is observable; together they triangulate cognitive distance.

01

Perceived Understanding

Can users explain what happened in their own words?

Self-assessed clarity. The felt sense of having a model.

02

Predictive Accuracy

Can users anticipate outcomes before acting?

Recurrent surprise is a sign of misalignment, not surprise itself.

03

Interaction Confidence

Do users decide without recurrent hesitation?

Stable decisions indicate an internalised model. Fragile ones don't.

04

Conceptual Transparency

Does the system make its causal links visible?

The interface's burden, not the user's, not the model's.

Method-agnostic: surveys, prediction tasks, behavioural coding, retrospective explanation.

Load and distance can dissociate

A task can be demanding and still transparent; effortless and still opaque. The quadrant below maps those four combinations. The study's central prediction is that Version B (institutional register) shifts participants into the high-load, high-distance quadrant, without changing the behavioural completion rate.

Analytical dissociation

Four logically possible combinations

High loadCognitive load increases upward · Cognitive distance increases rightwardHigh distance

Demanding, transparent

Long, well-explained flow (e.g. tax filing, walked step-by-step).

Q2

Demanding, opaque

Document submission under hidden rules (Version B targets this quadrant).

Study's central observation

Brief, predictable

One-click confirmation, obvious consequences.

Q1

Easy, inexplicable

Single click accepted or rejected with no intelligible reason.

Q4

Low loadLow distanceHigh distance

From score to action: four diagnostic bands

The CDI is diagnostic, not comparative. Each band carries a recommended response: Preferred (maintain), Watch, Action needed, and Critical (governance-level redesign).

Cognitive Distance Index · Interpretation

Four bands, four design actions

A diagnostic protocol, not a single comparable score. Bands travel with a recommended response.

01 · Preferred

Conceptual alignment is strong.

Users can predict, explain, transfer. Maintain and monitor for drift.

Lower CDI

02 · Watch

Early signs of misunderstanding.

Confined to specific tasks or groups. Refine labels, feedback, and guidance.

03 · Action needed

Misunderstanding is repeated.

Affects usability and accessibility. Prioritise redesign and validate iteratively.

04 · Critical

Misalignment is severe.

Disrupts interaction. Launch governance-level redesign and remediation.

Higher CDI

04 · Study design

One variable, two arms, three domains, five hypotheses

The design is deliberately austere. Interface language register is the only variable that moves. Every click, every field, every outcome stays identical. If the arms diverge on completion time or effort, the manipulation failed — and manipulation checks guard exactly that.

Study architecture

One variable. Two arms. Three domains. Five hypotheses.

Participant → Balanced assignment

Silent variant assignment (least-used arm) + balanced module-order rotation.

Device-local duplicate detection · One variant per participant, all three modules.

Arm A · Plain language

Explains what, why, and what next.

Same clicks, same fields, same outcomes as Arm B.

Arm B · Institutional register

Bureaucratic labels, no explanations.

Same clicks, same fields, same outcomes as Arm A.

Module 01

Banking

Transfer to a new recipient

Module 02

Medical

Book a specialist appointment

Module 03

License

Renew with prefilled data

Per-module measurement

Prediction MCQs

Comprehension MCQs

Confidence

Open explanation

Post-task scales

Between-subjects, not within. Each participant sees one register across all three modules. Once someone decodes an institutional phrase, they carry that decoding forward — between-subjects is the only design that keeps the register stable.

Balanced, not random. Silent server-side assignment picks whichever arm has fewer starters (ties break randomly). Module order is balanced across six rotations of [banking, medical, license]. Device-local duplicate detection prevents re-entry.

Three domains, one protocol. Online banking (transfer), medical appointments (specialist booking), and driver license renewal (pay, review, mail). Each module follows the same arc: task intro, prototype flow, prediction MCQs, result screen, comprehension MCQs, post-task scales.

Hypotheses

Five primary, four manipulation checks

All pre-registered as directional. Checks guard against confounds: if the flows themselves feel different, the manipulation failed its isolation test.

Primary hypotheses · Per-module

H1

Perceived understanding

Self-reported clarity, 1 to 7

Plain > Institutional

H2

Open explanation accuracy

Blind rubric, 0/1/2 per fact

Plain > Institutional

H3

Prediction accuracy

MCQs before result

Plain > Institutional

H4

Confidence calibration

Confidence × correctness

Plain better calibrated

H5

Task-done certainty

Single-item, post-task

Plain > Institutional

Primary hypotheses · Per-module
IDHypothesisMeasureDirection
H1Perceived understandingSelf-reported clarity, 1 to 7Plain > Institutional
H2Open explanation accuracyBlind rubric, 0/1/2 per factPlain > Institutional
H3Prediction accuracyMCQs before resultPlain > Institutional
H4Confidence calibrationConfidence × correctnessPlain better calibrated
H5Task-done certaintySingle-item, post-taskPlain > Institutional

Manipulation checks

Must not differ.

If the two arms feel different on these, the manipulation failed.

  • Ease (1 to 7)
  • Effort (1 to 7)
  • Completion time
  • Three agree-items

Direction pre-registered. Inferential tests run on exported CSV. Descriptive comparisons ship in-product.

Manipulation checks guard isolation. If the arms diverge on ease, effort, or completion time, something other than language moved. The platform flags this in real time on the researcher panel. Failed checks trigger sensitivity analysis.

Version A · Plain language

Explains what each step does, why information is needed, and what happens after submission. Result screens name the next actor and timeline in everyday terms. Same steps and final state as Version B.

Version B · Institutional register

Same flows and outcomes. Labels mirror bureaucratic systems ("Operation processed", "Pending allocation"); explanatory panels removed; result copy assumes institutional literacy. No extra clicks — only words change.

05 · Platform engineering

Built for real participants on real devices

A good research design means nothing without a platform participants trust, on the devices they use, in the languages they read. Building it end-to-end — front-end, back-end, deployment, researcher panel, walkthrough mode — was the bulk of the work. Six decisions mattered most.

One-click consent, GDPR / LGPD aligned

No localStorage before consent. No cookies. No tracking pixels. A compact summary line, a collapsible disclosure, and a single 'Agree and start' backed by `consented_at` and `consent_version` on the participant row. End-of-study data choice: keep or delete, one click each.

Silent balanced assignment

A server-side route reads current `module_runs` counts, picks the least-used arm and the least-used module order, and writes `variant` + `order` atomically. Participants never see the assignment. The field never sees a 60/40 imbalance.

Mobile-first from 320px to 1440px

Every phase audited on iPhone SE, iPhone 14 Pro, iPad, and desktop. Bottom sheet on mobile (up to 55vh) and a 420px dock on desktop, so the background never reflows when questions open. WCAG 2.2 AA: fieldset/legend, 44px targets, focus management, `aria-live` progress.

Three locales, one manipulation system

EN, ES, and PT-BR. Locale detection happens pre-consent (in memory only); persistence starts after agreement. Shell copy is identical across arms; only manipulation copy differs. A mistranslated button cannot become a hidden variable.

Transparent panel analytics

Passcode-protected researcher panel with H1 to H5, manipulation checks, and drill-downs to participant rows. Every chart links to 'How is this computed?' and a rule-based verdict. Descriptive comparisons ship in-product; inferential tests run on exported CSV.

Documented design system + walkthrough mode

About 35 documented components cover the experiment funnel: welcome, steppers, question dock, finding cards. Walkthrough mode replays all six module/variant combinations non-interactively, for training and IRB walkthroughs.

06 · Methodological rigor

Built to be replicated, not admired

Another researcher could rebuild the protocol from public documentation alone. Module rules live in plain Markdown (RULES-BANKING, RULES-MEDICAL, RULES-LICENSE). Business rules, Version A/B result copy, and scoring keys for every question are versioned alongside the code.

Anonymisation is aggressive by design. No name, email, IP address, device identifier, or full user-agent in the research dataset. No participant cookies. The researcher panel uses an httpOnly session cookie scoped to staff. This shaped the consent flow, welcome copy, assignment route, and deletion API — not a compliance afterthought.

Participant flow

The fifteen-minute funnel

Step 01

Welcome

Layered, one-click

Step 02

Demographics

5 questions

Step 03

Assignment

Silent, balanced, server-side

Step 04

Module 1

~5 min per module

Step 05

Module 2 to 3

~5 min per module

Inside each module

1. Prototype flow5 to 6 steps, mobile-first
2. Prediction2 MCQs before result
3. Comprehension2 to 3 MCQs with result
4. Open explanationMin 10 chars
5. ScalesConfidence · ease · effort

07 · Diagnostic concept

Externalized processing: the pattern behind the problem

The construct points at a recurring design pattern that produces cognitive distance. The conceptual paper introduces externalized processing: systems that specify valid input conditions without providing internal means to meet them. Format restrictions, size limits, naming rules — each becomes a user-side preprocessing task with an interpretive burden on top.

The diagram below contrasts the pattern that pushes work outward (letting interpretation collapse) with the pattern that keeps it in-system (keeping the causal chain legible). Every module embeds a version of this trade-off, which the comprehension and prediction questions are designed to detect.

Diagnostic concept

Externalized processing

The recurring pattern: systems specify constraints, users absorb the transformation.

Pattern A

User-side preprocessing

constraint without accommodation

  • System states constraintsformat · size · naming · structure
  • User searches, converts, compressesacross external tools and trial-and-error
  • Completion depends on rulesthat may remain only partly understood

Operational work and interpretive burden move outward to users.

Pattern B

System-side accommodation

feasible transformation retained in-system

  • User submits available inputthrough the primary interface
  • System validates and, when feasible,normalises or transforms internally
  • System returns clear guidanceor completes submission directly

Computational work stays with the system and causal relations become legible.

08 · In the interface

Production screens from the live study

The platform reads as a calm research instrument — not a survey app, not a marketing flow. The question dock keeps results visible while participants answer. Result screens name the next actor and timeline in plain language (Version A) or in the system's own voice (Version B).

Production welcome screen in English, large desktop view

The full welcome screen at desktop scale.

Elevated session card, three-line headline, trust strip, sticky agree button. The desktop split layout earns its place here.

Mobile welcome screen in English with three-line headline, trust strip, and sticky agree button

Mobile welcome, in English. Calm eConsent language, single elevated session card, sticky CTA.

Three-line headline with clamp() typography. Native disclosure for privacy summary; full notice one tap away.

Mobile banking flow with prediction question overlaying the review screen

Prediction question on mobile at Step 5. The review screen stays visible while participants answer.

Bottom sheet up to 55vh, with the primary action fixed outside the scroll area. The background never reflows when the sheet opens.

Production banking result screen in English with vertical status timeline

Banking result, Version A in English. Plain language next steps and a human status tracker.

Medical appointment booked result screen with plain language next steps

Medical result, Version A. The screen names who will contact the patient and when.

Desktop banking transfer sent result with vertical sub-stepper showing payment stages

Desktop banking result. The vertical sub-stepper tracks payment lifecycle without leaving the service context.

Post-task scale question asking about mental effort on a 1 to 7 scale

Post-task scales. Ease, effort, understanding, and task-done certainty, captured after each module.

These scales are the manipulation checks. The 1 to 7 effort scale shown here is one of four post-module measures. If A and B differ on them, the manipulation failed and the data is flagged for sensitivity analysis.

09 · Early signals

Directional observations from active data collection

Data collection is active. The observations below are descriptive from early cohorts — directional, not headline findings. Inferential analysis follows manuscript submission. The panel surfaces every chart with that caveat inline.

Predicted most-uncertain service

Version A and Version B rank medical differently. Version B participants are more likely to flag license as the source of uncertainty.

Open explanations

Version A explanations name who, when, and what next. Version B explanations frequently stop at the system's surface phrasing, even when participants clearly understand the question.

Confidence calibration

Version A participants are more often correctly calibrated. Version B participants are more often over- or under-confident, a sign the mental model is unstable.

10 · Implications

What cognitive distance changes, if the data holds

If the hypotheses are supported, the study shifts what counts as a usability problem. Three implications travel with the construct.

  1. Accessibility moves from accommodation to prevention. Neurodivergent HCI research shows systems designed around neurotypical defaults impose invisible access work. Cognitive distance makes that work visible and measurable in design review.
  2. AI-mediated interaction becomes diagnosable. Automation, adaptive logic, and AI-generated outputs produce what Sarter, Woods, and Billings called "automation surprise" — usable in the moment, opaque in rationale. CDI offers a mediating variable for trust, fairness, and reliance.
  3. Evaluation redefines what counts as evidence. Beyond whether users succeeded: can they explain the response, predict the next state, and identify the causal structure connecting action to outcome? That is the evidence base for inclusive, accountable design.

11 · Stack

What runs the study

Next.js 16 (App Router), React 19, Tailwind CSS 4. Supabase (Postgres + Row Level Security) for participants, module_runs, events, answers, and open_scores. Vercel deployment with /api/health check. No PII in the research dataset; no participant cookies; httpOnly session cookie for researcher access only.

Production stack

  • Application: Next.js 16 · React 19 · TypeScript · Tailwind 4
  • Data: Supabase Postgres · typed tables · Row Level Security
  • Auth: httpOnly session cookie (researcher panel) · anonymous key (participant funnel)
  • Infra: Vercel · health check · environment-scoped secrets
  • Documentation: rules in plain Markdown · design system · walkthrough mode

12 · Read next

Related work and next steps

Outcome

From theory to live instrument, collecting data

The platform is live, collecting data internationally. The conceptual paper is under peer review. The framework extends to other services, locales, and types of opacity — automation, AI, data dashboards.

This case study documents the work of taking a construct from intuition to instrument: a research design that survives review, a platform participants trust, and a measurement practice honest enough to publish.

Most usability failures are not about effort. They are about explanation — systems that work procedurally and fail interpretively. This study makes those failures measurable.

Open data collection · Manuscript under peer review · Platform live and extensible to adjacent domains.