qbank/SPEC.md

# QBank — Execution Plan

A self-hosted quiz app. Import PDF/DOCX documents containing multiple-choice questions, parse them with an LLM, store them, generate randomised tests, take them on phone or laptop, and review wrong answers afterward.

This document is the build plan. Work through phases in order. Each phase has a clear acceptance test — don't move on until it passes.

---

## Locked-in tech decisions

- **Language:** Go (1.22+ for `net/http` pattern routing, though we'll use chi).
- **Router:** `github.com/go-chi/chi/v5`.
- **DB:** SQLite via `modernc.org/sqlite` (pure Go, no CGo — simplifies Docker builds).
- **Sessions:** `github.com/alexedwards/scs/v2` with SQLite store.
- **Passwords:** `golang.org/x/crypto/bcrypt`.
- **PDF text extraction:** `github.com/ledongthuc/pdf` as the primary; fall back to shelling out to `pdftotext` (poppler-utils) if quality is poor on real docs.
- **DOCX text extraction:** hand-rolled — DOCX is a zip; unzip in memory, parse `word/document.xml`, concatenate `<w:t>` text nodes. ~50 lines, no extra deps beyond `archive/zip` and `encoding/xml`.
- **OpenAI client:** `github.com/sashabaranov/go-openai`. Use the current cost-efficient model that supports JSON schema response_format (verify the current model name in OpenAI docs at build time; e.g. `gpt-4o-mini` or its successor).
- **Templates:** `html/template` standard library.
- **CSS:** Tailwind via the standalone CLI (no Node.js needed) — download the binary in the Dockerfile.
- **Auth model:** two named accounts seeded at first run. Each user has their own test history.
- **Hosting target:** self-hosted on a local Portainer instance. Deployed as a Docker container via a stack (docker-compose). A named volume holds `qbank.db` and uploaded files.

---

## Project layout

```
qbank/
├── cmd/server/main.go              # entry point
├── internal/
│   ├── config/                     # env + flags
│   ├── db/
│   │   ├── schema.sql
│   │   ├── db.go                   # open, migrate, helpers
│   │   └── repo.go                 # queries
│   ├── models/                     # Question, Answer, Test, etc.
│   ├── parse/
│   │   ├── pdf.go
│   │   ├── docx.go
│   │   └── chunk.go
│   ├── llm/
│   │   └── openai.go               # ExtractQuestions(text) -> []Question
│   ├── auth/                       # login, middleware, session
│   ├── handlers/                   # http handlers, one file per area
│   └── web/
│       ├── templates/*.html
│       └── static/                 # tailwind output, manifest, sw.js, icons
├── data/                           # gitignored: qbank.db, uploads/
├── .env.example
├── Dockerfile
├── docker-compose.yml
├── go.mod
└── README.md
```

---

## Phase 0 — Skeleton & config

Tasks:
- `go mod init qbank`.
- Wire `cmd/server/main.go` with chi, a `/healthz` route returning 200, structured logging via `log/slog`.
- Config loader in `internal/config`: reads `OPENAI_API_KEY`, `SESSION_SECRET`, `DATA_DIR` (default `./data`), `PORT` (default `8080`), `ADMIN_USERS` (comma-separated `name:password` pairs, seeded on first start only).
- `.env.example` with the above keys.
- `Makefile` with `run`, `build`, `tailwind`, `tidy`.

**Acceptance:** `go run ./cmd/server` boots, `curl localhost:8080/healthz` returns OK.

---

## Phase 1 — Database & migrations

Schema in `internal/db/schema.sql`:

```sql
CREATE TABLE users (
  id INTEGER PRIMARY KEY,
  name TEXT UNIQUE NOT NULL,
  password_hash TEXT NOT NULL,
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE questions (
  id TEXT PRIMARY KEY,            -- sha256(question_text)[:16]
  text TEXT NOT NULL,
  source TEXT,
  tags TEXT,                      -- comma-separated, simple
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE answers (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  question_id TEXT NOT NULL REFERENCES questions(id) ON DELETE CASCADE,
  text TEXT NOT NULL,
  is_correct BOOLEAN NOT NULL,
  position INTEGER NOT NULL        -- canonical order as imported
);

CREATE TABLE tests (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  user_id INTEGER NOT NULL REFERENCES users(id),
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
  completed_at DATETIME,
  n_questions INTEGER NOT NULL,
  question_ids TEXT NOT NULL       -- JSON array, in chosen order
);

CREATE TABLE test_answers (
  test_id INTEGER NOT NULL REFERENCES tests(id) ON DELETE CASCADE,
  question_id TEXT NOT NULL REFERENCES questions(id),
  selected_answer_id INTEGER REFERENCES answers(id),
  is_correct BOOLEAN,
  answered_at DATETIME,
  PRIMARY KEY (test_id, question_id)
);

-- Per-user mastery stats. Drives weighted sampling in Phase 6 and the
-- accuracy display in Phase 5. A missing row means "never seen" — treat
-- as (times_seen=0, times_correct=0).
CREATE TABLE user_question_stats (
  user_id INTEGER NOT NULL REFERENCES users(id),
  question_id TEXT NOT NULL REFERENCES questions(id) ON DELETE CASCADE,
  times_seen INTEGER NOT NULL DEFAULT 0,
  times_correct INTEGER NOT NULL DEFAULT 0,
  last_seen_at DATETIME,
  PRIMARY KEY (user_id, question_id)
);

CREATE INDEX idx_test_answers_test ON test_answers(test_id);
CREATE INDEX idx_answers_question ON answers(question_id);
CREATE INDEX idx_stats_user ON user_question_stats(user_id);
```

Tasks:
- `db.Open(path)` — runs schema on first start (idempotent via `IF NOT EXISTS`).
- Repository functions: `CreateUser`, `GetUserByName`, `InsertQuestion`, `GetQuestion`, `ListQuestions(filter)`, `CountQuestions`, `CountAnswers`, `CreateTest`, `RecordAnswer`, `FinishTest`, `GetTest`, `ListTestsForUser`, `UpsertStat(userID, questionID, gotItRight bool)` (atomically increments `times_seen` and conditionally `times_correct`, sets `last_seen_at`), `GetStatsForUser(userID, questionIDs)` (returns a map; missing entries mean unseen).
- Seed admin users from `ADMIN_USERS` env var on first start only (skip if `users` table non-empty).

**Acceptance:** delete `qbank.db`, start the server, verify users are seeded; `sqlite3 qbank.db ".schema"` shows all tables.

---

## Phase 2 — Auth & layout

- Login page (POST → set session via scs).
- `RequireAuth` middleware on everything except `/login`, `/healthz`, `/static/*`.
- Base template with header (app name, current user, logout, nav: Library / Upload / Take Test / History).
- Tailwind setup: download standalone CLI in Dockerfile and `make tailwind`; `web/static/tailwind.css` is the build output. Use mobile-first layout — single column under `md:`, content max-width `2xl` on desktop.
- Add `web/static/manifest.json` and `web/static/sw.js` (minimal — cache shell, network-first for API). Wire them in the base template so "Add to Home Screen" works on iOS and Android.

**Acceptance:** log in as each seeded user, see your name in the header, log out. On a phone, "Add to Home Screen" produces a standalone-looking icon.

---

## Phase 3 — Document parsing & LLM extraction

`internal/parse`:
- `ExtractPDF(r io.Reader) (string, error)` using ledongthuc/pdf. If text comes back empty or gibberish (heuristic: <50 chars or <2% alphanumeric ratio), return a sentinel error so the handler can show "scan-based PDF — please convert to text first."
- `ExtractDOCX(r io.Reader) (string, error)` — open as zip, find `word/document.xml`, walk XML, concatenate `<w:t>` text content, insert newlines on `<w:p>` boundaries.
- `Chunk(text string, maxRunes int) []string` — split on double-newlines greedily up to ~10k runes per chunk (well under model context).

`internal/llm`:
- `ExtractQuestions(ctx, chunk string) ([]ParsedQuestion, error)` calls OpenAI chat completions with `response_format` set to a JSON schema:

```json
{
  "type": "object",
  "properties": {
    "questions": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["question", "answers"],
        "properties": {
          "question": {"type": "string"},
          "answers": {
            "type": "array",
            "minItems": 2,
            "items": {
              "type": "object",
              "required": ["text", "correct"],
              "properties": {
                "text": {"type": "string"},
                "correct": {"type": "boolean"}
              }
            }
          }
        }
      }
    }
  },
  "required": ["questions"]
}
```

System prompt: "You extract multiple-choice questions from study material. Return every question found. Exactly one answer per question must be marked correct. If the source doesn't clearly mark a correct answer, omit that question entirely. Do not invent questions not present in the text."

Validate the response: drop any question without exactly one `correct: true`. Dedupe by question text hash.

**Acceptance:** feed in a small handcrafted PDF and DOCX with 3–4 known Q&As, the function returns them correctly.

---

## Phase 4 — Upload & import review flow

- `GET /upload` — file input (accepts `.pdf`,`.docx`), optional comma-separated tags, optional source override.
- `POST /upload`:
  - Save file to `data/uploads/<timestamp>_<filename>`.
  - Extract text → chunk → call LLM per chunk → merge & dedupe → stash candidate questions in the session (or in a temp `import_drafts` table keyed by a UUID; the temp table is cleaner, do that).
  - Redirect to `/import/<draft_id>`.
- `GET /import/<draft_id>` — editable preview. Each candidate question shows: question text (editable), each answer (editable, with radio for which is correct), tag input, "delete this question" checkbox.
- `POST /import/<draft_id>` — write surviving questions to `questions` and `answers`. Delete the draft. Redirect to library with a flash message ("Imported N questions, M skipped").

Files larger than ~5MB or with >100 candidate questions should still work — paginate the review page if needed.

**Acceptance:** upload a real PDF, edit a couple of answers in the preview, confirm; library count goes up by the right amount.

---

## Phase 5 — Library & stats

`GET /` (library page):
- Total questions, total answers (sum across all questions), per-source breakdown, per-tag breakdown.
- Searchable/filterable list of all questions. Each row shows the current user's mastery for that question: "seen 8× · 3 correct (38%)" — pull from `user_question_stats`. Unseen questions show "—".
- Sort options: alphabetical, lowest accuracy first (i.e. "weakest first"), most-seen first.
- "Take a test" CTA.

`GET /questions/<id>` — view & edit a single question; show the same per-user stats prominently.
`POST /questions/<id>` — save edits.
`POST /questions/<id>/delete` — delete (cascades to answers; archive instead by adding a `deleted_at` column if you'd rather, but hard delete is fine for v1).

**Acceptance:** counts match `SELECT COUNT(*)`; editing a question persists; deleting a question removes its answers too.

---

## Phase 6 — Take a test

`GET /test/new` — form:
- Number of questions (default 10, capped at total available after filters).
- Optional tag filter, optional source filter. Show count of matching questions live.
- **Sampling mode** (radio): "Focus on weak spots" (default, weighted) vs "Mix evenly" (uniform random). Weighted mode biases toward questions the current user gets wrong more often; uniform gives a representative cross-section. See "Weighting algorithm" appendix below.

`POST /test/new` —
- Resolve the candidate pool from filters.
- For each candidate, compute the user's weight from `user_question_stats` (formula in appendix). Uniform mode skips weighting (all weights = 1).
- Weighted-sample N candidates without replacement using the A-Res algorithm.
- Shuffle the resulting order.
- Create `tests` row with `question_ids` JSON.
- Redirect to `/test/<id>/q/1`.

`GET /test/<id>/q/<n>` — show the n-th question. Answers shuffled deterministically per (test_id, question_id) using a seeded RNG so refreshing doesn't re-shuffle. Progress indicator ("Question 3 of 10"). Mobile-friendly: large tap targets for answer choices.

`POST /test/<id>/q/<n>` — inside one transaction:
- Record selection in `test_answers` (compute `is_correct` server-side).
- Call `UpsertStat(userID, questionID, isCorrect)` to bump `times_seen`, conditionally `times_correct`, and update `last_seen_at`.
- If more questions remain → redirect to next.
- Else → mark test complete (`completed_at`), redirect to `/test/<id>/results`.

Only the owning user can access their test.

**Acceptance:**
- Start a 5-question test, answer all 5, end on results page.
- After several tests, weighted mode demonstrably surfaces previously-wrong questions more often than ones consistently answered correctly — verify by inspecting `user_question_stats` and re-running test creation a few times.
- A mastered question (e.g. seen 10×, correct 10×) still has a non-zero chance of appearing — confirm by running enough tests with a small pool that the floor weight is exercised.

---

## Phase 7 — Results & review

`GET /test/<id>/results`:
- Score: X / N (Y%).
- Time taken.
- For each question: question text, every answer listed, with markers: ✅ correct answer, ❌ user's choice (if wrong), ✅ both (if user got it right).
- Source & tags shown per question.

**Acceptance:** wrong answers are clearly visible, correct answer always shown, page renders cleanly on a phone.

---

## Phase 8 — History

`GET /history`:
- List of past tests for the current user: date, score, n_questions, link to results.
- Aggregate stat: overall % correct across all attempts.
- Stretch: "Weak spots" — questions this user has gotten wrong more than once, with link to the question.

**Acceptance:** taking multiple tests builds up a history; clicking back into a past test shows the full review.

---

## Phase 9 — Containerise & deploy to Portainer

Build:
- `Dockerfile`: multi-stage.
  - Stage 1 (`golang:1.22-alpine` or newer): `go mod download`, run the Tailwind standalone CLI to produce `web/static/tailwind.css`, then `CGO_ENABLED=0 go build -o /out/qbank ./cmd/server`.
  - Stage 2 (`gcr.io/distroless/static-debian12` if not using `pdftotext`, otherwise `alpine:3.20` with `apk add --no-cache poppler-utils`). Copy the binary, `web/templates/`, and `web/static/`. `EXPOSE 8080`. Non-root user. `ENTRYPOINT ["/qbank"]`.
- `.dockerignore`: at minimum `data/`, `*.db`, `.env`, `.git`, `node_modules` (if any sneak in).

`docker-compose.yml` (this is what you paste into Portainer's "Stacks → Add stack → Web editor"):

```yaml
services:
  qbank:
    image: ghcr.io/<you>/qbank:latest   # or build locally and push, see below
    container_name: qbank
    restart: unless-stopped
    ports:
      - "8080:8080"                     # or put it behind your existing reverse proxy
    environment:
      DATA_DIR: /data
      PORT: "8080"
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      SESSION_SECRET: ${SESSION_SECRET}
      ADMIN_USERS: ${ADMIN_USERS}       # e.g. "alice:hunter2,bob:correcthorse"
    volumes:
      - qbank-data:/data
    healthcheck:
      test: ["CMD", "/qbank", "healthcheck"]   # implement a tiny subcommand, or use wget if base image has it
      interval: 30s
      timeout: 5s
      retries: 3

volumes:
  qbank-data:
```

Set the three env vars in Portainer's stack "Environment variables" section so they don't end up in the compose file. `ADMIN_USERS` is only read on first start (when the `users` table is empty) — subsequent restarts ignore it, so it's safe to leave set.

Image options — pick one:
1. **Build on your machine, push to a registry:** `docker build -t ghcr.io/<you>/qbank:latest .` then `docker push`. Portainer pulls it. Simplest if you already use GHCR/Docker Hub.
2. **Build on the Portainer host directly:** in the stack, replace `image:` with `build: { context: ., dockerfile: Dockerfile }` and use Portainer's git-based stack deployment pointing at the repo. Portainer pulls the repo and builds. Good for iterating without a registry round-trip.

HTTPS / external access:
- If you have an existing reverse proxy on that host (Traefik, Caddy, Nginx Proxy Manager), don't publish 8080 — put `qbank` on the proxy's network and add labels/config so the proxy terminates TLS and forwards to the container's internal `:8080`. Mention this in the README so future-you remembers.
- If you don't have a reverse proxy yet and want it reachable from her phone outside your LAN, the easiest add-on is Caddy in front, or a Tailscale sidecar so the app is only reachable on your tailnet (private, no public exposure, works on her phone with the Tailscale app installed).

Healthcheck note: the distroless image has no shell, so the healthcheck command must be the binary itself. Implement a `healthcheck` subcommand in `cmd/server/main.go` that does a localhost GET against `/healthz` and exits non-zero on failure. Or, if you switch the base image to `alpine`, you can use `wget --spider -q http://localhost:8080/healthz` and skip the subcommand.

Backups: the `qbank-data` volume holds everything that matters. A simple cron on the host runs `docker run --rm -v qbank-data:/data -v /backups:/out alpine tar czf /out/qbank-$(date +%F).tar.gz /data` weekly. Document this in the README.

**Acceptance:** stack comes up in Portainer, container is healthy, you can log in from your laptop and from her phone, restarting the container preserves all questions and history.

---

## Appendix A — Weighting algorithm

Each question has a per-user weight that determines its sampling probability in "Focus on weak spots" mode. Implement in `internal/sampling` (new package).

### Weight formula

For a question with `times_seen = s`, `times_correct = c`, and `last_seen_at = t`:

```
wrong = s - c

# Laplace-smoothed error rate. Pretending you've seen the question once
# right and once wrong dampens noise from tiny sample sizes — one wrong
# answer doesn't catapult the question to top priority, and one right
# answer doesn't bury it.
error_rate = (wrong + 1) / (s + 2)

# Floor so mastered questions still appear occasionally. 0.15 means a
# perfectly-mastered question shows up at roughly 15% the frequency of a
# brand-new one. Tune to taste.
base = max(0.15, error_rate)

# Recency nudge. Long-unseen questions creep back up regardless of past
# accuracy — this prevents the "staleness death-spiral" where low-weight
# questions never get sampled, never get updated, and stay low forever.
# Caps at 2× after ~30 days; unseen questions (t is NULL) get the full 2×.
days_since = (now - t).days  if t != NULL else 30
recency = 1 + min(days_since / 30.0, 1.0)

weight = base * recency
```

Unseen questions (no row in `user_question_stats`) get `base = 0.5` (mid-range, not punished and not boosted) and full recency multiplier — they end up at weight `~1.0`, which is roughly the same as a question you've gotten wrong half the time.

Knobs worth exposing as constants at the top of the file (don't make them config — just easy to find for tuning):

```go
const (
    FloorWeight       = 0.15  // never drop below this
    RecencyCapDays    = 30.0  // days at which recency multiplier saturates
    RecencyMaxMult    = 2.0   // peak recency multiplier
    UnseenBaseWeight  = 0.5   // base for questions with no stats row
)
```

### Weighted sample without replacement (A-Res)

```go
// SelectWeighted picks n distinct items from candidates using their weights.
// Equivalent to repeated weighted draws without replacement, in one pass.
// Time: O(m log m) where m = len(candidates).
func SelectWeighted(candidates []Candidate, n int, rng *rand.Rand) []Candidate {
    if n >= len(candidates) {
        return candidates
    }
    type keyed struct {
        c   Candidate
        key float64
    }
    keys := make([]keyed, len(candidates))
    for i, c := range candidates {
        u := rng.Float64()
        if u == 0 {
            u = 1e-12 // avoid log(0) / pow weirdness
        }
        // Efraimidis–Spirakis A-Res key: u^(1/weight)
        keys[i] = keyed{c, math.Pow(u, 1.0/c.Weight)}
    }
    sort.Slice(keys, func(i, j int) bool { return keys[i].key > keys[j].key })
    out := make([]Candidate, n)
    for i := 0; i < n; i++ {
        out[i] = keys[i].c
    }
    return out
}
```

### Testing the sampler

Write a deterministic test in `internal/sampling/sampling_test.go` that:
- Builds a fake pool of 100 questions with hand-crafted stats (some mastered, some weak, some unseen).
- Runs `SelectWeighted` 10,000 times with a seeded RNG.
- Asserts that high-error-rate questions are sampled at >3× the rate of mastered ones, and that mastered ones are still sampled at least N times (proving the floor works).

### Where it plugs in

- `internal/sampling/weight.go` — `ComputeWeight(stat *UserQuestionStat, now time.Time) float64`. Takes `nil` for unseen.
- `internal/sampling/select.go` — `SelectWeighted` above.
- `internal/handlers/test.go` `POST /test/new` —
  1. `candidates := repo.ListQuestions(filter)`.
  2. `stats := repo.GetStatsForUser(userID, candidateIDs)` (map keyed by question_id).
  3. For each candidate: `cand.Weight = sampling.ComputeWeight(stats[c.ID], time.Now())` — or `1.0` if mode is "uniform".
  4. `picked := sampling.SelectWeighted(candidates, n, rng)`.
  5. Shuffle `picked`, persist as the test's `question_ids`.

---

## Conventions for the build

- Errors bubble up; handlers translate to HTTP via a small `httpErr` helper. Never `panic` in a handler.
- All DB access goes through `internal/db/repo.go` — no inline SQL in handlers.
- Templates use a single `layout.html` with `{{block "content"}}`. Pages define `content` only.
- CSRF: scs has built-in support; enable it on POST routes.
- File upload limit: 20MB, enforced via `http.MaxBytesReader`.
- All user-supplied text rendered with `html/template`'s default escaping — never use `template.HTML` on imported content.
- Log: one structured line per request with method, path, status, duration, user.
- Tests: write table-driven tests for `parse/`, `llm/` (mock the OpenAI client behind an interface), and the question-dedup logic. Handler tests with `httptest` for the auth flow and the test-taking flow.

---

## Open questions to confirm before Phase 0

1. Two separate accounts or one shared? (Plan above assumes separate.)
2. Should the LLM auto-suggest tags during import, or are tags manual-only? (Plan above is manual; auto-tagging is a small addition to the prompt + schema.)
3. Hard delete or soft delete for questions? (Plan above is hard delete.)
4. Comfortable with the weighting defaults in Appendix A (floor 0.15, recency cap at 30 days, unseen base 0.5)? These are tuneable later but it's worth a sanity check up front — if you want mastered questions to appear more rarely, raise the floor; more often, lower it.