Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
QBank — Execution Plan
A self-hosted quiz app. Import PDF/DOCX documents containing multiple-choice questions, parse them with an LLM, store them, generate randomised tests, take them on phone or laptop, and review wrong answers afterward.
This document is the build plan. Work through phases in order. Each phase has a clear acceptance test — don't move on until it passes.
Locked-in tech decisions
- Language: Go (1.22+ for
net/httppattern routing, though we'll use chi). - Router:
github.com/go-chi/chi/v5. - DB: SQLite via
modernc.org/sqlite(pure Go, no CGo — simplifies Docker builds). - Sessions:
github.com/alexedwards/scs/v2with SQLite store. - Passwords:
golang.org/x/crypto/bcrypt. - PDF text extraction:
github.com/ledongthuc/pdfas the primary; fall back to shelling out topdftotext(poppler-utils) if quality is poor on real docs. - DOCX text extraction: hand-rolled — DOCX is a zip; unzip in memory, parse
word/document.xml, concatenate<w:t>text nodes. ~50 lines, no extra deps beyondarchive/zipandencoding/xml. - OpenAI client:
github.com/sashabaranov/go-openai. Use the current cost-efficient model that supports JSON schema response_format (verify the current model name in OpenAI docs at build time; e.g.gpt-4o-minior its successor). - Templates:
html/templatestandard library. - CSS: Tailwind via the standalone CLI (no Node.js needed) — download the binary in the Dockerfile.
- Auth model: two named accounts seeded at first run. Each user has their own test history.
- Hosting target: self-hosted on a local Portainer instance. Deployed as a Docker container via a stack (docker-compose). A named volume holds
qbank.dband uploaded files.
Project layout
qbank/
├── cmd/server/main.go # entry point
├── internal/
│ ├── config/ # env + flags
│ ├── db/
│ │ ├── schema.sql
│ │ ├── db.go # open, migrate, helpers
│ │ └── repo.go # queries
│ ├── models/ # Question, Answer, Test, etc.
│ ├── parse/
│ │ ├── pdf.go
│ │ ├── docx.go
│ │ └── chunk.go
│ ├── llm/
│ │ └── openai.go # ExtractQuestions(text) -> []Question
│ ├── auth/ # login, middleware, session
│ ├── handlers/ # http handlers, one file per area
│ └── web/
│ ├── templates/*.html
│ └── static/ # tailwind output, manifest, sw.js, icons
├── data/ # gitignored: qbank.db, uploads/
├── .env.example
├── Dockerfile
├── docker-compose.yml
├── go.mod
└── README.md
Phase 0 — Skeleton & config
Tasks:
go mod init qbank.- Wire
cmd/server/main.gowith chi, a/healthzroute returning 200, structured logging vialog/slog. - Config loader in
internal/config: readsOPENAI_API_KEY,SESSION_SECRET,DATA_DIR(default./data),PORT(default8080),ADMIN_USERS(comma-separatedname:passwordpairs, seeded on first start only). .env.examplewith the above keys.Makefilewithrun,build,tailwind,tidy.
Acceptance: go run ./cmd/server boots, curl localhost:8080/healthz returns OK.
Phase 1 — Database & migrations
Schema in internal/db/schema.sql:
CREATE TABLE users (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE questions (
id TEXT PRIMARY KEY, -- sha256(question_text)[:16]
text TEXT NOT NULL,
source TEXT,
tags TEXT, -- comma-separated, simple
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE answers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
question_id TEXT NOT NULL REFERENCES questions(id) ON DELETE CASCADE,
text TEXT NOT NULL,
is_correct BOOLEAN NOT NULL,
position INTEGER NOT NULL -- canonical order as imported
);
CREATE TABLE tests (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id INTEGER NOT NULL REFERENCES users(id),
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
completed_at DATETIME,
n_questions INTEGER NOT NULL,
question_ids TEXT NOT NULL -- JSON array, in chosen order
);
CREATE TABLE test_answers (
test_id INTEGER NOT NULL REFERENCES tests(id) ON DELETE CASCADE,
question_id TEXT NOT NULL REFERENCES questions(id),
selected_answer_id INTEGER REFERENCES answers(id),
is_correct BOOLEAN,
answered_at DATETIME,
PRIMARY KEY (test_id, question_id)
);
-- Per-user mastery stats. Drives weighted sampling in Phase 6 and the
-- accuracy display in Phase 5. A missing row means "never seen" — treat
-- as (times_seen=0, times_correct=0).
CREATE TABLE user_question_stats (
user_id INTEGER NOT NULL REFERENCES users(id),
question_id TEXT NOT NULL REFERENCES questions(id) ON DELETE CASCADE,
times_seen INTEGER NOT NULL DEFAULT 0,
times_correct INTEGER NOT NULL DEFAULT 0,
last_seen_at DATETIME,
PRIMARY KEY (user_id, question_id)
);
CREATE INDEX idx_test_answers_test ON test_answers(test_id);
CREATE INDEX idx_answers_question ON answers(question_id);
CREATE INDEX idx_stats_user ON user_question_stats(user_id);
Tasks:
db.Open(path)— runs schema on first start (idempotent viaIF NOT EXISTS).- Repository functions:
CreateUser,GetUserByName,InsertQuestion,GetQuestion,ListQuestions(filter),CountQuestions,CountAnswers,CreateTest,RecordAnswer,FinishTest,GetTest,ListTestsForUser,UpsertStat(userID, questionID, gotItRight bool)(atomically incrementstimes_seenand conditionallytimes_correct, setslast_seen_at),GetStatsForUser(userID, questionIDs)(returns a map; missing entries mean unseen). - Seed admin users from
ADMIN_USERSenv var on first start only (skip ifuserstable non-empty).
Acceptance: delete qbank.db, start the server, verify users are seeded; sqlite3 qbank.db ".schema" shows all tables.
Phase 2 — Auth & layout
- Login page (POST → set session via scs).
RequireAuthmiddleware on everything except/login,/healthz,/static/*.- Base template with header (app name, current user, logout, nav: Library / Upload / Take Test / History).
- Tailwind setup: download standalone CLI in Dockerfile and
make tailwind;web/static/tailwind.cssis the build output. Use mobile-first layout — single column undermd:, content max-width2xlon desktop. - Add
web/static/manifest.jsonandweb/static/sw.js(minimal — cache shell, network-first for API). Wire them in the base template so "Add to Home Screen" works on iOS and Android.
Acceptance: log in as each seeded user, see your name in the header, log out. On a phone, "Add to Home Screen" produces a standalone-looking icon.
Phase 3 — Document parsing & LLM extraction
internal/parse:
ExtractPDF(r io.Reader) (string, error)using ledongthuc/pdf. If text comes back empty or gibberish (heuristic: <50 chars or <2% alphanumeric ratio), return a sentinel error so the handler can show "scan-based PDF — please convert to text first."ExtractDOCX(r io.Reader) (string, error)— open as zip, findword/document.xml, walk XML, concatenate<w:t>text content, insert newlines on<w:p>boundaries.Chunk(text string, maxRunes int) []string— split on double-newlines greedily up to ~10k runes per chunk (well under model context).
internal/llm:
ExtractQuestions(ctx, chunk string) ([]ParsedQuestion, error)calls OpenAI chat completions withresponse_formatset to a JSON schema:
{
"type": "object",
"properties": {
"questions": {
"type": "array",
"items": {
"type": "object",
"required": ["question", "answers"],
"properties": {
"question": {"type": "string"},
"answers": {
"type": "array",
"minItems": 2,
"items": {
"type": "object",
"required": ["text", "correct"],
"properties": {
"text": {"type": "string"},
"correct": {"type": "boolean"}
}
}
}
}
}
}
},
"required": ["questions"]
}
System prompt: "You extract multiple-choice questions from study material. Return every question found. Exactly one answer per question must be marked correct. If the source doesn't clearly mark a correct answer, omit that question entirely. Do not invent questions not present in the text."
Validate the response: drop any question without exactly one correct: true. Dedupe by question text hash.
Acceptance: feed in a small handcrafted PDF and DOCX with 3–4 known Q&As, the function returns them correctly.
Phase 4 — Upload & import review flow
GET /upload— file input (accepts.pdf,.docx), optional comma-separated tags, optional source override.POST /upload:- Save file to
data/uploads/<timestamp>_<filename>. - Extract text → chunk → call LLM per chunk → merge & dedupe → stash candidate questions in the session (or in a temp
import_draftstable keyed by a UUID; the temp table is cleaner, do that). - Redirect to
/import/<draft_id>.
- Save file to
GET /import/<draft_id>— editable preview. Each candidate question shows: question text (editable), each answer (editable, with radio for which is correct), tag input, "delete this question" checkbox.POST /import/<draft_id>— write surviving questions toquestionsandanswers. Delete the draft. Redirect to library with a flash message ("Imported N questions, M skipped").
Files larger than ~5MB or with >100 candidate questions should still work — paginate the review page if needed.
Acceptance: upload a real PDF, edit a couple of answers in the preview, confirm; library count goes up by the right amount.
Phase 5 — Library & stats
GET / (library page):
- Total questions, total answers (sum across all questions), per-source breakdown, per-tag breakdown.
- Searchable/filterable list of all questions. Each row shows the current user's mastery for that question: "seen 8× · 3 correct (38%)" — pull from
user_question_stats. Unseen questions show "—". - Sort options: alphabetical, lowest accuracy first (i.e. "weakest first"), most-seen first.
- "Take a test" CTA.
GET /questions/<id> — view & edit a single question; show the same per-user stats prominently.
POST /questions/<id> — save edits.
POST /questions/<id>/delete — delete (cascades to answers; archive instead by adding a deleted_at column if you'd rather, but hard delete is fine for v1).
Acceptance: counts match SELECT COUNT(*); editing a question persists; deleting a question removes its answers too.
Phase 6 — Take a test
GET /test/new — form:
- Number of questions (default 10, capped at total available after filters).
- Optional tag filter, optional source filter. Show count of matching questions live.
- Sampling mode (radio): "Focus on weak spots" (default, weighted) vs "Mix evenly" (uniform random). Weighted mode biases toward questions the current user gets wrong more often; uniform gives a representative cross-section. See "Weighting algorithm" appendix below.
POST /test/new —
- Resolve the candidate pool from filters.
- For each candidate, compute the user's weight from
user_question_stats(formula in appendix). Uniform mode skips weighting (all weights = 1). - Weighted-sample N candidates without replacement using the A-Res algorithm.
- Shuffle the resulting order.
- Create
testsrow withquestion_idsJSON. - Redirect to
/test/<id>/q/1.
GET /test/<id>/q/<n> — show the n-th question. Answers shuffled deterministically per (test_id, question_id) using a seeded RNG so refreshing doesn't re-shuffle. Progress indicator ("Question 3 of 10"). Mobile-friendly: large tap targets for answer choices.
POST /test/<id>/q/<n> — inside one transaction:
- Record selection in
test_answers(computeis_correctserver-side). - Call
UpsertStat(userID, questionID, isCorrect)to bumptimes_seen, conditionallytimes_correct, and updatelast_seen_at. - If more questions remain → redirect to next.
- Else → mark test complete (
completed_at), redirect to/test/<id>/results.
Only the owning user can access their test.
Acceptance:
- Start a 5-question test, answer all 5, end on results page.
- After several tests, weighted mode demonstrably surfaces previously-wrong questions more often than ones consistently answered correctly — verify by inspecting
user_question_statsand re-running test creation a few times. - A mastered question (e.g. seen 10×, correct 10×) still has a non-zero chance of appearing — confirm by running enough tests with a small pool that the floor weight is exercised.
Phase 7 — Results & review
GET /test/<id>/results:
- Score: X / N (Y%).
- Time taken.
- For each question: question text, every answer listed, with markers: ✅ correct answer, ❌ user's choice (if wrong), ✅ both (if user got it right).
- Source & tags shown per question.
Acceptance: wrong answers are clearly visible, correct answer always shown, page renders cleanly on a phone.
Phase 8 — History
GET /history:
- List of past tests for the current user: date, score, n_questions, link to results.
- Aggregate stat: overall % correct across all attempts.
- Stretch: "Weak spots" — questions this user has gotten wrong more than once, with link to the question.
Acceptance: taking multiple tests builds up a history; clicking back into a past test shows the full review.
Phase 9 — Containerise & deploy to Portainer
Build:
Dockerfile: multi-stage.- Stage 1 (
golang:1.22-alpineor newer):go mod download, run the Tailwind standalone CLI to produceweb/static/tailwind.css, thenCGO_ENABLED=0 go build -o /out/qbank ./cmd/server. - Stage 2 (
gcr.io/distroless/static-debian12if not usingpdftotext, otherwisealpine:3.20withapk add --no-cache poppler-utils). Copy the binary,web/templates/, andweb/static/.EXPOSE 8080. Non-root user.ENTRYPOINT ["/qbank"].
- Stage 1 (
.dockerignore: at minimumdata/,*.db,.env,.git,node_modules(if any sneak in).
docker-compose.yml (this is what you paste into Portainer's "Stacks → Add stack → Web editor"):
services:
qbank:
image: ghcr.io/<you>/qbank:latest # or build locally and push, see below
container_name: qbank
restart: unless-stopped
ports:
- "8080:8080" # or put it behind your existing reverse proxy
environment:
DATA_DIR: /data
PORT: "8080"
OPENAI_API_KEY: ${OPENAI_API_KEY}
SESSION_SECRET: ${SESSION_SECRET}
ADMIN_USERS: ${ADMIN_USERS} # e.g. "alice:hunter2,bob:correcthorse"
volumes:
- qbank-data:/data
healthcheck:
test: ["CMD", "/qbank", "healthcheck"] # implement a tiny subcommand, or use wget if base image has it
interval: 30s
timeout: 5s
retries: 3
volumes:
qbank-data:
Set the three env vars in Portainer's stack "Environment variables" section so they don't end up in the compose file. ADMIN_USERS is only read on first start (when the users table is empty) — subsequent restarts ignore it, so it's safe to leave set.
Image options — pick one:
- Build on your machine, push to a registry:
docker build -t ghcr.io/<you>/qbank:latest .thendocker push. Portainer pulls it. Simplest if you already use GHCR/Docker Hub. - Build on the Portainer host directly: in the stack, replace
image:withbuild: { context: ., dockerfile: Dockerfile }and use Portainer's git-based stack deployment pointing at the repo. Portainer pulls the repo and builds. Good for iterating without a registry round-trip.
HTTPS / external access:
- If you have an existing reverse proxy on that host (Traefik, Caddy, Nginx Proxy Manager), don't publish 8080 — put
qbankon the proxy's network and add labels/config so the proxy terminates TLS and forwards to the container's internal:8080. Mention this in the README so future-you remembers. - If you don't have a reverse proxy yet and want it reachable from her phone outside your LAN, the easiest add-on is Caddy in front, or a Tailscale sidecar so the app is only reachable on your tailnet (private, no public exposure, works on her phone with the Tailscale app installed).
Healthcheck note: the distroless image has no shell, so the healthcheck command must be the binary itself. Implement a healthcheck subcommand in cmd/server/main.go that does a localhost GET against /healthz and exits non-zero on failure. Or, if you switch the base image to alpine, you can use wget --spider -q http://localhost:8080/healthz and skip the subcommand.
Backups: the qbank-data volume holds everything that matters. A simple cron on the host runs docker run --rm -v qbank-data:/data -v /backups:/out alpine tar czf /out/qbank-$(date +%F).tar.gz /data weekly. Document this in the README.
Acceptance: stack comes up in Portainer, container is healthy, you can log in from your laptop and from her phone, restarting the container preserves all questions and history.
Appendix A — Weighting algorithm
Each question has a per-user weight that determines its sampling probability in "Focus on weak spots" mode. Implement in internal/sampling (new package).
Weight formula
For a question with times_seen = s, times_correct = c, and last_seen_at = t:
wrong = s - c
# Laplace-smoothed error rate. Pretending you've seen the question once
# right and once wrong dampens noise from tiny sample sizes — one wrong
# answer doesn't catapult the question to top priority, and one right
# answer doesn't bury it.
error_rate = (wrong + 1) / (s + 2)
# Floor so mastered questions still appear occasionally. 0.15 means a
# perfectly-mastered question shows up at roughly 15% the frequency of a
# brand-new one. Tune to taste.
base = max(0.15, error_rate)
# Recency nudge. Long-unseen questions creep back up regardless of past
# accuracy — this prevents the "staleness death-spiral" where low-weight
# questions never get sampled, never get updated, and stay low forever.
# Caps at 2× after ~30 days; unseen questions (t is NULL) get the full 2×.
days_since = (now - t).days if t != NULL else 30
recency = 1 + min(days_since / 30.0, 1.0)
weight = base * recency
Unseen questions (no row in user_question_stats) get base = 0.5 (mid-range, not punished and not boosted) and full recency multiplier — they end up at weight ~1.0, which is roughly the same as a question you've gotten wrong half the time.
Knobs worth exposing as constants at the top of the file (don't make them config — just easy to find for tuning):
const (
FloorWeight = 0.15 // never drop below this
RecencyCapDays = 30.0 // days at which recency multiplier saturates
RecencyMaxMult = 2.0 // peak recency multiplier
UnseenBaseWeight = 0.5 // base for questions with no stats row
)
Weighted sample without replacement (A-Res)
// SelectWeighted picks n distinct items from candidates using their weights.
// Equivalent to repeated weighted draws without replacement, in one pass.
// Time: O(m log m) where m = len(candidates).
func SelectWeighted(candidates []Candidate, n int, rng *rand.Rand) []Candidate {
if n >= len(candidates) {
return candidates
}
type keyed struct {
c Candidate
key float64
}
keys := make([]keyed, len(candidates))
for i, c := range candidates {
u := rng.Float64()
if u == 0 {
u = 1e-12 // avoid log(0) / pow weirdness
}
// Efraimidis–Spirakis A-Res key: u^(1/weight)
keys[i] = keyed{c, math.Pow(u, 1.0/c.Weight)}
}
sort.Slice(keys, func(i, j int) bool { return keys[i].key > keys[j].key })
out := make([]Candidate, n)
for i := 0; i < n; i++ {
out[i] = keys[i].c
}
return out
}
Testing the sampler
Write a deterministic test in internal/sampling/sampling_test.go that:
- Builds a fake pool of 100 questions with hand-crafted stats (some mastered, some weak, some unseen).
- Runs
SelectWeighted10,000 times with a seeded RNG. - Asserts that high-error-rate questions are sampled at >3× the rate of mastered ones, and that mastered ones are still sampled at least N times (proving the floor works).
Where it plugs in
internal/sampling/weight.go—ComputeWeight(stat *UserQuestionStat, now time.Time) float64. Takesnilfor unseen.internal/sampling/select.go—SelectWeightedabove.internal/handlers/test.goPOST /test/new—candidates := repo.ListQuestions(filter).stats := repo.GetStatsForUser(userID, candidateIDs)(map keyed by question_id).- For each candidate:
cand.Weight = sampling.ComputeWeight(stats[c.ID], time.Now())— or1.0if mode is "uniform". picked := sampling.SelectWeighted(candidates, n, rng).- Shuffle
picked, persist as the test'squestion_ids.
Conventions for the build
- Errors bubble up; handlers translate to HTTP via a small
httpErrhelper. Neverpanicin a handler. - All DB access goes through
internal/db/repo.go— no inline SQL in handlers. - Templates use a single
layout.htmlwith{{block "content"}}. Pages definecontentonly. - CSRF: scs has built-in support; enable it on POST routes.
- File upload limit: 20MB, enforced via
http.MaxBytesReader. - All user-supplied text rendered with
html/template's default escaping — never usetemplate.HTMLon imported content. - Log: one structured line per request with method, path, status, duration, user.
- Tests: write table-driven tests for
parse/,llm/(mock the OpenAI client behind an interface), and the question-dedup logic. Handler tests withhttptestfor the auth flow and the test-taking flow.
Open questions to confirm before Phase 0
- Two separate accounts or one shared? (Plan above assumes separate.)
- Should the LLM auto-suggest tags during import, or are tags manual-only? (Plan above is manual; auto-tagging is a small addition to the prompt + schema.)
- Hard delete or soft delete for questions? (Plan above is hard delete.)
- Comfortable with the weighting defaults in Appendix A (floor 0.15, recency cap at 30 days, unseen base 0.5)? These are tuneable later but it's worth a sanity check up front — if you want mastered questions to appear more rarely, raise the floor; more often, lower it.