9 Commits

Author SHA1 Message Date
Nico
b35610cf6f Add node-level test suite for ErasExpertNode
6 tests that instantiate ErasExpertNode directly (no HTTP, no pipeline).
Assert SQL table selection, JOIN patterns, and response hygiene.
2 LLM calls per test vs 4+ for matrix — runs in ~22s total locally.
Requires pymysql in venv and DB access (WireGuard or NodePort).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 22:12:01 +02:00
Nico
d8e832d2d4 Add --parallel=N for concurrent test execution
- run_tests.py: ThreadPoolExecutor runs N tests concurrently within a suite
- Each testcase has its own session_id so parallel is safe
- Engine tests: fixed asyncio.new_event_loop() for thread safety
- Usage: python tests/run_tests.py testcases --parallel=3
- Wall time reduction: ~3x for testcases (15min → 5min with parallel=3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 20:01:06 +02:00
Nico
c21ff08211 Unify testcases into run_tests.py: SSE client, session isolation, dashboard
- tests/test_testcases.py: new ChatClient using /api/chat SSE (replaces
  /api/send polling), each testcase gets own session_id
- Registered as 'testcases' suite in run_tests.py (25 markdown testcases)
- Results post to /api/test-results for real-time /tests dashboard
- Reuses parser + assertion engine from runtime_test.py
- Usage: python tests/run_tests.py testcases/fast

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 19:48:58 +02:00
Nico
d3350fd502 Add 4 red tests for Phase 2: shared node pool + contextvar HUD
Tests that will pass once implemented:
- pool_creates_shared_nodes: NodePool has shared stateless nodes
- pool_excludes_stateful: sensor/memorizer/ui not shared
- pool_reuses_instances: same pool returns same objects
- contextvar_hud_isolation: concurrent tasks get isolated HUD

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:35:48 +02:00
Nico
b4031611e2 Add --repeat=N mode to test runner with timing stats (avg/p50/p95)
- run_tests.py: --repeat=N runs each test N times, aggregates into one result
- Stats include: runs, pass_rate, min/avg/p50/p95/max_ms
- Stats posted in result.stats field for dashboard display
- Works with all suites (engine, api, matrix, roundtrip)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:18:09 +02:00
Nico
4e679a3ad9 Add model matrix test suite: 3 tests × 3 variants = 9 combos
New 'matrix' suite runs same API tests with different LLM model configs:
- Variants: gemini-flash (baseline), haiku, gpt-4o-mini
- Tests: eras_query (SQL correctness), eras_artifact (data output), social_reflex (fast path)
- Posts results as test_name[variant] to /tests dashboard
- All 9 combos passing (6/9 verified locally, ~35s for ERAS tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:12:24 +02:00
Nico
58734c34d2 Wire model overrides end-to-end: API → runtime → frame engine
- /api/chat accepts {"models": {"role": "provider/model"}} for per-request overrides
- runtime.handle_message passes model_overrides through to frame engine
- All 4 graph definitions (v1-v4) now declare MODELS dicts
- test_graph_has_models expanded to verify all graphs
- 11/11 engine tests green

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:07:42 +02:00
Nico
ecfbc86676 Add 3 red tests for Phase 1: config-driven models
Tests that will pass once implemented:
- graph_has_models: graph definition includes MODELS dict
- instantiate_applies_graph_models: node.model set from graph config
- model_override_per_request: process_message accepts model_overrides

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:03:56 +02:00
Nico
097c7f31f3 Add engine test suite: 8 tests for graph loading, conditions, frame traces
New 'engine' suite in run_tests.py with tests that verify frame engine
mechanics without LLM calls. Covers graph loading, node instantiation,
edge type completeness, reflex/tool_output conditions, and frame trace
structure for reflex/expert/expert+interpreter pipelines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 18:01:06 +02:00