OpenAI's reasoning model. Chain-of-thought enables better logical structure, nuanced analysis, and accurate consumer rights coverage.
V3 uses o4-mini, OpenAI's reasoning-optimized model. Before generating the visible content, the model "thinks" internally using reasoning tokens that are billed but not visible. This produces more logically structured articles with better argument coherence than standard completion models.
| Attribute | Value |
|---|---|
| Model | o4-mini |
| Web searches | None |
| Avg article length | 1,300–1,800 words (reasoning uses tokens) |
| Avg quality score | 9/10 |
| Cost per article | €0.018–0.030 |
| Avg generation time | 45–80 seconds |
| Reasoning tokens | Billed but not shown in output |
| Cost × 1,000 articles | ≈ €18–30 |
o4-mini uses a two-phase internal process:
┌──────────────────────────────────────────────┐ │ PHASE 1: REASONING (internal, not visible) │ │ │ │ Model "thinks" about: │ │ - Best article structure for this service │ │ - Which competitor claims are plausible │ │ - How to structure the cancellation guide │ │ - Consumer rights applicable in Australia │ │ │ │ Uses N reasoning tokens (billed as output) │ └──────────────────┬───────────────────────────┘ │ ┌──────────────────▼───────────────────────────┐ │ PHASE 2: GENERATION (visible HTML output) │ │ │ │ Produces final article based on reasoning │ │ Uses completion tokens (also billed) │ └──────────────────────────────────────────────┘
Key implication: Set max_completion_tokens high enough (10,000+) to allow both reasoning AND generation. Setting it too low results in truncated articles.
V3 requires a system message to enforce Billoff branding and word count. Without it, the model tends to drop brand mentions:
"You are a senior SEO content writer for Billoff (billoff.com). " "CRITICAL: The word 'Billoff' MUST appear multiple times in your output. " "NEVER write 'Postclic'. " "Target output: 1,600–2,200 words of pure HTML."
Why needed: Reasoning models tend to be very literal — if the system message is absent, they may produce excellent structure but miss branding requirements specified only in the user prompt.
# Python (Billoff/scripts/03_generate_v3.py) resp = client.chat.completions.create( model="o4-mini", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": prompt}, ], max_completion_tokens=10000, # includes reasoning + completion ) # Extract reasoning tokens reasoning_t = getattr( resp.usage.completion_tokens_details, "reasoning_tokens", 0 )
// Browser (openai.js) — streaming const resp = await fetch('https://api.openai.com/v1/chat/completions', { method: 'POST', headers: { 'Authorization': `Bearer ${apiKey}` }, body: JSON.stringify({ model: 'o4-mini', messages: [systemMsg, ...messages], max_completion_tokens: 10000, stream: true, }) }); // Reasoning tokens visible in usage.completion_tokens_details.reasoning_tokens
| Component | Tokens (avg) | Cost (USD) | Cost (EUR) |
|---|---|---|---|
| Prompt tokens | ~1,900 | $0.00209 | €0.00192 |
| Completion (output) tokens | ~4,800 | $0.02112 | €0.01943 |
| Reasoning tokens (avg) | ~500–2,000 | $0.0022–0.0088 | €0.002–0.008 |
| TOTAL per article | ~7,200 | ~$0.026 | ~€0.024 |
| × 20 articles | ~144,000 | $0.52 | €0.48 |
| × 1,000 articles | ~7.2M | $26 | €24 |
Model pricing: Input $1.10/1M, Output $4.40/1M, Reasoning $4.40/1M tokens.
# Run V3 batch (all 20 sample services, 3 parallel workers) python scripts/03_generate_v3.py # Output: Billoff/data/results_v3.json # Compare all 3 methods side by side on 1 service python scripts/test_compare_3methods.py # Specific service python scripts/test_compare_3methods.py --service "Spotify"
# The script tracks reasoning tokens for accurate cost calculation reasoning_t = 0 if hasattr(usage, "completion_tokens_details") and usage.completion_tokens_details: reasoning_t = getattr(usage.completion_tokens_details, "reasoning_tokens", 0) elif hasattr(usage, "reasoning_tokens"): reasoning_t = usage.reasoning_tokens or 0 # Cost = (prompt × input_rate) + (completion × output_rate) + (reasoning × reasoning_rate) cost_usd = (prompt_t/1e6)*1.10 + (completion_t/1e6)*4.40 + (reasoning_t/1e6)*4.40
| Metric | Result | vs V2 | Notes |
|---|---|---|---|
| Word count | 1,276 | −651 | Reasoning tokens reduce budget for output |
| Tables | 4 | Same | |
| H2 sections | 14 | Same | |
| H3 sub-sections | 38 | −2 | |
| Company fact box | ✅ | Same | Better structured than V2 |
| FAQ | ✅ | Same | |
| Logical coherence | ⭐⭐⭐⭐⭐ | Better | Most nuanced analysis |
| Quality score | 9/10 | Same | |
| Cost | €0.024 | 2.2× more | |
| Time | 56s | −41% faster | Reasoning is efficient |
Tip: Increase max_completion_tokens to 12,000–15,000 for longer articles.
| ✅ Pros | ❌ Cons |
|---|---|
| Best logical structure & flow | 2× more expensive than V2 |
| Fastest wall-clock time (56s) | Shorter articles (reasoning uses token budget) |
| Most nuanced consumer advice | Branding requires explicit system message |
| Consistent section structure | No real-time data (same as V2) |
| Better at edge cases & exceptions | Reasoning tokens not visible/auditable |
| Great for legal/rights sections | May be overkill for simple services |
| Criterion | V1 Research | V2 GPT-5 Mini | V3 O4-Mini |
|---|---|---|---|
| Quality (avg) | 10/10 | 9/10 | 9/10 |
| Word count (avg) | 2,100+ | 1,900+ | 1,300+ |
| Cost/article | €0.10 | €0.011 | €0.024 |
| Speed | 137s | 95s | 56s |
| Real-time data | ✅ Yes | ❌ No | ❌ No |
| Best for | Top 100 services | Mass generation | Editorial quality |
| Scale (1,000 articles) | €102 | €11 | €24 |