METHOD 3

🧠 O4-Mini Reasoning (V3)

OpenAI's reasoning model. Chain-of-thought enables better logical structure, nuanced analysis, and accurate consumer rights coverage.

Overview

V3 uses o4-mini, OpenAI's reasoning-optimized model. Before generating the visible content, the model "thinks" internally using reasoning tokens that are billed but not visible. This produces more logically structured articles with better argument coherence than standard completion models.

AttributeValue
Modelo4-mini
Web searchesNone
Avg article length1,300–1,800 words (reasoning uses tokens)
Avg quality score9/10
Cost per article€0.018–0.030
Avg generation time45–80 seconds
Reasoning tokensBilled but not shown in output
Cost × 1,000 articles≈ €18–30

How Reasoning Works

o4-mini uses a two-phase internal process:

┌──────────────────────────────────────────────┐
│  PHASE 1: REASONING (internal, not visible)  │
│                                              │
│  Model "thinks" about:                       │
│  - Best article structure for this service   │
│  - Which competitor claims are plausible     │
│  - How to structure the cancellation guide   │
│  - Consumer rights applicable in Australia   │
│                                              │
│  Uses N reasoning tokens (billed as output)  │
└──────────────────┬───────────────────────────┘┌──────────────────▼───────────────────────────┐
│  PHASE 2: GENERATION (visible HTML output)   │
│                                              │
│  Produces final article based on reasoning   │
│  Uses completion tokens (also billed)        │
└──────────────────────────────────────────────┘

Key implication: Set max_completion_tokens high enough (10,000+) to allow both reasoning AND generation. Setting it too low results in truncated articles.

System Message (Important)

V3 requires a system message to enforce Billoff branding and word count. Without it, the model tends to drop brand mentions:

"You are a senior SEO content writer for Billoff (billoff.com). "
"CRITICAL: The word 'Billoff' MUST appear multiple times in your output. "
"NEVER write 'Postclic'. "
"Target output: 1,600–2,200 words of pure HTML."

Why needed: Reasoning models tend to be very literal — if the system message is absent, they may produce excellent structure but miss branding requirements specified only in the user prompt.

API Call Configuration

# Python (Billoff/scripts/03_generate_v3.py)
resp = client.chat.completions.create(
    model="o4-mini",
    messages=[
        {"role": "system", "content": system_msg},
        {"role": "user",   "content": prompt},
    ],
    max_completion_tokens=10000,  # includes reasoning + completion
)

# Extract reasoning tokens
reasoning_t = getattr(
    resp.usage.completion_tokens_details,
    "reasoning_tokens", 0
)
// Browser (openai.js) — streaming
const resp = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}` },
  body: JSON.stringify({
    model: 'o4-mini',
    messages: [systemMsg, ...messages],
    max_completion_tokens: 10000,
    stream: true,
  })
});
// Reasoning tokens visible in usage.completion_tokens_details.reasoning_tokens

Cost Breakdown

ComponentTokens (avg)Cost (USD)Cost (EUR)
Prompt tokens~1,900$0.00209€0.00192
Completion (output) tokens~4,800$0.02112€0.01943
Reasoning tokens (avg)~500–2,000$0.0022–0.0088€0.002–0.008
TOTAL per article~7,200~$0.026~€0.024
× 20 articles~144,000$0.52€0.48
× 1,000 articles~7.2M$26€24

Model pricing: Input $1.10/1M, Output $4.40/1M, Reasoning $4.40/1M tokens.

Python Script Reference

# Run V3 batch (all 20 sample services, 3 parallel workers)
python scripts/03_generate_v3.py

# Output: Billoff/data/results_v3.json

# Compare all 3 methods side by side on 1 service
python scripts/test_compare_3methods.py

# Specific service
python scripts/test_compare_3methods.py --service "Spotify"

Reasoning token tracking

# The script tracks reasoning tokens for accurate cost calculation
reasoning_t = 0
if hasattr(usage, "completion_tokens_details") and usage.completion_tokens_details:
    reasoning_t = getattr(usage.completion_tokens_details, "reasoning_tokens", 0)
elif hasattr(usage, "reasoning_tokens"):
    reasoning_t = usage.reasoning_tokens or 0

# Cost = (prompt × input_rate) + (completion × output_rate) + (reasoning × reasoning_rate)
cost_usd = (prompt_t/1e6)*1.10 + (completion_t/1e6)*4.40 + (reasoning_t/1e6)*4.40

Quality Results (Ocado test)

MetricResultvs V2Notes
Word count1,276−651Reasoning tokens reduce budget for output
Tables4Same
H2 sections14Same
H3 sub-sections38−2
Company fact boxSameBetter structured than V2
FAQSame
Logical coherence⭐⭐⭐⭐⭐BetterMost nuanced analysis
Quality score9/10Same
Cost€0.0242.2× more
Time56s−41% fasterReasoning is efficient

Tip: Increase max_completion_tokens to 12,000–15,000 for longer articles.

Pros & Cons

✅ Pros❌ Cons
Best logical structure & flow2× more expensive than V2
Fastest wall-clock time (56s)Shorter articles (reasoning uses token budget)
Most nuanced consumer adviceBranding requires explicit system message
Consistent section structureNo real-time data (same as V2)
Better at edge cases & exceptionsReasoning tokens not visible/auditable
Great for legal/rights sectionsMay be overkill for simple services

Method Comparison Summary

CriterionV1 ResearchV2 GPT-5 MiniV3 O4-Mini
Quality (avg)10/109/109/10
Word count (avg)2,100+1,900+1,300+
Cost/article€0.10€0.011€0.024
Speed137s95s56s
Real-time data✅ Yes❌ No❌ No
Best forTop 100 servicesMass generationEditorial quality
Scale (1,000 articles)€102€11€24

Recommendation