V3 Documentation — O4-Mini Reasoning

Overview

V3 uses o4-mini, OpenAI's reasoning-optimized model. Before generating the visible content, the model "thinks" internally using reasoning tokens that are billed but not visible. This produces more logically structured articles with better argument coherence than standard completion models.

Attribute	Value
Model	o4-mini
Web searches	None
Avg article length	1,300–1,800 words (reasoning uses tokens)
Avg quality score	9/10
Cost per article	€0.018–0.030
Avg generation time	45–80 seconds
Reasoning tokens	Billed but not shown in output
Cost × 1,000 articles	≈ €18–30

How Reasoning Works

o4-mini uses a two-phase internal process:

┌──────────────────────────────────────────────┐
│  PHASE 1: REASONING (internal, not visible)  │
│                                              │
│  Model "thinks" about:                       │
│  - Best article structure for this service   │
│  - Which competitor claims are plausible     │
│  - How to structure the cancellation guide   │
│  - Consumer rights applicable in Australia   │
│                                              │
│  Uses N reasoning tokens (billed as output)  │
└──────────────────┬───────────────────────────┘
                   │
┌──────────────────▼───────────────────────────┐
│  PHASE 2: GENERATION (visible HTML output)   │
│                                              │
│  Produces final article based on reasoning   │
│  Uses completion tokens (also billed)        │
└──────────────────────────────────────────────┘

Key implication: Set max_completion_tokens high enough (10,000+) to allow both reasoning AND generation. Setting it too low results in truncated articles.

System Message (Important)

V3 requires a system message to enforce Billoff branding and word count. Without it, the model tends to drop brand mentions:

"You are a senior SEO content writer for Billoff (billoff.com). "
"CRITICAL: The word 'Billoff' MUST appear multiple times in your output. "
"NEVER write 'Postclic'. "
"Target output: 1,600–2,200 words of pure HTML."

Why needed: Reasoning models tend to be very literal — if the system message is absent, they may produce excellent structure but miss branding requirements specified only in the user prompt.

API Call Configuration

# Python (Billoff/scripts/03_generate_v3.py)
resp = client.chat.completions.create(
    model="o4-mini",
    messages=[
        {"role": "system", "content": system_msg},
        {"role": "user",   "content": prompt},
    ],
    max_completion_tokens=10000,  # includes reasoning + completion
)

# Extract reasoning tokens
reasoning_t = getattr(
    resp.usage.completion_tokens_details,
    "reasoning_tokens", 0
)

// Browser (openai.js) — streaming
const resp = await fetch('https://api.openai.com/v1/chat/completions', {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${apiKey}` },
  body: JSON.stringify({
    model: 'o4-mini',
    messages: [systemMsg, ...messages],
    max_completion_tokens: 10000,
    stream: true,
  })
});
// Reasoning tokens visible in usage.completion_tokens_details.reasoning_tokens

Cost Breakdown

Component	Tokens (avg)	Cost (USD)	Cost (EUR)
Prompt tokens	~1,900	$0.00209	€0.00192
Completion (output) tokens	~4,800	$0.02112	€0.01943
Reasoning tokens (avg)	~500–2,000	$0.0022–0.0088	€0.002–0.008
TOTAL per article	~7,200	~$0.026	~€0.024
× 20 articles	~144,000	$0.52	€0.48
× 1,000 articles	~7.2M	$26	€24

Model pricing: Input $1.10/1M, Output $4.40/1M, Reasoning $4.40/1M tokens.

Python Script Reference

# Run V3 batch (all 20 sample services, 3 parallel workers)
python scripts/03_generate_v3.py

# Output: Billoff/data/results_v3.json

# Compare all 3 methods side by side on 1 service
python scripts/test_compare_3methods.py

# Specific service
python scripts/test_compare_3methods.py --service "Spotify"

Reasoning token tracking

# The script tracks reasoning tokens for accurate cost calculation
reasoning_t = 0
if hasattr(usage, "completion_tokens_details") and usage.completion_tokens_details:
    reasoning_t = getattr(usage.completion_tokens_details, "reasoning_tokens", 0)
elif hasattr(usage, "reasoning_tokens"):
    reasoning_t = usage.reasoning_tokens or 0

# Cost = (prompt × input_rate) + (completion × output_rate) + (reasoning × reasoning_rate)
cost_usd = (prompt_t/1e6)*1.10 + (completion_t/1e6)*4.40 + (reasoning_t/1e6)*4.40

Quality Results (Ocado test)

Metric	Result	vs V2	Notes
Word count	1,276	−651	Reasoning tokens reduce budget for output
Tables	4	Same
H2 sections	14	Same
H3 sub-sections	38	−2
Company fact box	✅	Same	Better structured than V2
FAQ	✅	Same
Logical coherence	⭐⭐⭐⭐⭐	Better	Most nuanced analysis
Quality score	9/10	Same
Cost	€0.024	2.2× more
Time	56s	−41% faster	Reasoning is efficient

Tip: Increase max_completion_tokens to 12,000–15,000 for longer articles.

Pros & Cons

✅ Pros	❌ Cons
Best logical structure & flow	2× more expensive than V2
Fastest wall-clock time (56s)	Shorter articles (reasoning uses token budget)
Most nuanced consumer advice	Branding requires explicit system message
Consistent section structure	No real-time data (same as V2)
Better at edge cases & exceptions	Reasoning tokens not visible/auditable
Great for legal/rights sections	May be overkill for simple services

Method Comparison Summary

Criterion	V1 Research	V2 GPT-5 Mini	V3 O4-Mini
Quality (avg)	10/10	9/10	9/10
Word count (avg)	2,100+	1,900+	1,300+
Cost/article	€0.10	€0.011	€0.024
Speed	137s	95s	56s
Real-time data	✅ Yes	❌ No	❌ No
Best for	Top 100 services	Mass generation	Editorial quality
Scale (1,000 articles)	€102	€11	€24

Recommendation

Use V1 for your top 100–500 most important service pages (high traffic, competitive)
Use V2 for bulk generation of the long tail (1,000–10,000 services)
Use V3 for services where consumer rights and legal nuance matter most

🧠 O4-Mini Reasoning (V3)