IntentForge — Performance¶

Overview¶

This document captures the defensible baseline numbers behind IntentForge's scalability claims (2000-agent budget on a single game thread), plus the exact reproduction steps so anyone can re-run the suite on their own hardware and confirm the numbers — or beat them. It is not a profiler tutorial; if you want to learn how to profile your own game's IntentForge usage, start with Unreal Insights and the IntentForge.* stat groups.

The numbers below come from the v0.33.0 benchmark suite, captured on the reference machine specified in the "Reference machine" section. They are the median-of-three-runs p99 (per-cycle) tail latency — see "Methodology notes" for why p99 is the lead statistic.

Headline numbers¶

Per-cycle wall time for one full replan pass across N agents, plus per-tick dispatcher cost at the same N:

Agent count	Replan p99 (ms)	Per-agent p99 (us)	Dispatcher tick p99 (ms)
100	0.016	0.16	0.0012
250	0.040	0.16	0.0030
500	0.105	0.21	0.0083
1000	0.201	0.20	0.019
2000	0.345	0.17	0.026
4000	0.678	0.17	0.054

The two numbers that matter for the 2000-agent scalability claim are Replan p99 = 0.345 ms and Dispatcher tick p99 = 0.026 ms at N=2000. Combined, that is well under one millisecond per frame at 2000 agents on a single thread, with room to spare for the rest of the game.

The framework remains roughly linear past the marketed 2000-agent budget: at N=4000 the replan p99 doubles to 0.68 ms, exactly what an O(N) loop would predict.

Reference machine¶

Item	Value
OS	Windows 11 Pro 10.0.26100
UE version	5.7
Build config	Development Editor
RHI	NullRHI (automation run, headless)
Run date (UTC)	2026-05-21

The numbers in this document were captured on the plugin author's workstation. Community contributions from other machines are welcomed (see "Community contributions" below) — variance of 2x is normal between consumer CPUs.

Benchmark catalogue¶

Nine automation tests live under the IntentForge.Performance.* namespace. Three were inherited from earlier releases; six new ones shipped with v0.33.

Test	What it measures	Asserts?
`SingleStepPlan`	Trivial archetype planner cost	mean < 0.1 ms (hard)
`BranchingArchetype`	Branching A* search cost	mean < 0.5 ms (hard)
`MediumArchetype`	20-action linear chain (depth-limit bail-out)	mean < 2.0 ms (hard)
`ActivityHistoryRingAppend`	One `RecordActivity` call at depths 32/128/256	depth-256 mean < 0.005 ms (hard)
`AgentMemoryFootprint`	KB per active component at N=1/100/1000	none
`ManyAgentReplan`	Per-cycle wall time for N agents to each replan	N=100 p99 < 50 ms (hard)
`PlannerOnlyReplan`	Same N sweep, direct `Plan()` calls — no component, no coalesce, no timer	none
`AntiFlapSelectionCost`	Momentum + filter + latch overhead vs raw v0.30	none
`WorldStateSetScalarCost`	Per-call `SetScalar` cost: baseline / filter-only / latch-only / both	none
`DispatcherTickCost`	One dispatcher Tick(0.016) with N components	none
`SensorSampleCost`	Per-class `Sample()` cost (Core sensors only)	none

Implementation lives in Plugins/IntentForge/Source/IntentForgeCore/Private/Tests/:

PerfBenchmarkUtils.h/.cpp — shared stats, CSV writer, transient-world helpers
PerformanceBenchmarks.cpp — the three planner-only legacy tests
PerfBenchmarks_Memory.cpp — ring append + memory footprint
PerfBenchmarks_Sensors.cpp — per-class sensor cost
PerfBenchmarks_Dispatcher.cpp — dispatcher tick cost
PerfBenchmarks_MultiAgent.cpp — many-agent replan + anti-flap cost
PerfBenchmarks_WorldState.cpp — SetScalar filter+latch micro-benchmark
PerfBenchmarks_BenchExecutor.h — bench-only running executor used by the dispatcher and multi-agent tests

Reproduction steps¶

Build the editor target:

"C:\Program Files\Epic Games\UE_5.7\Engine\Build\BatchFiles\Build.bat"
  CrucibleEditor Win64 Development -Project="<path-to>\Crucible.uproject" -WaitMutex

Run the full IntentForge automation suite (includes all benchmarks):

"C:\Program Files\Epic Games\UE_5.7\Engine\Binaries\Win64\UnrealEditor-Cmd.exe"
  "<path-to>\Crucible.uproject"
  -ExecCmds="Automation RunTests IntentForge; Quit"
  -unattended -nopause -nullrhi

Or run just the perf subset:

-ExecCmds="Automation RunTests IntentForge.Performance; Quit"

CSV output lands at:

<ProjectDir>/Saved/Logs/IntentForgePerf/perf_<UTC-timestamp>.csv

One row per (benchmark, config, run). Columns: Benchmark, Config, RunIndex, Iterations, MeanMs, P50Ms, P95Ms, P99Ms, MaxMs, Notes.

Expected wall-time for a full sweep on a developer machine: 10-30 seconds for the suite end-to-end (most of the cost is ManyAgentReplan at N=4000 and DispatcherTickCost at N=4000).

Anti-flap stack overhead¶

The v0.31+ anti-flap stack (goal momentum bonus, EMA scalar filter, Schmitt latch) adds runtime cost on every replan and on every scalar fact write. The AntiFlapSelectionCost benchmark isolates the momentum dimension by running the same archetype with GoalMomentumBonus = 0 then again with the default 0.15.

At N=500 with 1000 iterations per config and 3 runs:

Configuration	Mean per cycle (ms)	p99 per cycle (ms)
Momentum off	0.0798	0.083
Momentum on	0.0780	0.084
Delta	-0.0018 (-2.3%)	+0.001

The delta is within measurement noise. The momentum bonus is effectively free at the per-cycle level — it costs one multiplication per candidate goal during goal ranking, which is negligible against the rest of the selection path.

The EMA filter + Schmitt latch costs live inside FWorldState::SetScalar (see concepts for the world-state model) and only fire when the schema has a configured filter/latch. AntiFlapSelectionCost does not isolate them — the benchmark archetype intentionally does not configure filter/latch on its facts, keeping the goal-selection cost surface clean. The companion micro-benchmark WorldStateSetScalarCost (added in v0.35, INT-16) covers those per-call costs directly: four configurations (baseline / filter-only / latch-only / both) at K=100 batched calls per timing window. The Baseline → FilterOnly delta is the EMA cost (one MUL + one MAD plus the runtime map probe); Baseline → LatchOnly is the worst-case latch cost with every-call threshold flips. Numbers land in the next baseline refresh.

Memory footprint¶

AgentMemoryFootprint snapshots process-level memory before and after spawning N components on a transient world, force-GCs, snapshots again, and divides the delta by N.

N	Total delta (KB)	Approx per-agent (KB)
1	8.0	8.0
100	~0	(noise)
1000	220	0.22

The N=100 row falling to zero is the measurement noise — process-level FPlatformMemory::GetStats().UsedPhysical varies by tens of KB between quiescent snapshots, so small deltas disappear. The N=1000 row gives the most stable per-agent estimate: roughly 220 bytes per agent including the component itself, the live FWorldState, and the two history ring buffers at their default depths.

This is approximate, intentionally so. The exact number depends on the schema (more facts = more bitset bits), history depth settings, and the number of sensor instances per agent. For sizing purposes, plan on "under 1 KB per agent for the framework data" plus whatever your sensors and executors carry.

Sensor cost reference¶

Per-Core-sensor cost for one Sample() invocation. Each timing window batches K=100 Sample() calls (per-call below the ~100 ns QPC resolution floor); the table reports per-call ms across 5000 windows:

Sensor class	Mean (us)	p99 (us)
`GameplayTagOnActor`	0.034	0.065
`DistanceToWorldLocation`	0.055	0.095
`DistanceToReferencedActor`	0.073	0.125
`LineOfSightToActor`	0.208	0.357

LineOfSightToActor is the most expensive Core sensor because it does an actual line trace against the world. The empty-scene case measured here is the best case; expect 2-5x in a dense scene with collision geometry.

Perception sensors (AIPerception, BlackboardBool, BlackboardFloat) are out of scope for this benchmark by design — their per-Sample cost is dominated by the AI subsystem bridges they hook into (AIPerception event subscription, Blackboard reads/writes), not the sensor pattern itself. Publishing those numbers as "IntentForge sensor cost" would misattribute AI-subsystem cost to the framework. EnvQuery is excluded for the same reason; EQS evaluation cost is owned by the EQS system and varies wildly by query. The framework's contribution to any of these is identical to the Core sensor cost above — one Sample() virtual call plus a fact write — so the headline cost reference is the table above.

Methodology notes¶

Warmup: 10 iterations per config before timing starts. First-call costs (executor JIT, cache misses, first allocation) are excluded.

Repeats: 1000 iterations per config (100 at N >= 2000, where wall time stretches). Most benchmarks (ManyAgentReplan, DispatcherTickCost, ActivityHistoryRingAppend, AntiFlapSelectionCost) do 3 runs per config and report the median of the three p99s to remove worst-case background noise without applying outlier rejection within a run.

SensorSampleCost and AgentMemoryFootprint are intentional exceptions — each runs once with a very high sample count (5000 batched windows of 100 calls for sensors; one snapshot for memory) because their per-call cost is at or below the timer noise floor and only the high-iteration window produces a stable estimate. Cross-check the CSV: those two benchmarks emit RunIndex=0 rows only.

Percentiles: nearest-rank, no linear interpolation. At 1000+ samples the difference between interpolation methods is below the timer noise floor.

Lead statistic: p99, not mean. Tail latency is what users feel — a mean that hides a 10x spike every 100 frames is a hung-frame problem, not a "5% mean" problem.

GC policy: CollectGarbage(RF_NoFlags, /*bPerformFullPurge=*/true) between runs, followed by a few TimerManager ticks to let any deferred cleanup land. Avoids cross-run memory pressure influencing the next benchmark.

Timer: FPlatformTime::Seconds(). At sub-microsecond per call the resolution floor dominates (Windows QPC is ~100 ns). Two benchmarks batch K=100 calls per timing window and divide for per-call ms so the window stays above the floor: ActivityHistoryRingAppend and SensorSampleCost. Without batching, the mean and p50 of these sub-100ns operations are dominated by quantization, not real cost.

Known caveats¶

Single-thread, game-thread only. IntentForge does not currently parallelize across worker threads. The numbers here are the full single-thread cost — there is no "real cost" that includes async work.
No GPU. Everything in this document is CPU work. Sensors that trigger collision queries hit the physics scene, which the engine may schedule async, but the per-Sample cost reported here is the game-thread blocking portion only.
Transient-world setup excludes movement/animation. The benchmark worlds have actors but no movement components, no skeletal meshes, no nav agents. Real games carry significant per-actor cost outside the framework; the benchmarks here measure the framework's contribution only.
UIntentSensor_EnvQuery is not benchmarked. EQS resolution pulls in NavSystem and is not representative of sensor cost.
ManyAgentReplan measures per-cycle wall time, not per-plan compute. Each cycle issues N SetFactScalar + N RequestReplan (renamed from ForceReplan in v0.34) + 3 TimerManager ticks. Depending on coalescing and plan reuse, the per-cycle work can be less than "plan from scratch per agent" — which is the realistic production case. The numbers reflect the framework's actual behavior under repeated replan requests.
AgentMemoryFootprint uses FPlatformMemory::GetStats, which is process-global and noisy at the tens-of-KB level. Treat the per-agent numbers as order-of-magnitude.
CI machine variance is real. Consumer CPUs vary by 2-3x at the single-threaded throughput level. Server-grade Xeons can be slower per-core than recent consumer chips. Your numbers will differ — the point of this document is the shape (linear scaling, bounded tail latency), not the absolute milliseconds.

Community contributions¶

Re-run the benchmarks on your own hardware and add a row here via PR. If you also drop the CSV file alongside, the project keeps a per-machine record of how the framework performs in the wild.

Contributor	CPU	RAM	OS	UE version	p99 replan @ N=2000	CSV link
your row here	e.g. Ryzen 7 7800X3D	32GB DDR5-6000	Win 11	5.7	e.g. 0.41 ms	link to CSV in PR

Planner-only vs full-cycle cost¶

PlannerOnlyReplan (added in v0.35) calls UIntentForgePlannerSubsystem::Plan(WorldState, Archetype) directly on N independent FWorldState instances, with no component, no coalesce, no TimerManager, no dispatcher acknowledgement. Compare its per-cycle number against ManyAgentReplan at the same N to back out the "tax" of running through the component machinery vs driving the planner from a custom orchestrator.

This benchmark exists for two reasons. First, it lets us regress-test the planner itself in isolation — if A* gets slower, we see it cleanly here without the noise of timer + component overhead. Second, it gives a defensible answer to "what does the planner actually cost?" — a question the existing ManyAgentReplan cannot answer because its per-cycle wall time bakes in three TimerManager ticks per cycle and a stack of component-side bookkeeping.

The numbers for this benchmark land in the next baseline refresh.

Version history¶

v0.35.0 (unreleased) — Added PlannerOnlyReplan (INT-15) and WorldStateSetScalarCost (INT-16) benchmarks. The first isolates planner cost from the component/coalesce/timer machinery; the second isolates per-call FWorldState::SetScalar cost across four filter+latch configurations. Together they let users back out per-cycle replan overhead and per-fact-write derivation overhead independently.
v0.33.0 (2026-05-21) — Baseline established with the nine-test perf suite. p99 replan at N=2000 = 0.345 ms; per-agent p99 ~ 0.17 us; dispatcher tick at N=2000 = 0.026 ms.