IntentForge — Performance¶
Overview¶
This document captures the defensible baseline numbers behind IntentForge's
scalability claims (2000-agent budget on a single game thread), plus the
exact reproduction steps so anyone can re-run the suite on their own hardware
and confirm the numbers — or beat them. It is not a profiler tutorial; if
you want to learn how to profile your own game's IntentForge usage, start
with Unreal Insights and the IntentForge.* stat groups.
The numbers below come from the v0.33.0 benchmark suite, captured on the reference machine specified in the "Reference machine" section. They are the median-of-three-runs p99 (per-cycle) tail latency — see "Methodology notes" for why p99 is the lead statistic.
Headline numbers¶
Per-cycle wall time for one full replan pass across N agents, plus per-tick dispatcher cost at the same N:
| Agent count | Replan p99 (ms) | Per-agent p99 (us) | Dispatcher tick p99 (ms) |
|---|---|---|---|
| 100 | 0.016 | 0.16 | 0.0012 |
| 250 | 0.040 | 0.16 | 0.0030 |
| 500 | 0.105 | 0.21 | 0.0083 |
| 1000 | 0.201 | 0.20 | 0.019 |
| 2000 | 0.345 | 0.17 | 0.026 |
| 4000 | 0.678 | 0.17 | 0.054 |
The two numbers that matter for the 2000-agent scalability claim are
Replan p99 = 0.345 ms and Dispatcher tick p99 = 0.026 ms at N=2000.
Combined, that is well under one millisecond per frame at 2000 agents on
a single thread, with room to spare for the rest of the game.
The framework remains roughly linear past the marketed 2000-agent budget: at N=4000 the replan p99 doubles to 0.68 ms, exactly what an O(N) loop would predict.
Reference machine¶
| Item | Value |
|---|---|
| OS | Windows 11 Pro 10.0.26100 |
| UE version | 5.7 |
| Build config | Development Editor |
| RHI | NullRHI (automation run, headless) |
| Run date (UTC) | 2026-05-21 |
The numbers in this document were captured on the plugin author's workstation. Community contributions from other machines are welcomed (see "Community contributions" below) — variance of 2x is normal between consumer CPUs.
Benchmark catalogue¶
Nine automation tests live under the IntentForge.Performance.* namespace.
Three were inherited from earlier releases; six new ones shipped with v0.33.
| Test | What it measures | Asserts? |
|---|---|---|
SingleStepPlan |
Trivial archetype planner cost | mean < 0.1 ms (hard) |
BranchingArchetype |
Branching A* search cost | mean < 0.5 ms (hard) |
MediumArchetype |
20-action linear chain (depth-limit bail-out) | mean < 2.0 ms (hard) |
ActivityHistoryRingAppend |
One RecordActivity call at depths 32/128/256 |
depth-256 mean < 0.005 ms (hard) |
AgentMemoryFootprint |
KB per active component at N=1/100/1000 | none |
ManyAgentReplan |
Per-cycle wall time for N agents to each replan | N=100 p99 < 50 ms (hard) |
PlannerOnlyReplan |
Same N sweep, direct Plan() calls — no component, no coalesce, no timer |
none |
AntiFlapSelectionCost |
Momentum + filter + latch overhead vs raw v0.30 | none |
WorldStateSetScalarCost |
Per-call SetScalar cost: baseline / filter-only / latch-only / both |
none |
DispatcherTickCost |
One dispatcher Tick(0.016) with N components | none |
SensorSampleCost |
Per-class Sample() cost (Core sensors only) |
none |
Implementation lives in Plugins/IntentForge/Source/IntentForgeCore/Private/Tests/:
PerfBenchmarkUtils.h/.cpp— shared stats, CSV writer, transient-world helpersPerformanceBenchmarks.cpp— the three planner-only legacy testsPerfBenchmarks_Memory.cpp— ring append + memory footprintPerfBenchmarks_Sensors.cpp— per-class sensor costPerfBenchmarks_Dispatcher.cpp— dispatcher tick costPerfBenchmarks_MultiAgent.cpp— many-agent replan + anti-flap costPerfBenchmarks_WorldState.cpp—SetScalarfilter+latch micro-benchmarkPerfBenchmarks_BenchExecutor.h— bench-only running executor used by the dispatcher and multi-agent tests
Reproduction steps¶
- Build the editor target:
"C:\Program Files\Epic Games\UE_5.7\Engine\Build\BatchFiles\Build.bat"
CrucibleEditor Win64 Development -Project="<path-to>\Crucible.uproject" -WaitMutex
- Run the full IntentForge automation suite (includes all benchmarks):
"C:\Program Files\Epic Games\UE_5.7\Engine\Binaries\Win64\UnrealEditor-Cmd.exe"
"<path-to>\Crucible.uproject"
-ExecCmds="Automation RunTests IntentForge; Quit"
-unattended -nopause -nullrhi
Or run just the perf subset:
- CSV output lands at:
One row per (benchmark, config, run). Columns: Benchmark, Config, RunIndex, Iterations, MeanMs, P50Ms, P95Ms, P99Ms, MaxMs, Notes.
Expected wall-time for a full sweep on a developer machine: 10-30 seconds
for the suite end-to-end (most of the cost is ManyAgentReplan at N=4000
and DispatcherTickCost at N=4000).
Anti-flap stack overhead¶
The v0.31+ anti-flap stack (goal momentum bonus, EMA scalar
filter, Schmitt latch) adds runtime cost on every replan and on every
scalar fact write. The AntiFlapSelectionCost benchmark isolates the
momentum dimension by running the same archetype with
GoalMomentumBonus = 0 then again with the default 0.15.
At N=500 with 1000 iterations per config and 3 runs:
| Configuration | Mean per cycle (ms) | p99 per cycle (ms) |
|---|---|---|
| Momentum off | 0.0798 | 0.083 |
| Momentum on | 0.0780 | 0.084 |
| Delta | -0.0018 (-2.3%) | +0.001 |
The delta is within measurement noise. The momentum bonus is effectively free at the per-cycle level — it costs one multiplication per candidate goal during goal ranking, which is negligible against the rest of the selection path.
The EMA filter + Schmitt latch costs live inside FWorldState::SetScalar
(see concepts for the world-state model) and only fire when
the schema has a configured filter/latch. AntiFlapSelectionCost does not
isolate them — the benchmark archetype intentionally does not configure
filter/latch on its facts, keeping the goal-selection cost surface clean.
The companion micro-benchmark WorldStateSetScalarCost (added in v0.35,
INT-16) covers those per-call costs directly: four configurations
(baseline / filter-only / latch-only / both) at K=100 batched calls per
timing window. The Baseline → FilterOnly delta is the EMA cost
(one MUL + one MAD plus the runtime map probe); Baseline → LatchOnly is
the worst-case latch cost with every-call threshold flips. Numbers land
in the next baseline refresh.
Memory footprint¶
AgentMemoryFootprint snapshots process-level memory before and after
spawning N components on a transient world, force-GCs, snapshots again,
and divides the delta by N.
| N | Total delta (KB) | Approx per-agent (KB) |
|---|---|---|
| 1 | 8.0 | 8.0 |
| 100 | ~0 | (noise) |
| 1000 | 220 | 0.22 |
The N=100 row falling to zero is the measurement noise — process-level
FPlatformMemory::GetStats().UsedPhysical varies by tens of KB between
quiescent snapshots, so small deltas disappear. The N=1000 row gives the
most stable per-agent estimate: roughly 220 bytes per agent including
the component itself, the live FWorldState, and the two history ring
buffers at their default depths.
This is approximate, intentionally so. The exact number depends on the schema (more facts = more bitset bits), history depth settings, and the number of sensor instances per agent. For sizing purposes, plan on "under 1 KB per agent for the framework data" plus whatever your sensors and executors carry.
Sensor cost reference¶
Per-Core-sensor cost for one Sample() invocation. Each timing window
batches K=100 Sample() calls (per-call below the ~100 ns QPC resolution
floor); the table reports per-call ms across 5000 windows:
| Sensor class | Mean (us) | p99 (us) |
|---|---|---|
GameplayTagOnActor |
0.034 | 0.065 |
DistanceToWorldLocation |
0.055 | 0.095 |
DistanceToReferencedActor |
0.073 | 0.125 |
LineOfSightToActor |
0.208 | 0.357 |
LineOfSightToActor is the most expensive Core sensor because it does
an actual line trace against the world. The empty-scene case measured
here is the best case; expect 2-5x in a dense scene with collision
geometry.
Perception sensors (AIPerception, BlackboardBool, BlackboardFloat)
are out of scope for this benchmark by design — their per-Sample cost is
dominated by the AI subsystem bridges they hook into (AIPerception event
subscription, Blackboard reads/writes), not the sensor pattern itself.
Publishing those numbers as "IntentForge sensor cost" would misattribute
AI-subsystem cost to the framework. EnvQuery is excluded for the same
reason; EQS evaluation cost is owned by the EQS system and varies wildly
by query. The framework's contribution to any of these is identical to
the Core sensor cost above — one Sample() virtual call plus a fact
write — so the headline cost reference is the table above.
Methodology notes¶
Warmup: 10 iterations per config before timing starts. First-call costs (executor JIT, cache misses, first allocation) are excluded.
Repeats: 1000 iterations per config (100 at N >= 2000, where wall time
stretches). Most benchmarks (ManyAgentReplan, DispatcherTickCost,
ActivityHistoryRingAppend, AntiFlapSelectionCost) do 3 runs per
config and report the median of the three p99s to remove worst-case
background noise without applying outlier rejection within a run.
SensorSampleCost and AgentMemoryFootprint are intentional exceptions
— each runs once with a very high sample count (5000 batched windows of
100 calls for sensors; one snapshot for memory) because their per-call
cost is at or below the timer noise floor and only the high-iteration
window produces a stable estimate. Cross-check the CSV: those two
benchmarks emit RunIndex=0 rows only.
Percentiles: nearest-rank, no linear interpolation. At 1000+ samples the difference between interpolation methods is below the timer noise floor.
Lead statistic: p99, not mean. Tail latency is what users feel — a mean that hides a 10x spike every 100 frames is a hung-frame problem, not a "5% mean" problem.
GC policy: CollectGarbage(RF_NoFlags, /*bPerformFullPurge=*/true)
between runs, followed by a few TimerManager ticks to let any deferred
cleanup land. Avoids cross-run memory pressure influencing the next
benchmark.
Timer: FPlatformTime::Seconds(). At sub-microsecond per call the
resolution floor dominates (Windows QPC is ~100 ns). Two benchmarks
batch K=100 calls per timing window and divide for per-call ms so the
window stays above the floor: ActivityHistoryRingAppend and
SensorSampleCost. Without batching, the mean and p50 of these
sub-100ns operations are dominated by quantization, not real cost.
Known caveats¶
-
Single-thread, game-thread only. IntentForge does not currently parallelize across worker threads. The numbers here are the full single-thread cost — there is no "real cost" that includes async work.
-
No GPU. Everything in this document is CPU work. Sensors that trigger collision queries hit the physics scene, which the engine may schedule async, but the per-Sample cost reported here is the game-thread blocking portion only.
-
Transient-world setup excludes movement/animation. The benchmark worlds have actors but no movement components, no skeletal meshes, no nav agents. Real games carry significant per-actor cost outside the framework; the benchmarks here measure the framework's contribution only.
-
UIntentSensor_EnvQueryis not benchmarked. EQS resolution pulls in NavSystem and is not representative of sensor cost. -
ManyAgentReplanmeasures per-cycle wall time, not per-plan compute. Each cycle issues NSetFactScalar+ NRequestReplan(renamed fromForceReplanin v0.34) + 3 TimerManager ticks. Depending on coalescing and plan reuse, the per-cycle work can be less than "plan from scratch per agent" — which is the realistic production case. The numbers reflect the framework's actual behavior under repeated replan requests. -
AgentMemoryFootprintusesFPlatformMemory::GetStats, which is process-global and noisy at the tens-of-KB level. Treat the per-agent numbers as order-of-magnitude. -
CI machine variance is real. Consumer CPUs vary by 2-3x at the single-threaded throughput level. Server-grade Xeons can be slower per-core than recent consumer chips. Your numbers will differ — the point of this document is the shape (linear scaling, bounded tail latency), not the absolute milliseconds.
Community contributions¶
Re-run the benchmarks on your own hardware and add a row here via PR. If you also drop the CSV file alongside, the project keeps a per-machine record of how the framework performs in the wild.
| Contributor | CPU | RAM | OS | UE version | p99 replan @ N=2000 | CSV link |
|---|---|---|---|---|---|---|
| your row here | e.g. Ryzen 7 7800X3D | 32GB DDR5-6000 | Win 11 | 5.7 | e.g. 0.41 ms | link to CSV in PR |
Planner-only vs full-cycle cost¶
PlannerOnlyReplan (added in v0.35) calls
UIntentForgePlannerSubsystem::Plan(WorldState, Archetype) directly on N
independent FWorldState instances, with no component, no coalesce, no
TimerManager, no dispatcher acknowledgement. Compare its per-cycle number
against ManyAgentReplan at the same N to back out the "tax" of running
through the component machinery vs driving the planner from a custom
orchestrator.
This benchmark exists for two reasons. First, it lets us regress-test the
planner itself in isolation — if A* gets slower, we see it cleanly here
without the noise of timer + component overhead. Second, it gives a
defensible answer to "what does the planner actually cost?" — a question
the existing ManyAgentReplan cannot answer because its per-cycle wall
time bakes in three TimerManager ticks per cycle and a stack of
component-side bookkeeping.
The numbers for this benchmark land in the next baseline refresh.
Version history¶
- v0.35.0 (unreleased) — Added
PlannerOnlyReplan(INT-15) andWorldStateSetScalarCost(INT-16) benchmarks. The first isolates planner cost from the component/coalesce/timer machinery; the second isolates per-callFWorldState::SetScalarcost across four filter+latch configurations. Together they let users back out per-cycle replan overhead and per-fact-write derivation overhead independently. - v0.33.0 (2026-05-21) — Baseline established with the nine-test perf suite. p99 replan at N=2000 = 0.345 ms; per-agent p99 ~ 0.17 us; dispatcher tick at N=2000 = 0.026 ms.