Shader and Shader Graph Compiler¶
This page explains how both the code-based (.shader) and visual
(.shadergraph) pipelines compile to the same intermediate representation,
how the register-based VM executes that IR, and how to write shaders that
stay within the hardware budget.
Compilation Pipeline¶
Both authoring workflows produce the same Shader Graph IR — a directed acyclic graph of typed nodes and links — that feeds the same backend compiler.
From a .shader file (Daslang source)
.shader (Daslang source)
│
▼ [pixel_shader] macro
│ walks the Daslang AST, maps each expression to an IR node
│
▼
Shader Graph IR (nodes + links + material properties)
│
▼ C++ backend compiler
│ classifies nodes, assigns registers, emits bytecode
│
▼
VM bytecode (executed per-fragment on the GPU)
The [pixel_shader] annotation triggers a Daslang compile-time macro that
traverses the function body’s AST. Each expression — arithmetic operator,
function call, field read, constant literal — becomes one IR node with typed
input and output pins. Data flow between expressions becomes links between
pins. Module-level var declarations become the material property list.
Unsupported constructs are rejected with a compile-time error.
From a .shadergraph file (visual editor)
.shadergraph (DataBlock node graph)
│
▼ C++ graph loader
│ reads node types, connections, and property values from the file
│
▼
Shader Graph IR (same structure as above)
│
▼ C++ backend compiler
│
▼
VM bytecode
The visual editor writes node positions and connections to a DataBlock file. The C++ loader reads it and builds exactly the same IR structure that the Daslang macro produces.
Hot-reload repeats the full pipeline when the file is saved. While a shader is being recompiled, the previously compiled version stays active. On error, the previous version remains until the next successful compile.
The Register-Based VM¶
The compiled IR executes on a custom register-based virtual machine that runs once per fragment inside the GPU pixel shader.
Registers¶
The VM operates on a stack of 16 float4 registers.
Each value produced by a node occupies one register slot regardless of
width — a float4 costs the same one slot as a float. Internally all
registers store float4; narrower types broadcast across unused components
at no extra cost.
A slot is freed as soon as the last node that reads from it is evaluated. The compiler measures the peak live count — the maximum number of simultaneously occupied slots.
If peak liveness exceeds the available slots, compilation fails with:
Register file overflow: N registers needed, max is 16.
Simplify the graph or split it into multiple materials.
Instructions¶
Each Op node maps to one VM instruction. An instruction reads from one or more registers, computes the result, and writes to one register. The maximum number of instructions in one shader program is 128. If exceeded:
Too many instructions (max 128)
In practice, register exhaustion is hit before the instruction limit for most complex shaders.
Node Kinds¶
Kind |
Description |
|---|---|
|
Reads a per-fragment attribute (UV, worldPos, etc.) or a per-frame global (g_Time, g_LightDirection). No inputs; one output. |
|
Writes one component of the master output (albedo, roughness, etc.). One input; no outputs. |
|
Inline constant — value is stored in the node. No inputs; one output. |
|
Material property reference — value comes from the per-instance property table. No inputs; one output. |
|
Pure math operation. One or more inputs; one output. |
AnyFloat Polymorphism¶
Most Op nodes accept AnyFloat pins — the actual width (float / float2 /
float3 / float4) is resolved once at compile time from the widest
connected input. A narrower float scalar broadcasts into any wider type
for free (the register stack stores all values as float4 internally).
Practical consequence: connecting a float3 and a float to a Multiply
node produces a float3 result without any explicit widening node.
Automatic Optimizations¶
The compiler applies several optimizations before register assignment.
Per-Frame Constant Folding¶
The compiler classifies every node into one of three tiers:
Tier |
Condition |
|---|---|
Const |
Depends only on inline constants (Const nodes) and material properties (Var nodes). Value never changes during rendering. |
PerFrame |
Depends on |
PerPixel |
Depends on any per-fragment attribute: UV, worldPos, worldNormal, etc. Must be evaluated on the GPU for every fragment. |
PerFrame nodes are executed by the CPU once per frame. Their results are uploaded as shader constants before the draw call. Only PerPixel nodes run inside the GPU pixel shader.
This means sub-expressions that involve only g_Time, g_LightDirection,
and material properties cost zero GPU instructions:
// All three lines are PerFrame — computed on CPU, free on GPU:
let pulse = sin(g_Time * speed) // g_Time + material property
let lightFactor = dot(g_LightDirection, float3(0, 1, 0)) // g_LightDirection + const
let combined = pulse * lightFactor // both inputs are PerFrame
// This line is PerPixel — must run on GPU:
let detail = noise(inp.worldPos) * combined // worldPos forces PerPixel
When a PerFrame value feeds a PerPixel node, the compiler creates a crossing pin: the PerFrame result is pre-computed on CPU and passed as a constant to the GPU shader. The GPU sees it as an ordinary register value.
Instruction Fusion (MAD)¶
a * b + c is always fused into a single multiply-add instruction:
let r = a * b + c // one MAD instruction
let r = mad(a, b, c) // identical
let r = c + a * b // identical
Zero-Cost Expressions¶
Several constructs emit no IR node and consume no register slot:
Sequential swizzles —
.xyz,.xy,.xyzw,.rgba,.rgb,.rgare identity masks; they pass the existing register through unchanged.Scalar broadcast constructors —
float2(s),float3(s),float4(s)with a single scalar argument emit no Combine node. The scalar broadcasts via the register file’sfloat4layout.Alias variables —
let a = bwherebis another variable is a pure alias resolved at compile time. No node or slot is allocated fora.Unary ``+`` emits nothing.
Common Subexpression Elimination (CSE)¶
The compiler deduplicates identical subgraphs. Two nodes are considered identical if they have the same operation and recursively identical inputs.
If the same expression appears in multiple places, it is computed once and its
result is shared — no manual caching with let is required:
// noise(inp.worldPos) appears twice — CSE computes it once, result shared.
let base = noise(inp.worldPos) * 2.0 - 1.0
let glow = noise(inp.worldPos) * tint
CSE requires the full expression trees to match. Constant values that differ between two otherwise identical sub-expressions prevent deduplication.
Swizzle Optimizations¶
Single component (
.x,.y,.z,.w) → one Extract node, one output slot.Arbitrary multi-component (
zy,wzyx,xxyy, …) → one compact Swizzle node with the mask packed into a single byte.``.xyxy`` on float2 → one
CombineFloat4F2F2node instead of four splats and a combine.
Reducing Register and Instruction Usage¶
When you hit a limit, apply the strategies below.
Vectorize Parallel Operations¶
Replace N independent scalar operations with one vector operation of width N. Separate the constant part so the compiler can pre-compute it on the CPU:
// 3 separate sin calls — 3 instructions, 3 slots live simultaneously
let a = sin(phase + 0.0)
let b = sin(phase + 2.09)
let c = sin(phase + 4.19)
// 1 sin call on float3 — 1 instruction, 1 slot
// float3(0.0, 2.09, 4.19) is a Const node → folded to CPU, free on GPU
// float3(phase) is a scalar broadcast → zero-cost
let abc = sin(float3(phase) + float3(0.0, 2.09, 4.19))
// access components as abc.x, abc.y, abc.z
Prefer Swizzles Over Manual Channel Extraction¶
Building a vector from individually extracted components creates multiple Splat + Combine nodes. A swizzle mask does the same work in one node:
// float3(v.x, v.x, v.z): SplatX, SplatX, SplatZ, CombineFloat3 — 4 nodes
let bad = float3(v.x, v.x, v.z)
// v.xxz: one Swizzle node
let good = v.xxz
Exploit Per-Frame Folding¶
Separate per-frame terms so the compiler can hoist them to the CPU:
// Bad: g_Time * freq mixed with per-pixel inp.uv.x — entire expr runs on GPU
let phase = sin(inp.uv.x + g_Time * freq)
// Good: isolated PerFrame sub-expression hoisted to CPU, free on GPU
let timeOffset = g_Time * freq
let phase = sin(inp.uv.x + timeOffset)