Skip to content

Instantly share code, notes, and snippets.

@darksylinc
Last active October 2, 2023 09:21
Show Gist options
  • Save darksylinc/38e2a1c42414a1877d7fb0908e801832 to your computer and use it in GitHub Desktop.
Save darksylinc/38e2a1c42414a1877d7fb0908e801832 to your computer and use it in GitHub Desktop.

Godot Shader System Proposal

Godot 4's current shader/PSO system assumes the following:

  1. Specialization Constants reduce compilation time (true) but are not free.
  2. Specialization Constants are well optimized by Driver (blatantly false on Android),
  3. Uniform Static branches are free. They're very cheap but I am seeing incremental improvements when removed entirely via #ifdef. Sometimes major improvements.

Godot 4 wants these assumptions to be true because they help reduce the number of shader & PSO variants, which are key to reducing compilation stutter.

Nonetheless Godot 4 has visible stutter, although it could probably be worse.

Performance

  • On Android, Godot is often GPU bound
    • To fix this we need a system that can #ifdef out a lot of unwanted code.
    • It needs to scale to that number of variants.
    • Each setting is 2.
  • On high end Desktop, I notice Godot will often be CPU bound. Although it's not very relevant because it can reach 60 fps.
    • Biggest bottleneck is merging all the disjoint caches.

Disjoint caches

I find this deeply wasteful (this is from RenderForwardClustered::_render_list_template. It has been simplified for clarity):

for( i < num_elements )
{
    // Get the mesh for this instance.
    const GeometryInstanceSurfaceDataCache *surf = p_params->elements[i];
    // Get the shader for this mesh (I'm skipping shadow map passes & misc. for clarity).
    SceneShaderForwardClustered::ShaderData *shader = surf->shader;

    // This is actually part of a complex switch statement
    SceneShaderForwardClustered::PipelineVersion pipeline_version =
        SceneShaderForwardClustered::PIPELINE_VERSION_COLOR_PASS;

    // Get a 'pipeline'. Which variable to access depends on whether it's a color pass.
    // shader->pipelines[3][5][10]
    // shader->color_pipelines[3][5][32]
    PipelineCacheRD *pipeline = &shader->color_pipelines[cull_variant][primitive][pipeline_version];
    mesh_storage->mesh_surface_get_vertex_arrays_and_format(mesh_surface,
					pipeline->get_vertex_input_mask(), pipeline_motion_vectors, vertex_array_rd,
					vertex_format);
    RID pipeline_rd = pipeline->get_render_pipeline(
                    vertex_format, framebuffer_format, p_params->force_wireframe, 0, pipeline_specialization);

    RD::draw_list_bind_render_pipeline(draw_list, pipeline_rd);
}

And when we get inside both mesh_surface_get_vertex_arrays_and_format and get_render_pipeline:

RID get_render_pipeline(RD::VertexFormatID p_vertex_format_id, RD::FramebufferFormatID p_framebuffer_format_id, bool p_wireframe, uint32_t p_render_pass, uint32_t p_bool_specializations ) {
    spin_lock.lock();
    p_wireframe |= rasterization_state.wireframe;

    RID result;
    for (uint32_t i = 0; i < version_count; i++) {
        if (versions[i].vertex_id == p_vertex_format_id && versions[i].framebuffer_id == p_framebuffer_format_id && versions[i].wireframe == p_wireframe && versions[i].render_pass == p_render_pass && versions[i].bool_specializations == p_bool_specializations) {
            result = versions[i].pipeline;
            spin_lock.unlock();
            return result;
        }
    }
    result = _generate_version(p_vertex_format_id, p_framebuffer_format_id, p_wireframe, p_render_pass, p_bool_specializations);
    spin_lock.unlock();
    return result;
}

void mesh_surface_get_vertex_arrays_and_format(void *p_surface, uint32_t p_input_mask,
			bool p_input_motion_vectors, RID &r_vertex_array_rd, RD::VertexFormatID &r_vertex_format) {
    Mesh::Surface *s = reinterpret_cast<Mesh::Surface *>(p_surface);

    s->version_lock.lock();

    //there will never be more than, at much, 3 or 4 versions, so iterating is the fastest way

    for (uint32_t i = 0; i < s->version_count; i++) {
        if (s->versions[i].input_mask != p_input_mask ||
                s->versions[i].input_motion_vectors != p_input_motion_vectors) {
            // Find the version that matches the inputs required.
            continue;
        }

        //we have this version, hooray
        r_vertex_format = s->versions[i].vertex_format;
        r_vertex_array_rd = s->versions[i].vertex_array;
        s->version_lock.unlock();
        return;
    }

    uint32_t version = s->version_count;
    s->version_count++;
    s->versions = (Mesh::Surface::Version *)memrealloc(
            s->versions, sizeof(Mesh::Surface::Version) * s->version_count);

    _mesh_surface_generate_version_for_input_mask(
            s->versions[version], s, p_input_mask, p_input_motion_vectors);

    r_vertex_format = s->versions[version].vertex_format;
    r_vertex_array_rd = s->versions[version].vertex_array;

    s->version_lock.unlock();
}

That is in both cases it performs a linear search, assuming the number of versions is low (which appears to be true).

However I will ask: WHY???

There's two linear lookups, an O(1) look up on color_pipelines[][][], which needs a ridiculous number of cache lines. pipeline_version is also reevaluated for every object. If it weren't for lightmaps, pipeline_version could be calculated once per pass.

sizeof(PipelineCacheRD) == 264. Thus the whole thing needs 124kb per material.

All of that per object.

There's a much better way.

Pass hash and Object hash

This is the concept I use in the HLMS from OgreNext Creation of Shaders.

The HLMS does a lot more than that, but we don't need it since Godot has its own stuff.

The core concepts is that we 3 types of data and 3 "moments":

Types of data.

All of the 3 following is required to create a PSO (non-exhaustive):

  1. Pass data.
    • Whether we're doing the depth prepass, the colour pass.
    • VkRenderPass info (e.g. colour formats, depth buffer & stencil format, MSAA settings).
    • Number of directional lights.
    • Global Overrides (e.g. render everything as wireframe, flip culling mode, force-disable depth writes).
    • Anything that stays the same for all objects about to be rendered.
  2. Material data.
    • Colour, BRDF type, metalness, textures, samplers to use, etc.
    • Some of these settings might override pass' (e.g. disable receiving shadows, disable lights)
    • Culling mode, depth read, depth write, wireframe setting, blending mode.
  3. Instance / Mesh data.
    • Vertex Format.
    • Any override to Material & Pass.
    • Mesh data has relevant info for the material: Material's Normal mapping can't be used if the mesh doesn't have tangents. Textures can't be used if there are no UVs. Vertex Colours can't be used if the mesh doesn't have them.
    • Any flag that indicates this mesh requires special treatment (e.g. wind shader for trees, if it's not a material setting).

The most desperate thing is that everything can potentially match with anything.

A lot of people feel overwhelmed by this: Any material can be bound to any mesh. Meshes may have multiple vertex formats. They may be used in multiple render passes with wildly different settings (e.g. RGBA10_2 vs depth only passes, vs RGBA8_UNORM with 8xMSAA, etc).

Once we merge all these settings we can:

  • Get or generate the right shader (or SPIR-V if already cached).
  • Generate the right PSO.

Moments

We can identify 3 moments in the lifetime of an instance:

  1. When an Instance is assigned to a Material.
    • Material data can be merged with Instance data.
    • This happens rarely, i.e. once per object in the whole lifetime of the object.
    • We will assume exceptions are rare.
  2. When a Pass starts.
    • Pass data must be generated now.
  3. When an Instance is about to be rendered.
    • We can merge Pass data with the already merged Instance + Material data.
    • We can generate the Shader, SPIR-V and PSO now.

Implementation

The goal is to turn the data into hashes on merge:

instance->set_material( my_material );

This is moment #1. We can already start merging the data:

void Mesh::set_material( const Ref<Material> &p_material )
{
    MaterialInstance cache_entry = {};

    if( p_material.receives_shadows() )
        cache_entry.defines += "\n#define RECEIVE_SHADOWS";
    if( p_material.has_normal_maps() && vertex_format.has( tangents ) )
        cache_entry.defines += "\n#define NORMAL_MAPS";

    if( find_result = global_cache.find( cache_entry ) ) {
        // Already in cache.
        this->material_instance_hash = find_result;
    }
    else
    {
        // The hash is an index to global_cache for O(1) look ups, but this is optional.
        // The disadvantage is that the hash numbers on each run depend on the order in which set_material() is called.
        this->material_instance_hash = global_cache.size();
        global_cache.push_back( cache_entry );
    }
}

The moment #2 is what happens in RenderForwardClustered::_render_scene: A lot of pass data is already gathered and sent to _render_list_with_threads().

The only difference is that would also need to add it to a cache for hashing:

void RenderForwardClustered::_render_scene( const Ref<Material> &p_material )
{
    PassData cache_entry = {};

    cache_entry.directional_light_count = ...;
    cache_entry.vkRenderPass = ...;
    cache_entry.force_wireframe = ...;
    global_pass_cache.push_back( cache_entry );

    if( find_result = global_pass_cache.find( cache_entry ) ) {
        // Already in cache.
        pass_hash = find_result;
    }
    else {
        // New entry
        pass_hash = global_pass_cache.size();
        global_pass_cache.push_back( cache_entry );
    }

   _render_list_with_threads( pass_hash, ... );
}

Now we just need to pass down the pass_hash all the way down to _render_list_template and create moment #3:

void RenderForwardClustered::_render_list_template(RenderingDevice::DrawListID p_draw_list,
		RenderingDevice::FramebufferFormatID p_framebuffer_Format, RenderListParameters *p_params,
		uint32_t p_from_element, uint32_t p_to_element, uint32_t pass_hash) {
{
    uint64_t last_used_hash = 0xFFFFFFFFFFFFFFFF;

    for (uint32_t i = p_from_element; i < p_to_element; i++) {
        const GeometryInstanceSurfaceDataCache *surf = p_params->elements[i];

        // Caches from moment #1 + #2
        uint64_t final_hash = surf->material_instance_hash | (pass_hash << 32);

        // This is our first improvement: Without a single lookup, we can detect whether
        // the PSO from the previous object is the same and already bound. If so, then
        // we don't do any lookups at all!
        if( final_hash != last_used_hash ) {
            // global_pso_final_cache must be a binary search.
            if( pso = global_pso_final_cache.find( final_hash ) ) {
                RD::get_singleton()->draw_list_bind_render_pipeline(draw_list, pso);
            }
            else {
                // We must generate the PSO. These two are already guaranteed to be cached
                PassData *pass_data = global_pass_cache[pass_hash];
                MaterialInstance *mat_inst_data = global_cache[surf->material_instance_hash];
                pso = create_pso();
                global_pso_final_cache[final_hash] = pso;
                RD::get_singleton()->draw_list_bind_render_pipeline(draw_list, pso);
            }
        }
    }
}

Analysis

Advantages

  • It eliminates all the disjoint caches (material, shaders, vertex_format, pso) for 3 caches:
    1. global_cache
    2. global_pass_cache
    3. global_pso_final_cache
  • Only global_pso_final_cache gets touched per object per frame. And we can even avoid the lookup entirely if the previous object used the same PSO, because merging the pass & material+instance hashes can be done in constant time without lookup.
  • Per pass settings means we can support a lot of shader variants. We can hyper specialize shaders for slow GPUs.

Disadvantages

  • It assumes all LODs must have the same vertex format. Otherwise we'd need one Material+Instance hash per LOD.
  • Godot's current method has a lot of ugly lookups, but they're all O(1) or O(N) with N < 4. This method uses a global cache with a binary search that is O(log N).
  • Per pass settings means we can support a lot of shader variants. If it gets out of hand it can cause a lot of stutter.
  • Hard to predict shaders a priori. Shaders to use are not known until last possible moment.
    • This can be workarounded though.

Ubershaders

Dario proposed the following brilliant scheme:

// As an ubershader
#define us_has_directional_shadows ubo.has_directional_shadows
if( us_has_directional_shadows ) {
    // process directional shadows
}
// As a specialized shader
#define us_has_directional_shadows true
if( us_has_directional_shadows ) {
    // process directional shadows
}

It doesn't just work for true/false parameters. The number of directional lights is supported too. It basically replaces what Specialization Constants are doing right now.

This ubershader method is compatible with almost any PSO creation strategy, including the one I'm proposing.

The ubershader can be used whenever the specialized version hasn't done compiling, thus heavily alleviating stutter. In the meantime, we are compiling the shader (and creating the PSO) in background threads.

Consoles support and predicting shaders a priori

Godot can't compile on the console. The shader must have been compiled offline before shipping.

The solution requires a bit of tracking what can be rendered.

From that on, we can pretend we are rendering, even if we are not. For example we don't need vertex or index buffers, but we do need vertex formats.

We don't need a render texture, but we do need the pixel format.

After tracking what meshes (their vertex formats), materials, pixel formats, msaa setting (etc) is being used, we brute force a simulated render of all these combinations.

We can use past runs to gather this info.

Whatever isn't caught by the cache or by the predictor will have to fallback by the ubershader.

We will also need more tools to report how many PSOs had to fallback to an ubershader (so that the binary can be exported again with an improved cache).

@reduz
Copy link

reduz commented Oct 2, 2023

Here are some ideas that I have that aim to solve the same problem but should IMO be more efficient and require less changes.

Use a last pass cache

Instead of drowning ourselves in hashmaps or binary sorted arrays and lots of comparisons to find the right caches, I think an extra layer of "last pass" cache should be a lot more efficient. Here's how I would implement it.

struct PipelineLastUsage {
	uint64_t last_input_format : 32;
	uint64_t last_primitive : 3;
	uint64_t last_cull_mode : 2;
	uint64_t last_mirror : 1;
	uint64_t last_uses_lightmap : 1;
	uint64_t last_uses_multiview : 1;
	uint64_t last_wireframe : 1;
	uint64_t last_subpass : 4;
	RD::VertexFormatID last_vertex_format;
	RD::VertexFormatID last_framebuffer_format;
        void *last_surface;
	RID pipeline;
	RID vertex_buffer;
	RID index_buffer;
	uint32_t base_spec_constants;
	float last_lod_min_depth;
	float last_lod_max_depth;

};

enum PipelineLastUsageType {
	PIPELINE_LAST_USAGE_SHADOW_PASS,
	PIPELINE_LAST_USAGE_DEPTH_PREPASS,
	PIPELINE_LAST_USAGE_COLOR_PASS,	
}

struct GeometryInstanceForwardMobile {
	//...//
	PipelineLastUsage pipeline_last_usage[PIPELINE_LAST_USAGE_MAX]
};

This is only needed for color, depth and shadow passes, the other types of passes do not execute every frame and without nearly as many objects.

This way, all you need to do is compare whether the current and last frame are using the same configuration and reuse the pipelines. Zero lookups on any table. We know for a fact that these values do not change from frame to frame. For LOD, the search function will also return the depth range for the current LOD, so we can tell whether it went our of range and request another index buffer if this be the case.

If any of the things that make up this cache change (mesh, material, etc), Godot already clears GeometryInstanceForwardMobile and re-creates it, so using this cache poses no risk. The optimization should be really simple and just scoped to RenderForwardMobile::_render_list_template.

Using this two level cache, ALL memory accesses during rendering will effecively happen on these two pools:

PagedAllocator<GeometryInstanceForwardMobile> geometry_instance_alloc;
PagedAllocator<GeometryInstanceSurfaceDataCache> geometry_instance_surface_alloc;

Ensuring cache usage is 100% optimal on a large scene.

Use a mixture of shader versions and specialization constants.

Currently, we have these shader versions:

  • SHADER_VERSION_COLOR_PASS,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS,

  • SHADER_VERSION_SHADOW_PASS,

  • SHADER_VERSION_SHADOW_PASS_DP,

  • SHADER_VERSION_DEPTH_PASS_WITH_MATERIAL,

  • SHADER_VERSION_COLOR_PASS_MULTIVIEW,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_MULTIVIEW,

  • SHADER_VERSION_SHADOW_PASS_MULTIVIEW,

The ones that use specialization constants are 4:

  • SHADER_VERSION_COLOR_PASS,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS,
  • SHADER_VERSION_COLOR_PASS_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_MULTIVIEW,

By default the multivew ones are enabled only on XR (Quest) so technically, we only care for two with SC:

  • SHADER_VERSION_COLOR_PASS,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS,

Lightmap is needed to be a separate version because a lot of code is simply gone when lightmaps are enabled, as indirect light is read from the lightmap. As such, codes to read it from probe, radiance cubemaps, etc. needs to be disabled.

My proposal here is to grow these versions to a couple of the most commonly used in mobile games:

  • A version with lighting but no shadows
  • A version with lighting but directional shadows, spot/omni without shadows
  • A version with all shadows
  • A version with soft shadows and projector

Additionally, all the above, have versions with and without decals: Total 8 versions

For these SCs:

layout(constant_id = 3) const uint sc_soft_shadow_samples = 4;
layout(constant_id = 4) const uint sc_penumbra_shadow_samples = 4;

layout(constant_id = 5) const uint sc_directional_soft_shadow_samples = 4;
layout(constant_id = 6) const uint sc_directional_penumbra_shadow_samples = 4;

layout(constant_id = 8) const bool sc_projector_use_mipmaps = true;

layout(constant_id = 7) const bool sc_decal_use_mipmaps = true;

These are global settings in Godot, they can be moved to shader defines and trigger an all shaders recompilation if they change.

This makes no sense being a specialization constant should be made a global variable:

layout(constant_id = 15) const float sc_luminance_multiplier = 2.0;

Finally, I would leave the following as specialization constant, which should be mostly harmless

  • fog
  • refprobes

So finally, I would have the following shader versions:

  • SHADER_VERSION_COLOR_PASS,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS,

  • SHADER_VERSION_COLOR_PASS_NO_SHADOW,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_NO_SHADOW,

  • SHADER_VERSION_COLOR_PASS_DIRECTIONAL_SHADOW,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DIRECTIONAL_SHADOW,

  • SHADER_VERSION_COLOR_PASS_DECAL,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DECAL,

  • SHADER_VERSION_COLOR_PASS_NO_SHADOW_DECAL,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_NO_SHADOW_DECAL,

  • SHADER_VERSION_COLOR_PASS_DIRECTIONAL_SHADOW_DECAL,

  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DIRECTIONAL_SHADOW_DECAL,

  • SHADER_VERSION_SHADOW_PASS,

  • SHADER_VERSION_SHADOW_PASS_DP,

  • SHADER_VERSION_DEPTH_PASS_WITH_MATERIAL,

// Multiview versions only compiled for mobile XR (Quest), so not much harm done.

  • SHADER_VERSION_COLOR_PASS_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_MULTIVIEW,
  • SHADER_VERSION_COLOR_PASS_NO_SHADOW_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_NO_SHADOW_MULTIVIEW,
  • SHADER_VERSION_COLOR_PASS_DIRECTIONAL_SHADOW_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DIRECTIONAL_SHADOW_MULTIVIEW,
  • SHADER_VERSION_COLOR_PASS_DECAL_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DECAL_MULTIVIEW,
  • SHADER_VERSION_COLOR_PASS_NO_SHADOW_DECAL_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_NO_SHADOW_DECAL_MULTIVIEW,
  • SHADER_VERSION_COLOR_PASS_DIRECTIONAL_SHADOW_DECAL_MULTIVIEW,
  • SHADER_VERSION_LIGHTMAP_COLOR_PASS_DIRECTIONAL_SHADOW_DECAL_MULTIVIEW,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment