Ethan ethan-tqa

GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview

2011 - A trip through the Graphics Pipeline 2011
2013 - Performance Optimization Guidelines and the GPU Architecture behind them
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

Effective Modern CMake

Getting Started

For a brief user-level introduction to CMake, watch C++ Weekly, Episode 78, Intro to CMake by Jason Turner. LLVM’s CMake Primer provides a good high-level introduction to the CMake syntax. Go read it now.

After that, watch Mathieu Ropert’s CppCon 2017 talk Using Modern CMake Patterns to Enforce a Good Modular Design (slides). It provides a thorough explanation of what modern CMake is and why it is so much better than “old school” CMake. The modular design ideas in this talk are based on the book [Large-Scale C++ Software Design](https://www.amazon.de/Large-Scale-Soft

Orthodox C++

This article has been updated and is available here.

This is my little Christmas-break experiment trying to (among other things) reduce the amount of generated code for containers.

THIS CODE WILL CONTAIN BUGS AND IS ONLY PRESENTED AS AN EXAMPLE.

The C++ STL is still an undesirable library for many reasons I have extolled in the past. But it's also a good library. Demons lie in this here debate and I have no interest in revisiting it right now.

The goals that I have achieved with this approach are:

Please consider using http://lygia.xyz instead of copy/pasting this functions. It expand suport for voronoi, voronoise, fbm, noise, worley, noise, derivatives and much more, through simple file dependencies. Take a look to https://github.com/patriciogonzalezvivo/lygia/tree/main/generative

Generic 1,2,3 Noise

float rand(float n){return fract(sin(n) * 43758.5453123);}

float noise(float p){
	float fl = floor(p);
  float fc = fract(p);

	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]

	Setup:
	1. Index buffer containing N quads (each 2 triangles), where N is the max amount of spheres. Repeating pattern of {0,1,2,1,3,2} + K*4.
	2. No vertex buffer.

	Render N*2 triangles, where N is the number of spheres you have.

	Vertex shader:
	1. Sphere index = N/4 (N = SV_VertexId)
	2. Quad coord: Q = float2(N%2, (N%4)/2) * 2.0 - 1.0
	3. Transform sphere center -> pos