Dietmar Suoch didito

Unexpected Uses for the Galois Field Affine Transformation Instruction

Intel added the Galois Field instruction set (GFNI) extensions to their Sunny Cove and Tremont cores. What’s particularly interesting is that GFNI is the only new SIMD extension that came with SSE and VEX/AVX encodings (in addition to EVEX/AVX512), to allow it to be supported on all future Intel cores, including those which don’t support AVX512 (such as the Atom line, as well as Celeron/Pentium branded “big” cores).

I suspect GFNI was aimed at accelerating SM4 encryption, however, one of the instructions can be used for many other purposes. The extension includes three instructions, but of particular interest here is the Affine Transformation (GF2P8AFFINEQB), aka bit-matrix multiply, instruction.

There have been various articles which discuss out-of-band

How do you descriptor set?

Descriptor sets have vexed me at every step of development. They're new and different and they have a lot of rules which aren't all that obvious. This document will hopefully lay out everything in one convenient place that I - and also you - can refer to

First, let's talk about what we're trying to do

Use Case

Most renderers need some way for shaders to access resources like textures, buffers, etc. For the Vulkan API, this way is the almighty descriptor set. Descriptor sets, as I understand them, are essentially a pointer to a resource. You update your descriptor sets with your resource, then you bind the descriptor sets to your command buffer, then shaders involved in subsequent drawcalls can look at the descriptors to know what resources they should actually read from. I'm not entirely sure why there's this indirection - and in fact, on AMD GPUs descriptor sets are actually just pointers - but the indirection exists, and we all have to find a way to deal with it

How do you descriptor set?

Descriptor sets have vexed me at every step of development. They're new and different and they have a lot of rules which aren't all that obvious. This document will hopefully lay out everything in one convenient place that I - and also you - can refer to

First, let's talk about what we're trying to do

Use Case

Most renderers need some way for shaders to access resources like textures, buffers, etc. For the Vulkan API, this way is the almighty descriptor set. Descriptors, as I understand them, are essentially a pointer to a resource. You update your descriptor sets with your resources, then you bind the descriptor sets to your command buffer, then shaders involved in subsequent drawcalls can look at the descriptors to know what resources they should actually read from. I'm not entirely sure why there's this indirection - and in fact, on AMD GPUs descriptor sets are actually just pointers - but the indirection exists, and we all have to find a way to deal with it

Volumetric Clouds Resources List

A. Schneider, "Real-Time Volumetric Cloudscapes," in GPU Pro 7: Advanced Rendering Techniques, 2016, pp. 97-127. (Follow up presentations here, and here.)
S. Hillaire, "Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite" in Physically Based Shading in Theory and Practice course, SIGGRAPH 2016. [video] [course notes] [scatter integral shadertoy]
[R. Högfeldt, "Convincing Cloud Rendering – An Implementation of Real-Time Dynamic Volumetric Clouds in Frostbite"](https://odr.chalmers.se/hand

State of Roblox graphics API across all platforms, with percentage deltas since EOY 2019. Updated December 27 2020.

Windows

API	Share
Direct3D 11+	89% (+4%)
Direct3D 10.1	7% (-2%)
Direct3D 10.0	3.5% (-1.5%)
Direct3D 9	0.5% (-0.5%)

GPU Optimization for Games

By person (random order)

Emil Persson @Humus
- Blog
- <2013> Low-Level Thinking in High-Level Shading Languages
- <2014> Low-Level Shader Optimization for Next-Gen and DX11
- <2018> Rule of optimization
Matt Pettineo @mynameismjp

The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.

These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).

That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and d

GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview

2011 - A trip through the Graphics Pipeline 2011
2013 - Performance Optimization Guidelines and the GPU Architecture behind them
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon

	// SV - Save Version
	// This file has an API for saving binary data with versioning support for backwards compatibility.
	// Original API design by Media Molecule.
	// See: https://gist.githubusercontent.com/OswaldHurlem/4810ad510669097db872c6de305c9df0/raw/2fdf47eead527e954d29950aa41debf34547e5bd/mmalex_serialization_and_formats.log
	//
	// Design specs:
	// + Very fast reads/writes
	// + Backwards compat
	// - Not self-describing (serializes opaque data)
	// + The code itself describes the data, and versioning, all in one place

	Volition, Inc. Programmer's Test
	Created: October 12, 1999
	Last Revision: Tuesday, January 7, 2003 (MWA)

	Please attempt all questions on this test. Type your answers immediately
	after the questions. If you are unable to solve a problem, typing your
	thoughts as you attempt the problem is useful.

	There are eleven questions on this test. If you get stuck on one, move to the
	next one. Please be sure that you completely understand the problem