<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>Blog</title><link>https://www.gfxstrand.net/faith/blog/</link><description>Blog</description><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><language>en</language><lastBuildDate>Mon, 26 May 2025 17:50:46 +0000</lastBuildDate><item><title>Descriptors are hard</title><link>https://www.gfxstrand.net/faith/blog/2022/08/descriptors-are-hard/</link><description>&lt;h1 id="descriptors-are-hard"&gt;Descriptors are hard&lt;/h1&gt;
&lt;p&gt;Over the weekend, I &lt;a href="https://twitter.com/jekstrand_/status/1556494610222010369"&gt;asked
on twitter&lt;/a&gt; if people would be interested in a rant about descriptor
sets. As of the writing of this post, it has 46 likes so I’ll count that
as a yes.&lt;/p&gt;
&lt;p&gt;I kind-of hate descriptor sets…&lt;/p&gt;
&lt;p&gt;Well, not descriptor sets per se. More descriptor set layouts. The
fundamental problem, I think, was that we too closely tied memory layout
to the shader interface. The Vulkan model works ok if your objective is
to implement GL on top of Vulkan. You want 32 textures, 16 images, 24
UBOs, etc. and everything in your engine fits into those limits. As long
as they’re always separate bindings in the shader, it works fine. It
also works fine if you attempt to implement HLSL SM6.6 bindless on top
of it. Have one giant descriptor set with all resources ever in giant
arrays and pass indices into the shader somehow as part of the
material.&lt;/p&gt;
&lt;p&gt;The moment you want to use different binding interfaces in different
shaders (pretty common if artists author shaders), things start to get
painful. If you want to avoid excess descriptor set switching, you need
multiple pipelines with different interfaces to use the same set. This
makes the already painful situation with pipelines worse. Now you need
to know the binding interfaces of all pipelines that are going to be
used together so you can build the combined descriptor set layout and
you need to know that before you can compile ANY pipelines. We tried to
solve this a bit with multiple descriptor sets and pipeline layout
compatibility which is supposed to let you mix-and-match a bit. It’s
probably good enough for VS/FS mixing but not for mixing whole
materials.&lt;/p&gt;
&lt;h2 id="the-problem-space"&gt;The problem space&lt;/h2&gt;
&lt;p&gt;So, how did we get here? As with most things in Vulkan, a big part of
the problem is that Vulkan targets a very diverse spread of hardware and
everyone does descriptor binding a bit differently. In order to
understand the problem space a bit, we need to look at the hardware…&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DISCLAIMER:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I’m about to spill a truckload of hardware beans. Let me reassure you
all that I am not violating any NDAs here. Everything I’m about to tell
you is either publicly documented (AMD and Intel) or can be gleaned from
reading public Mesa source code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Descriptor binding methods in hardware can be roughly broken down
into 4 broad categories, each with its own advantages and
disadvantages:&lt;/p&gt;
&lt;ol type="1"&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct access (D):&lt;/strong&gt; This is where the shader
passes the entire descriptor to the access instruction directly. The
descriptor may have been loaded from a buffer somewhere but the shader
instructions do not reference that buffer in any way; they just take
what they’re given. The classic example here is implementing SSBOs as
“raw” pointer access. Direct access is extremely flexible because the
descriptors can live literally anywhere but it comes at the cost of
having to pass the full descriptor through the shader every
time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Descriptor buffers (B):&lt;/strong&gt; Instead of passing the
entire descriptor through the shader, descriptors live in a buffer. The
buffers themselves are bound to fixed binding points or have their base
addresses pushed into the shader somehow. The shader instruction takes
either a fixed descriptor buffer binding index or a base address (as
appropriate) along with some form of offset to the descriptor in the
buffer. The difference between this and the direct access model is that
the descriptor data lives in some other bit of memory that the hardware
must first read before it can do the actual access. Changing buffer
bindings, while definitely not free, is typically not incredibly
expensive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Descriptor heaps (H):&lt;/strong&gt; Descriptors of a
particular type all live in a single global table or heap. Because the
table is global, changing it typically involves a full GPU stall and
maybe dumping a bunch of caches. This makes changing out the table
fairly expensive. Shader instructions which access these descriptors are
passed an index into the global table. Because everything is fixed and
global, this requires the least amount of data to pass through the
shader of the three bindless mechanisms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fixed HW bindings (F):&lt;/strong&gt; In this model, resources
are bound to fixed HW slots, often by setting registers from the command
streamer or filling out small tables in memory. With the push towards
bindless, fixed HW bindings are typically only used for fixed-function
things on modern hardware such as render targets and vertex, index, and
streamout buffers. However, we still need to consider them because
Vulkan 1.0 was designed to support pre-bindless hardware which might not
be quite as nice.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here’s a quick run-down on where things sit with most of the hardware
shipping today:&lt;/p&gt;
&lt;table style="width:100%;"&gt;
&lt;colgroup&gt;
&lt;col style="width: 18%"/&gt;
&lt;col style="width: 13%"/&gt;
&lt;col style="width: 11%"/&gt;
&lt;col style="width: 13%"/&gt;
&lt;col style="width: 11%"/&gt;
&lt;col style="width: 12%"/&gt;
&lt;col style="width: 9%"/&gt;
&lt;col style="width: 9%"/&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class="header"&gt;
&lt;th style="text-align: left;"&gt;Hardware&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Textures&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Images&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Samplers&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Border Colors&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Typed buffers&lt;/th&gt;
&lt;th style="text-align: center;"&gt;UBOs&lt;/th&gt;
&lt;th style="text-align: center;"&gt;SSBOs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;NVIDIA (Kepler+)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D/F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;AMD&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;Intel (Skylake+)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H/D/F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;H/D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;Intel (pre-Skylake)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D/F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;Arm (Valhal+)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B/D/F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B/D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;Arm (Pre-Valhal)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D/F&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="odd"&gt;
&lt;td style="text-align: left;"&gt;Qualcomm (a5xx+)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;td style="text-align: center;"&gt;B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class="even"&gt;
&lt;td style="text-align: left;"&gt;Broadcom (vc5)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;td style="text-align: center;"&gt;D&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The line above for “Intel (pre-Skylake)” is a bit misleading. I’m
labeling everything as fixed HW bindings but it’s actually a bit more
flexible than most fixed HW binding mechanisms. It a sort of heap model
but where, instead of indexing into heaps directly from the shader,
everything goes through a second layer of indirection called a binding
table which is restricted to 240 entries. On Skylake and later hardware,
the binding table hardware still exists and uses a different set up
heaps which provides a nice back-door for drivers. More on that when we
talk about D3D12.&lt;/p&gt;
&lt;h2 id="the-vulkan-1.0-descriptor-set-model"&gt;The Vulkan 1.0 descriptor
set model&lt;/h2&gt;
&lt;p&gt;As you can see from above, the hardware landscape is quite diverse
when it comes to descriptor binding. Everyone has made slightly
different choices depending on the type of descriptor and picking a
single model for everyone isn’t easy. The Vulkan answer to this was, of
course, descriptor sets and their dreaded layouts.&lt;/p&gt;
&lt;p&gt;Ignoring UBOs for the moment, the mapping from the Vulkan API to
these hardware descriptors is conceptually fairly simple. The descriptor
set layout describe a set of bindings, each with a binding type and a
number of descriptors in that binding. The driver maps the binding type
to the type of HW binding it uses and computes how much GPU or CPU
memory is needed to store all the bindings. Fixed HW bindings are
typically stored CPU-side and the actual bindings get set as part of
&lt;code&gt;vkCmdBindDescriptorSets()&lt;/code&gt; or
&lt;code&gt;vkCmdDraw/Dispatch()&lt;/code&gt;. For everything in one of the three
bindless categories, they allocate GPU memory. For heap descriptors,
descriptors may be allocated as part of the descriptor set or, to save
memory, as part of the image or buffer view object. Given that
descriptor heaps are often limited in size, allocating them as part of
the view object is often preferred.&lt;/p&gt;
&lt;p&gt;UBOs get weird. I’m not going to try and go into all of the details
because there are often heuristics involved and it gets complicated
fast. However, as you can see from the above table, most hardware has
some sort of fixed HW binding for UBOs, even on bindless hardware. This
is because UBOs are the hottest of hot paths and even small differences
in UBO fetch speed turn into real FPS differences in games. This is why,
even with descriptor indexing, UBOs aren’t required to support
update-after-bind. The Intel Linux driver has three or four different
paths a UBO may take based on how often it’s used relative to other
UBOs, update-after-bind, and which shader stage it’s being accessed
from.&lt;/p&gt;
&lt;p&gt;The other thing I have yet to mention is dynamic buffers. These
typically look like a fixed HW binding. How they’re implemented varies
by hardware and driver. Often they use fixed HW bindings or the
descriptors are loaded into the shader as push constants. Even if the
buffer pointer comes from descriptor set memory, the dynamic offset has
to get loaded in via some push-like mechanism.&lt;/p&gt;
&lt;h2 id="the-d3d12-descriptor-heap"&gt;The D3D12 descriptor heap&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DISCLAIMER:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Again, I’m going to talk details here. Again, in spite of the fact
that there are exactly zero open-source D3D12 drivers, I can safely say
that I’m not violating any NDAs. I’ve literally never seen the inside of
a D3D12 driver. I’ve just read public documentation and am familiar with
how hardware works and is driven. This is all based on D3D12 drivers
I’ve written in my head, not the real deal. I may get a few things
wrong.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For D3D12, Microsoft took a very different approach. They embraced
heaps. D3D12 has these heavy-weight descriptor heap objects which have
to be bound before you can execute any 3D or compute commands. Shaders
have the usual HLSL register notation for describing the descriptor
interface. When shaders are compiled into pipelines, descriptor tables
are used to map the bindings in the shader to ranges in the relevant
descriptor heap. While the size of a descriptor heap range remains
fixed, each such range has a dynamic offset which allows the application
to move it around at will.&lt;/p&gt;
&lt;p&gt;With SM6.6, Microsoft added significant flexibility and further
embraced heaps. Now, instead of having to use descriptor tables in the
root descriptor, applications can use heap indices directly. This
provides a full bindless experience. All the application developer has
to do is manage heap allocations with resource lifetimes and figure out
how to get indices into their shader. Gone are the days of fiddling with
fixed interface layouts through side-bind pipeline create APIs. From
what I’ve heard, most developers love it.&lt;/p&gt;
&lt;p&gt;If D3D12 has embraced heaps, how does it work on AMD? They use
descriptor buffers, don’t they? Yup. But, fortunately for Microsoft, a
descriptor heap is just a very restrictive descriptor buffer. The AMD
driver just uses two of their descriptor buffer bindings (resource and
sampler heaps are separate in D3D12) and implements the heap as a
descriptor buffer.&lt;/p&gt;
&lt;p&gt;One downside to the descriptor heap approach is that it forces some
amount of extra indirection, especially with the SM6.6 bindless model.
If your application is using bindless, you first have to load a heap
index from a constant buffer somewhere and then pass that to the
load/store op. The load/store turns into a sequence of instruction that
fetches the descriptor from the heap, does the offset calculation, and
then does the actual load or store from the corresponding pointer.
Depending on how often the shader does this, how many unique descriptors
are involved, and the compiler’s ability to optimize away redundant
descriptor fetches, this can add up to real shader time in a hurry.&lt;/p&gt;
&lt;p&gt;The other major downside to the D3D12 model is that handing control
of the hardware heaps to the application really ties driver writers’
hands. Any time the client does a copy or blit operation which isn’t
implemented directly in the DMA hardware, the driver has to spin up the
3D hardware, set up a pipeline, and do a few draws. In order to do a
blit, the pixel shader needs to be able to read from the blit source
image. This means it needs a texture or UAV descriptor which needs to
live in the heap which is now owned by the client. On AMD, this isn’t a
problem because they can re-bind descriptor sets relatively cheaply or
just use one of the high descriptor set bindings which they’re not using
for heaps. On Intel, they have the very convenient back-door I mentioned
above where the old binding table hardware still exists for fragment
shaders.&lt;/p&gt;
&lt;p&gt;Where this gets especially bad is on NVIDIA, which is a bit ironic
given that the D3D12 model is basically exactly NVIDIA hardware. NVIDIA
hardware only has one texture/image heap and switching it is expensive.
How do they implement these DMA operations, then? First off, as far as I
can tell, the only DMA operation in D3D12 that isn’t directly supported
by NVIDIA’s DMA engine is MSAA resolves. D3D12 doesn’t have an
equivalent of &lt;code&gt;vkCmdBlitImage()&lt;/code&gt;. Applications are told to
implement that themselves if they really want it. What saves them, I
think (I can’t confirm), is that D3D12 exposes &lt;span class="math inline"&gt;10&lt;sup&gt;6&lt;/sup&gt;&lt;/span&gt; descriptors to the application
but NVIDIA hardware supports &lt;span class="math inline"&gt;2&lt;sup&gt;20&lt;/sup&gt;&lt;/span&gt; descriptors. That leaves about
48k descriptors for internal usage. Some of those are reserved by
Microsoft for tools such as PIX but I’m guessing a few of them are
reserved for the driver as well. As long as the hardware is able to copy
descriptors around a bit (NVIDIA is very good at doing tiny DMA ops),
they can manage their internal descriptors inside this range. It’s not
ideal, but it does work.&lt;/p&gt;
&lt;h2 id="towards-a-better-future"&gt;Towards a better future?&lt;/h2&gt;
&lt;p&gt;I have nothing to announce but me and others have been thinking about
descriptors in Vulkan and how to make them better. I think we should be
able to do something that’s better than the descriptor sets we have
today. What is that? I’m personally not sure yet.&lt;/p&gt;
&lt;p&gt;The good news is that, if we’re willing to ignore non-bindless
hardware (I think we are for forward-looking things), there are really
only two models: heaps and buffers. (Anything direct access can be
stored in the heap or buffer and it won’t hurt anything.) I too can hear
the siren call of D3D12 heaps but I’d really like to avoid tying the
drivers hands like that. Even if NVIDIA were to rework their hardware to
support two heaps today to get around the internal descriptors problem
and make it part of the next generation of GPUs, we wouldn’t be able to
rely on users having that for 5-10 years, longer depending on
application targets.&lt;/p&gt;
&lt;p&gt;If we keep letting drivers managing their own heaps, D3D12 layering
on top of Vulkan becomes difficult. D3D12 doesn’t have image or buffer
view objects in the same sense that Vulkan does. You just create
descriptors and stick them in the heap somewhere. This means we either
need to come up with a way to get rid of view objects in Vulkan or a
D3D12 layer needs a giant cache of view objects, the lifetimes if which
are difficult to manage to say the least. It’s quite the pickle.&lt;/p&gt;
&lt;p&gt;As with many of my rant posts, I don’t really have a solution. I’m
not even really asking for feedback and ideas. My primary goal is to
educate people and help them understand the problem space. Graphics is
insanely complicated and hardware vendors are notoriously cagey about
the details. I’m hoping that, by demystifying things a bit, I can at the
very least garner a bit of sympathy for what we at Khronos are trying to
do and help people understand that it’s a near miracle that we’ve gotten
where we are. 😅&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2022/08/descriptors-are-hard/</guid><pubDate>Mon, 08 Aug 2022 15:09:00 -0500</pubDate></item><item><title>In defense of NIR</title><link>https://www.gfxstrand.net/faith/blog/2022/01/in-defense-of-nir/</link><description>&lt;h1 id="in-defense-of-nir"&gt;In defense of NIR&lt;/h1&gt;
&lt;p&gt;NIR has been an integral part of the Mesa driver stack for about six
or seven years now (depending on how you count) and a lot has changed
since NIR first landed at the end of 2014 and I wrote my initial &lt;a href="https://www.gfxstrand.net/faith/projects/mesa/nir-notes/"&gt;NIR
notes&lt;/a&gt;. Also, for various reasons, I’ve had to give my NIR elevator
pitch a few times lately. I think it’s time for a new post. This time on
why, after working on this mess for seven years, I still think NIR was
the right call.&lt;/p&gt;
&lt;h2 id="a-bit-of-history"&gt;A bit of history&lt;/h2&gt;
&lt;p&gt;Shortly after I joined the Mesa team at Intel in the summer of 2014,
I was sitting in the cube area asking Ken questions, trying to figure
out how Mesa was put together, and I asked, “Why don’t you use LLVM?”
Suddenly, all eyes turned towards Ken and myself and I realized I’d
poked a bear. Ken calmly explained a bunch of the packaging/shipping
issues around having your compiler in a different project as well as
issues radeonsi had run into with apps bundling their own LLVM that
didn’t work. But for the more technical question of whether or not it
was a good idea, his answer was something about trade-offs and how it’s
really not clear if LLVM would really gain them much.&lt;/p&gt;
&lt;p&gt;That same summer, Connor Abbott showed up as our intern and started
developing NIR. By the end of the summer, he had a bunch of data
structures a few mostly untested passes, and a validator. He also had
most of a GLSL IR to NIR pass which mostly passed validation. Later that
year, after Connor had gone off to school, I took over NIR, finished the
Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote
out-of-SSA and a bunch of optimization passes to get it to the point
where we could finally land it in the tree at the end of 2014.
Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at
the time) who were all that interested in NIR. Today, it’s integral to
the Mesa project and at the core of every driver that’s still seeing
active development. Over the past seven years, we (the Mesa community)
have poured thousands of man hours (probably millions of engineering
dollars) into NIR and it’s gone from something only capable of handling
fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task
and mesh are coming) along with OpenCL 1.2 compute.&lt;/p&gt;
&lt;p&gt;Was it worth it? That’s the multi-million dollar (literally)
question. 2014 was a simpler time. Compute shaders were still newish and
people didn’t use them for all that much more than they would have used
a fancy fragment shader for a couple years earlier. More advanced
features like Vulkan’s variable pointers weren’t even on the horizon.
Had I known at the time how much work we’d have to put into NIR to keep
up, I may have said, “Nah, this is too much effort; let’s just use
LLVM.” If I had, I think it would have made the wrong call.&lt;/p&gt;
&lt;h2 id="distro-and-packaging-issues"&gt;Distro and packaging issues&lt;/h2&gt;
&lt;p&gt;I’d like to get this one out of the way first because, while these
issues are definitely real, it’s easily the least compelling reason to
write a whole new piece of software. Having your compiler in a separate
project and in LLVM in particular comes with an annoying set of
problems.&lt;/p&gt;
&lt;p&gt;First, there’s release cycles. Mesa releases on a rough 3-month
cadence whereas LLVM releases on a 6-month cadence and there’s nothing
syncing the two release cycles. This means that any new feature enabled
in Mesa that require new LLVM compiler work can’t be enabled until they
pick up a new LLVM. Not only does this make the question “what mesa
version has X? unanswerable, it also means every one of these features
needs conditional paths in the driver to be enabled or not depending on
LLVM version. Also, because we can’t guarantee which LLVM version a
distro will choose to pair with any give Mesa version, radeonsi (the
only LLVM-based hardware driver in Mesa) has to support the latest two
releases of LLVM as well as tip-of-tree at all times. While this has
certainly gotten better in recent years, it used to be that LLVM would
switch around C++ data structures on you requiring a bunch of wrapper
classes in Mesa to deal with the mess. (They still reserve the right, it
just happens less these days.)&lt;/p&gt;
&lt;p&gt;Second is bug fixing. What do you do if there’s a compiler bug? You
fix it in LLVM, of course, right? But what if the bug is in an old
version of the AMD LLVM back-end and AMD’s LLVM people refuse to
back-port the fix? You work around it in Mesa, of course! Yup, even
though Mesa and LLVM are both open-source projects that theoretically
have a stable bugfix release cycle, Mesa has to carry LLVM work-around
patches because we can’t get the other team/project to back-port fixes.
Things also get sticky whenever there’s a compiler bug which touches on
the interface between the LLVM back-end compiler and the driver. How do
you fix that in a backwards-compatible way? Sometimes, you don’t. Those
interfaces can be absurdly subtle and complex and sometimes the bug is
in the interface itself so you either have to fix it LLVM tip-of-tree
and work around it in Mesa for older versions, or you have to break
backwards compatibility somewhere and hope users pick up the LLVM
bug-fix release.&lt;/p&gt;
&lt;p&gt;Third is that some games actually link against LLVM and,
historically, LLVM hasn’t done well with two different versions of it
loaded at the same time. Some of this is LLVM and some of it is the way
C++ shared library loading is handled on Linux. I won’t get into all the
details but the point is that there have been some games in the past
which simply can’t run on radeonsi because of LLVM library version
conflicts. Some of this could probably be solved if Mesa were linked
against LLVM statically but distros tend to be pretty sour on static
linking unless you have a really good reason. A closed-source game
pulling in their own LLVM isn’t generally considered to be a good
reason.&lt;/p&gt;
&lt;p&gt;And that, in the words of Forrest Gump, is all I have to say about
that.&lt;/p&gt;
&lt;h2 id="a-compiler-built-for-gpus"&gt;A compiler built for GPUs&lt;/h2&gt;
&lt;p&gt;One of the key differences between NIR and LLVM is that NIR is a
GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an
upstream LLVM back-end for their GPU hardware, Intel likes to brag about
their out-of-tree LLVM back-end and many other vendors use it in their
drivers as well even if their back-ends are closed-source and Internal.
However, none of that actually means that LLVM understands GPUs or is
any good at compiling for them. Most HW vendors have made that choice
because they needed LLVM for OpenCL support and they wanted a unified
compiler so they figured out how to make LLVM do graphics. It works but
that doesn’t mean it works well.&lt;/p&gt;
&lt;p&gt;To demonstrate this, let’s look at the following GLSL shader I stole
from the &lt;code&gt;texelFetch&lt;/code&gt; piglit test:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-cp"&gt;#version 120&lt;/span&gt;

&lt;span class="pygments-cp"&gt;#extension GL_EXT_gpu_shader4: require&lt;/span&gt;
&lt;span class="pygments-cp"&gt;#define ivec1 int&lt;/span&gt;
&lt;span class="pygments-k"&gt;flat&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;varying&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;ivec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;uniform&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;divisor&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;uniform&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;sampler2D&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;out&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;
&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;color&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;texelFetch2D&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;ivec2&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;w&lt;/span&gt;&lt;span class="pygments-p"&gt;);&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;color&lt;/span&gt;&lt;span class="pygments-o"&gt;/&lt;/span&gt;&lt;span class="pygments-n"&gt;divisor&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;When compiled to NIR, this turns into&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 1
outputs: 1
uniforms: 1
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_function main (0 params)

impl main {
    block block_0:
    /* preds: */
    vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
    vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */
    vec1 32 ssa_2 = deref_var &amp;amp;tex (uniform sampler2D)
    vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y
    vec1 32 ssa_4 = mov ssa_1.z
    vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
    vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */
    vec1 32 ssa_7 = frcp ssa_6.x
    vec1 32 ssa_8 = frcp ssa_6.y
    vec1 32 ssa_9 = frcp ssa_6.z
    vec1 32 ssa_10 = frcp ssa_6.w
    vec1 32 ssa_11 = fmul ssa_5.x, ssa_7
    vec1 32 ssa_12 = fmul ssa_5.y, ssa_8
    vec1 32 ssa_13 = fmul ssa_5.z, ssa_9
    vec1 32 ssa_14 = fmul ssa_5.w, ssa_10
    vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14
    intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */
    /* succs: block_1 */
    block block_1:
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, the AMD driver turns it into the following LLVM IR:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-c"&gt;; ModuleID = 'mesa-shader'&lt;/span&gt;
&lt;span class="pygments-k"&gt;source_filename&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"mesa-shader"&lt;/span&gt;
&lt;span class="pygments-k"&gt;target&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;datalayout&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"&lt;/span&gt;
&lt;span class="pygments-k"&gt;target&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;triple&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"amdgcn--"&lt;/span&gt;

&lt;span class="pygments-k"&gt;define&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;amdgpu_ps&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@main&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;noalias&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;align&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;dereferenceable&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;18446744073709551615&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;noalias&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;align&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;dereferenceable&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;18446744073709551615&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%1&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;noalias&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;align&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;dereferenceable&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;18446744073709551615&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%2&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;noalias&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;align&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;dereferenceable&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;18446744073709551615&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%3&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%4&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inreg&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%5&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%6&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%7&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%8&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;3&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%9&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%10&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%11&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%12&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%13&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%14&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%15&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%16&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%17&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%18&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%19&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%20&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%21&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#0&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-nl"&gt;main_body:&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%22&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.interp.mov&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%5&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%23&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;bitcast&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%22&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;to&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%24&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.interp.mov&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%5&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%25&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;bitcast&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%24&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;to&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%26&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.interp.mov&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%5&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%27&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;bitcast&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%26&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;to&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%28&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;getelementptr&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;inbounds&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%3&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv"&gt;!amdgpu.uniform&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;!0&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%29&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;load&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%28&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;align&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv"&gt;!invariant.load&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;!0&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.image.load.mip.2d.v4f32.i32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;15&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%23&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%25&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%27&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%29&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%31&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;ptrtoint&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;addrspace&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;&lt;span class="pygments-p"&gt;)*&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;to&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertelement&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;poison&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;16&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;163756&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%31&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%33&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.s.buffer.load.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%34&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.s.buffer.load.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%35&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.s.buffer.load.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%36&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.s.buffer.load.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;12&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%37&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.rcp.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%33&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%38&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.rcp.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%34&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%39&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.rcp.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%35&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%40&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.rcp.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%36&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%41&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;extractelement&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%42&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;fmul&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%41&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%37&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%43&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;extractelement&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;1&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%44&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;fmul&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%43&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%38&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%45&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;extractelement&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;2&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%46&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;fmul&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%45&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%39&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%47&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;extractelement&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;3&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%48&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;fmul&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%47&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%40&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%49&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;undef&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%4&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%50&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%49&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%42&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;5&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%51&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%50&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%44&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;6&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%52&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%51&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%46&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;7&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%53&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%52&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%48&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%54&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;insertvalue&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%53&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%20&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;19&lt;/span&gt;
&lt;span class="pygments-w"&gt;  &lt;/span&gt;&lt;span class="pygments-k"&gt;ret&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%54&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;

&lt;span class="pygments-c"&gt;; Function Attrs: nounwind readnone speculatable willreturn&lt;/span&gt;
&lt;span class="pygments-k"&gt;declare&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.interp.mov&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#1&lt;/span&gt;

&lt;span class="pygments-c"&gt;; Function Attrs: nounwind readonly willreturn&lt;/span&gt;
&lt;span class="pygments-k"&gt;declare&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.image.load.mip.2d.v4f32.i32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#2&lt;/span&gt;

&lt;span class="pygments-c"&gt;; Function Attrs: nounwind readnone willreturn&lt;/span&gt;
&lt;span class="pygments-k"&gt;declare&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.s.buffer.load.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#3&lt;/span&gt;

&lt;span class="pygments-c"&gt;; Function Attrs: nounwind readnone speculatable willreturn&lt;/span&gt;
&lt;span class="pygments-k"&gt;declare&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.rcp.f32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#1&lt;/span&gt;

&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#0&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"InitialPSInputAddr"&lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-s"&gt;"0xb077"&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"denormal-fp-math"&lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-s"&gt;"ieee,ieee"&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"denormal-fp-math-f32"&lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-s"&gt;"preserve-sign,preserve-sign"&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-s"&gt;"target-features"&lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-s"&gt;"+DumpCode"&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#1&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readnone&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;speculatable&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;willreturn&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readonly&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;willreturn&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#3&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readnone&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;willreturn&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readnone&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;

&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;!0&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;!{}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For those of you who can’t read NIR and/or LLVM or don’t want to sift
through all that, let me reduce it down to the important lines:&lt;/p&gt;
&lt;p&gt;GLSL:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;color&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;texelFetch2D&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;ivec2&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;w&lt;/span&gt;&lt;span class="pygments-p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;NIR:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LLVM:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%30&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;call&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nsz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;arcp&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.image.load.mip.2d.v4f32.i32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;15&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%23&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%25&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%27&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-nv pygments-nv-Anonymous"&gt;%29&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-m"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;

&lt;span class="pygments-c"&gt;; Function Attrs: nounwind readonly willreturn&lt;/span&gt;
&lt;span class="pygments-k"&gt;declare&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;@llvm.amdgcn.image.load.mip.2d.v4f32.i32&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-m"&gt;8&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-p"&gt;&amp;gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;i32&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;immarg&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#2&lt;/span&gt;

&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readonly&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;willreturn&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;span class="pygments-k"&gt;attributes&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-vg"&gt;#4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;nounwind&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;readnone&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In NIR, a &lt;code&gt;texelFetch()&lt;/code&gt; shows up as a texture
instruction. NIR has a special instruction type just for textures called
&lt;code&gt;nir_tex_instr&lt;/code&gt; to handle of the combinatorial explosion of
possibilities when it comes to all the different ways you can access a
texture. In this particular case, the texture opcode is
&lt;code&gt;nir_texop_txf&lt;/code&gt; for a texel fetch and it is passed a texture,
a sampler, a coordinate and an LOD. Pretty standard stuff.&lt;/p&gt;
&lt;p&gt;In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton
called &lt;code&gt;llvm.amdgcn.image.load.mip.2d.v4f32.i32&lt;/code&gt;. A bunch of
information about the operation such as the fact that it takes a mip
parameter and returns a &lt;code&gt;vec4&lt;/code&gt; is encoded in the function
name. The AMD back-end then knows how to turn this into the right
sequence of hardware instructions to load from a texture.&lt;/p&gt;
&lt;p&gt;There are a couple of important things to note here. First is the
&lt;code&gt;@llvm.amdgcn&lt;/code&gt; prefix on the function name. This is an
entirely AMD-specific function. If I dumped out the LLVM from the Intel
windows drivers for that same GLSL, it would use a different function
name with a different encoding for the various bits of ancillary
information such as the return type. Even though both drivers share
LLVM, in theory, the way they encode graphics operations is entirely
different. If you looked at NVIDIA, you would find a third encoding.
There is no standardization.&lt;/p&gt;
&lt;p&gt;Why is this important? Well, one of the most common arguments I hear
from people for why we should all be using LLVM for graphics is because
it allows for code sharing. Everyone can leverage all that great work
that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you
can get LLVM’s algebraic optimizations and code motion etc. But you
can’t share any of the optimizations that are really interesting for
graphics because nothing graphics-related is common. Could it be
standardized? Probably. But, in the state it’s in today, any claims that
two graphics compilers are sharing significant optimizations because
they’re both LLVM based is a half-truth at best. And it will never
become standardized unless someone other than AMD decides to put their
back-end into upstream LLVM and they decide to work together.&lt;/p&gt;
&lt;p&gt;The second important bit about that LLVM function call is that LLVM
has absolutely no idea what that function does. All it knows is that
it’s been decorated &lt;code&gt;nounwind&lt;/code&gt;, &lt;code&gt;readonly&lt;/code&gt;, and
&lt;code&gt;willreturn&lt;/code&gt;. The &lt;code&gt;readonly&lt;/code&gt; gives it a bit of
information so it knows it can move the function call around a bit since
it won’t write misc data. However, it can’t even eliminate redundant
texture ops because, for all LLVM knows, a second call will return a
different result. While LLVM has pretty good visibility into the basic
math in the shader, when it comes to anything that touches image or
buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics
compiler tries to improve this somewhat by using actual LLVM pointers
for buffer memory so LLVM gets a bit more visibility but you still end
up with a pile of out-of-thin-air pointers that all potentially alias
each other so it’s pretty limited.&lt;/p&gt;
&lt;p&gt;In contrast, NIR knows exactly what sort of thing
&lt;code&gt;nir_texop_txf&lt;/code&gt; is and what it does. It knows, for instance,
that, even though it accesses external memory, the API guarantees that
nothing shifts out from under you so it’s fine to eliminate redundant
texture calls. For &lt;code&gt;nir_texop_tex&lt;/code&gt; (&lt;code&gt;texture()&lt;/code&gt; in
GLSL), it knows that it takes implicit derivatives and so it can’t be
moved into non-uniform control-flow. For things like SSBO and workgroup
memory, we know what kind of memory they’re touching and can do alias
analysis that’s actually aware of buffer bindings.&lt;/p&gt;
&lt;h2 id="code-sharing"&gt;Code sharing&lt;/h2&gt;
&lt;p&gt;When people try to justify their use of LLVM to me, there are
typically two major benefits they cite. The first is that LLVM lets them
take advantage of all this academic compiler work. In the previous
section, I explained why this is a weak argument at best. The second is
that embracing LLVM for graphics lets them share code with their compute
compiler. Does that mean that we’re against sharing code? Not at all! In
fact, NIR lets us get far more code sharing than most companies do by
using LLVM.&lt;/p&gt;
&lt;p&gt;The difference is the axis for sharing. This is something I ran into
trying to explain myself to people at Intel all the time. They’re
usually only thinking about how to get the Intel OpenCL driver and the
Intel D3D12 driver to share code. With NIR, we have compiler code shared
effectively across 20 years of hardware from a 8 different vendors and
at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t
share a single line of compiler code, it’s not like we went off and
hand-coded a whole compiler stack just for Intel Linux Vulkan.&lt;/p&gt;
&lt;p&gt;As an example of this, consider &lt;code&gt;nir_lower_tex()&lt;/code&gt; a pass
that lowers various different types of texture operations to other
texture operations. It can, among other things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Lower texture projectors away by doing the division in the
shader,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lower &lt;code&gt;texelFetchOffset()&lt;/code&gt; to
&lt;code&gt;texelFetch()&lt;/code&gt;,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lower rectangle textures by dividing the coordinate by the result
of &lt;code&gt;textureSize()&lt;/code&gt;,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lower texture swizzles to swizzling in the shader,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lower various forms of &lt;code&gt;textureGrad*()&lt;/code&gt; to
&lt;code&gt;textureLod*()&lt;/code&gt; under various conditions,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lower &lt;code&gt;imageSize(i, lod)&lt;/code&gt; with an LOD to
&lt;code&gt;imageSize(i, 0)&lt;/code&gt; and some shader math,&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;And much more…&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Exactly what lowering is needed is highly hardware dependent (except
projectors; only old Qualcomm hardware has those) but most of them are
needed by at least two different vendor’s hardware. While most of these
are pretty simple, when you get into things like turning derivatives
into LODs, the calculations get complex and we really don’t want
everyone typing it themselves if we can avoid it.&lt;/p&gt;
&lt;p&gt;And texture lowering is just one example. We’ve got dozens of passes
for everything from lowering read-only images to textures for OpenCL to
lowering built-in functions like &lt;code&gt;frexp()&lt;/code&gt; to simpler math to
flipping &lt;code&gt;gl_FragCoord&lt;/code&gt; and &lt;code&gt;gl_PointCoord&lt;/code&gt; when
rendering upside down which as is required to implement OpenGL on Linux
window-systems. All that code is in one central place where it’s usable
by all the graphics drivers on Linux.&lt;/p&gt;
&lt;h2 id="tight-driver-integration"&gt;Tight driver integration&lt;/h2&gt;
&lt;p&gt;I mentioned earlier that having your compiler out-of-tree is painful
from a packaging and release point-of-view. What I haven’t addressed yet
is just how tight driver/compiler integration has to be. It depends a
lot on the API and hardware, of course but the interface between
compiler and driver is often very complex. We make it look very simple
on the API side where you have descriptor sets (or bindings in GL) and
then you access things from them in the shader. Simple, right? Hah!&lt;/p&gt;
&lt;p&gt;In the Intel Linux Vulkan driver, we can access a UBO one of four
ways depending on a complex heuristic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We try to find up to 4 small ranges UBO commonly used constants
and push those into the shader as push constants.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If we can’t push it all and it fits inside the hardware’s 240
entry binding table, we create a descriptor for it and put it in the
binding table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Depending on the hardware generation, UBOs successfully bound to
descriptors might be accessed as SSBOs or we might access them through
the texture unit.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If we ran our of entries in the binding table or if it’s in a
ray-tracing stage (those don’t have binding tables), we fall back to
doing bounds checking in the shader and access it using raw 64-bit GPU
addresses.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that’s just UBOs! SSBO binding has a similar level of complexity
and also depends on the SSBO operations done in the shader. Textures
have silent fall-back to bindless if we have too many, etc. In order to
handle all this insanity, we have a compiler pass called
&lt;code&gt;anv_nir_apply_pipeline_layout()&lt;/code&gt; which lives in the driver.
The interface between that pass and the rest of the driver is quite
complex and can communicate information about exactly how things are
actually laid out. We do have to serialize it to put it all in the
pipeline cache so that limits the complexity some but we don’t have to
worry about keeping the interface stable at all because it lives in the
driver.&lt;/p&gt;
&lt;p&gt;We also have passes for handling YCbCr format conversion, turning
multiview into instanced rendering and constructing a
&lt;code&gt;gl_ViewID&lt;/code&gt; in the shader based on the view mask and the
instance number, and a handful of other tasks. Each of these requires
information from the &lt;code&gt;VkPipelineCreateInfo&lt;/code&gt; and some of them
result in magic push constants which the driver has to know need
pushing.&lt;/p&gt;
&lt;p&gt;Trying to do that with your compiler in another project would be
insane. So how does AMD do it with their LLVM compiler? Good question!
They either do it in NIR or as part of the NIR to LLVM conversion. By
the time the shader gets to LLVM, most of the GL or Vulkanisms have been
translated to simpler constructs, keeping the driver/LLVM interface
manageable. It also helps that AMD’s hardware binding model is crazy
simple and was basically designed for an API like Vulkan.&lt;/p&gt;
&lt;h2 id="structured-control-flow"&gt;Structured control-flow&lt;/h2&gt;
&lt;p&gt;One of the riskier decisions we made when designing NIR was to make
all control-flow inherently structured. Instead of branch and
conditional branch instructions like LLVM or SPIR-V has, NIR has
control-flow nodes in a tree structure. The root of the tree is always a
&lt;code&gt;nir_function_impl&lt;/code&gt;. In each function, is a list of
control-flow nodes that may be &lt;code&gt;nir_block&lt;/code&gt;,
&lt;code&gt;nir_if&lt;/code&gt;, or &lt;code&gt;nir_loop&lt;/code&gt;. An if has a condition and
then and else cases. A loop is a simple infinite loop and there are
&lt;code&gt;nir_jump_break&lt;/code&gt; and &lt;code&gt;nir_jump_continue&lt;/code&gt;
instructions which act exactly as their C counterparts.&lt;/p&gt;
&lt;p&gt;At the time, this decision was made from pure pragmatism. We had
structure coming out of GLSL and most of the back-ends expected
structure. Why break everything? It did mean that, when we started
writing control-flow manipulation passes, things were a lot harder. A
dead control-flow pass in an unstructured IR is trivial:. Delete any
conditional branches where the condition is false and replace it with an
unconditional branch if the condition is true. Then delete any
unreachable blocks and merge blocks as necessary. Done. In a structured
IR, it’s a lot more fiddly. You have to manually collapse if ladders and
deleting the unconditional break at the end of a loop is equivalent to
loop unrolling. But we got over that hump, built tools to make it less
painful, and have implemented most of the important control-flow
optimizations at this point. In exchange, back-ends get structure which
is something most GPUs want thanks to the SIMT model they use.&lt;/p&gt;
&lt;p&gt;What we didn’t see coming when we made that decision (2014,
remember?) was wave/subgroup ops. In the last several years, the SIMT
nature of shader execution has slowly gone from an implementation detail
to something that’s baked into all modern 3D and compute APIs and shader
languages. With that shift has come the need to be consistent about
re-convergence. If we say “&lt;code&gt;texture()&lt;/code&gt; has to be in uniform
control flow”, is the following shader ok?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;
&lt;span class="pygments-cp"&gt;#version 120&lt;/span&gt;

&lt;span class="pygments-k"&gt;varying&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;uniform&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;sampler2D&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;out&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;
&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-k"&gt;if&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-mf"&gt;1.0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;
&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-mf"&gt;1.0&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;

&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;texture&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;);&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Obviously, it should be. But what guarantees that you’re actually in
uniform control-flow by the time you get to the &lt;code&gt;texture()&lt;/code&gt;
call? In an unstructured IR, once you diverge, it’s really hard to
guarantee convergence. Of course, every GPU vendor with an LLVM-based
compiler has algorithms for trying to maintain or re-create the
structure but it’s always a bit fragile. Here’s an even more subtle
example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;
&lt;span class="pygments-cp"&gt;#version 120&lt;/span&gt;

&lt;span class="pygments-k"&gt;varying&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;uniform&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;sampler2D&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-k"&gt;out&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;
&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-cm"&gt;/* Block 0 */&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-kt"&gt;float&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-k"&gt;while&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-cm"&gt;/* Block 1 */&lt;/span&gt;
&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-k"&gt;if&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-mf"&gt;1.0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;            &lt;/span&gt;&lt;span class="pygments-cm"&gt;/* Block 2 */&lt;/span&gt;
&lt;span class="pygments-w"&gt;            &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;.&lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;            &lt;/span&gt;&lt;span class="pygments-k"&gt;break&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;

&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-cm"&gt;/* Block 3 */&lt;/span&gt;
&lt;span class="pygments-w"&gt;        &lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;x&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;-&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-mf"&gt;1.0&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-p"&gt;}&lt;/span&gt;

&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-cm"&gt;/* Block 4 */&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;fragColor&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;texture&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;tex&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;tc&lt;/span&gt;&lt;span class="pygments-p"&gt;);&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The same question of validity holds but there’s something even
trickier in here. Can the compiler merge block 4 and block 2? If so,
where should it put it? To a CPU-centric compiler like LLVM, it looks
like it would be fine to merge the two and put it all in block 2. In
fact, since texture ops are expensive and block 2 is deeper inside
control-flow, it may think the resulting shader would be more efficient
if it did. And it would be wrong on both counts.&lt;/p&gt;
&lt;p&gt;First, the loop exit condition is non-uniform and, since
&lt;code&gt;texture()&lt;/code&gt; takes derivatives, it’s illegal to put it in
non-uniform control-flow. (Yes, in this particular case, the result of
those derivatives might be a bit wonky.) Second, due to the SIMT nature
of execution, you really don’t want the texture op in the loop. In the
worst case, a 32-wide execution will hit block 2 32 separate times
whereas, if you guarantee re-convergence, it only hits block 4 once.&lt;/p&gt;
&lt;p&gt;The fact that NIR’s control-flow is structured from start to finish
has been a hidden blessing here. Once we get the structure figured out
from SPIR-V decorations (which is annoyingly challenging at times), we
never lose that structure and the re-convergence information it implies.
NIR knows better than to move derivatives into non-uniform control-flow
and its code-motion passes are tuned assuming a SIMT execution model.
What has become a constant fight for people working with LLVM is a
non-issue for us. The only thing that has been a challenge has been
dealing with SPIR-V’s less than obvious structure rules and trying to
make sure we properly structurize everything that’s legal. (It’s been
getting better recently.)&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Side-note:&lt;/strong&gt; NIR does support OpenCL SPIR-V which is
unstructured. To handle this, we have &lt;code&gt;nir_jump_goto&lt;/code&gt; and
&lt;code&gt;nir_jump_goto_if&lt;/code&gt; instructions which are allowed only for a
very brief period of time. After the initial SPIR-V to NIR conversion,
we run a couple passes and then structurize. After that, it remains
structured for the rest of the compile.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="algebraic-optimizations"&gt;Algebraic optimizations&lt;/h2&gt;
&lt;p&gt;Every GPU compiler engineer has horror stories about something some
app developer did in a shader. Sometimes it’s the fault of the developer
and sometimes it’s just an artifact of whatever node-based visual shader
building system the game engine presents to the artists and how it’s
been abused. On Linux, however, it can get even more entertaining. Not
only do we have those shaders that were written for DX9 and someone lost
the code so they ran them through a DX9 to HLSL translator and then
through FXC, but they then ported the app to OpenGL so it can run on
Linux they did a DXBC to GLSL conversion with some horrid tool. The end
result is &lt;code&gt;x != 0&lt;/code&gt; implemented with three levels of nested
function calls, multiple splats out to a &lt;code&gt;vec4&lt;/code&gt; and a truly
impressive pile of control-flow. I only wish I were joking….&lt;/p&gt;
&lt;p&gt;To chew through this mess, we have &lt;code&gt;nir_opt_algebraic()&lt;/code&gt;.
We’ve implemented a little language for expressing these expression
trees using python tuples and nir_opt_algebraic.py. To get a sense for
what this looks like, let’s look at some excerpts from
&lt;code&gt;nir_opt_algebraic.py&lt;/code&gt; starting with the simple description
at the top:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-c1"&gt;# Written in the form (&amp;lt;search&amp;gt;, &amp;lt;replace&amp;gt;) where &amp;lt;search&amp;gt; is an expression&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# and &amp;lt;replace&amp;gt; is either an expression or a value.  An expression is&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# defined as a tuple of the form ([~]&amp;lt;op&amp;gt;, &amp;lt;src0&amp;gt;, &amp;lt;src1&amp;gt;, &amp;lt;src2&amp;gt;, &amp;lt;src3&amp;gt;)&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# where each source is either an expression or a value.  A value can be&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# either a numeric constant or a string representing a variable name.&lt;/span&gt;
&lt;span class="pygments-c1"&gt;#&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# &amp;lt;more details&amp;gt;&lt;/span&gt;

&lt;span class="pygments-n"&gt;optimizations&lt;/span&gt; &lt;span class="pygments-o"&gt;=&lt;/span&gt; &lt;span class="pygments-p"&gt;[&lt;/span&gt;
   &lt;span class="pygments-o"&gt;...&lt;/span&gt;
   &lt;span class="pygments-p"&gt;((&lt;/span&gt;&lt;span class="pygments-s1"&gt;'iadd'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-mi"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This rule is a good starting example because it’s so straightforward.
It looks for an integer add operation of something with zero and gets
rid of it. A slightly more complex example removes redundant
&lt;code&gt;fmax&lt;/code&gt; opcodes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-p"&gt;((&lt;/span&gt;&lt;span class="pygments-s1"&gt;'fmax'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'fmax'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'fmax'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;)),&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since it’s written in python, we can also write little rule
generators if the same thing applies to a bunch of opcodes or if you
want to generalize across types:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-c1"&gt;# For any float comparison operation, "cmp", if you have "a == a &amp;amp;&amp;amp; a cmp b"&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# then the "a == a" is redundant because it's equivalent to "a is not NaN"&lt;/span&gt;
&lt;span class="pygments-c1"&gt;# and, if a is a NaN then the second comparison will fail anyway.&lt;/span&gt;
&lt;span class="pygments-k"&gt;for&lt;/span&gt; &lt;span class="pygments-n"&gt;op&lt;/span&gt; &lt;span class="pygments-ow"&gt;in&lt;/span&gt; &lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-s1"&gt;'flt'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-s1"&gt;'fge'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-s1"&gt;'feq'&lt;/span&gt;&lt;span class="pygments-p"&gt;]:&lt;/span&gt;
   &lt;span class="pygments-n"&gt;optimizations&lt;/span&gt; &lt;span class="pygments-o"&gt;+=&lt;/span&gt; &lt;span class="pygments-p"&gt;[&lt;/span&gt;
      &lt;span class="pygments-p"&gt;((&lt;/span&gt;&lt;span class="pygments-s1"&gt;'iand'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'feq'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;op&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;)),&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'!'&lt;/span&gt; &lt;span class="pygments-o"&gt;+&lt;/span&gt; &lt;span class="pygments-n"&gt;op&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;)),&lt;/span&gt;
      &lt;span class="pygments-p"&gt;((&lt;/span&gt;&lt;span class="pygments-s1"&gt;'iand'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'feq'&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;),&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;op&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;)),&lt;/span&gt; &lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-s1"&gt;'!'&lt;/span&gt; &lt;span class="pygments-o"&gt;+&lt;/span&gt; &lt;span class="pygments-n"&gt;op&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;b&lt;/span&gt;&lt;span class="pygments-p"&gt;,&lt;/span&gt; &lt;span class="pygments-n"&gt;a&lt;/span&gt;&lt;span class="pygments-p"&gt;)),&lt;/span&gt;
   &lt;span class="pygments-p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Because we’ve made adding new optimizations so incredibly easy, we
have a lot of them. Not just the simple stuff I’ve highlighted above,
either. We’ve got at least two cases where someone hand-rolled
&lt;code&gt;bitfieldReverse()&lt;/code&gt; and we match a giant pattern and turn it
into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you
want to know who to blame. They hand-roll it differently, of course.) We
also have patterns to chew through all the garbage from D3D9 to HLSL
conversion where they emit piles of &lt;code&gt;x ? 1.0 : 0.0&lt;/code&gt;
everywhere because D3D9 didn’t have real Boolean types. All told, as of
the writing of this blog post, we have 1911 such search and replace
patterns.&lt;/p&gt;
&lt;p&gt;Not only have we made it easy to add new patterns but the
&lt;code&gt;nir_search&lt;/code&gt; framework has some pretty useful smarts in it.
The expression I first showed matches &lt;code&gt;a + 0&lt;/code&gt; and replaces it
with &lt;code&gt;a&lt;/code&gt; but &lt;code&gt;nir_search&lt;/code&gt; is smart enough to know
that &lt;code&gt;nir_op_iadd&lt;/code&gt; is commutative and so it also matches
&lt;code&gt;0 + a&lt;/code&gt; without having to write two expressions. We also have
syntax for detecting constants, handling different bit sizes, and
applying arbitrary C predicates based on the SSA value. Since NIR is
actually a vector IR (we support a lot of vec4-based hardware),
&lt;code&gt;nir_search&lt;/code&gt; also magically handles swizzles for you.&lt;/p&gt;
&lt;p&gt;You might think 1911 patterns is a lot and it is. Doesn’t that take
forever? Isn’t it &lt;span class="math inline"&gt;&lt;em&gt;O&lt;/em&gt;(&lt;em&gt;N&lt;/em&gt;&lt;em&gt;P&lt;/em&gt;&lt;em&gt;S&lt;/em&gt;)&lt;/span&gt;
where &lt;span class="math inline"&gt;&lt;em&gt;N&lt;/em&gt;&lt;/span&gt; is the number of
instructions, &lt;span class="math inline"&gt;&lt;em&gt;P&lt;/em&gt;&lt;/span&gt; is the number
of patterns and &lt;span class="math inline"&gt;&lt;em&gt;S&lt;/em&gt;&lt;/span&gt; is the
average pattern size or something like that? Nope! A couple years ago,
Connor Abbot converted it to using a finite state machine automata,
built at driver compile time, to filter out impossible matches as we go.
The result is that the whole pass effectively runs in linear time in the
number of instructions.&lt;/p&gt;
&lt;h2 id="nir-is-a-lowish-level-ir"&gt;NIR is a low(ish) level IR&lt;/h2&gt;
&lt;p&gt;This one continues to surprise me. When we set out to design NIR, the
goal was something that was SSA and used flat lists of instructions (not
expression trees). That was pretty much the extent of the design
requirements. However, whenever you build an IR, you inevitably make a
series of choices about what kinds of things you’re going to support
natively and what things are going to require emulation or be a bit more
painful.&lt;/p&gt;
&lt;p&gt;One of the most fundamental choices we made in NIR was that SSA
values would be typeless vectors. Each &lt;code&gt;nir_ssa_def&lt;/code&gt; has a
bit size and a number of vector components and that’s it. We don’t
distinguish between integers and floats and we don’t support matrix or
composite types. Not supporting matrix types was a bit controversial but
it’s turned out fine. We also have to do a bit of juggling to support
hardware that doesn’t have native integers because we have to lower
integer operations to float and we’ve lost the type information. When
working with shaders that come from D3D to OpenGL or Vulkan translators,
the type information does more harm than good. I can’t count the number
of shaders I’ve seen where they declare &lt;code&gt;vec4 x1&lt;/code&gt; through
&lt;code&gt;vec4 x80&lt;/code&gt; at the top and then &lt;code&gt;uintBitsToFloat()&lt;/code&gt;
and &lt;code&gt;floatBitsToUint()&lt;/code&gt; all over everywhere.&lt;/p&gt;
&lt;p&gt;We also made adding new ALU ops and intrinsics really easy but also
added a fairly powerful metadata system for both so the compiler can
still reason about them. The lines we drew between ALU ops, intrinsics,
texture instructions, and control-flow like &lt;code&gt;break&lt;/code&gt; and
&lt;code&gt;continue&lt;/code&gt; were pretty arbitrary at the time if we’re honest.
Texturing was going to be a lot of intrinsics so Connor added an
instruction type. That was pretty much it.&lt;/p&gt;
&lt;p&gt;The end result, however, has been an IR that’s incredibly versatile.
It’s somehow both a high-level and low-level IR at the same time. When
we do SPIR-V to NIR translation, we don’t have a separate IR for parsing
SPIR-V. We have some data structures to deal with composite types and a
handful of other stuff but when we parse SPIR-V opcodes, we go straight
to NIR. We’ve got variables with fairly standard dereference chains
(those do support composite types), bindings, all the crazy built-ins
like &lt;code&gt;frexp()&lt;/code&gt;, and a bunch of other language-level stuff. By
the time the NIR shows up in your back-end, however, all that’s gone.
Crazy built-in functions have been lowered. GL/Vulkan binding with
derefs, descriptors, and locations has been turned into byte offsets and
indices in a flat binding table. Some drivers have even attempted to
emit hardware instructions directly from NIR. (It’s never quite worked
but says a lot that they even tried.)&lt;/p&gt;
&lt;p&gt;The Intel compiler back-end has probably shrunk by half in terms of
optimization and lowering passes in the last seven years because we’re
able to do so much in NIR. We’ve got code that lowers storage image
access with unsupported formats to other image formats or even SSBO
access, splitting of vector UBO/SSBO access that’s too wide for
hardware, workarounds for imprecise trig ops, and a bunch of others. All
of the interesting lowering is done in NIR. One reason for this is that
Intel has two back-ends, one for scalar and one that’s vec4 and any
lowering we can do in NIR is lowering that only happens once. But, also,
it’s nice to be able to have the full power of NIR’s optimizer run on
your lowered code.&lt;/p&gt;
&lt;p&gt;As I said earlier, I find the versatility of NIR astounding. We never
intended to write an IR that could get that close to hardware. We just
wanted SSA for easier optimization writing. But the end result has been
absolutely fantastic and has done a lot to accelerate driver development
in Mesa.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;If you’ve gotten this far, I both applaud and thank you! NIR has been
a lot of fun to build and, as you can probably tell, I’m quite proud of
it. It’s also been a huge investment involving thousands of man hours
but I think it’s been well worth it. There’s a lot more work to do, of
course. We still don’t have the ray-tracing situation where it needs to
be and OpenCL-style compute needs some help to be really competent. But
it’s come an incredibly long way in the last seven years and I’m
incredibly proud of what we’ve built and forever thankful to the many
many developers who have chipped in and fixed bugs and contributed
optimization and lowering passes.&lt;/p&gt;
&lt;p&gt;Hopefully, this post provides some additional background and
explanation for the big question of why Mesa carries its own compiler
stack. And maybe, just maybe, someone will get excited enough about it
to play around with it and even contribute! One can hope, right?&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2022/01/in-defense-of-nir/</guid><pubDate>Thu, 27 Jan 2022 17:07:00 -0600</pubDate></item><item><title>Hello, Collabora!</title><link>https://www.gfxstrand.net/faith/blog/2022/01/hello-collabora/</link><description>&lt;h1 id="hello-collabora"&gt;Hello, Collabora!&lt;/h1&gt;
&lt;p&gt;Ever since &lt;a href="https://twitter.com/jekstrand_/status/1471961379150176256"&gt;I
announced that I was leaving Intel&lt;/a&gt;, there’s been a lot of
speculation as to where I’d end up. I left it a bit quiet over the
holidays but, now that we’re solidly in 2022, It’s time to let it spill.
As of January 24, I’ll be at Collabora!&lt;/p&gt;
&lt;p&gt;For those of you that don’t know, Collabora is an open-source
consultancy. They sell engineering services to companies who are making
devices that run Linux and want to contribute to open-source
technologies. They’ve worked on everything from automotive to gaming
consoles to smart TVs to infotainment systems to VR platforms. I’m not
an expert on what Collabora has done over the years so I’ll refer you to
their &lt;a href="https://www.collabora.com/industries/"&gt;brag sheet&lt;/a&gt; for
that. Unlike some contract houses, Collabora doesn’t just do engineering
for hire. They’re also an ideologically driven company that really
believes in upstream and invests directly in upstream projects such as
Mesa, Wayland, and others.&lt;/p&gt;
&lt;p&gt;My personal history with Collabora is as old as my history as an
open-source software developer. My first real upstream work was on
Wayland in early 2013. I jumped in with a cunning plan for running a
graphics-enabled desktop Linux chroot on an Android device and
absolutely no idea what I was getting myself into. Two of the people who
not only helped me understand the underbelly of Linux window systems but
also helped me learn to navigate the world of open-source software were
Daniel Stone and Pekka Paalanen, both of whom were at Collabora then and
still are today.&lt;/p&gt;
&lt;p&gt;After switching to Mesa when I joined Intel in 2014, I didn’t
interact with Collabora devs quite as much since they mostly stayed in
the window-system world and I tried to stay in 3D. In the last few
years, however, they’ve been building up their 3D team and doing some
really interesting work. Alyssa Rosenzweig and I have worked quite a bit
together on various NIR passes as part of her work on Panfrost and now
agx. I also worked with Boris Brezillon and Erik Faye-Lund on some of
the CLOn12, GLOn12, and Zink work which layers OpenGL and OpenCL on top
of D3D12 and Vulkan. In case you haven’t figured it out already from my
glowing review, Collabora has some top-notch people who are doing great
work and I’m excited to be joining the team and working more closely
with them.&lt;/p&gt;
&lt;p&gt;So how did this happen? What convinced me to leave the cushy
corporate job and join a tiny (compared to Intel) open-source company?
It’s not been for lack of opportunities. I get pinged by recruiters on
LinkedIn on a regular basis and certain teams in the industry have been
rather persistent. I’ve thought quite a lot over the years about where
I’d want to go if I ever left Intel. Intel has been my engineering home
for 7.5 years and has provided the strange cocktail on which I’ve built
my career: a stable team, well-funded upstream open-source work, fairly
cutting edge hardware, and an IHV seat at Khronos. Every place I’d ever
considered going would mean losing one or more of those things and,
until Collabora, no one had given me a good enough reason to give any of
that up.&lt;/p&gt;
&lt;p&gt;Back in September, I was chatting on IRC with other Mesa devs about
OpenCL, SPIR-V, and some corner-case we were missing in the compiler
when the following exchange happened:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;11:39 &amp;lt; jenatali&amp;gt; I hope I get time to get back to CL at some point, I
      hate leaving it half-finished, but stupid corporate priorities
      mean I have to do other stuff instead :P
11:41 &amp;lt; jekstrand&amp;gt; Yeah... Corporations... Why do we work for them
      again?  Oh, right, so we can afford to eat.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;About an hour later, Daniel Stone replied privately:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;12:40 &amp;lt;daniels&amp;gt; hey so if corporations ever get you down, there are
      always less-corporate options … :)
12:40 &amp;lt;daniels&amp;gt; timing completely coincidental of course
12:42 &amp;lt;jekstrand&amp;gt; Of course...
12:42 &amp;lt;jekstrand&amp;gt; I'm always open to new things if the offer is
      right...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This kicked off the weirdest and most interesting career conversation
I’ve had to date. At first, I didn’t believe him. The job he was
describing doesn’t exist. No one gets that offer. Not unless you’re Dave
Airlie or Linus Torvalds. But, after multiple 1 – 2 hour video chats,
more IRC chatter, and an hour chatting with Philippe Kalaf (Collabora’s
CEO), they had me convinced. This is real.&lt;/p&gt;
&lt;p&gt;So what did Collabora finally offer me that no one else has? Total
autonomy. In my new role at Collabora, my mandate consists of two
things: invest in and mentor the Collabora 3D graphics team and invest
in upstream Linux and open-source graphics however I see fit. I won’t be
expected to do any contract work. I may meet with clients from time to
time and I’ll likely get involved more with the various Collabora-driven
Mesa projects but my primary focus will be on ensuring that upstream is
healthy. I won’t be tied to any one driver or hardware vendor either.
Sure, it’d be good to do a bit of Panfrost work so I can help Alyssa out
since she’s now my coworker and I’ll likely still work on Intel drivers
a bit since that’s my home turf. But, at the end of the day, I’m now
free to put my effort wherever it’s needed in the stack without concern
for corporate priorities. Ray-tracing in RADV? Why not. OpenCL 3.0 for
everyone? Sure. Hacking on a new kernel interface for Freedreno? That’s
fine too. As far as I’m concerned, when it comes to how I spend my
engineering effort, I now report directly to upstream. No strings
attached.&lt;/p&gt;
&lt;p&gt;One of the interesting side-effect of this is how it will affect my
role within Khronos. Collabora is a Khronos member so I still plan to be
involved there but it will look different. For several years now (as
long as RADV has been a competent driver, really), I’ve always worn two
hats at Khronos: Intel and Mesa/Linux. Most of the time, I’m
representing Intel but there were always those weird awkward moments
where I help out the Igalia team working on V3DV or the RADV team. Now
that I’m no longer at a hardware vendor, I can really embrace the role
of representing Mesa and Linux upstream within Khronos. This doesn’t
mean that I’m suddenly going to fix all your Vulkan spec problems
overnight but it does mean I’ll be paying a bit more attention to the
non-Intel drivers and doing what I can to ensure that all the Vulkan
drivers in Mesa are in good shape.&lt;/p&gt;
&lt;p&gt;Honestly, I’m still in shock that I was offered this role. It’s a
great testament to Collabora’s belief in upstream that they’re willing
to fund such a role and it shows an incredible amount of faith in my
work. At Intel, I was blessed to be able to work upstream as part of my
day job, which isn’t something most open-source software developers get.
To have someone believe in your work so much that they’re willing to cut
you a pay check just to keep doing what you’re doing is mind boggling.
I’m truly honored and I hope the work I do in the days, months, and
years to come will prove that their faith was well placed.&lt;/p&gt;
&lt;p&gt;So, what am I going to be working on with my new found freedom? Do I
have any cool new projects planned that are going to turn the industry
upside-down? Of course I do! But those are topics for other blog
posts.&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2022/01/hello-collabora/</guid><pubDate>Mon, 17 Jan 2022 09:48:00 -0600</pubDate></item><item><title>Getting the most out of your Intel integrated GPU on Linux</title><link>https://www.gfxstrand.net/faith/blog/2020/11/getting-the-most-out-of-your-intel/</link><description>&lt;h1 id="getting-the-most-out-of-your-intel-integrated-gpu-on-linux"&gt;Getting
the most out of your Intel integrated GPU on Linux&lt;/h1&gt;
&lt;p&gt;About a year ago ago, I got a new laptop: a late 2019 Razer Blade
Stealth 13. It sports an Intel i7-1065G7 with the best Intel’s Ice Lake
graphics along with an NVIDIA GeForce GTX 1650. Apart from needing an
ACPI lid quirk and the power management issues described here, it’s been
a great laptop so far and the Linux experience has been very smooth.&lt;/p&gt;
&lt;p&gt;Unfortunately, the out-of-the-box integrated graphics performance of
my new laptop was less than stellar. My first task with the new laptop
was to debug a rendering issue in the Linux port of Shadow of the Tomb
Raider which turned out to be a bug in the game. In the process, I
discovered that the performance of the game’s built-in benchmark was
almost half of Windows. We’ve had some performance issues with Mesa from
time to time on some games but half seemed a bit extreme. Looking at
system-level performance data with gputop revealed that GPU clock rate
was unable to get above about 60-70% of the maximum in spite of the GPU
being busy the whole time. Why? The GPU wasn’t able to get enough power.
Once I sorted out my power management problems, the benchmark went from
about 50-60% the speed of Windows to more like 104% the speed of windows
(yes, that’s more than 100%).&lt;/p&gt;
&lt;p&gt;This blog post is intended to serve as a bit of a guide to
understanding memory throughput and power management issues and
configuring your system properly to get the most out of your Intel
integrated GPU. Not everything in this post will affect all laptops so
you may have to do some experimentation with your system to see what
does and does not matter. I also make no claim that this post is in any
way complete; there are almost certainly other configuration issues of
which I’m not aware or which I’ve forgotten.&lt;/p&gt;
&lt;h2 id="update-your-drivers"&gt;Update your drivers&lt;/h2&gt;
&lt;p&gt;This should go without saying but if you want the best performance
out of your hardware, running the latest drivers is always recommended.
This is especially true for hardware that has just been released.
Generally, for graphics, most of the big performance improvements are
going to be in Mesa but your Linux kernel version can matter as well. In
the case of Intel Ice Lake processors, some of the power management
features aren’t enabled until Linux 5.4.&lt;/p&gt;
&lt;p&gt;I’m not going to give a complete guide to updating your drivers here.
If you’re running a distro like Arch, chances are that you’re already
running something fairly close to the latest available. If you’re on
Ubuntu, the padoka PPA provides versions of the userspace components
(Mesa, X11, etc.) that are usually no more than about a week out-of-date
but upgrading your kernel is more complicated. Other distros may have
something similar but I’ll leave as an exercise to the reader.&lt;/p&gt;
&lt;p&gt;This doesn’t mean that you need to be obsessive about updating
kernels and drivers. If you’re happy with the performance and stability
of your system, go ahead and leave it alone. However, if you have brand
new hardware and want to make sure you have new enough drivers, it may
be worth attempting an update. Or, if you have the patience, you can
just wait 6 months for the next distro release cycle and hope to pick up
with a distro update.&lt;/p&gt;
&lt;h2 id="make-sure-you-have-dual-channel-ram"&gt;Make sure you have
dual-channel RAM&lt;/h2&gt;
&lt;p&gt;One of the big bottleneck points in 3D rendering applications is
memory bandwidth. Most standard monitors run at a resolution of
1920x1080 and a refresh rate of 60 Hz. A 1920x1080 RGBA (32bpp) image is
just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that
adds up to about 474 MiB/s of memory bandwidth to write out the image
every frame. If you’re running a 4K monitor, multiply by 4 and you get
about 1.8 GiB/s. Those numbers are only for the final color image,
assume we write every pixel of the image exactly once, and don’t take
into account any other memory access. Even in a simple 3D scene, there
are other images than just the color image being written such as depth
buffers or auxiliary gbuffers, each pixel typically gets written more
than once depending on app over-draw, and shading typically involves
reading from uniform buffers and textures. Modern 3D applications
typically also have things such as depth pre-passes, lighting passes,
and post-processing filters for depth-of-field and/or motion blur. The
result of this is that actual memory bandwidth for rendering a 3D scene
can be 10-100x the bandwidth required to simply write the color
image.&lt;/p&gt;
&lt;p&gt;Because of the incredible amount of bandwidth required for 3D
rendering, discrete GPUs use memories which are optimized for bandwidth
above all else. These go by different names such as GDDR6 or HBM2
(current as of the writing of this post) but they all use extremely wide
buses and access many bits of memory in parallel to get the highest
throughput they can. CPU memory, on the other hand, is typically DDR4
(current as of the writing of this post) which runs on a narrower 64-bit
bus and so the over-all maximum memory bandwidth is lower. However, as
with anything in engineering, there is a trade-off being made here.
While narrower buses have lower over-all throughput, they are much
better at random access which is necessary for good CPU memory
performance when crawling complex data structures and doing other normal
CPU tasks. When 3D rendering, on the other hand, the vast majority of
your memory bandwidth is consumed in reading/writing large contiguous
blocks of memory and so the trade-off falls in favor of wider buses.&lt;/p&gt;
&lt;p&gt;With integrated graphics, the GPU uses the same DDR RAM as the CPU so
it can’t get as much raw memory throughput as a discrete GPU. Some of
the memory bottlenecks can be mitigated via large caches inside the GPU
but caching can only do so much. At the end of the day, if you’re
fetching 2 GiB of memory to draw a scene, you’re going to blow out your
caches and load most of that from main memory.&lt;/p&gt;
&lt;p&gt;The good news is that most motherboards support a dual-channel ram
configurations where, if your DDR units are installed in identical
pairs, the memory controller will split memory access between the two
DDR units in the pair. This has similar benefits to running on a 128-bit
bus but without some of the drawbacks. The result is about a 2x
improvement in over-all memory throughput. While this may not affect
your CPU performance significantly outside of some very special cases,
it makes a huge difference to your integrated GPU which cares far more
about total throughput than random access. If you are unsure how your
computer’s RAM is configured, you can run “dmidecode -t memory” and see
if you have two identical devices reported in different channels.&lt;/p&gt;
&lt;h2 id="power-management-101"&gt;Power management 101&lt;/h2&gt;
&lt;p&gt;Before getting into the details of how to fix power management
issues, I should explain a bit about how power management works and,
more importantly, how it doesn’t. If you don’t care to learn about power
management and are just here for the system configuration tips, feel
free to skip this section.&lt;/p&gt;
&lt;p&gt;Why is power management important? Because the clock rate (and
therefore the speed) of your CPU or GPU is heavily dependent on how much
power is available to the system. If it’s unable to get enough power for
some reason, it will run at a lower clock rate and you’ll see that as
processes taking more time or lower frame rates in the case of graphics.
There are some things that you, as the user, cannot control such as the
physical limitations of the chip or the way the OEM has configured
things on your particular laptop. However, there are some things which
you can do from a system configuration perspective which can greatly
affect power management and your performance.&lt;/p&gt;
&lt;p&gt;First, we need to talk about thermal design power or TDP. There is a
lot of misunderstanding on the internet about TDP and we need to clear
some of them up. Wikipedia defines TDP as “the maximum amount of heat
generated by a computer chip or component that the cooling system in a
computer is designed to dissipate under any workload.” The Intel Product
Specifications site defines TDP as follows:&lt;/p&gt;
&lt;p&gt;Thermal Design Power (TDP) represents the average power, in watts,
the processor dissipates when operating at Base Frequency with all cores
active under an Intel-defined, high-complexity workload. Refer to
Datasheet for thermal solution requirements.&lt;/p&gt;
&lt;p&gt;In other words, the TDP value provided on the Intel spec sheet is a
pretty good design target for OEMs but doesn’t provide nearly as many
guarantees as one might hope. In particular, there are several things
that the TDP value on the spec sheet is not:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It’s not the exact maximum power. It’s a “average
power”.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It may not match any particular workload. It’s based on “an
Intel-defined, high-complexity workload”. Power consumption on any other
workload is likely to be slightly different.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s not the actual maximum. It’s based on when the processor is
“operating at Base Frequency with all cores active.” Technologies such
as Turbo Boost can cause the CPU to operate at a higher power for short
periods of time.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you look at the Intel Product Specifications page for the
i7-1065G7, you’ll see three TDP values: the nominal TDP of 15W, a
configurable TDP-up value of 25W and a configurable TDP-down value of
12W. The nominal TDP (simply called “TDP”) is the base TDP which is
enough for the CPU to run all of its cores at the base frequency which,
given sufficient cooling, it can do in the steady state. The TDP-up and
TDP-down values provide configurability that gives the OEM options when
they go to make a laptop based on the i7-1065G7. If they’re making a
performance laptop like Razer and are willing to put in enough cooling,
they can configure it to 25W and get more performance. On the other
hand, if they’re going for battery life, they can put the exact same
chip in the laptop but configure it to run as low as 12W. They can also
configure the chip to run at 12W or 15W and then ship software with the
computer which will bump it to 25W once Windows boots up. We’ll talk
more about this reconfiguration later on.&lt;/p&gt;
&lt;p&gt;Beyond just the numbers on the spec sheet, there are other things
which may affect how much power the chip can get. One of the big ones is
cooling. The law of conservation of energy dictates that energy is never
created or destroyed. In particular, your CPU doesn’t really consume
energy; it turns that electrical energy into heat. For every Watt of
electrical power that goes into the CPU, a Watt of heat has to be pumped
out by the cooling system. (Yes, a Watt is also a measure of heat flow.)
If the CPU is using more electrical energy than the cooling system can
pump back out, energy gets temporarily stored in the CPU as heat and you
see this as the CPU temperature rising. Eventually, however, the CPU has
to back off and let the cooling system catch up or else that built up
heat may cause permanent damage to the chip.&lt;/p&gt;
&lt;p&gt;Another thing which can affect CPU power is the actual power delivery
capabilities of the motherboard itself. In a desktop, the discrete GPU
is typically powered directly by the power supply and it can draw 300W
or more without affecting the amount of power available to the CPU. In a
laptop, however, you may have more power limitations. If you have
multiple components requiring significant amounts of power such as a CPU
and a discrete GPU, the motherboard may not be able to provide enough
power for both of them to run flat-out so it may have to limit CPU power
while the discrete GPU is running. These types of power balancing
decisions can happen at a very deep firmware level and may not be
visible to software.&lt;/p&gt;
&lt;p&gt;The moral of this story is that the TDP listed on the spec sheet for
the chip isn’t what matters; what matters is how the chip is configured
by the OEM, how much power the motherboard is able to deliver, and how
much power the cooling system is able to remove. Just because two
laptops have the same processor with the same part number doesn’t mean
you should expect them to get the same performance. This is unfortunate
for laptop buyers but it’s the reality of the world we live in. There
are some things that you, as the user, cannot control such as the
physical limitations of the chip or the way the OEM has configured
things on your particular laptop. However, there are some things which
you can do from a system configuration perspective and that’s what we’ll
talk about next.&lt;/p&gt;
&lt;p&gt;If you want to experiment with your system and understand what’s
going on with power, there are two tools which are very useful for this:
powertop and turbostat. Both are open-source and should be available
through your distro package manager. I personally prefer the turbostat
interface for CPU power investigations but powertop is able to split
your power usage up per-process which can be really useful as well.&lt;/p&gt;
&lt;h2 id="update-gamemode-to-at-least-version-1.5"&gt;Update GameMode to at
least version 1.5&lt;/h2&gt;
&lt;p&gt;About a two and a half years ago (1.0 was released in may of 2018),
Feral Interactive released their GameMode daemon which is able to tweak
some of your system settings when a game starts up to get maximal
performance. One of the settings that GameMode tweaks is your CPU
performance governor. By default, GameMode will set it to “performance”
when a game is running. While this seems like a good idea (“performance”
is better, right?), it can actually be counterproductive on integrated
GPUs and cause you to get worse over-all performance.&lt;/p&gt;
&lt;p&gt;Why would the “performance” governor cause worse performance? First,
understand that the names “performance” and “powersave” for CPU
governors are a bit misleading. The powersave governor isn’t just for
when you’re running on battery and want to use as little power as
possible. When on the powersave governor, your system will clock all the
way up if it needs to and can even turbo if you have a heavy workload.
The difference between the two governors is that the powersave governor
tries to give you as much performance as possible while also caring
about power; it’s quite well balanced. Intel typically recommends the
powersave governor even in data centers because, even though they have
piles of power and cooling available, data centers typically care about
their power bill. The performance governor, on the other hand, doesn’t
care about power consumption and only cares about getting the maximum
possible performance out of the CPU so it will typically burn
significantly more power than needed.&lt;/p&gt;
&lt;p&gt;So what does this have to do with GPU performance? On an integrated
GPU, the GPU and CPU typically share a power budget and every Watt of
power the CPU is using is a Watt that’s unavailable to the GPU. In some
configurations, the TDP is enough to run both the GPU and CPU flat-out
but that’s uncommon. Most of the time, however, the CPU is capable of
using the entire TDP if you clock it high enough. When running with the
performance governor, that extra unnecessary CPU power consumption can
eat into the power available to the GPU and cause it to clock down.&lt;/p&gt;
&lt;p&gt;This problem should be mostly fixed as of GameMode version 1.5 which
adds an integrated GPU heuristic. The heuristic detects when the
integrated GPU is using significant power and puts the CPU back to using
the powersave governor. In the testing I’ve done, this pretty reliably
chooses the powersave governor in the cases where the GPU is likely to
be TDP limited. The heuristic is dynamic so it will still use the
performance governor if the CPU power usage way overpowers the GPU power
usage such as when compiling shaders at a loading screen.&lt;/p&gt;
&lt;p&gt;What do you need to do on your system? First, check what version of
GameMode you have installed on your system (if any). If it’s version 1.4
or earlier)and you intend to play games on an integrated GPU, I
recommend either upgrading GameMode or disabling or uninstalling the
GameMode daemon.&lt;/p&gt;
&lt;h2 id="use-thermald"&gt;Use thermald&lt;/h2&gt;
&lt;p&gt;In “power management 101” I talked about how sometimes OEMs will
configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W
in software. This is done via the “Intel Dynamic Platform and Thermal
Framework” driver on Windows. The DPTF driver manages your over-all
system thermals and keep the system within its thermal budget. This is
especially important for fanless or ultra-thin laptops where the cooling
may not be sufficient for the system to run flat-out for long periods.
One thing the DPTF driver does is dynamically adjust the TDP of your
CPU. It can adjust it both up if the laptop is running cool and you need
the power or down if the laptop is running hot and needs to cool down.
Some OEMs choose to be very conservative with their TDP defaults in BIOS
to prevent the laptop from overheating or constantly running hot if the
Windows DPTF driver is not available.&lt;/p&gt;
&lt;p&gt;On Linux, the equivalent to this is thermald. When installed and
enabled on your system, it reads the same OEM configuration data from
ACPI as the windows DPTF driver and is also able to scale up your
package TDP threshold past the BIOS default as per the OEM
configuration. You can also write your own configuration files if you
really wish but you do so at your own risk.&lt;/p&gt;
&lt;p&gt;Most distros package thermald but it may not be enabled nor work
quite properly out-of-the-box. This is because, historically, it has
relied on the closed-source dptfxtract utility that’s provided by Intel
as a binary. It requires dptfxtract to fetch the OEM provided
configuration data from the ACPI tables. Since most distros don’t
usually ship closed-source software in their main repositories and since
thermald doesn’t do much without that data, a lot of distros don’t
bother to ship or enable it by default. You’ll have to turn it on
manually.&lt;/p&gt;
&lt;p&gt;To fix this, install both thermald and dptfxtract and ensure that
thermald is enabled. On most distros, thermald is packaged normally even
if it isn’t enabled by default because it is open-source. The dptfxtract
utility is usually available in your distro’s non-free repositories. On
Ubuntu, dptfxtract is available as a package in multiverse. For Fedora,
dptfxtract is available via RPM Fusion’s non-free repo. There are also
packages for Arch and likely others as well. If no one packages it for
your distro, it’s just one binary so it’s pretty easy to install
manually.&lt;/p&gt;
&lt;p&gt;Some of this may change going forward, however. Recently, however,
Matthew Garrett did some work to reverse-engineer the DPTF framework and
provide support for fetching the DPTF data from ACPI without the need
for the binary blob. When running with a recent kernel and Matthew’s
fork of thermald, you should be able to get OEM-configured thermals
without the need for the dptfxtract blob at least on some hardware.
Whether or not you get the right configuration will depend on your
hardware, your kernel version, your distro, and whether they ship the
Intel version of thermald or Matthew’s fork. Even there, your distro may
leave it uninstalled or disabled by default. It’s still disabled by
default in Fedora 33, for instance.&lt;/p&gt;
&lt;p&gt;It should be noted at this point that, if thermald and dptfxtract are
doing their job, your laptop is likely to start running much hotter when
under heavy load than it did before. This is because thermald is
re-configuring your processor with a higher thermal budget which means
it can now run faster but it will also generate more heat and may drain
your battery faster. In theory, thermald should keep your laptop’s
thermals within safe limits; just not within the more conservative
limits the OEM programmed into BIOS. If all the additional heat makes
you uncomfortable, you can just disable thermald and it should go back
to the BIOS defaults.&lt;/p&gt;
&lt;h2 id="enable-nvidias-dynamic-power-management"&gt;Enable NVIDIA’s dynamic
power-management&lt;/h2&gt;
&lt;p&gt;On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the
CPU configured to 35W out-of-the-box. (Yes, 35W is higher than TDP-up
and I’ve never seen it burn anything close to that much power; I have no
idea why it’s configured that way.) This means that we have no need for
DPTF and the cooling is good enough that I don’t really need thermald on
it either. Instead, its power management problems come from the power
balancing that the motherboard does between the CPU and the discrete
NVIDIA GPU.&lt;/p&gt;
&lt;p&gt;If the NVIDIA GPU is powered on at all, the motherboard configures
the CPU to the TDP-down value of 12W. I don’t know exactly how it’s
doing this but it’s at a very deep firmware level that seems completely
opaque to software. To make matters worse, it doesn’t just restrict CPU
power when the discrete GPU is doing real rendering; it restricts CPU
power whenever the GPU is powered on at all. In the default
configuration with the NVIDIA proprietary drivers, that’s all the
time.&lt;/p&gt;
&lt;p&gt;Fortunately, if you know where to find it, there is a configuration
option available in recent drivers for Turing and later GPUs which lets
the NVIDIA driver completely power down the discrete GPU when it isn’t
in use. You can find this documented in Chapter 22 of the NVIDIA driver
README. The runtime power management feature is still beta as of the
writing of this post and does come with some caveats such as that it
doesn’t work if you have audio or USB controllers (for USB-C video) on
your GPU. Fortunately, with many laptops with a hybrid Intel+NVIDIA
graphics solution, the discrete GPU exists only for render off-loading
and doesn’t have any displays connected to it. In that case, the audio
and USB-C can be disabled and don’t cause any problems. On my laptop, as
soon as I properly enabled runtime power management in the NVIDIA
driver, the motherboard stopped throttling my CPU and it started running
at the full TDP-up of 25W.&lt;/p&gt;
&lt;p&gt;I believe that nouveau has some capabilities for runtime power
management. However, I don’t know for sure how good they are and whether
or not they’re able to completely power down the GPU.&lt;/p&gt;
&lt;h2 id="look-for-other-things-which-might-be-limiting-power"&gt;Look for
other things which might be limiting power&lt;/h2&gt;
&lt;p&gt;In this blog post, I’ve covered some of the things which I’ve
personally seen limit GPU power when playing games and running
benchmarks. However, it is by no means an exhaustive list. If there’s
one thing that’s true about power management, it’s that every machine is
a bit different. The biggest challenge with my laptop was the NVIDIA
discrete GPU draining power. On some other laptop, it may be something
else.&lt;/p&gt;
&lt;p&gt;You can also look for background processes which may be using
significant CPU cycles. With a discrete GPU, a modest amount of
background CPU work will often not hurt you unless the game is
particularly CPU-hungry. With an integrated GPU, however, it’s far more
likely that a background task such as a backup or software update will
eat into the GPU’s power budget. Just this last week, a friend of mine
was playing a game on Proton and discovered that the game launcher
itself was burning enough power with the CPU to prevent the GPU from
running at full power. Once he suspended the game launcher, his GPU was
able to run at full power.&lt;/p&gt;
&lt;p&gt;Especially with laptops, you’re also likely to be affected by the
computer’s cooling system as was mentioned earlier. Some laptops such as
my Razer are designed with high-end cooling systems that let the laptop
run at full power. Others, particularly the ultra-thin laptops, are far
more thermally limited and may never be able to hit the advertised TDP
for extended periods of time.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;When trying to get the most performance possible out of a laptop, RAM
configuration and power management are key. Unfortunately, due to the
issues documented above (and possibly others), the out-of-the-box
experience on Linux is not what it should be. Hopefully, we’ll see this
situation improve in the coming years but for now this post will
hopefully give people the tools they need to configure their machines
properly and get the full performance out of their hardware.&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2020/11/getting-the-most-out-of-your-intel/</guid><pubDate>Fri, 06 Nov 2020 10:27:25 -0800</pubDate></item><item><title>Does subgroup/wave size matter?</title><link>https://www.gfxstrand.net/faith/blog/2020/10/does-subgroup-wave-size-matter/</link><description>&lt;h1 id="does-subgroupwave-size-matter"&gt;Does subgroup/wave size
matter?&lt;/h1&gt;
&lt;p&gt;This week, I had a conversation with one of my coworkers about our
subgroup/wave size heuristic and, in particular, whether or not
control-flow divergence should be considered as part of the choice. This
lead me down a fun path of looking into the statistics of control-flow
divergence and the end result is somewhat surprising: Once you get above
about an 8-wide subgroup, the subgroup size doesn’t matter.&lt;/p&gt;
&lt;p&gt;Before I get into the details, let’s talk nomenclature. As you’re
likely aware, GPUs often execute code in groups of 1 or more
invocations. In D3D terminology, these are called waves. In Vulkan and
OpenGL terminology, these are called subgroups. The two terms are
interchangeable and, for the rest of this post, I’ll use the
Vulkan/OpenGL conventions. Control-flow divergence&lt;/p&gt;
&lt;p&gt;Before we dig into the statistics, let’s talk for a minute about
control-flow divergence. This is mostly going to be a primer on SIMT
execution and control-flow divergence in GPU architectures. If you’re
already familiar, skip ahead to the next section.&lt;/p&gt;
&lt;p&gt;Most modern GPUs use a Single Instruction Multiple Thread (SIMT)
model. This means that the graphics programmer writes a shader which,
for instance, colors a single pixel (fragment/pixel shader) but what the
shader compiler produces is a program which colors, say, 32 pixels using
a vector instruction set architecture (ISA). Each logical single-pixel
execution of the shader is called an “invocation” while the physical
vectorized execution of the shader which covers multiple pixels is
called a wave or a subgroup. The size of the subgroup (number of pixels
colored by a single hardware execution) varies depending on your
architecture. On Intel, it can be 8, 16, or 32, on AMD, it’s 32 or 64
and, on Nvidia (if my knowledge is accurate), it’s always 32.&lt;/p&gt;
&lt;p&gt;This conversion from logical single-pixel version of the shader to a
physical multi-pixel version is often fairly straightforward. The GPU
registers each hold N values and the instructions provided by the GPU
ISA operate on N pieces of data at a time. If, for instance, you have an
add in the logical shader, it’s converted to an add provided by the
hardware ISA which adds N values. (This is, of course an
over-simplification but it’s sufficient for now.) Sounds simple,
right?&lt;/p&gt;
&lt;p&gt;Where things get more complicated is when you have control-flow in
your shader. Suppose you have an if statement with both then and else
sections. What should we do when we hit that if statement? The if
condition will be N Boolean values. If all of them are true or all of
them are false, the answer is pretty simple: we do the then or the else
respectively. If you have a mix of true and false values, we have to
execute both sides. More specifically, the physical shader has to
disable all of the invocations for which the condition is false and run
the “then” side of the if statement. Once that’s complete, it has to
re-enable those channels and disable the channels for which the
condition is true and run the “else” side of the if statement. Once
that’s complete, it re-enables all the channels and continues executing
the code after the if statement.&lt;/p&gt;
&lt;p&gt;When you start nesting if statements and throw loops into the mix,
things get even more complicated. Loop continues have to disable all
those channels until the next iteration of the loop, loop breaks have to
disable all those channels until the loop is entirely complete, and the
physical shader has to figure out when there are no channels left and
complete the loop. This makes for some fun and interesting challenges
for GPU compiler developers. Also, believe it or not, everything I just
said is a massive over-simplification. :-)&lt;/p&gt;
&lt;p&gt;The point which most graphics developers need to understand and
what’s important for this blog post is that the physical shader has to
execute every path taken by any invocation in the subgroup. For loops,
this means that it has to execute the loop enough times for the worst
case in the subgroup. This means that if you have the same work in both
the then and else sides of an if statement, that work may get executed
twice rather than once and you may be better off pulling it outside the
if. It also means that if you have something particularly expensive and
you put it inside an if statement, that doesn’t mean that you only pay
for it when needed, it means you pay for it whenever any invocation in
the subgroup needs it.&lt;/p&gt;
&lt;h2 id="fun-with-statistics"&gt;Fun with statistics&lt;/h2&gt;
&lt;p&gt;At the end of the last section, I said that one of the problems with
the SIMT model used by GPUs is that they end up having worst-case
performance for the subgroup. Every path through the shader which has to
be executed for any invocation in the subgroup has to be taken by the
shader as a whole. The question that naturally arises is, “does a larger
subgroup size make this worst-case behavior worse?” Clearly, the naive
answer is, “yes”. If you have a subgroup size of 1, you only execute
exactly what’s needed and if you have a subgroup size of 2 or more, you
end up hitting this worst-case behavior. If you go higher, the bad cases
should be more likely, right? Yes, but maybe not quite like you
think.&lt;/p&gt;
&lt;p&gt;This is one of those cases where statistics can be surprising. Let’s
say you have an if statement with a boolean condition b. That condition
is actually a vector (b1, b2, b3, …, bN) and if any two of those vector
elements differ, we path the cost of both paths. Assuming that the
conditions are independent identically distributed (IID) random
variables, the probability of entire vector being true is P(all(bi =
true) = P(b1 = true) * P(b2 = true) * … * P(bN = true) = P(bi = true)^N
where N is the size of the subgroup. Therefore, the probability of
having uniform control-flow is P(bi = true)^N + P(bi = false)^N. The
probability of non-uniform control-flow, on the other hand, is 1 - P(bi
= true)^N - P(bi = false)^N.&lt;/p&gt;
&lt;p&gt;Before we go further with the math, let’s put some solid numbers on
it. Let’s say we have a subgroup size of 8 (the smallest Intel can do)
and let’s say that our input data is a series of coin flips where bi is
“flip i was heads”. Then P(bi = true) = P(bi = false) = 1/2. Using the
math in the previous paragraph, P(uniform) = P(bi = true)^8 + P(bi =
false)^8 = 1/128. This means that the there is only a 1:128 chance that
that you’ll get uniform control-flow and a 127:128 chance that you’ll
end up taking both paths of your if statement. If we increase the
subgroup size to 64 (the maximum among AMD, Intel, and Nvidia), you get
a 1:2^63 chance of having uniform control-flow and a
(2&lt;sup&gt;63-1):2&lt;/sup&gt;63 chance of executing both halves. If we assume
that the shader takes T time units when control-flow is uniform and 2T
time units when control-flow is non-uniform, then the amortized cost of
the shader for a subgroup size of 8 is 1/128 * T + 127/128 * 2T =
255/128 T and, by a similar calculation, the cost of a shader with a
subgroup size of 64 is (2^64 - 1)/2^63. Both of those are within
rounding error of 2T and the added cost of using the massively wider
subgroup size is less than 1%. Playing with the statistics a bit, the
following chart shows the probability of divergence vs. the subgroup
size for various choices of P(bi = true):&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="probability of divergence vs. the subgroup size for various choices of P(divergent)" src="https://www.gfxstrand.net/faith/blog/2020/10/does-subgroup-wave-size-matter/../divergence.png"/&gt;
&lt;figcaption aria-hidden="true"&gt;probability of divergence vs. the
subgroup size for various choices of P(divergent)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;One thing to immediately notice is that because we’re only concerned
about the probability of divergence and not of the two halves of the if
independently, the graph is symmetric (p=0.9 and p=0.1 are the same).
Second, and the point I was trying to make with all of the math above,
is that until your probability gets pretty extreme (&amp;gt; 90%) the
probability of divergence is reasonably high at any subgroup size. From
the perspective of a compiler with no knowledge of the input data, we
have to assume every if condition is a 50/50 chance at which point we
can basically assume it will always diverge.&lt;/p&gt;
&lt;p&gt;Instead of only considering divergence, let’s take a quick look at
another case. Let’s say that the you have a one-sided if statement (no
else) that is expensive but rare. To put numbers on it, let’s say the
probability of the if statement being taken is 1/16 for any given
invocation. Then P(taken) = P(any(bi = true)) = 1 - P(all(bi = false)) =
1 - P(bi = false)^N = 1 - (15/16)^N. This works out to about 0.4 for a
subgroup size of 8, 0.65 for 16, 0.87 for 32, and 0.98 for 64. The
following chart shows what happens if we play around with the
probabilities of our if condition a bit more:&lt;/p&gt;
&lt;figure&gt;
&lt;img alt="probability of divergence vs. the subgroup size for various choices of P(bi = true)" src="https://www.gfxstrand.net/faith/blog/2020/10/does-subgroup-wave-size-matter/../probability-chart.png"/&gt;
&lt;figcaption aria-hidden="true"&gt;probability of divergence vs. the
subgroup size for various choices of P(bi = true)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;As we saw with the earlier divergence plot, even events with a fairly
low probability (10%) are fairly likely to happen even with a subgroup
size of 8 (57%) and are even more likely the higher the subgroup size
goes. Again, from the perspective of a compiler with no knowledge of the
data trying to make heuristic decisions, it looks like “ifs always
happen” is a reasonable assumption. However, if we have something
expensive like a texture instruction that we can easily move into an if
statement, we may as well. There’s no guarantees but if the probability
of that if statement is low enough, we might be able to avoid it at
least some of the time.&lt;/p&gt;
&lt;h2 id="statistical-independence"&gt;Statistical independence&lt;/h2&gt;
&lt;p&gt;A keen statistical eye may have caught a subtle statement I made very
early on in the previous section:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Assuming that the conditions are independent identically distributed
(IID) random variables…&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;While less statistically minded readers may have glossed over this as
meaningless math jargon, it’s actually very important assumption. Let’s
take a minute to break it down. A random variable in statistics is just
an event. In our case, it’s something like “the if condition was true”.
To say that a set of random variables is identically distributed means
that they have the same underlying probabilities. Two coin tosses, for
instance, are identically distributed while the distribution of “coin
came up heads” and “die came up 6” are very different. When combining
random variables, we have to be careful to ensure that we’re not mixing
apples and oranges. All of the analysis above was looking at the
evaluation of a boolean in the same if condition but across different
subgroup invocations. These should be identically distributed.&lt;/p&gt;
&lt;p&gt;The remaining word that’s of critical importance in the IID
assumption is “independent”. Two random variables are said to be
independent if they have no effect on one another or, to be more
precise, knowing the value of one tells you nothing whatsoever about the
value of the other. Random variables which are not dependent are said to
be “correlated”. One example of random variables which are very much not
independent would be housing prices in a neighborhood because the first
thing home appraisers look at to determine the value of a house is the
value of other houses in the same area that have sold recently. In my
computations above, I used the rule that P(X and Y) = P(X) * P(Y) but
this only holds if X and Y are independent random variables. If they’re
dependent, the statistics look very different. This raises an obvious
question: Are if conditions statistically independent across a subgroup?
The short answer is “no”.&lt;/p&gt;
&lt;p&gt;How does this correlation and lack of independence (those are the
same) affect the statistics? If two events X and Y are negatively
correlated then P(X and Y) &amp;lt; P(X) * P(Y) and if two events are
positively correlated then P(X and Y) &amp;gt; P(X) * P(Y). When it comes to
if conditions across a subgroup, most correlations that matter are
positive. Going back to our statistics calculations, the probability of
if condition diverging is 1 - P(all(bi = true)) - P(all(bi = false)) and
P(all(bi = true)) = P(b1 = true and b2 = true and… bN = true). So, if
the data is positively correlated, we get P(all(bi = true)) &amp;gt; P(bi =
true)^N and P(divergent) = 1 - P(all(bi = true)) - P(all(bi = false))
&amp;lt; 1 - P(bi = true)^N - P(bi = false)^N. So correlation for us
typically reduces the probability of divergence. This is a good thing
because divergence is expensive. How much does it reduce the probability
of divergence? That’s hard to tell without deep knowledge of the data
but there are a few easy cases to analyze.&lt;/p&gt;
&lt;p&gt;One particular example of dependence that comes up all the time is
uniform values. Many values passed into a shader are the same for all
invocations within a draw call or for all pixels within a group of
primitives. Sometimes the compiler is privy to this information (if it
comes from a uniform or constant buffer, for instance) but often it
isn’t. It’s fairly common for apps to pass some bit of data as a vertex
attribute which, even though it’s specified per-vertex, is actually the
same for all of them. If a bit of data is uniform (even if the compiler
doesn’t know it is), then any if conditions based on that data (or from
a calculation using entirely uniform values) will be the same. From a
statics perspective, this means that P(all(bi = true)) + P(all(bi =
false)) = 1 and P(divergent) = 0. From a shader execution perspective,
this means that it will never diverge no matter the probability of the
condition because our entire wave will evaluate the same value.&lt;/p&gt;
&lt;p&gt;What about non-uniform values such as vertex positions, texture
coordinates, and computed values? In your average vertex, geometry, or
tessellation shader, these are likely to be effectively independent.
Yes, there are patterns in the data such as common edges and some
triangles being closer to others. However, there is typically a lot of
vertex data and the way that vertices get mapped to subgroups is random
enough that these correlations between vertices aren’t likely to show up
in any meaningful way. (I don’t have a mathematical proof for this
off-hand.) When they’re independent, all the statistics we did in the
previous section apply directly.&lt;/p&gt;
&lt;p&gt;With pixel/fragment shaders, on the other hand, things get more
interesting. Most GPUs rasterize pixels in groups of 2x2 pixels where
each 2x2 pixel group comes from the same primitive. Each subgroup is
made up of a series of these 2x2 pixel groups so, if the subgroup size
is 16, it’s actually 4 groups of 2x2 pixels each. Within a given 2x2
pixel group, the chances of a given value within the shader being the
same for each pixel in that 2x2 group is quite high. If we have a
condition which is the same within each 2x2 pixel group then, from the
perspective of divergence analysis, the subgroup size is effectively
divided by 4. As you can see in the earlier charts (for which I
conveniently provided small subgroup sizes), the difference between a
subgroup size of 2 and 4 is typically much larger than between 8 and
16.&lt;/p&gt;
&lt;p&gt;Another common source of correlation in fragment shader data comes
from the primitives themselves. Even if they may be different between
triangles, values are often the same or very tightly correlated between
pixels in the same triangle. This is sort of a super-set of the 2x2
pixel group issue we just covered. This is important because this is a
type of correlation that hardware has the ability to encourage. For
instance, hardware can choose to dispatch subgroups such that each
subgroup only contains pixels from the same primitive. Even if the
hardware typically mixes primitives within the same subgroup, it can
attempt to group things together to increase data correlation and reduce
divergence.&lt;/p&gt;
&lt;h2 id="why-bother-with-subgroups"&gt;Why bother with subgroups?&lt;/h2&gt;
&lt;p&gt;All this discussion of control-flow divergence might leave you
wondering why we bother with subgroups at all. Clearly, they’re a pain.
They definitely are. Oh, you have no idea…&lt;/p&gt;
&lt;p&gt;But they also bring some significant advantages in that the
parallelism allows us to get better throughput out of the hardware. One
obvious way this helps is that we can spend less hardware on instruction
decoding (we only have to decode once for the whole wave) and put those
gates into more floating-point arithmetic units. Also, most processors
are pipelined and, while they can start processing a new instruction
each cycle, it takes several cycles before an instruction makes its way
from the start of the pipeline to the end and its result can be used in
a subsequent instruction. If you have a lot of back-to-back dependent
calculations in the shader, you can end up with lots of stalls where an
instruction goes into the pipeline and the next instruction depends on
its value and so you have to wait 10ish cycles until for the previous
instruction to complete. On Intel, each SIMD32 instruction is actually
four SIMD8 instructions that pipeline very nicely and so it’s easier to
keep the ALU busy.&lt;/p&gt;
&lt;p&gt;Ok, so wider subgroups are good, right? Go as wide as you can! Well,
yes and no. Generally, there’s a point of diminishing returns. Is one
instruction decoder per 32 invocations of ALU really that much more
hardware than one per 64 invocations? Probalby not. Generally, the
subgroup size is determined based on what’s required to keep the
underlying floating-point arithmetic hardware full. If you have 4 ALUs
per execution unit and a pipeline depth of 10 cycles, then an 8-wide
subgroup is going to have trouble keeping the ALU full. A 32-wide
subgroup, on the other hand, will keep it 80% full even with
back-to-back dependent instructions so going 64-wide is pointless.&lt;/p&gt;
&lt;p&gt;On Intel GPU hardware, there are additional considerations. While
most GPUs have a fixed subgroup size, ours is configurable and the
subgroup size is chosen by the compiler. What’s less flexible for us is
our register file. We have a fixed register file size of 4KB regardless
of the subgroup size so, depending on how many temporary values your
shader uses, it may be difficult to compile it 16 or 32-wide and still
fit everything in registers. While wider programs generally yield better
parallelism, the additional register pressure can easily negate any
parallelism benefits.&lt;/p&gt;
&lt;p&gt;There are also other issues such as cache utilization and thrashing
but those are way out of scope for this blog post…&lt;/p&gt;
&lt;h2 id="what-does-this-all-mean"&gt;What does this all mean?&lt;/h2&gt;
&lt;p&gt;This topic came up this week in the context of tuning our subgroup
size heuristic in the Intel Linux 3D drivers. In particular, how should
that heuristic reason about control-flow and divergence? Are wider
programs more expensive because they have the potential to diverge
more?&lt;/p&gt;
&lt;p&gt;After all the analysis above, the conclusion I’ve come to is that any
given if condition falls roughly into one of three categories:&lt;/p&gt;
&lt;ol type="1"&gt;
&lt;li&gt;&lt;p&gt;Effectively uniform. It never (or very rarely ever) diverges. In
this case, there is no difference between subgroup sizes because it
never diverges.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Random. Since we have no knowledge about the data in the
compiler, we have to assume that random if conditions are basically a
coin flip every time. Even with our smallest subgroup size of 8, this
means it’s going to diverge with a probability of 99.6%. Even if you
assume 2x2 subspans in fragment shaders are strongly correlated,
divergence is still likely with a probability of 75% for SIMD8 shaders,
94% for SIMD16, and 99.6% for SIMD32.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Random but very one-sided. These conditions are the type where we
can actually get serious statistical differences between the different
subgroup sizes. Unfortunately, we have no way of knowing when an if
condition will be in this category so it’s impossible to make heuristic
decisions based on it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Where does that leave our heuristic? The only interesting case in the
above three is random data in fragment shaders. In our experience, the
increased parallelism going from SIMD8 to SIMD16 is huge so it probably
makes up for the increased divergence. The parallelism increase from
SIMD16 to SIMD32 isn’t huge but the change in the probability of a
random if diverging is pretty small (94% vs. 99.6%) so, all other things
being equal, it’s probably better to go SIMD32.&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2020/10/does-subgroup-wave-size-matter/</guid><pubDate>Tue, 13 Oct 2020 14:36:44 -0700</pubDate></item><item><title>Transform feedback is terrible, so why are we doing it?</title><link>https://www.gfxstrand.net/faith/blog/2018/10/transform-feedback-is-terrible-so-why/</link><description>&lt;h1 id="transform-feedback-is-terrible-so-why-are-we-doing-it"&gt;Transform
feedback is terrible, so why are we doing it?&lt;/h1&gt;
&lt;p&gt;In the latest Vulkan spec update from Khronos (version 1.1.88),
there’s a new extension called VK_EXT_transform_feedback. Some of you
might be thinking, “Finally! Why’d it take them so long to add this
obviously useful feature? It should have been there on day 1.” The
answer to that question is that transform feedback (or streamout in D3D
lingo) is a terrible feature that we all regret putting into OpenGL and
OpenGL ES and we didn’t want that baggage in Vulkan.&lt;/p&gt;
&lt;h2 id="why-is-transform-feedback-terrible"&gt;Why is transform feedback
terrible?&lt;/h2&gt;
&lt;p&gt;Transform feedback didn’t start off terrible. When it was first added
to OpenGL in 2006, it provided some very useful functionality. You could
now take the result of your geometry pipeline and use for whatever you
wanted. You could read it from the CPU and feed it back into your
physics engine or you could re-use it directly on the GPU and feed it
back into another draw call. In some ways, this was OpenGL’s first form
of compute shaders. Since the only other way to get data out of shaders
prior to transform feedback was glReadPixels and friends, it was a
pretty neat feature.&lt;/p&gt;
&lt;p&gt;The real difficulty with transform feedback is a subtle requirement
that isn’t explicitly stated anywhere in the spec that the data land in
the transform feedback buffer in the same order as the input data. The
OpenGL and Vulkan graphics pipelines are specified in terms of a
theoretical pipeline which is executed one primitive at a time. Even
though a modern GPU has thousands of shader cores all executing in
parallel and potentially out-of-order, the end result has to be as if
they executed serially. This is very important for things such as
blending and depth/stencil testing because those calculations are
potentially non-commutative and you can’t get consistent results without
controlling the order in which those calculations occur. The reality,
however, is that GPUs don’t have accomplish this by processing the
primitives in-order; the only real requirement is that they process the
blending operations in-order on a per-pixel basis. So, while on one part
of the image, the GPU is blending primitive 17, it may be blending
primitive 182 in some other part of the image.&lt;/p&gt;
&lt;p&gt;With transform feedback, you have a similar ordering requirement.
Without this requirement the feature would be almost useless since you
wouldn’t be able to match input data to output data. However, this
requirement is also the feature’s Achilles’ heel. While the
serialization required for blending only occurs at the very end and
happens on a per-pixel basis, the serialization required for transform
feedback happens much earlier in the pipeline and serializes across the
entire draw call and not just per-pixel. In 2006, when the feature was
first added to OpenGL, GPUs still had lots of fixed-function hardware
and very few shader cores. On modern GPUs with thousands of shaders
in-flight at any given time, the primitive ordering requirement becomes
much more painful.&lt;/p&gt;
&lt;p&gt;You may be thinking, “What’s the big deal? You know the order the
data came in, can’t you just write out-of-order but in the right spot in
the buffer?” If only life were that easy… With a simple pipeline
containing only a vertex shader, yes, you can do that. However,
transform feedback also has to interact with geometry and tessellation
shaders which produce an unknown number of primitives. Since transform
feedback is specified using OpenGL’s theoretical serial execution model,
that means that you first get all the primitives resulting from input
primitive 0 followed by all the primitives resulting from input
primitive 1, followed by 2, etc. Because you have no idea up-front how
many output primitives will be produced from any given input primitive
until the entire pipeline has been run, you really do have to wait until
the last shader stage for primitive 41 has been executed before you know
where to put the data resulting from primitive 42. Most desktop GPU
vendors are carrying special hardware just to sort all this out without
running the entire pipeline serially.&lt;/p&gt;
&lt;p&gt;There’s a second issue that has arisen since 2006 which is also
somewhat non-obvious: the rise of tiled architectures. Tiling GPU
architectures have been around for a long time but in 2006, tiling had
fallen out of favor and all three of the GPU vendors implementing OpenGL
were immediate-mode renderers. On a tiled architecture, you frequently
run part of the vertex pipeline up-front to perform the binning step and
then re-run the pipeline a second time per-tile to actually generate all
the information needed by the fragment shader. This means that the
vertex shader may get run multiple times for any particular vertex. It
may sound crazy to do duplicate work like that but it does end up being
more efficient on those architectures and it’s allowed because the
vertex shader doesn’t have any side-effects and the only thing it does
is dump data into the fragment shader. The moment transform feedback is
enabled, all that goes out the window because you have to process all
the primitives in full (can’t drop any output) and in order. This leads
to a significant performance drop because they can no longer play all
their binning games and keep that data on-chip. It’s worth noting that
tiling architectures do run into similar issues without transform
feedback if you enable a geometry or tessellation shader but transform
feedback certainly isn’t helping.&lt;/p&gt;
&lt;p&gt;To sum it all up, transform feedback isn’t as great as it looks on
the surface. In the modern world, we have compute shaders which can do
basically everything that people actually need transform feedback to do.
Want to transform some geometry and feed back into your physics engine?
Use a compute shader. Want to compute some geometry for use in a future
draw call? Use a compute shader. There isn’t nearly as much need for
transform feedback now as there was then. Transform feedback does
provide one bit of functionality over compute shaders which is that you
can generate an arbitrary amount of output data from a single piece of
input data and it’s guaranteed to be in-order. However, that feature
isn’t nearly as useful as it sounds because you can’t figure out where
that piece of data is in the output stream without starting at the
beginning and adding up all the geometry shader outputs.&lt;/p&gt;
&lt;p&gt;In light of the fact that transform feedback is painful to implement,
comes at a significant performance cost on some architectures, and
doesn’t provide significant functionality over compute shaders, we
decided not to put it in Vulkan. This is a decision I supported then and
I still support now. It should be considered legacy functionality and
not used in new software.&lt;/p&gt;
&lt;h2 id="so-why-are-we-implementing-it"&gt;So why are we implementing
it?&lt;/h2&gt;
&lt;p&gt;Hopefully, the last section convinced you that transform feedback is
a terrible legacy feature and doesn’t belong in a modern graphics API.
The question then naturally arises, “Why are we adding it now?”&lt;/p&gt;
&lt;p&gt;The answer is API translation. Over the course of the last year, many
projects have arisen which attempt to translate other graphics APIs to
Vulkan: DXVK, VKD3D, ANGLE, Zink, and GLOVE just to name a few. One
thing that’s common among all of them is that the API which they are
attempting to translate has some form of transform feedback. There are
also tools such as RenderDoc that use transform feedback to capture the
result of the geometry pipeline for debugging purposes.&lt;/p&gt;
&lt;p&gt;For simple geometry pipelines containing only vertex shaders or where
you can statically determine the number of primitives produced by the
geometry shader, there are other options. If the Vulkan implementation
supports vertexPipelineStoresAndAtomics feature, you can simply add SSBO
writes to the last shader stage and compute the offset in the buffer to
write based on gl_VertexId or gl_PrimitiveId. If the implementation does
not support SSBO writes from vertex and geometry shaders, you can still
fairly easily translate it into a compute shader at fairly little cost.
For the more complex geometry and tessellation shader cases, however,
the ordering guarantees come into play and cause significant
headaches.&lt;/p&gt;
&lt;p&gt;Initially, our answer to these complex use-cases was the same as our
answer to new application developers: “Use compute shaders.” While
compute shaders are a better fit for most applications, taking a entire
geometry pipeline which has already been described in terms of vertex,
tessellation, and geometry shaders and translating that into a compute
shader is a giant pain. Such a translation is also likely to be
significantly slower than what the GPU’s dedicated hardware can do. If
we were only looking at one or two translation layers and we didn’t care
about performance, that would likely still be the answer. However, with
people wanting to run D3D games at full frame-rate on Vulkan via layers
like DXVK and VKD3D, that’s not really a good answer.&lt;/p&gt;
&lt;p&gt;In the end, then, we decided that the functionality was needed badly
enough that we begrudgingly drafted the extension and accepted the
burden of supporting the legacy functionality. As is explicitly stated
in the extension text, the intention is that VK_EXT_transform_feedback
will likely never become core Vulkan functionality that new applications
(or even Vulkan ports of old ones) should find some other way to
transform geometry on the GPU. However, for those cases where it really
is needed, the functionality is now there.&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2018/10/transform-feedback-is-terrible-so-why/</guid><pubDate>Sat, 13 Oct 2018 19:37:59 -0700</pubDate></item><item><title>The quest for known behavior</title><link>https://www.gfxstrand.net/faith/blog/2018/10/the-quest-for-known-behavior/</link><description>&lt;h1 id="the-quest-for-known-behavior"&gt;The quest for known behavior&lt;/h1&gt;
&lt;p&gt;This post is one I’ve been meaning to write for a while to explain my
personal philosophy about designing, testing, and tooling APIs to
provide the best experience for the implementers and users of that
API.&lt;/p&gt;
&lt;p&gt;In my position on Intel’s Linux 3D driver team, I see the way this
all plays out from multiple angles. As a member of the Khronos Vulkan
working group, I am one of the many spec authors and get my hands dirty
with the minutiae of exactly how all the various bits of the API are
specified to work. As a driver author, I see how we implement the APIs
and all of the various corner cases where things can go wrong. As
someone who debugs game issues and communicates with game developers, I
see pain of debugging issues in applications and drivers that anything
from rendering errors to full system crashes. One of those is obviously
worse than the other but neither leads to happy users.&lt;/p&gt;
&lt;p&gt;My objective as a spec author and driver developer is to make the
Vulkan specification the best it can be and provide the best experience
possible for both game developers and the users who enjoy playing their
games. So how do we go about accomplishing this?&lt;/p&gt;
&lt;h2 id="what-is-undefined-behavior"&gt;What is undefined behavior?&lt;/h2&gt;
&lt;p&gt;Fundamentally, an API specification like the Vulkan specification is
a contract between the client and the implementation that if the client
does X, Y, and Z, then the implementation will do A, B, and C. The
difficult part is what happens when that contract is broken. In Vulkan,
any misuse of the API on the part of the client results in what we call,
“undefined behavior.” Here’s a short quote from the Vulkan 1.1
specification:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The core layer assumes applications are using the API correctly.
Except as documented elsewhere in the Specification, the behavior of the
core layer to an application using the API incorrectly is undefined, and
may include program termination.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The consequences of misusing Vulkan are pretty bad. “May include
program termination” means that using the API wrong may cause your
program to crash or the kernel to decide to kill it. This sits in stark
contrast to OpenGL where the worst that happens for most common
programming errors is that whatever function you just called harmlessly
sets an error code and does nothing. Almost worse than the program
crashing is that the undefined behavior may be that it works perfectly
and the developer remains blissfully unaware of the problem until
someone runs the application on a different Vulkan implementation and it
immediately crashes.&lt;/p&gt;
&lt;h2 id="how-can-we-avoid-undefined-behavior"&gt;How can we avoid undefined
behavior?&lt;/h2&gt;
&lt;p&gt;How can anyone write software against an API that provides no
feedback about errors and where the consequences for violating any one
of the specification’s more than four thousand “valid usage” statements
are so dire? For that, we have a set of what we call “validation layers”
which do piles of error checking to inform the developer when they are
in violation of their side of the API contract. In theory, if the
validation layers give the application the green light then it’s
fulfilling its side of the contract and will get correct rendering.&lt;/p&gt;
&lt;p&gt;There is a second issue here which comes from the other side of the
API. The specification is a contract and we also have to ensure that the
implementation (driver) lives up to it’s side of the bargain. For that,
we have what we have a conformance test suite (CTS) that vendors are
required to run and pass before they can claim that what they have is a
Vulkan driver. These tests attempt to test a broad cross-section of the
API to give some sense of security that the driver is, indeed,
implementing it correctly. In theory, if you pass the conformance test
suite then any application which uses the Vulkan API correctly will
render correctly on your implementation.&lt;/p&gt;
&lt;p&gt;Those are both nice theories but we know that theory and practice are
often two different things. That only works if both the validation
layers and the conformance test suites are perfect. The reality,
however, is that not every corner of API validity is covered by
validation. On the implementation side, when you consider both software
and hardware, the complexity of the implementation is such that perfect
test coverage is impossible to achieve.&lt;/p&gt;
&lt;p&gt;I think, by now, I’ve probably successfully convinced you that making
this whole mess work reliably is a hard problem. In fact, it’s
impossible. Before we get too depressed about the future or lost in the
details of validation and conformance testing, let’s take a step back
and look at the big picture again.&lt;/p&gt;
&lt;h2 id="the-quest-for-known-behavior-1"&gt;The quest for known
behavior:&lt;/h2&gt;
&lt;p&gt;At the end of the day, what do our customers want? In particular,
what do the software developers write applications that use the Vulkan
API want? It’s really very simple: they want to know that their
application will run correctly and perform well on their user’s
computer. They don’t care that every possible theoretical correct Vulkan
program runs correctly on implementation A. They also don’t really care
that Vulkan application B works correctly on every theoretically
possible correct Vulkan implementation. They care that their application
will run correctly and perform well on their user’s computer.&lt;/p&gt;
&lt;p&gt;In other words, they want what I call “known behavior”. They want to
know that their application will run as intended. Ensuring this in
general is still an impossible task but keeping the correct perspective
helps us prioritize so that we can come as close as possible to the real
goal of happy customers. Our goal as spec authors, driver developers,
test writers, and validation layer developers should be to ensure that,
if an application passes validation, then the developer has a pretty
good idea that it will actually work when deployed in the wild.&lt;/p&gt;
&lt;h2 id="where-do-we-go-wrong"&gt;Where do we go wrong?&lt;/h2&gt;
&lt;p&gt;The goal I stated in the previous paragraph sounds obvious, but it’s
amazingly easy to get so caught up in the details that you forget the
big picture. Let me give two examples.&lt;/p&gt;
&lt;p&gt;First, let’s look at the group of CTS tests called
dEQP-VK.pipeline.stencil. There were around 16,000 tests in this test
group that test every possible combination of depth/stencil image format
and stencil pass/fail and depth fail op in the API. On the face of
things, this sounds like fantastic coverage because it covers all
combinations of some things. However, when Vulkan 1.0 was released, this
was about 10% of the tests in the CTS, the whole lot caught exactly one
bug in our driver, and it took three lines of code to fix it. Meanwhile,
there was not a single test in the CTS which tested using depth or
stencil on a mip level or array slice other than zero nor were there any
tests for different clear operations nor were there any multisampled
depth/stencil tests. So, while we had tens of thousands of stencil
tests, they they exhaustively covered one tiny corner of the API and
left vast swaths completely untested.&lt;/p&gt;
&lt;p&gt;A second example is SPIR-V testing. The SPIR-V spec is on the same
order of complexity (as far as combinatorial explosions go) as the
Vulkan spec itself. It’s a very general spec which specifies a binary
language for the exchange of shaders between the Vulkan application and
driver. Because no one likes to write SPIR-V directly, the choice was
made early on to write most CTS shaders in GLSL and use GLSLang to
compile them to SPIR-V. We also wrote a few hundred tests directly in
SPIR-V to test various control-flow conditions that weren’t likely to be
generated by GLSLang. The result was that the CTS does a pretty good job
of testing that implementations can consume the subset of SPIR-V that’s
produced by GLSLang. However, as people have started developing SPIR-V
compilers for other languages such as HLSL and OpenCL C, we’ve
discovered that driver quality is not so good the moment you step off of
the path of what’s generated by GLSLang.&lt;/p&gt;
&lt;p&gt;The point of those two examples is not to poke fun at any particular
person or to make you think that the state of Vulkan testing is bad.
Quite the contrary, I feel like the state of Vulkan testing is actually
pretty good these days (it was very bad at first) and it only keeps
improving. The point is to show how easy it can be to leave giant gaping
test coverage holes if you aren’t careful.&lt;/p&gt;
&lt;h2 id="how-can-we-achieve-known-behavior"&gt;How can we achieve known
behavior?&lt;/h2&gt;
&lt;p&gt;We can’t actually get there; not really. However, we can make strides
in that direction and we can actually get pretty close if we keep the
real goal in focus. How do we do that? There are some basic guiding
principles that I use when writing spec, working on the driver, or
developing tests to help keep my priorities in order and keep focused on
the ultimate goal of happy users:&lt;/p&gt;
&lt;ol type="1"&gt;
&lt;li&gt;&lt;p&gt;Write specifications that are clear and easy to validate. In the
Vulkan specification, we make it easy to validate by describing as much
of the client side of the contract as we can in terms of simple “valid
usage” statements which are straightforward to turn into validation
code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep the API surface small and easy to test. The more different
ways you have to do a particular thing, the more different combinations
you end up with. For example, it works differently in our implementation
when you clear with LOAD_OP_CLEAR vs. LOAD_OP_DONT_CARE followed by
vkCmdClearAttachments vs. vkCmdClearColorImage followed by LOAD_OP_LOAD.
Throw in multi-sampling, depth, and stencil, and you have a testing
nightmare. In the case of clears, all those mechanisms are there for
good reasons but they come at the cost of a higher testing burden. When
you can make the API simpler, you should as it reduces the testing
burden.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Watch out for edge cases. This applies to all areas:&lt;/p&gt;
&lt;ol type="1"&gt;
&lt;li&gt;&lt;p&gt;When writing the spec, try to design edge cases out. It can
sometimes be tempting to start off with something with lots of edge
cases and then try to fix them one by one. Often, it’s better to step
back and rework the spec or implementation and structure it in such a
way that it has fewer edge cases by design.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When implementing the API, try to design your software with the
right level of generality so there are fewer internal edge-cases that
need testing. It also often helps to have fewer layers and abstractions
that interact in strange ways which can lead to more edge
cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When implementing the API, watch out for edge cases and ensure
they are tested. Only someone actively working on our driver would
understand the all of the different image clearing paths we have and
know that separately testing rendering to an image and texturing from an
identical image doesn’t actually cover the case of rendering,
transitioning to SHADER_READ_ONLY_OPTIMAL, and then texturing. Whenever
I’m implementing a new feature, I actively pay attention to places where
I know it could go wrong and ensure that there are tests in the CTS
which test those cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When writing tests, look for all the non-obvious combinations.
It’s impossible to test every combination of everything. However, it’s
better to test a lot of different types of combinations than to
exhaustively test one tiny corner. See also my story about the stencil
tests.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Test everything. This really should go without saying, but
there’s no excuse for having a feature that simply isn’t tested at all.
It doesn’t matter how small it is or how it’s classified, or how many
people are implementing or using it, it needs to be tested. Our team
makes it a policy that nothing lands in our driver without independent
tests that can be run in our CI system. It doesn’t matter if an
application uses the feature successfully so you know it works; the
tests need to run in CI.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write tests/validation for bugs. Every time you find a bug in an
application, it’s something the validator didn’t catch. Every time you
find a bug in an driver, it’s something the CTS didn’t catch. Take
advantage of the opportunity when bugs arise, to identify the testing or
validation hole which allowed that bug to creep through and fix
it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Testing and validation aren’t and never will be perfect. However,
with a little care, we can get pretty close to a state of known
behavior. As I said above, the state of Vulkan testing and validation
today is miles ahead of where it was two and a half years ago. When our
driver first shipped, it passed the entire CTS and couldn’t render
either of the two available games correctly. Today, users are constantly
running random Vulkan applications that we (the driver team) have never
seen before with good results. That’s known behavior!&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2018/10/the-quest-for-known-behavior/</guid><pubDate>Thu, 04 Oct 2018 18:15:15 -0700</pubDate></item><item><title>Optimizing DXVK apps</title><link>https://www.gfxstrand.net/faith/blog/2018/09/optimizing-dxvk-apps/</link><description>&lt;h1 id="optimizing-dxvk-apps"&gt;Optimizing DXVK apps&lt;/h1&gt;
&lt;p&gt;One of the recent happenings in the world of Linux graphics is rise
of DXVK. For those who don’t know, DXVK is a translation layer which
translates D3D11 and D3D10 Api calls to Vulkan. It’s intended to be used
together with Wine to allow more Windows game titles to run directly on
Linux without modification. Wine already has a D3D10/11 to OpenGL
translator but DXVK has generally better performance and compatibility
than what is built into core Wine.&lt;/p&gt;
&lt;p&gt;For Linux gamers, this has meant a wealth of new titles to play on
their favorite operating system. For driver developers, it means more
workloads which have different shaders and API usage patterns. This
means more bugs and more opportunities for performance optimization.
While a lot of stuff works fine and performs very well out-of-the-box,
we’ve gotten a handful of new GPU hangs and other issues reported. Much
of the work I’ve done over the course of the last three months or so has
been focused around fixing or improving the performance of games running
under DXVK.&lt;/p&gt;
&lt;p&gt;Because bug fixing is boring, let’s talk about making games
faster!&lt;/p&gt;
&lt;h2 id="skyrim-special-edition"&gt;Skyrim Special Edition&lt;/h2&gt;
&lt;p&gt;One of the first titles I tested on DXVK (the third, if I recall
correctly) was The Elder Scrolls V: Skyrim Special Edition. When I first
fired the game up, there were two immediately obvious problems:
everything was green (this turned out to be a DXVK bug) and it was a
slide-show. I don’t recall the details exactly but it may have been in
the seconds-per-frame range. While Skyrim may have once been considered
graphically intensive, that was a long time ago and I knew we could do
better.&lt;/p&gt;
&lt;p&gt;The first thing I did to try and narrow down the problem was to use
RenderDoc to capture a frame of the game so I could inspect it
draw-by-draw. Even though RenderDoc doesn’t have actual performance
counter support yet, it does use timestamps to tell you how long each
draw takes. I was quickly able to identify a particular draw call that
was dominating the frame render time even though it was just rendering a
quad with some shading.&lt;/p&gt;
&lt;p&gt;With a bit more work, I was able to isolate the offending shader and
look at the assembly. The shader was an ambient occlusion shader which
had a couple of large constant arrays in the shader which it used as a
look-up table for part of the calculation. Due to the size of the
arrays, they were taking considerable shader resources and causing a
large amount of spilling in the shader. Also, since they were accessed
indirectly, we were generating large if-ladders for accessing them.&lt;/p&gt;
&lt;p&gt;Isn’t this a fairly obvious thing we should be optimizing? Yes, and
we have been in OpenGL. Unfortunately, the optimization pass for this
lives at the GLSL IR level and not in NIR so the SPIR-V path can’t take
advantage of it. Using more-or-less the same idea as the GLSL IR pass, I
wrote a NIR pass which pulls large constant arrays out into a blob of
constant data associated with the shader which we then turn into a UBO
in the Vulkan driver. The optimization successfully got rid of all of
the spilling in that and similar shaders, reduced the time required for
that draw by 99.6% (no joke!), brought the framerate from slide-show to
nicely playable and roughly in-line with the performance of the same
game under native D3D11.&lt;/p&gt;
&lt;p&gt;This all goes to show that sometimes the difference between garbage
performance and good performance is just that one tiny thing you were
missing all along.&lt;/p&gt;
&lt;h2 id="batman-arkham-city"&gt;Batman: Arkham City&lt;/h2&gt;
&lt;p&gt;Some time later, a user was complaining on the DXVK issue tracker
about GPU hangs with Batman: Arkham City on Intel. How I fixed the hangs
is a very boring story but, while I was looking at GPU error states
trying to figure out the hangs, I noticed that the tessellation shaders
were spilling like mad. (As it turns out, that had nothing to do with
the hangs and our spilling was working perfectly.)&lt;/p&gt;
&lt;p&gt;Why were they spilling so badly? The problem turned out to be because
of the shadow variables that DXVK was creating for inputs. There are
very good reasons why it creates these shadows that has to do with
differences between the D3D shader interface and Vulkan. However, our
compiler was having difficulty eliminating them and so we were storing
4K of temporary data which blows out the register file and we start
spilling like mad. The pattern in DXVK looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;in&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec3&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;in&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;out&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;32&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;

&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;32&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;

&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;hs_main&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xy&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xy&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-c1"&gt;// Do some other stuff&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;

&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;shader_in&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;2&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;

&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;hs_main&lt;/span&gt;&lt;span class="pygments-p"&gt;();&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In order to chew through it, I wrote a series of four optimizations
which chews through the above mess and turns it into, effectively,
this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;in&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec3&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;in&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec2&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;
&lt;span class="pygments-k"&gt;layout&lt;/span&gt;&lt;span class="pygments-p"&gt;(&lt;/span&gt;&lt;span class="pygments-n"&gt;location&lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;)&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-k"&gt;out&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-kt"&gt;vec4&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-mi"&gt;3&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;32&lt;/span&gt;&lt;span class="pygments-p"&gt;];&lt;/span&gt;

&lt;span class="pygments-kt"&gt;void&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;main&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;()&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-p"&gt;{&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mo"&gt;0&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v0&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xyz&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-n"&gt;oVertex&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;][&lt;/span&gt;&lt;span class="pygments-mi"&gt;1&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xy&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-o"&gt;=&lt;/span&gt;&lt;span class="pygments-w"&gt; &lt;/span&gt;&lt;span class="pygments-n"&gt;v1&lt;/span&gt;&lt;span class="pygments-p"&gt;[&lt;/span&gt;&lt;span class="pygments-nb"&gt;gl_InvocationId&lt;/span&gt;&lt;span class="pygments-p"&gt;].&lt;/span&gt;&lt;span class="pygments-n"&gt;xy&lt;/span&gt;&lt;span class="pygments-p"&gt;;&lt;/span&gt;
&lt;span class="pygments-w"&gt;    &lt;/span&gt;&lt;span class="pygments-c1"&gt;// Do some other stuff&lt;/span&gt;
&lt;span class="pygments-p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Not only are the temporary arrays gone but the array access with an
index of gl_InvocationId is now on an input variable directly and not on
a temporary. It’s much easier for our hardware to do an indirect access
on a vertex input than on a temporary so, again, we dropped the
if-ladders and almost all of the spilling.&lt;/p&gt;
&lt;p&gt;The improvement to Batman: Arkham City wasn’t nearly as dramatic as
with Skyrim but it was still around a 15% FPS increase in the game’s
built-in benchmark.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;So what’s the moral of the story? It’s not that bad shaders or
spilling is the root of all performance problems. (I could just as
easily tell you stories of badly placed HiZ resolves.) It’s that
sometimes big performance problems are caused by small things (that
doesn’t mean they’re easy to find!). Also, that we (the developers on
the Intel Mesa team) care about Linux gamers and are hard at work trying
to make our open-source Vulkan and OpenGL drivers the best they can
be.&lt;/p&gt;
</description><guid isPermaLink="true">https://www.gfxstrand.net/faith/blog/2018/09/optimizing-dxvk-apps/</guid><pubDate>Sun, 16 Sep 2018 23:45:58 -0700</pubDate></item></channel></rss>