- Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't.
The chips can't actually do it. They removed all but one fp64 CUDA core per SM, it wasn't worth wasting the hardware on, which left the chips at a 1:64 ratio. Similarly one of the two cuda compute cores per unit can either do int32 or fp32 but not both per cycle so int32 is always half speed and affects fp32 speed. The architecture docs say that was left around for compatibility with software that expects the type to exist. NVidia (for the most part) doesn't operate as a "real" vectorized parallel processor like the AMD GPUs... if you check clinfo for a 7900XTX and a 4090 you'll get preferred vector sizes for (32b:16b:8b) of 1:2:4 on the AMD card and 1:1:1 on the NVidia card because the regular CUDA instructions only really do 32-bit. Some of the AMD instructions are 2/clock as well. That's part of the reason NVidia needs twice as many stream processors to get their base numbers roughly 1.5x higher than AMD for fp32. The ~200TF of dedicated raytracing hardware when AMD doesn't have any (they just have schedulers) are why NVidia currently beats the pants off them in workstation more than lack of support. I own both cards and tested against AMD's own open source physically accurate renderer which doesn't even use the raytracing cores yet and it's still almost twice as fast on NVidia. LuisaRender which uses OptiX but also has a DirectX backend is even worse in terms of performance gap. I tested the directx backends vs each other in that one. If all you're doing is gaming and you buy a 4090 it's probably sheer stupidity but doing 3D work if I can GPU render a frame in 1 minute vs. 4 at the same power draw it's kinda a no brainer to spend the extra 1.5x on the GPU, especially when it gets things fast enough that I can just render the viewport with Karma XPU in Houdini and know more or less exactly what the final render is going to look like while still being able to interactively modify things.
The tensor cores are a different story and although they don't list the numbers there's a single 16x16 FP64 variant of FMA listed in the instruction set manuals without any mention of it introducing a longer wait period than the rest of the tensor instructions so there's more speed than it seems like assuming you can coax your math into 16x16 matrices.
AMD is more guilty of software-locking out the 1:2 ratio fp64 on consumer Vega hardware, if anything. IIRC it existed in one of the cards for a while when it was released then got removed by the drivers and people started cross-flashing them to workstation firmware to get it back.
I don't think anything since vega has had the 1:2 ratio available even in unlocked form; I can't recall where I read it but somebody interviewed them and they mentioned the hardware compromises not being worth it. They already need to dedicate something like 64k of registers per stream processor to keep things flowing and SRAM that fast isn't cheap. The double-issue thing lets them keep registers the same width and tell the compiler to generate the double-issue instructions instead of 2-vector instructions which isn't too much harder and keeps the hardware prices... well let's face it still incredibly expensive. AMD has been going up in price every generation, top end NVidia consumer has been going down since the Titan since they were always luxury items / introductory workstation cards... most motherboards can only support one full speed GPU anymore despite CPU having advanced to the point that many machines could be usable as workstations so there's a market for a single huge fast-clocked card vs the usual multiple slimmer ones that draw reasonable power (I was surprised too, but the 4090 is the lowest MSRP x090 series card they've made by a good amount). CDNA is 1:1 but that's a different architecture entirely. It has better support for branched code and real matrix hardware among other things.
The current high-end consumer card fp64 numbers are:
FP64 numbers:
7900XTX - 1.919TFLOPs
RTX4090 - 1.290TFLOPS
Which isn't much of a difference.
The cheapest, fastest FP64 cards I'm aware of that are possible to afford at the moment are both older but if that's what you need they're functional:
NVidia Tesla V100 FHHL 16GB HBM2 - This FHHL model has been showing up on ebay for $500-600 and had a 1:2 FP64 / FP32 ratio. It won't be fast for anything else but it did 6.605TF FP64. It was clocked lower than other Tesla models but anything above 16GB is too expensive except SXM2 module versions which are impossible to find motherboards for, making them dirt cheap.
The Radeon Instinct MI100 is showing up @ $1000-1200 on ebay and (CDNA) is more modern, has actual matrix hardware, and does 11.54TF
It also has a nightmarish flaw that will make you want to flee from it, that being that the only drivers AMD lists are for VMWare VSphere VSXi. There's a free version of that, but it's CLI only and from talking to friends management using the GUI is bad enough.
The Instinct MI50 and MI60 are a bit faster than the V100 and are Vega 20 based, but I'd be concerned about buying anything Vega series used because of how many were snapped up and abused to death for mining various coins during that whole mess. I haven't looked at their costs either. Radeon Pro VIIs are also 1:2 and seem to go for about the same as the V100 I mentioned at roughly the same FP64 speed so that's an option if you trust the seller.
I currently have a 7900XTX and an RTX4090 running side by side in my system and wouldn't touch NVidia with a 10 foot pole until last year because of their prior deceptive advertising when I looked at some numbers for what I was doing and realized they'd finally cut the advertising BS and were way farther ahead than I'd realized so I'm not particularly trying to defend them on that, just point out that it's a bit more complicated than only putting it on expensive hardware. The 7900 is still around but not much longer, some of the basic features they sold the card with still haven't been implemented on Windows (the "AI cores" aka WMMA schedulers, and raytracing accelerators which only work in games poorly and in some specific build of blender which I don't use). They're missing essential libs required for Torch to produce a build that can use the matrix acceleration cores and only have halfway functional support for HIP (just a CUDA compatibility shim anyway), which should have had day 1 support IMO. This was a very good way for a company I liked to piss me off. They've done this before with the geometry acceleration hardware on Vega which just mysteriously stopped being mentioned after no support in drivers for over a year and it turned out to be flawed hardware. I didn't really care because I wasn't doing anything that would benefit from it and it was a minor feature compared to these two. I'm planning to sell the thing for what I can get and replace it with a second 4090 or just leave it at a single card since the performance gains aren't linear anyway. I still need to test the system with the 7900XTX as primary video and the 4090 with the compute-only TCC driver mode to see if that accomplishes anything, but I get the feeling it'll cause too many issues with things like the viewport Karma render which will take a large latency hit if it has to be copied between GPUs, or not work at all if it's OpenGL at any stage.
Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes.
Right, but Ultra Fractal compiles those. They didn't compile to arm either, but
I'll go out on a limb and say Frederik didn't run off and hand-write assembly language for every user fractal to make them work on it in the new update. I mean in a sense he did since that's what the fractal compiler is doing now and some lucky bastard gets to write the part that converts everything to either assembly or interprets it / JITs it to some kind of bytecode that it can then run in an optimal way against a table of bytecode -> machine code mappings, but it only needs to know how to handle the things its capable of doing on the architectures it targets.
I don't know what its compiler looks like, I'm assuming it's home-cooked since Frederik was working on the ARM version for a while. (Don't worry, they'll be switching to RISC V without telling anybody in advance soon I'm sure). The language it uses isn't complex and could be ported to a compiler system like LLVM as a frontend; once you get LLVM generating IR, you can literally take IR files not even built for compatible architectures and feed them into LLC targeting random other architectures and get working optimized binaries within limitations that ultra fractal meets (they're mostly related to calling incompatible or non-standard CRT functions or OS specific crap). I've lowered IR generated for 32-bit PPC MacOS to Javascript via the emscripten backend before just to demonstrate to people at work that the compiler basically ignored datatypes if you overrode them. This doesn't work when the frontend needs to generate processor-specific intrinsics for some reason but that isn't a thing here, I'm just giving that as an example. See also the Julia language.
Classes will run on anything after they're compiled down to its assembly language, they're just a HLL abstraction. (See the CUDA code on the repo below, it's all namespaced C++ template classes that expand to the multiple variants for the sizes used during preprocessing and are lowered as... let's see, 128 bit - 32kb integers that need bitness divisible by 32... well potentially a metric buttload of C++ classes that need to be lowered to GPU assembly.
For the GPU backends you really just need to inline everything to avoid performance hits from branching but UF classes aren't the complexity of C++ classes which will make it a lot easier.
Compute shaders don't really have many instruction set limitations inherent to them vs. regular CPUs anymore, only standard library limitations.
Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision.
As I mentioned above this can be done on the GPU without a ton of effort. It's not full speed GPU domain but it's still faster than CPU, and there's not a ton of data being shuffled between domains which more or less eliminates the cost of running on a PCIe device.
https://github.com/NVlabs/CGBN
speedup on add and mul for 128-bit numbers on a V100 (nvidia cards suck at integer btw) vs a Xeon 20-Core E5-2698v4 was
174x. It drops as the numbers get bigger in most cases but it's still 3.5x faster on the slowest instruction (sqrt, which may be a software implementation) on an 8kb integer. If you look at the cuda source you might notice that it's implemented as C++ template classes, which is why people like CUDA over the plain C OpenCL.
Since then NVidia implemented the fastest of those (int128_t) as a native CUDA type even though no hardware exists on the GPU (internally implemented by the compiler as 4 fp32 operations in parallel with maximum O(n^2) complexity for operations like multiplication and division; presumably they plan on adding hardware support to something, it might exist already on Hopper), and even though this stupid link isn't showing up right, there's an implementation of fixed point operations for it. The previous one I found was OpenCL so more usable cross-platform, but I'm just pointing it out. https://github.com/rapidsai/cudf/blob/6638b5248fdf8cfcdff29f8209799f02abf77de1/cpp/include/cudf/fixed_point/fixed_point.hpp
Most of the code is the same as you'd use doing it on x86, it just runs in parallel across a single operation implicitly so instead of 4 clocks per step it's one, I think up to around the 1024-bit range where it might start to need 2 cycles for an L1 cache hit.
>2. Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't.
The chips can't actually do it. They removed all but one fp64 CUDA core per SM, it wasn't worth wasting the hardware on, which left the chips at a 1:64 ratio. Similarly one of the two cuda compute cores per unit can either do int32 or fp32 but not both per cycle so int32 is always half speed and affects fp32 speed. The architecture docs say that was left around for compatibility with software that expects the type to exist. NVidia (for the most part) doesn't operate as a "real" vectorized parallel processor like the AMD GPUs... if you check clinfo for a 7900XTX and a 4090 you'll get preferred vector sizes for (32b:16b:8b) of 1:2:4 on the AMD card and 1:1:1 on the NVidia card because the regular CUDA instructions only really do 32-bit. Some of the AMD instructions are 2/clock as well. That's part of the reason NVidia needs twice as many stream processors to get their base numbers roughly 1.5x higher than AMD for fp32. The ~200TF of dedicated raytracing hardware when AMD doesn't have any (they just have schedulers) are why NVidia currently beats the pants off them in workstation more than lack of support. I own both cards and tested against AMD's own open source physically accurate renderer which doesn't even use the raytracing cores yet and it's still almost twice as fast on NVidia. LuisaRender which uses OptiX but also has a DirectX backend is even worse in terms of performance gap. I tested the directx backends vs each other in that one. If all you're doing is gaming and you buy a 4090 it's probably sheer stupidity but doing 3D work if I can GPU render a frame in 1 minute vs. 4 at the same power draw it's kinda a no brainer to spend the extra 1.5x on the GPU, especially when it gets things fast enough that I can just render the viewport with Karma XPU in Houdini and know more or less exactly what the final render is going to look like while still being able to interactively modify things.
The tensor cores are a different story and although they don't list the numbers there's a single 16x16 FP64 variant of FMA listed in the instruction set manuals without any mention of it introducing a longer wait period than the rest of the tensor instructions so there's more speed than it seems like assuming you can coax your math into 16x16 matrices.
AMD is more guilty of software-locking out the 1:2 ratio fp64 on consumer Vega hardware, if anything. IIRC it existed in one of the cards for a while when it was released then got removed by the drivers and people started cross-flashing them to workstation firmware to get it back.
I don't think anything since vega has had the 1:2 ratio available even in unlocked form; I can't recall where I read it but somebody interviewed them and they mentioned the hardware compromises not being worth it. They already need to dedicate something like 64k of registers per stream processor to keep things flowing and SRAM that fast isn't cheap. The double-issue thing lets them keep registers the same width and tell the compiler to generate the double-issue instructions instead of 2-vector instructions which isn't too much harder and keeps the hardware prices... well let's face it still incredibly expensive. AMD has been going up in price every generation, top end NVidia consumer has been going down since the Titan since they were always luxury items / introductory workstation cards... most motherboards can only support one full speed GPU anymore despite CPU having advanced to the point that many machines could be usable as workstations so there's a market for a single huge fast-clocked card vs the usual multiple slimmer ones that draw reasonable power (I was surprised too, but the 4090 is the lowest MSRP x090 series card they've made by a good amount). CDNA is 1:1 but that's a different architecture entirely. It has better support for branched code and real matrix hardware among other things.
The current high-end consumer card fp64 numbers are:
FP64 numbers:
7900XTX - 1.919TFLOPs
RTX4090 - 1.290TFLOPS
Which isn't much of a difference.
The cheapest, fastest FP64 cards I'm aware of that are possible to afford at the moment are both older but if that's what you need they're functional:
NVidia Tesla V100 FHHL 16GB HBM2 - This FHHL model has been showing up on ebay for $500-600 and had a 1:2 FP64 / FP32 ratio. It won't be fast for anything else but it did 6.605TF FP64. It was clocked lower than other Tesla models but anything above 16GB is too expensive except SXM2 module versions which are impossible to find motherboards for, making them dirt cheap.
The Radeon Instinct MI100 is showing up @ $1000-1200 on ebay and (CDNA) is more modern, has actual matrix hardware, and does 11.54TF
It also has a nightmarish flaw that will make you want to flee from it, that being that the only drivers AMD lists are for VMWare VSphere VSXi. There's a free version of that, but it's CLI only and from talking to friends management using the GUI is bad enough.
The Instinct MI50 and MI60 are a bit faster than the V100 and are Vega 20 based, but I'd be concerned about buying anything Vega series used because of how many were snapped up and abused to death for mining various coins during that whole mess. I haven't looked at their costs either. Radeon Pro VIIs are also 1:2 and seem to go for about the same as the V100 I mentioned at roughly the same FP64 speed so that's an option if you trust the seller.
I currently have a 7900XTX and an RTX4090 running side by side in my system and wouldn't touch NVidia with a 10 foot pole until last year because of their prior deceptive advertising when I looked at some numbers for what I was doing and realized they'd finally cut the advertising BS and were way farther ahead than I'd realized so I'm not particularly trying to defend them on that, just point out that it's a bit more complicated than only putting it on expensive hardware. The 7900 is still around but not much longer, some of the basic features they sold the card with _still_ haven't been implemented on Windows (the "AI cores" aka WMMA schedulers, and raytracing accelerators which only work in games poorly and in some specific build of blender which I don't use). They're missing essential libs required for Torch to produce a build that can use the matrix acceleration cores and only have halfway functional support for HIP (just a CUDA compatibility shim anyway), which should have had day 1 support IMO. This was a very good way for a company I liked to piss me off. They've done this before with the geometry acceleration hardware on Vega which just mysteriously stopped being mentioned after no support in drivers for over a year and it turned out to be flawed hardware. I didn't really care because I wasn't doing anything that would benefit from it and it was a minor feature compared to these two. I'm planning to sell the thing for what I can get and replace it with a second 4090 or just leave it at a single card since the performance gains aren't linear anyway. I still need to test the system with the 7900XTX as primary video and the 4090 with the compute-only TCC driver mode to see if that accomplishes anything, but I get the feeling it'll cause too many issues with things like the viewport Karma render which will take a large latency hit if it has to be copied between GPUs, or not work at all if it's OpenGL at any stage.
>Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes.
Right, but Ultra Fractal _compiles_ those. They didn't compile to arm either, but
I'll go out on a limb and say Frederik didn't run off and hand-write assembly language for every user fractal to make them work on it in the new update. I mean in a sense he did since that's what the fractal compiler is doing now and some lucky bastard gets to write the part that converts everything to either assembly or interprets it / JITs it to some kind of bytecode that it can then run in an optimal way against a table of bytecode -> machine code mappings, but it only needs to know how to handle the things its capable of doing on the architectures it targets.
I don't know what its compiler looks like, I'm assuming it's home-cooked since Frederik was working on the ARM version for a while. (Don't worry, they'll be switching to RISC V without telling anybody in advance soon I'm sure). The language it uses isn't complex and could be ported to a compiler system like LLVM as a frontend; once you get LLVM generating IR, you can literally take IR files not even built for compatible architectures and feed them into LLC targeting random other architectures and get working optimized binaries within limitations that ultra fractal meets (they're mostly related to calling incompatible or non-standard CRT functions or OS specific crap). I've lowered IR generated for 32-bit PPC MacOS to Javascript via the emscripten backend before just to demonstrate to people at work that the compiler basically ignored datatypes if you overrode them. This doesn't work when the frontend needs to generate processor-specific intrinsics for some reason but that isn't a thing here, I'm just giving that as an example. See also the Julia language.
Classes will run on anything after they're compiled down to its assembly language, they're just a HLL abstraction. (See the CUDA code on the repo below, it's all namespaced C++ template classes that expand to the multiple variants for the sizes used during preprocessing and are lowered as... let's see, 128 bit - 32kb integers that need bitness divisible by 32... well potentially a metric buttload of C++ classes that need to be lowered to GPU assembly.
For the GPU backends you really just need to inline everything to avoid performance hits from branching but UF classes aren't the complexity of C++ classes which will make it a lot easier.
Compute shaders don't really have many instruction set limitations inherent to them vs. regular CPUs anymore, only standard library limitations.
> Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision.
As I mentioned above this can be done on the GPU without a ton of effort. It's not full speed GPU domain but it's still faster than CPU, and there's not a ton of data being shuffled between domains which more or less eliminates the cost of running on a PCIe device.
[Integer bignum from NVidia up to 32kb](https://github.com/NVlabs/CGBN)
speedup on add and mul for 128-bit numbers on a V100 (nvidia cards suck at integer btw) vs a Xeon 20-Core E5-2698v4 was **174x**. It drops as the numbers get bigger in most cases but it's still 3.5x faster on the slowest instruction (sqrt, which may be a software implementation) on an 8kb integer. If you look at the cuda source you might notice that it's implemented as C++ template classes, which is why people like CUDA over the plain C OpenCL.
Since then NVidia implemented the fastest of those (int128_t) as a native CUDA type even though no hardware exists on the GPU (internally implemented by the compiler as 4 fp32 operations in parallel with maximum O(n^2) complexity for operations like multiplication and division; presumably they plan on adding hardware support to something, it might exist already on Hopper), and even though this stupid link isn't showing up right, there's an implementation of fixed point operations for it. The previous one I found was OpenCL so more usable cross-platform, but I'm just pointing it out. [there's already a fixed point library using it.](https://github.com/rapidsai/cudf/blob/6638b5248fdf8cfcdff29f8209799f02abf77de1/cpp/include/cudf/fixed_point/fixed_point.hpp "there's already a fixed point library using it.")
Most of the code is the same as you'd use doing it on x86, it just runs in parallel across a single operation implicitly so instead of 4 clocks per step it's one, I think up to around the 1024-bit range where it might start to need 2 cycles for an L1 cache hit.