Put the GPU at work for UF6

JVYV

posted Dec 27 '17 at 12:49 pm

Hi Frederik,

When using UF6 (and the previous versions also) my computers CPU (Intel I5 with 4 cores) is working at 100 % while my GPU (NVIDIA GeForce GTX670) is doing ... right ... nothing at all.

There are softwares - I'm refering to "SETI@home" (*) - that are doing math calculations and are using not only the CPU's in your PC but also the GPU as you can see on the diagram below and the calculations of the GPU are much, much faster than the CPU's. One disadvantage = the heat production of the CPU/GPU combination.

Is there a possibility that this calculation method will be introduced in UF?

(*) SETI@home is a scientific experiment, based at UC Berkeley, that uses Internet-connected computers in the Search for Extraterrestrial Intelligence (SETI). You can participate by running a free program that downloads and analyzes radio telescope data.

Hi Frederik, When using UF6 (and the previous versions also) my computers CPU (Intel I5 with 4 cores) is working at 100 % while my GPU (NVIDIA GeForce GTX670) is doing ... right ... nothing at all. There are softwares - I'm refering to "SETI@home" (*) - that are doing math calculations and are using not only the CPU's in your PC but also the GPU as you can see on the diagram below and the calculations of the GPU are much, much faster than the CPU's. One disadvantage = the heat production of the CPU/GPU combination. ![5a439626ce22e.png](serve/attachment&path=5a439626ce22e.png) Is there a possibility that this calculation method will be introduced in UF? (*) SETI@home is a scientific experiment, based at UC Berkeley, that uses Internet-connected computers in the Search for Extraterrestrial Intelligence (SETI). You can participate by running a free program that downloads and analyzes radio telescope data.

reply

Frederik Slijkerman

posted Dec 31 '17 at 1:55 pm

The GPU seems an obvious fit for fractal calculations but unfortunately it doesn’t work at all. Almost all consumer-level GPUs perform their arithmetic using 32-bit floating point numbers which is really only enough for the first couple of zoom levels. You need 64-bit floats but only very expensive pro cards provide this. I’ve tested with simulating 64-bit math using multiple 32-bit operations but that turned out to be slower than the regular CPU.

Then there’s the problem that the GPU blocks all graphics work while it’s working on fractals so it essentially blocks the entire computer. Very hard to work around this. All in all, it just doesn’t work out. I know there are many examples of Mandelbrot GPU plotters but if you look closely, none of them provide proper zooming.

The GPU seems an obvious fit for fractal calculations but unfortunately it doesn’t work at all. Almost all consumer-level GPUs perform their arithmetic using 32-bit floating point numbers which is really only enough for the first couple of zoom levels. You need 64-bit floats but only very expensive pro cards provide this. I’ve tested with simulating 64-bit math using multiple 32-bit operations but that turned out to be slower than the regular CPU. Then there’s the problem that the GPU blocks all graphics work while it’s working on fractals so it essentially blocks the entire computer. Very hard to work around this. All in all, it just doesn’t work out. I know there are many examples of Mandelbrot GPU plotters but if you look closely, none of them provide proper zooming.

Ultra Fractal author

reply

JVYV

posted Dec 31 '17 at 5:12 pm

Thanks for the response Frederik

reply

manticorp

posted Apr 19 '23 at 9:22 am

Sorry to resurrect an old thread, but many consumer cards now support 64 bit double precision floating point!

reply

physicist

posted Apr 19 '23 at 2:32 pm

I have no idea about the overall statistics, but the UF fractals that I see on various parts of deviantart.com rarely have any significant magnification. The emphasis is on making artistic images, not on deep explorations.

reply

FractalGnome

posted May 7 '23 at 7:16 pm

I lost the link, but googling around also finds an arbitrary precision decimal math OpenCL library that out-performs CPU. Top-end AMD consumer cards do ~2TF fp64 right now which is more than you can get out of any CPU I'm aware of. The Instinct MI210 PCIe accelerators do 22TF fp64 and will probably end up on fire-sale on ebay later this year when the 300 series are released, but I don't think they have any Windows or Mac drivers.

The biggest problem would be that an LLVM frontend would need to be written to lower the UF language to GPU machine code or OpenCL to make user formulas work. This would provide other benefits ("free" access to AVX512 without writing additional code using the auto-vectorizers, support for any processor architecture) but although it's not something I think would be very complicated given the scope of the language I worked on that compiler system for ~5 years so I'm slightly out of touch with what "complicated" might mean there.

LLVM actually has a "My First Language Frontend" tutorial @ https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html if you feel like looking into how much work it might be, it might end up making an ARM port for the mac people easier as well.

I lost the link, but googling around also finds an arbitrary precision decimal math OpenCL library that out-performs CPU. Top-end AMD consumer cards do ~2TF fp64 right now which is more than you can get out of any CPU I'm aware of. The Instinct MI210 PCIe accelerators do 22TF fp64 and will probably end up on fire-sale on ebay later this year when the 300 series are released, but I don't think they have any Windows or Mac drivers. The biggest problem would be that an LLVM frontend would need to be written to lower the UF language to GPU machine code or OpenCL to make user formulas work. This would provide other benefits ("free" access to AVX512 without writing additional code using the auto-vectorizers, support for any processor architecture) but although it's not something I think would be very complicated given the scope of the language I worked on that compiler system for ~5 years so I'm slightly out of touch with what "complicated" might mean there. LLVM actually has a "My First Language Frontend" tutorial @ https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html if you feel like looking into how much work it might be, it might end up making an ARM port for the mac people easier as well.

reply

Phillip

posted May 9 '23 at 9:19 am

There's still some things that don't make it straightforward. I managed to do a bit of fractal calculations on my GPU just last week and the speed is astonishing as long as you stay in single precision territory, however:

The code is entirely different from regular CPU rendering, in an established software, switching from one to the other isn't easy. It could, for example, be the case that fractal formulas and coloring algorithms are handled by objects that flat out can not exist in GPU render code, so it would require quite a bit of work to translate those.
Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't.
It is extremely easy to exceed even double precision when zooming for just a few seconds.
Even non-zoomed images can require high precision, for instance when working with very high bailouts to make stuff like exponential smoothing look nice

All that being said, that still leaves a substantial subsets of all fractals in UF that would greatly benefit from GPU support. I know there are other programs that will happily chug along on a GPU without blocking the use of the PC, like Chaotica or Fractron 9000. You have leave them rendering in the background just fine and work on something else, as long as you don't want to play a demanding game at the same time.

Personally, I would love to see it, but I kind of fear that by now UF has lost so much popularity that implementing such a labor-intensive feature might just not be worth it. On the other hand, having the UI of UF and GPU support could revive it a bit. Who knows, there's a saying in German: the last thing to die is hope. smile

I am actually curious about whether it could be possible to translate UF formulas into shaders, at least the simpler ones. I have done it with my burning Mandala, the standard MB and Julia sets as well as the Burning ship with no problem, including anti-aliasing. But the program just draws them, it has zero functionality other than that. My 3900X takes about a minute to draw a julia set at 1000x1000 with good anti-aliasing (128 samples per pixel), my GTX 1070 runs the same fractal it at 15 FPS.

There's still some things that don't make it straightforward. I managed to do a bit of fractal calculations on my GPU just last week and the speed is astonishing as long as you stay in single precision territory, however: 1. The code is entirely different from regular CPU rendering, in an established software, switching from one to the other isn't easy. It could, for example, be the case that fractal formulas and coloring algorithms are handled by objects that flat out can not exist in GPU render code, so it would require quite a bit of work to translate those. 2. Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't. 3. It is extremely easy to exceed even double precision when zooming for just a few seconds. 4. Even non-zoomed images can require high precision, for instance when working with very high bailouts to make stuff like exponential smoothing look nice All that being said, that still leaves a substantial subsets of all fractals in UF that would greatly benefit from GPU support. I know there are other programs that will happily chug along on a GPU without blocking the use of the PC, like Chaotica or Fractron 9000. You have leave them rendering in the background just fine and work on something else, as long as you don't want to play a demanding game at the same time. Personally, I would love to see it, but I kind of fear that by now UF has lost so much popularity that implementing such a labor-intensive feature might just not be worth it. On the other hand, having the UI of UF and GPU support could revive it a bit. Who knows, there's a saying in German: the last thing to die is hope. :) I am actually curious about whether it could be possible to translate UF formulas into shaders, at least the simpler ones. I have done it with my burning Mandala, the standard MB and Julia sets as well as the Burning ship with no problem, including anti-aliasing. But the program just draws them, it has zero functionality other than that. **My 3900X takes about a minute to draw a julia set at 1000x1000 with good anti-aliasing (128 samples per pixel), my GTX 1070 runs the same fractal it at 15 FPS. :D**

reply

Frederik Slijkerman

posted May 16 '23 at 4:37 pm

I think it should be possible to translate user formulas into shaders. I experimented with this quite a few years ago so it sounds like the technology has advanced and it's worthwhile to try it again. smile

I think it should be possible to translate user formulas into shaders. I experimented with this quite a few years ago so it sounds like the technology has advanced and it's worthwhile to try it again. :)

Ultra Fractal author

reply

FractalGnome

posted Jun 10 '23 at 1:41 pm

I am actually curious about whether it could be possible to translate UF formulas into shaders, at least the simpler ones. I have done it with my burning Mandala, the standard MB and Julia sets as well as the Burning ship with no problem, including anti-aliasing. But the program just draws them, it has zero functionality other than that. My 3900X takes about a minute to draw a julia set at 1000x1000 with good anti-aliasing (128 samples per pixel), my GTX 1070 runs the same fractal it at 15 FPS.

A long time ago (I think I had a Radeon HD6970 at the time) I discovered that the Milkdrop plugin for Winamp optionally used HLSL shaders for visualizations. I decided to write a Julia set that animated itself to the music. The biggest issues there turned out to be that

Milkdrop either wasn't being developed or the developer had said he wasn't porting it past DirectX 9. This was back when there were still people running Windows 95 for their games because they thought the NT kernel based Windows versions were too slow, or more likely read it somewhere online without trying it.
DX9 only supported ps / vs 2_x.
for loops were considered a static construct in HLSL, because it unrolled them fully.
Most of the fractal code needed to be done in the vertex shader which only supported a total of 256 instructions including the unrolled loop for fractal iteration.

Luckily there's a lot more tolerance for missing detail in an animation than a still image and I was able to get something that looked good fit into that instruction limit. Naturally about a year later the hard drive Winamp and the plugin presets were on crashed and couldn't be recovered, but I was getting something like 140fps at 1080p. Julia doesn't need huge iteration counts to look good while it's fairly zoomed out which helped a lot. Current HLSL doesn't have an instruction count limit, only a register / variable limit.

I've been hoping someone would write a visualizer for some program that could be customized like Milkdrop but used DX12 for years but it doesn't seem to be happening, or at least I can't find it. The iteration count would need to go up quite a bit to look good on a 4k display but that old card only managed 2.3 TFLOPS fp32 and I'm up to something like 70 or above now and have 140 TFLOP fp16 as an option for earlier iterations if it needed to be optimized.

>I am actually curious about whether it could be possible to translate UF formulas into shaders, at least the simpler ones. I have done it with my burning Mandala, the standard MB and Julia sets as well as the Burning ship with no problem, including anti-aliasing. But the program just draws them, it has zero functionality other than that. **My 3900X takes about a minute to draw a julia set at 1000x1000 with good anti-aliasing (128 samples per pixel), my GTX 1070 runs the same fractal it at 15 FPS. :D** A long time ago (I think I had a Radeon HD6970 at the time) I discovered that the Milkdrop plugin for Winamp optionally used HLSL shaders for visualizations. I decided to write a Julia set that animated itself to the music. The biggest issues there turned out to be that - Milkdrop either wasn't being developed or the developer had said he wasn't porting it past DirectX 9. This was back when there were still people running Windows 95 for their games because they thought the NT kernel based Windows versions were too slow, or more likely read it somewhere online without trying it. - DX9 only supported ps / vs 2_x. - for loops were considered a static construct in HLSL, because it unrolled them fully. - Most of the fractal code needed to be done in the vertex shader which only supported a total of 256 instructions including the unrolled loop for fractal iteration. Luckily there's a lot more tolerance for missing detail in an animation than a still image and I was able to get something that looked good fit into that instruction limit. Naturally about a year later the hard drive Winamp and the plugin presets were on crashed and couldn't be recovered, but I was getting something like 140fps at 1080p. Julia doesn't need huge iteration counts to look good while it's fairly zoomed out which helped a lot. Current HLSL doesn't have an instruction count limit, only a register / variable limit. I've been hoping someone would write a visualizer for some program that could be customized like Milkdrop but used DX12 for years but it doesn't seem to be happening, or at least I can't find it. The iteration count would need to go up quite a bit to look good on a 4k display but that old card only managed 2.3 TFLOPS fp32 and I'm up to something like 70 or above now and have 140 TFLOP fp16 as an option for earlier iterations if it needed to be optimized.

reply

manticorp

posted Jan 30 '24 at 8:28 am

If it helps, there are a bunch of Mandelbrot examples on Shadertoy!

https://www.shadertoy.com/results?query=mandelbrot

If it helps, there are a bunch of Mandelbrot examples on Shadertoy! https://www.shadertoy.com/results?query=mandelbrot

reply

Phillip

posted Jan 30 '24 at 3:37 pm

As far as I understand, the problem is not to figure out how to do MB in shaders. The problem is to transfer to shaders what UF can already do.

Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes.

Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision.

As far as I understand, the problem is not to figure out how to do MB in shaders. The problem is to transfer to shaders what UF can already do. Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes. Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision.

reply

FractalGnome

posted Feb 9 '24 at 5:49 pm

Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't.

The chips can't actually do it. They removed all but one fp64 CUDA core per SM, it wasn't worth wasting the hardware on, which left the chips at a 1:64 ratio. Similarly one of the two cuda compute cores per unit can either do int32 or fp32 but not both per cycle so int32 is always half speed and affects fp32 speed. The architecture docs say that was left around for compatibility with software that expects the type to exist. NVidia (for the most part) doesn't operate as a "real" vectorized parallel processor like the AMD GPUs... if you check clinfo for a 7900XTX and a 4090 you'll get preferred vector sizes for (32b:16b:8b) of 1:2:4 on the AMD card and 1:1:1 on the NVidia card because the regular CUDA instructions only really do 32-bit. Some of the AMD instructions are 2/clock as well. That's part of the reason NVidia needs twice as many stream processors to get their base numbers roughly 1.5x higher than AMD for fp32. The ~200TF of dedicated raytracing hardware when AMD doesn't have any (they just have schedulers) are why NVidia currently beats the pants off them in workstation more than lack of support. I own both cards and tested against AMD's own open source physically accurate renderer which doesn't even use the raytracing cores yet and it's still almost twice as fast on NVidia. LuisaRender which uses OptiX but also has a DirectX backend is even worse in terms of performance gap. I tested the directx backends vs each other in that one. If all you're doing is gaming and you buy a 4090 it's probably sheer stupidity but doing 3D work if I can GPU render a frame in 1 minute vs. 4 at the same power draw it's kinda a no brainer to spend the extra 1.5x on the GPU, especially when it gets things fast enough that I can just render the viewport with Karma XPU in Houdini and know more or less exactly what the final render is going to look like while still being able to interactively modify things.
The tensor cores are a different story and although they don't list the numbers there's a single 16x16 FP64 variant of FMA listed in the instruction set manuals without any mention of it introducing a longer wait period than the rest of the tensor instructions so there's more speed than it seems like assuming you can coax your math into 16x16 matrices.

AMD is more guilty of software-locking out the 1:2 ratio fp64 on consumer Vega hardware, if anything. IIRC it existed in one of the cards for a while when it was released then got removed by the drivers and people started cross-flashing them to workstation firmware to get it back.

I don't think anything since vega has had the 1:2 ratio available even in unlocked form; I can't recall where I read it but somebody interviewed them and they mentioned the hardware compromises not being worth it. They already need to dedicate something like 64k of registers per stream processor to keep things flowing and SRAM that fast isn't cheap. The double-issue thing lets them keep registers the same width and tell the compiler to generate the double-issue instructions instead of 2-vector instructions which isn't too much harder and keeps the hardware prices... well let's face it still incredibly expensive. AMD has been going up in price every generation, top end NVidia consumer has been going down since the Titan since they were always luxury items / introductory workstation cards... most motherboards can only support one full speed GPU anymore despite CPU having advanced to the point that many machines could be usable as workstations so there's a market for a single huge fast-clocked card vs the usual multiple slimmer ones that draw reasonable power (I was surprised too, but the 4090 is the lowest MSRP x090 series card they've made by a good amount). CDNA is 1:1 but that's a different architecture entirely. It has better support for branched code and real matrix hardware among other things.

The current high-end consumer card fp64 numbers are:

FP64 numbers:
7900XTX - 1.919TFLOPs
RTX4090 - 1.290TFLOPS

Which isn't much of a difference.

The cheapest, fastest FP64 cards I'm aware of that are possible to afford at the moment are both older but if that's what you need they're functional:

NVidia Tesla V100 FHHL 16GB HBM2 - This FHHL model has been showing up on ebay for $500-600 and had a 1:2 FP64 / FP32 ratio. It won't be fast for anything else but it did 6.605TF FP64. It was clocked lower than other Tesla models but anything above 16GB is too expensive except SXM2 module versions which are impossible to find motherboards for, making them dirt cheap.

The Radeon Instinct MI100 is showing up @ $1000-1200 on ebay and (CDNA) is more modern, has actual matrix hardware, and does 11.54TF

It also has a nightmarish flaw that will make you want to flee from it, that being that the only drivers AMD lists are for VMWare VSphere VSXi. There's a free version of that, but it's CLI only and from talking to friends management using the GUI is bad enough.

The Instinct MI50 and MI60 are a bit faster than the V100 and are Vega 20 based, but I'd be concerned about buying anything Vega series used because of how many were snapped up and abused to death for mining various coins during that whole mess. I haven't looked at their costs either. Radeon Pro VIIs are also 1:2 and seem to go for about the same as the V100 I mentioned at roughly the same FP64 speed so that's an option if you trust the seller.

I currently have a 7900XTX and an RTX4090 running side by side in my system and wouldn't touch NVidia with a 10 foot pole until last year because of their prior deceptive advertising when I looked at some numbers for what I was doing and realized they'd finally cut the advertising BS and were way farther ahead than I'd realized so I'm not particularly trying to defend them on that, just point out that it's a bit more complicated than only putting it on expensive hardware. The 7900 is still around but not much longer, some of the basic features they sold the card with still haven't been implemented on Windows (the "AI cores" aka WMMA schedulers, and raytracing accelerators which only work in games poorly and in some specific build of blender which I don't use). They're missing essential libs required for Torch to produce a build that can use the matrix acceleration cores and only have halfway functional support for HIP (just a CUDA compatibility shim anyway), which should have had day 1 support IMO. This was a very good way for a company I liked to piss me off. They've done this before with the geometry acceleration hardware on Vega which just mysteriously stopped being mentioned after no support in drivers for over a year and it turned out to be flawed hardware. I didn't really care because I wasn't doing anything that would benefit from it and it was a minor feature compared to these two. I'm planning to sell the thing for what I can get and replace it with a second 4090 or just leave it at a single card since the performance gains aren't linear anyway. I still need to test the system with the 7900XTX as primary video and the 4090 with the compute-only TCC driver mode to see if that accomplishes anything, but I get the feeling it'll cause too many issues with things like the viewport Karma render which will take a large latency hit if it has to be copied between GPUs, or not work at all if it's OpenGL at any stage.

Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes.

Right, but Ultra Fractal compiles those. They didn't compile to arm either, but

I'll go out on a limb and say Frederik didn't run off and hand-write assembly language for every user fractal to make them work on it in the new update. I mean in a sense he did since that's what the fractal compiler is doing now and some lucky bastard gets to write the part that converts everything to either assembly or interprets it / JITs it to some kind of bytecode that it can then run in an optimal way against a table of bytecode -> machine code mappings, but it only needs to know how to handle the things its capable of doing on the architectures it targets.

I don't know what its compiler looks like, I'm assuming it's home-cooked since Frederik was working on the ARM version for a while. (Don't worry, they'll be switching to RISC V without telling anybody in advance soon I'm sure). The language it uses isn't complex and could be ported to a compiler system like LLVM as a frontend; once you get LLVM generating IR, you can literally take IR files not even built for compatible architectures and feed them into LLC targeting random other architectures and get working optimized binaries within limitations that ultra fractal meets (they're mostly related to calling incompatible or non-standard CRT functions or OS specific crap). I've lowered IR generated for 32-bit PPC MacOS to Javascript via the emscripten backend before just to demonstrate to people at work that the compiler basically ignored datatypes if you overrode them. This doesn't work when the frontend needs to generate processor-specific intrinsics for some reason but that isn't a thing here, I'm just giving that as an example. See also the Julia language.

Classes will run on anything after they're compiled down to its assembly language, they're just a HLL abstraction. (See the CUDA code on the repo below, it's all namespaced C++ template classes that expand to the multiple variants for the sizes used during preprocessing and are lowered as... let's see, 128 bit - 32kb integers that need bitness divisible by 32... well potentially a metric buttload of C++ classes that need to be lowered to GPU assembly.

For the GPU backends you really just need to inline everything to avoid performance hits from branching but UF classes aren't the complexity of C++ classes which will make it a lot easier.

Compute shaders don't really have many instruction set limitations inherent to them vs. regular CPUs anymore, only standard library limitations.

Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision.

As I mentioned above this can be done on the GPU without a ton of effort. It's not full speed GPU domain but it's still faster than CPU, and there's not a ton of data being shuffled between domains which more or less eliminates the cost of running on a PCIe device.

https://github.com/NVlabs/CGBN

speedup on add and mul for 128-bit numbers on a V100 (nvidia cards suck at integer btw) vs a Xeon 20-Core E5-2698v4 was 174x. It drops as the numbers get bigger in most cases but it's still 3.5x faster on the slowest instruction (sqrt, which may be a software implementation) on an 8kb integer. If you look at the cuda source you might notice that it's implemented as C++ template classes, which is why people like CUDA over the plain C OpenCL.

Since then NVidia implemented the fastest of those (int128_t) as a native CUDA type even though no hardware exists on the GPU (internally implemented by the compiler as 4 fp32 operations in parallel with maximum O(n^2) complexity for operations like multiplication and division; presumably they plan on adding hardware support to something, it might exist already on Hopper), and even though this stupid link isn't showing up right, there's an implementation of fixed point operations for it. The previous one I found was OpenCL so more usable cross-platform, but I'm just pointing it out. https://github.com/rapidsai/cudf/blob/6638b5248fdf8cfcdff29f8209799f02abf77de1/cpp/include/cudf/fixed_point/fixed_point.hpp

Most of the code is the same as you'd use doing it on x86, it just runs in parallel across a single operation implicitly so instead of 4 clocks per step it's one, I think up to around the 1024-bit range where it might start to need 2 cycles for an L1 cache hit.

>2. Double precision is unlocked on AMD cards, Nvidia "consumer" cards (that includes everything up to the 40-fucking-90 Ti!) will do double, but nerf the performance for no other reason than Nvidia wanting more money. The chips can do it, the just won't. The chips can't actually do it. They removed all but one fp64 CUDA core per SM, it wasn't worth wasting the hardware on, which left the chips at a 1:64 ratio. Similarly one of the two cuda compute cores per unit can either do int32 or fp32 but not both per cycle so int32 is always half speed and affects fp32 speed. The architecture docs say that was left around for compatibility with software that expects the type to exist. NVidia (for the most part) doesn't operate as a "real" vectorized parallel processor like the AMD GPUs... if you check clinfo for a 7900XTX and a 4090 you'll get preferred vector sizes for (32b:16b:8b) of 1:2:4 on the AMD card and 1:1:1 on the NVidia card because the regular CUDA instructions only really do 32-bit. Some of the AMD instructions are 2/clock as well. That's part of the reason NVidia needs twice as many stream processors to get their base numbers roughly 1.5x higher than AMD for fp32. The ~200TF of dedicated raytracing hardware when AMD doesn't have any (they just have schedulers) are why NVidia currently beats the pants off them in workstation more than lack of support. I own both cards and tested against AMD's own open source physically accurate renderer which doesn't even use the raytracing cores yet and it's still almost twice as fast on NVidia. LuisaRender which uses OptiX but also has a DirectX backend is even worse in terms of performance gap. I tested the directx backends vs each other in that one. If all you're doing is gaming and you buy a 4090 it's probably sheer stupidity but doing 3D work if I can GPU render a frame in 1 minute vs. 4 at the same power draw it's kinda a no brainer to spend the extra 1.5x on the GPU, especially when it gets things fast enough that I can just render the viewport with Karma XPU in Houdini and know more or less exactly what the final render is going to look like while still being able to interactively modify things. The tensor cores are a different story and although they don't list the numbers there's a single 16x16 FP64 variant of FMA listed in the instruction set manuals without any mention of it introducing a longer wait period than the rest of the tensor instructions so there's more speed than it seems like assuming you can coax your math into 16x16 matrices. AMD is more guilty of software-locking out the 1:2 ratio fp64 on consumer Vega hardware, if anything. IIRC it existed in one of the cards for a while when it was released then got removed by the drivers and people started cross-flashing them to workstation firmware to get it back. I don't think anything since vega has had the 1:2 ratio available even in unlocked form; I can't recall where I read it but somebody interviewed them and they mentioned the hardware compromises not being worth it. They already need to dedicate something like 64k of registers per stream processor to keep things flowing and SRAM that fast isn't cheap. The double-issue thing lets them keep registers the same width and tell the compiler to generate the double-issue instructions instead of 2-vector instructions which isn't too much harder and keeps the hardware prices... well let's face it still incredibly expensive. AMD has been going up in price every generation, top end NVidia consumer has been going down since the Titan since they were always luxury items / introductory workstation cards... most motherboards can only support one full speed GPU anymore despite CPU having advanced to the point that many machines could be usable as workstations so there's a market for a single huge fast-clocked card vs the usual multiple slimmer ones that draw reasonable power (I was surprised too, but the 4090 is the lowest MSRP x090 series card they've made by a good amount). CDNA is 1:1 but that's a different architecture entirely. It has better support for branched code and real matrix hardware among other things. The current high-end consumer card fp64 numbers are: FP64 numbers: 7900XTX - 1.919TFLOPs RTX4090 - 1.290TFLOPS Which isn't much of a difference. The cheapest, fastest FP64 cards I'm aware of that are possible to afford at the moment are both older but if that's what you need they're functional: NVidia Tesla V100 FHHL 16GB HBM2 - This FHHL model has been showing up on ebay for $500-600 and had a 1:2 FP64 / FP32 ratio. It won't be fast for anything else but it did 6.605TF FP64. It was clocked lower than other Tesla models but anything above 16GB is too expensive except SXM2 module versions which are impossible to find motherboards for, making them dirt cheap. The Radeon Instinct MI100 is showing up @ $1000-1200 on ebay and (CDNA) is more modern, has actual matrix hardware, and does 11.54TF It also has a nightmarish flaw that will make you want to flee from it, that being that the only drivers AMD lists are for VMWare VSphere VSXi. There's a free version of that, but it's CLI only and from talking to friends management using the GUI is bad enough. The Instinct MI50 and MI60 are a bit faster than the V100 and are Vega 20 based, but I'd be concerned about buying anything Vega series used because of how many were snapped up and abused to death for mining various coins during that whole mess. I haven't looked at their costs either. Radeon Pro VIIs are also 1:2 and seem to go for about the same as the V100 I mentioned at roughly the same FP64 speed so that's an option if you trust the seller. I currently have a 7900XTX and an RTX4090 running side by side in my system and wouldn't touch NVidia with a 10 foot pole until last year because of their prior deceptive advertising when I looked at some numbers for what I was doing and realized they'd finally cut the advertising BS and were way farther ahead than I'd realized so I'm not particularly trying to defend them on that, just point out that it's a bit more complicated than only putting it on expensive hardware. The 7900 is still around but not much longer, some of the basic features they sold the card with _still_ haven't been implemented on Windows (the "AI cores" aka WMMA schedulers, and raytracing accelerators which only work in games poorly and in some specific build of blender which I don't use). They're missing essential libs required for Torch to produce a build that can use the matrix acceleration cores and only have halfway functional support for HIP (just a CUDA compatibility shim anyway), which should have had day 1 support IMO. This was a very good way for a company I liked to piss me off. They've done this before with the geometry acceleration hardware on Vega which just mysteriously stopped being mentioned after no support in drivers for over a year and it turned out to be flawed hardware. I didn't really care because I wasn't doing anything that would benefit from it and it was a minor feature compared to these two. I'm planning to sell the thing for what I can get and replace it with a second 4090 or just leave it at a single card since the performance gains aren't linear anyway. I still need to test the system with the 7900XTX as primary video and the 4090 with the compute-only TCC driver mode to see if that accomplishes anything, but I get the feeling it'll cause too many issues with things like the viewport Karma render which will take a large latency hit if it has to be copied between GPUs, or not work at all if it's OpenGL at any stage. >Assuming we just take the standard library of UF formulas, one would need to program shaders for each of the 20-40 fractals and coloring algorithms, which would probably be quite doable, even if tedious. But there's hundreds if not thousands of user formulas which may use features not available in shaders, like classes. Right, but Ultra Fractal _compiles_ those. They didn't compile to arm either, but I'll go out on a limb and say Frederik didn't run off and hand-write assembly language for every user fractal to make them work on it in the new update. I mean in a sense he did since that's what the fractal compiler is doing now and some lucky bastard gets to write the part that converts everything to either assembly or interprets it / JITs it to some kind of bytecode that it can then run in an optimal way against a table of bytecode -> machine code mappings, but it only needs to know how to handle the things its capable of doing on the architectures it targets. I don't know what its compiler looks like, I'm assuming it's home-cooked since Frederik was working on the ARM version for a while. (Don't worry, they'll be switching to RISC V without telling anybody in advance soon I'm sure). The language it uses isn't complex and could be ported to a compiler system like LLVM as a frontend; once you get LLVM generating IR, you can literally take IR files not even built for compatible architectures and feed them into LLC targeting random other architectures and get working optimized binaries within limitations that ultra fractal meets (they're mostly related to calling incompatible or non-standard CRT functions or OS specific crap). I've lowered IR generated for 32-bit PPC MacOS to Javascript via the emscripten backend before just to demonstrate to people at work that the compiler basically ignored datatypes if you overrode them. This doesn't work when the frontend needs to generate processor-specific intrinsics for some reason but that isn't a thing here, I'm just giving that as an example. See also the Julia language. Classes will run on anything after they're compiled down to its assembly language, they're just a HLL abstraction. (See the CUDA code on the repo below, it's all namespaced C++ template classes that expand to the multiple variants for the sizes used during preprocessing and are lowered as... let's see, 128 bit - 32kb integers that need bitness divisible by 32... well potentially a metric buttload of C++ classes that need to be lowered to GPU assembly. For the GPU backends you really just need to inline everything to avoid performance hits from branching but UF classes aren't the complexity of C++ classes which will make it a lot easier. Compute shaders don't really have many instruction set limitations inherent to them vs. regular CPUs anymore, only standard library limitations. > Then there's the precision limit, single precision float will only let you zoom in a bit before breaking down and modern consumer GPUs are crippled when it comes to double precision and as far as I know no extended or even arbitrary precision. As I mentioned above this can be done on the GPU without a ton of effort. It's not full speed GPU domain but it's still faster than CPU, and there's not a ton of data being shuffled between domains which more or less eliminates the cost of running on a PCIe device. [Integer bignum from NVidia up to 32kb](https://github.com/NVlabs/CGBN) speedup on add and mul for 128-bit numbers on a V100 (nvidia cards suck at integer btw) vs a Xeon 20-Core E5-2698v4 was **174x**. It drops as the numbers get bigger in most cases but it's still 3.5x faster on the slowest instruction (sqrt, which may be a software implementation) on an 8kb integer. If you look at the cuda source you might notice that it's implemented as C++ template classes, which is why people like CUDA over the plain C OpenCL. Since then NVidia implemented the fastest of those (int128_t) as a native CUDA type even though no hardware exists on the GPU (internally implemented by the compiler as 4 fp32 operations in parallel with maximum O(n^2) complexity for operations like multiplication and division; presumably they plan on adding hardware support to something, it might exist already on Hopper), and even though this stupid link isn't showing up right, there's an implementation of fixed point operations for it. The previous one I found was OpenCL so more usable cross-platform, but I'm just pointing it out. [there's already a fixed point library using it.](https://github.com/rapidsai/cudf/blob/6638b5248fdf8cfcdff29f8209799f02abf77de1/cpp/include/cudf/fixed_point/fixed_point.hpp "there's already a fixed point library using it.") Most of the code is the same as you'd use doing it on x86, it just runs in parallel across a single operation implicitly so instead of 4 clocks per step it's one, I think up to around the 1024-bit range where it might start to need 2 cycles for an L1 cache hit.

reply

Phillip

posted Feb 3 at 12:43 pm

Hello,

it has been quite a while since this topic saw activity, but I thought I'd give it a go. I have been working on a UF-esque GPU program recently, and admittedly this is a lot of new stuff for me so things are going about as well as can be expected from a rookie. Nothing is even remotely close to being done. That being said, there is one thing I have accomplished that is pretty neat.

I am using OpenCL, so in theory stuff can run on GPUs and CPUs. And I'm using a C++ program which combines several fragments of OpenCL code to combine them into a working OpenCL kernel that produces an image and accepts parameters which can be tweaked in a graphical user interface. There are a few general fragments that contain stuff like functions for complex numbers, code for placing samples within pixels, code that organizes the loop and arguments and so on.

And then there are fragments that contain 4 parts: init, loop, final, and functions, which will be plugged into the template and end up coloring pixels. The idea of init, loop, and final is taken from the way UF structures its formulas. Separation between coloring and fractal is a lot weaker, so while you can access any variable of the fractal in coloring and vice versa, you can also get naming collisions, which is not ideal. So it's not fully equivalent, but you get the idea.

Here's an example screenshot of the current version:
null

There is no published version yet, but the code is public, and I would be willing to help if there is any chance to get UF to leverage the power of GPUs, even if it means giving up zooming capabilities. It could be there as an optional way to calculate things, but it probably won't replace the classic CPU route any time soon, if ever.

Hello, it has been quite a while since this topic saw activity, but I thought I'd give it a go. I have been working on a UF-esque GPU program recently, and admittedly this is a lot of new stuff for me so things are going about as well as can be expected from a rookie. Nothing is even remotely close to being done. That being said, there is one thing I have accomplished that is pretty neat. I am using OpenCL, so in theory stuff can run on GPUs and CPUs. And I'm using a C++ program which combines several fragments of OpenCL code to combine them into a working OpenCL kernel that produces an image and accepts parameters which can be tweaked in a graphical user interface. There are a few general fragments that contain stuff like functions for complex numbers, code for placing samples within pixels, code that organizes the loop and arguments and so on. And then there are fragments that contain 4 parts: init, loop, final, and functions, which will be plugged into the template and end up coloring pixels. The idea of init, loop, and final is taken from the way UF structures its formulas. Separation between coloring and fractal is a lot weaker, so while you can access any variable of the fractal in coloring and vice versa, you can also get naming collisions, which is not ideal. So it's not fully equivalent, but you get the idea. Here's an example screenshot of the current version: https://cdn.discordapp.com/attachments/190110978292187138/1335003488052314164/image.png?ex=67a1e222&is=67a090a2&hm=bd842a93540034eccb02bb3b7d993f00a0cd27809464a81a991c1ebba40fc349& There is no published version yet, but the code is public, and I would be willing to help if there is any chance to get UF to leverage the power of GPUs, even if it means giving up zooming capabilities. It could be there as an optional way to calculate things, but it probably won't replace the classic CPU route any time soon, if ever.

edited Feb 3 at 1:02 pm

reply

Frederik Slijkerman

posted Feb 5 at 4:00 pm

I'm very curious whether you will get decent results when zooming in a little further, so requiring at least double precision.

Ultra Fractal author

reply

Phillip

posted Feb 6 at 7:46 am

I have added double precision and it works. But it's much slower because consumer GPUs have their fp64 performance massively nerfed. Single precision problems are visible anywhere from about 1000x to 50000x magnification, double takes you much deeper obviously, but it still is extremely limiting for exploration.

As I mentioned elsewhere, there are many UF fractals that don't use much zooming, but even simple stuff like high bailouts with CAs like gaussian integer can run into problems without even zooming when using single precision, and also formulas that use exp or trig functions, they will start exceeding single precision limits with bailouts even below 10000.

I will add a little example of what I did and how it works later.

//edit:

Here's a bit more info on my program. It's on GitHub, so anyone can check it out, currently I am calling it GPUtopia:

https://github.com/baeldin/GPUtopia2

It's nowhere near complete, there are tons of bugs and very few features, but the part that handles formulas is already quite versatile.

Here is an example of what a UF formula looks like vs. one in GPUtopia:

trignarls_simple_x3 {
init:
  z = #pixel
  float func stream_func_gradient_X(const float xold, const float yold)
    float a = 1. / sqrt(3.)
    float f = sin(2. * a * xold)
    float g = sin(a * xold + yold)
    float h = sin(a * xold - yold)
    float gy = cos(a * xold + yold)
    float hy = -cos(a * xold - yold)
    return f*gy*h + f*g*hy
  endfunc
  float func stream_func_gradient_Y(const float xold, const float yold)
    float a = 1. / sqrt(3.)
    float f = sin(2. * a * xold)
    float g = sin(a * xold + yold)
    float h = sin(a * xold - yold)
    float fx = 2. * a * cos(2. * a * xold)
    float gx = a * cos(a * xold + yold)
    float hx = a * cos(a * xold - yold)
    return -(fx*g*h + f*gx*h + f*g*hx)
  endfunc
  niter = 0
loop:
  float xold = real(z)
  float yold = imag(z)
  x = xold + @c1 * stream_func_gradient_X(@f1 * xold, @f1 * yold) + \
             @c2 * stream_func_gradient_X(@f2 * xold, @f2 * yold) + \
             @c3 * stream_func_gradient_X(@f3 * xold, @f3 * yold)
  y = yold + @c1 * stream_func_gradient_Y(@f1 * xold, @f1 * yold) + \
             @c2 * stream_func_gradient_Y(@f2 * xold, @f2 * yold) + \
             @c3 * stream_func_gradient_Y(@f3 * xold, @f3 * yold)
  z = x + (0, 1)* y
niter = niter + 1
bailout:
|z|<@bail
default:
title="Trignarls"
maxiter=50
center=(0., 0.)
magn=2
periodicity=0
param bail
caption="Bailout"
default=10000000.0
endparam
float param c1
  default = 0.1
endparam
float param f1
  default = 1.
endparam
float param c2
  default = 0.1
endparam
float param f2
  default = 2.
endparam
float param c3
  default = 0.1
endparam
float param f3
  default = 3.
endparam
}

GPUtopia:

__parameters:
    float parameter c1 = 0.1f;
    float parameter f1 = 1.f
    float parameter c2 = 0.1f;
    float parameter f2 = 2.f;
    float parameter c3 = 0.1f;
    float parameter f3 = 3.f;
__init:
//=====| fractal formula init
        real2 z = z0;
__loop:
//=========| fractal formula loop
            z += (real2)(
        @c1 * stream_func_gradient_X(@f1 * z.x, @f1 * z.y) +
        @c2 * stream_func_gradient_X(@f2 * z.x, @f2 * z.y) +
        @c3 * stream_func_gradient_X(@f3 * z.x, @f3 * z.y),
        @c1 * stream_func_gradient_Y(@f1 * z.x, @f1 * z.y) +
        @c2 * stream_func_gradient_Y(@f2 * z.x, @f2 * z.y) +
        @c3 * stream_func_gradient_Y(@f3 * z.x, @f3 * z.y));
__bailout:
//=| factal bailout function
    // must always define bool bailedout!!!
    bool bailedout = (dot(z, z) > bailout*bailout);
__functions:
real stream_func_gradient_X(const real xold, const real yold)
{
    real a = 1. / sqrt(3.);
    real f = sin(2. * a * xold);
    real g = sin(a * xold + yold);
    real h = sin(a * xold - yold);
    real gy = cos(a * xold + yold);
    real hy = -cos(a * xold - yold);
    return f*gy*h + f*g*hy;
}
real stream_func_gradient_Y(const real xold, const real yold)
{
    real a = 1. / sqrt(3.);
    real f = sin(2. * a * xold);
    real g = sin(a * xold + yold);
    real h = sin(a * xold - yold);
    real fx = 2. * a * cos(2. * a * xold);
    real gx = a * cos(a * xold + yold);
    real hx = a * cos(a * xold - yold);
    return -(fx*g*h + f*gx*h + f*g*hx);
}

Likewise with coloring algorithms. So there is some resemblance, especially in the way it is broken up into different parts. The program then incorporates each part into a larger OpenCL kernel that is used to calculate the fractal on the GPU. Here's a comparison of UF's result and GPUtopia's result. It's not perfect, there is an offset of about a pixel in my program and the gradient isn't quite interpolated the same way.

Left: UF took 50 seconds on a 3900X
Right: GPUtopia took 5 seconds on a 7900 XTX (89 samples per pixel)
Admittedly not a fair comparison. The GPU is much newer than the CPU. I used normal AA with UF, I don't know how many samples that takes per pixel.

And most importantly, UF can do about a million things that GPUtopia can't. smile

BTW, just like last time, credit goes to Lycium for taching me a lot of stuff, especially anti aliasing smile

I have added double precision and it works. But it's much slower because consumer GPUs have their fp64 performance massively nerfed. Single precision problems are visible anywhere from about 1000x to 50000x magnification, double takes you much deeper obviously, but it still is extremely limiting for exploration. As I mentioned elsewhere, there are many UF fractals that don't use much zooming, but even simple stuff like high bailouts with CAs like gaussian integer can run into problems without even zooming when using single precision, and also formulas that use exp or trig functions, they will start exceeding single precision limits with bailouts even below 10000. I will add a little example of what I did and how it works later. //edit: Here's a bit more info on my program. It's on GitHub, so anyone can check it out, currently I am calling it GPUtopia: https://github.com/baeldin/GPUtopia2 It's nowhere near complete, there are tons of bugs and very few features, but the part that handles formulas is already quite versatile. Here is an example of what a UF formula looks like vs. one in GPUtopia: ```` trignarls_simple_x3 { init: z = #pixel float func stream_func_gradient_X(const float xold, const float yold) float a = 1. / sqrt(3.) float f = sin(2. * a * xold) float g = sin(a * xold + yold) float h = sin(a * xold - yold) float gy = cos(a * xold + yold) float hy = -cos(a * xold - yold) return f*gy*h + f*g*hy endfunc float func stream_func_gradient_Y(const float xold, const float yold) float a = 1. / sqrt(3.) float f = sin(2. * a * xold) float g = sin(a * xold + yold) float h = sin(a * xold - yold) float fx = 2. * a * cos(2. * a * xold) float gx = a * cos(a * xold + yold) float hx = a * cos(a * xold - yold) return -(fx*g*h + f*gx*h + f*g*hx) endfunc niter = 0 loop: float xold = real(z) float yold = imag(z) x = xold + @c1 * stream_func_gradient_X(@f1 * xold, @f1 * yold) + \ @c2 * stream_func_gradient_X(@f2 * xold, @f2 * yold) + \ @c3 * stream_func_gradient_X(@f3 * xold, @f3 * yold) y = yold + @c1 * stream_func_gradient_Y(@f1 * xold, @f1 * yold) + \ @c2 * stream_func_gradient_Y(@f2 * xold, @f2 * yold) + \ @c3 * stream_func_gradient_Y(@f3 * xold, @f3 * yold) z = x + (0, 1)* y niter = niter + 1 bailout: |z|<@bail default: title="Trignarls" maxiter=50 center=(0., 0.) magn=2 periodicity=0 param bail caption="Bailout" default=10000000.0 endparam float param c1 default = 0.1 endparam float param f1 default = 1. endparam float param c2 default = 0.1 endparam float param f2 default = 2. endparam float param c3 default = 0.1 endparam float param f3 default = 3. endparam } ```` GPUtopia: ```` __parameters: float parameter c1 = 0.1f; float parameter f1 = 1.f float parameter c2 = 0.1f; float parameter f2 = 2.f; float parameter c3 = 0.1f; float parameter f3 = 3.f; __init: //=====| fractal formula init real2 z = z0; __loop: //=========| fractal formula loop z += (real2)( @c1 * stream_func_gradient_X(@f1 * z.x, @f1 * z.y) + @c2 * stream_func_gradient_X(@f2 * z.x, @f2 * z.y) + @c3 * stream_func_gradient_X(@f3 * z.x, @f3 * z.y), @c1 * stream_func_gradient_Y(@f1 * z.x, @f1 * z.y) + @c2 * stream_func_gradient_Y(@f2 * z.x, @f2 * z.y) + @c3 * stream_func_gradient_Y(@f3 * z.x, @f3 * z.y)); __bailout: //=| factal bailout function // must always define bool bailedout!!! bool bailedout = (dot(z, z) > bailout*bailout); __functions: real stream_func_gradient_X(const real xold, const real yold) { real a = 1. / sqrt(3.); real f = sin(2. * a * xold); real g = sin(a * xold + yold); real h = sin(a * xold - yold); real gy = cos(a * xold + yold); real hy = -cos(a * xold - yold); return f*gy*h + f*g*hy; } real stream_func_gradient_Y(const real xold, const real yold) { real a = 1. / sqrt(3.); real f = sin(2. * a * xold); real g = sin(a * xold + yold); real h = sin(a * xold - yold); real fx = 2. * a * cos(2. * a * xold); real gx = a * cos(a * xold + yold); real hx = a * cos(a * xold - yold); return -(fx*g*h + f*gx*h + f*g*hx); } ```` Likewise with coloring algorithms. So there is some resemblance, especially in the way it is broken up into different parts. The program then incorporates each part into a larger OpenCL kernel that is used to calculate the fractal on the GPU. Here's a comparison of UF's result and GPUtopia's result. It's not perfect, there is an offset of about a pixel in my program and the gradient isn't quite interpolated the same way. Left: UF took 50 seconds on a 3900X Right: GPUtopia took 5 seconds on a 7900 XTX (89 samples per pixel) Admittedly not a fair comparison. The GPU is much newer than the CPU. I used normal AA with UF, I don't know how many samples that takes per pixel. ![67a516f73682f.jpg](serve/attachment&path=67a516f73682f.jpg) And most importantly, UF can do about a million things that GPUtopia can't. :D BTW, just like last time, credit goes to Lycium for taching me a lot of stuff, especially anti aliasing :)

edited Feb 6 at 10:13 pm

reply

Put the GPU at work for UF6

Pending draft

Edit history