Finally it turns out torturing the kid was unnecessary and spreading out the suffering would have worked fine. All Omelas had to do was raise their income tax a little bit.
Bistable multivibrator
Non-state actor
Tabs for AI indentation, spaces for AI alignment
410,757,864,530 DEAD COMPUTERS
Finally it turns out torturing the kid was unnecessary and spreading out the suffering would have worked fine. All Omelas had to do was raise their income tax a little bit.
GPU programs (specifically CUDA, although other vendors’ stacks are similar) combine code for the host system in a conventional programming language (typically C++), and code for the GPU written in CUDA language. Even if the C++ code for the host system can be optimized with hand written assembly, it’s not going to lead to significant gains when the performance bottleneck is on the GPU side.
The CUDA compiler translates the high level CUDA code into something called PTX, machine code for a “virtual ISA” which is then translated by the GPU driver into native machine language for the proprietary instruction set of the GPU. This seems to be somewhat comparable to a compiler intermediate representation, such as LLVM. It’s plausible that hand written PTX assembly/IR language could have been used to optimize parts of the program, but that would be somewhat unusual.
For another layer or assembly/machine languages, technically they could have reverse engineered the actual native ISA of the GPU core and written machine code for it, bypassing the compiler in the driver. This is also quite unlikely as it would practically mean writing their own driver for latest-gen Nvidia cards that vastly outperforms the official one and that would be at least as big of a news story as Yet Another Slightly Better Chatbot.
While JIT and runtimes do have an overhead compared to direct native machine code, that overhead is relatively small, approximately constant, and easily amortized if the JIT is able to optimize a tight loop. For car analogy enjoyers, imagine a racecar that takes ten seconds to start moving from the starting line in exchange for completing a lap one second faster. If the race is more than ten laps long, the tradeoff is worth it, and even more so the longer the race. Ahead of time optimizations can do the same thing at the cost of portability, but unless you’re running Gentoo, most of the C programs on your computer are likely compiled for the lowest common denominator of x86/AMD64/ARMwhatever instruction sets your OS happens to support.
If the overhead of a JIT and runtime are significant in the overall performance of the program, it’s probably a small program to begin with. No shame to small programs, but unless you’re running it very frequently, it’s unlikely to matter if the execution takes five or fifty milliseconds.
“Wow, this Penny Arcade comic featuring toxic yaoi of submissive Sam Altman is lowkey kinda hot” is a sentence neither I nor any LLM, Markov chain or monkey on a typewriter could have predicted but now exists.
Meanwhile I’m reverse engineering some very much not performance sensitive video game binary patcher program some guy made a decade ago and Ghidra interprets a string splitting function as a no-op because MSVC decided calling conventions are a spook and made up a new one at link time. And it was right to do that.
EDIT: Also me looking for audio data from another old video game, patiently waiting for my program to take about half an hour on my laptop every time I run it. Then I remember to add --release
to cargo run
and while the compilation takes three seconds longer, the runtime shrinks to about ten seconds. I wonder if the above guy ever tried adding -O2
to his CFLAGS
?
I hear Private Reasoning of the first through nth LLM Understander Corps is highly motivated
There’s no way to know since they didn’t have the money to test.
to /dev/null preferably