NHacker Next
login
▲Constrained languages are easier to optimizejyn.dev
37 points by PaulHoule 7 hours ago | 28 comments
Loading comments...
JimDabell 3 hours ago [-]
This is just one aspect of the principle of least power:

https://www.w3.org/DesignIssues/Principles.html#PLP

By restricting the power of a language, you enable it to be used in more ways than just “execute it and see what the result is”.

holowoodman 4 hours ago [-]
But the potential of easy optimization does not mean that they will be optimized. Compilers for higher-level languages are full of broken promises around their potential. "Faster than C because we can optimize" is never really realized, except in maybe one benchmark that only works in that one exact scenario and nowhere in my code...
Ygg2 4 hours ago [-]
> Compilers for higher-level languages are full of broken promises

Sometimes because C is lingua franca of low level.

Some noalias optimizations Rust had didn't work in LLVM, because no one bothered using it before in C.

This goes even further to hardware. C-like null terminated string search SIMD is faster than a saner (pointer + len ) string view. So it's faster to append null to end of string view than to invoke the SIMD on string slice.

holowoodman 3 hours ago [-]
And here we go again with the compiler excuses.

C standards with the 'restrict' keyword to allow aliasing optimisations have been existing longer than Rust. LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential. LLVM is still slower than GCC, even in plain C.

Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search? Should only be 1 move slower than the native C. "Space after that string" shouldn't be a problem, since in higher-level languages, allocations can make enough room, it is higher-level after all (and you need null-termination for syscalls anyways).

It is always the same story. Compiler and programming language people make claims about future optimisations but never ever follow through. It's always just theory, never implementation.

chrismorgan 2 hours ago [-]
LLVM did do some with noalias, but it wasn’t very mature because it wasn’t broadly useful in C/C++, because it was too hard to use correctly.

Then Rust made it trivial to use correctly, and found that what LLVM did have was quite buggy, because it hadn’t been exercised. These were bugs that could in theory have been found in C/C++, but in practice never would have been. And so for quite some years Rust would toggle the noalias optimisations on and off, depending on whether there were any known bad-codegen issues at the time. I think the last toggle was a few years ago by now, this stuff is finally actually stable and appreciated.

My recollection is that the figures for the Rust compiler are in the vicinity of a 5% improvement from emitting noalias in the LLVM IR.

And it’s likely that quite a bit more is possible.

Ygg2 2 hours ago [-]
> LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential.

They did bother but no one seemed to be using it, so a bug snuck in. It took Rust exercising that corner of LLVM to find the bug.

> LLVM is still slower than GCC, even in plain C.

Citation needed. Last time I checked LLVM beat GCC on -O3. Unless you mean compilation performance.

> Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search?

Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?

My point if compilers and hardware are optimizing for C, it's no surprise no one can approach its speed.

It's like asking why when everyone is optimizing for Chrome no other web browser can approach its speed.

holowoodman 2 hours ago [-]
>> LLVM is still slower than GCC, even in plain C.

> Citation needed. Last time I checked LLVM beat GCC on -O3.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

https://www.phoronix.com/review/gcc-clang-eoy2023/8

There is tons more, just ask your favourite internet search engine.

> Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?

Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors.

Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition.

There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD.

miloignis 20 seconds ago [-]
Your Phoronix link shows clang as faster than GCC (more is better on the graph). The benchmark game results seem to go back and forth.

I don't think that it's convincing evidence that clang is slower than GCC.

jcranmer 58 minutes ago [-]
A loop that's basically:

  for (int i = 0; i < len; i++)
    /* do something */
is going to compile down to something like:

  label: ; blah blah blah
    INC r0
    CMP r0, r1
    JNZ label
(Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.

A null-terminated loop instead compiles down to:

  label: ; blah blah blah
  LD r0, r1
  INC r1
  CMP r0, 0
  JNZ label
You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.

It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization.

holowoodman 39 minutes ago [-]
No. First that isn't SIMD code. Second, CMP r0,0 is redundant, operating on a register that is zero will set the zero flag automatically. You also skipped the string search/comparison part, but that isn't relevant, except that the operation will implicitly set the zero flag. Generally, plain JZ is faster than CMP + JZ. Even more so if it is SIMD.

And loads are amortized by tons of speculative preloading. Which isn't (yet, unfortunately) optimized by preload-length hints.

Oh, and since we are talking about comparing strings, there is no extra load. This isn't strlen(), this is strchr() or strstr() we are talking about. You always have to load the string into some register.

habibur 3 hours ago [-]
Brings back nostalgic memories from Java days in the 2000s. "Java will be faster than C" for exactly the same reasons laid out in the article.

And here we are, 20 years later.

yccs27 8 minutes ago [-]
TFA only talks about having strong or weak static guarantees. There's more that typically comes with a "high-level" language like Java: Nonzero-cost abstractions like boxed objects, automatic memory management and more.
pegasus 2 hours ago [-]
Yes, 20 years later, the JVM is considered a marvel of engineering, and, surprisingly, some of those marketing claims have in fact come true. Or at least, more than would be expected after doing the hype-correction always called for in such situations. Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important. Then there's GraalVM which allows for AOT compilation, etc.
holowoodman 2 hours ago [-]
I'd like to see those numbers please. While there have been improvements, Java performance work seems to have stopped after 2011 or so. Mobile CPUs became fast enough for Android, so no movement there. And commercial software got stuck on numerous proprietary Java VMs that didn't improve after the Oracle takeover of SUN. Alterative VMs improved, but didn't get adopted, fragemented and in the end didn't really raise the bar a lot.

To this day, Java applications are the slowest and most memory hungry long-running server applications by far. Only some scripting languages are worse, but only by performance, almost never by memory.

hyperpape 24 minutes ago [-]
> While there have been improvements, Java performance work seems to have stopped after 2011 or so.

No idea how you got that impression. Here is an incomplete set of important changes:

- major developments on Graal and native code compilation for applications that prioritize startup time and memory use (I think Graal was started before 2011, but has changed a huge amount since then).

- the concurrent G1 garbage collector became the default, making most applications have much lower pause times (I believe it was released in Java 7 in 2011, but was not mature, only became a default in Java 9, and has regular improvements since then).

- the Shenendoah and ZGC concurrent garbage collectors were not started until well after 2011

- project lilliput, which reduces the header size of Java objects from 16 to 8 bytes is landing in Java 25, which comes out this fall

- specializing strings to use 1 byte/character for ascii strings, and 2 bytes for non-ascii strings

- virtual threads shipped in Java 22

- work on a vectorization API

- plenty of work on reducing the startup time of the JDK and modularizing it

One where I'm not sure about the timeline:

- https://docs.oracle.com/javase/8/docs/technotes/guides/vm/cl... class data sharing to lower overhead of running multiple JVMs on a single node

P.S. I'm not writing this to comment on the the parent's claim that Java is faster in some contexts.

holowoodman 7 minutes ago [-]
The grandparent made specific claims about scenarios where Java is supposedly faster:

> Yes, Java can be faster in certain contexts: the kind of contexts Java is usually employed in addressing these days. Stuff like enterprise backend systems and big data processing, where startup time or precise memory control are not important.

The only thing that might be relevant in your list is the vectorization API.

The rest of your list is just a bunch of band-aids for the worst problems that Java specifically has, that other languages lack. That is at best playing catch-up with everything else, since in C there is no GC overhead, almost no startup time, no object overhead, no GC-induced memory hogging, no GC pauses. Yes, I agree that things did happen. But nothing in there makes Java any more desirable for relevant workloads, it just makes it less undesirable.

pegasus 1 hours ago [-]
Seriously? Since 2011, so since Java 7, stagnation in performance? That's a startling claim. I'd like to see some numbers as well. I'll start the first round with this article:

https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html

Note: this is comparing to Java 17 to Java 8 (released in 2014), and we are now at Java 24!

holowoodman 30 minutes ago [-]
That is garbage collection performance, something that only GC languages even have to consider as additional overhead. It also doesn't even try to compare against anything non-Java.

There are no serious Java performance comparisons on stuff like transaction throughput, which would be what the "long running server application" usecase should aim for. You yourself claimed that enterprise data processing is what you are talking about. There are benchmarks that even with long warmup times make Java come up far worse than all the compiled languages (even Golang nowadays). https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Imho, people have stopped benchmarking Java ages ago, since it is known to be hopeless.

carlmr 3 hours ago [-]
I dunno, I was there in the 2000s and can't remember anyone saying that. I remember a lot of people saying Java's slightly slower execution will not matter that much when processing hardware gets better.
chubot 2 hours ago [-]
I was also there in the 2000's and I remember people saying Java can be just as fast as C

And in particular that garbage collection can be faster than malloc/free.

This is technically true in the case you have a naive C program. There are tons of C programs out there, and some are slow. (But there are more slow Java programs out there)

But it's 100% clear now that GC has a significant cost, and while you can optimize a bad C program, it can be very hard in many cases to optimize the cost of GC away.

C offers more freedom and thus more footguns, but the freedom also means you don't really have a performance cap.

And I prefer GC -- most programs I write will have GC; I just recognize the performance cap, and design accordingly. (In fact I wrote a GC for a subset of C++ for https://oils.pub -- I actually think this is a language with nicer performance properties than Java, mainy due to being AOT)

holowoodman 3 hours ago [-]
There were tons of people making those claims. https://stackoverflow.com/questions/64582/c-java-performance...

Especially profile-guided-optimization was hailed as the next big thing, only JIT-ed languages were the ones to be fast, because after some warmup time they would adapt to your data and hardware. Java is still dog-slow...

Sesse__ 3 hours ago [-]
I was there in the 2000s and remember several people saying that.
mrkeen 2 hours ago [-]
Look at your sibling comments. Some people are still saying it now.
donatj 3 hours ago [-]
They are also easier to statically analyze and write tooling around.
holowoodman 3 hours ago [-]
Then were is my easy Haskell tooling to find the place where I should put a strictness hint? Where is the equivalent of kcachegrind?

That it's supposedly easier might be true. But that usually doesn't lead to someone really doing it.

ImPostingOnHN 2 hours ago [-]
I have constructed a language which can only say "hello world" and there is only 1 way to do it.

It is the most easily optimized language I know. If you write a working program in it, you know you're doing it in the most optimal way for that language already.

withoutboats3 3 hours ago [-]
Initial comments as I write this are all negative, and also responding to something the blog post didn't claim. The only time it says faster than C is when talking about a language targeting the GPU; it is not controversial that GPUs perform better than CPUs for many workloads.

On the other hand, the results of this true fact are already seen in practice in real systems. Rust's iterator adapters, for example, allow high level "functional style" with closures to be compiled to code that is exactly the same as hand rolled C loops without unamortized bounds checks - despite guaranteeing the absence of out of bound accesses even in the face of programmer error - because the constraints of these interfaces and the design of the language enable tons of optimizations by LLVM. This has been true since Rust 1.0 more than a decade ago, it's not a promise of the future.

greatgib 5 hours ago [-]
Thank you Captain Obvious!

But easier to optimize doesn't necessarily means that they could be optimized to be more performant than less constrained languages.

4 hours ago [-]