-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ArrayList #316
Optimize ArrayList #316
Conversation
Did you profile these changes? (With a proper JMH setup) I'm doubtful that these changes have any noticeable performance improvements (the JVM is incredibly good at figuring out these things). In particular, copying fields to local variables in my experience just adds clutter and doesn't yield anything. (Why I come to this conclusion: I also thought that it helps, after all, the compiler cannot prove the absence of concurrent modification. But even in really costly iterative methods this made practically no difference. I think this is partly due to such a local variable needing to be copied and written to memory, since, indeed, the compiler cannot prove that not copying the value will have no influence.) Similarly, marking local variables as final has (hardly) any influence. To my knowledge, the only possibility is that the compiler can perform some static analysis (which the JIT would very likely figure out, too) - |
I conducts a benchmark for List forEach. The benchmark code is as follows: @State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"10", "1000", "100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
for (int i = 0; i < n; i++) {
a[i] = i;
}
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
blackhole.consume(a[i]);
}
}
} The benchmark results are as follows:
Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching. Regarding the benchmark results, my opinions are as follows:
Furthermore, using the final modifier to declare local variables is to conform to the existing code style of fastutil. Looking forward to your reply. |
... on this benchmark that doesn't have dynamic dispatch etc. involved and is about forEach (I'm totally in favor of specializing I doubt that there will be any difference (there might even be a penalty) for Also: Which version of Java are you using? I cannot replicate these findings. For me, all three are the same timing:
All these assumptions are assumptions. The bytecode is not at all what is actually produced by the JIT, which in turn is not at all what the actual operations executed on the CPU are (see pipelining, speculative execution / branch prediction, etc.). Such constant-factor optimizations need (as realistic as possible) profiling as justification. (regarding point 2: The field however already exists on the heap, so there is no need to write and then free that memory at all!)
Oh, right! Then making all effectively final variables final definitely is good :) |
PS: I don't want to sound overly negative! I've just seen way too often that people try to outsmart a compiler, which more often than not actually is counterproductive. |
It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets. |
Oh, that is surprising! I am on the same major version, to be precise Can you try with .0.9? I doubt that this changes with a minor version, but still. I don't get why there should be such drastic differences. |
Could you try this benchmark variant: @State(Scope.Benchmark)
@Warmup(time = 2, iterations = 2)
@Measurement(time = 2, iterations = 4)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
Arrays.setAll(a, IntUnaryOperator.identity());
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
a[i] += i;
}
blackhole.consume(a);
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
a[i] += i;
}
blackhole.consume(a);
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
a[i] += i;
}
blackhole.consume(a);
}
} (I just want to eliminate the low probability that it has something to do with the used blackhole implementation) |
My benchmark result is based on |
Ok this is very surprising. Let me try if I can reproduce with EDIT: Nope. |
Run your benchmark variant in
|
I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object. |
But the compiler cannot know that I re-ran the original benchmark with HotSpot and get the same results:
For completeness: |
But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing. |
My JMH version is |
Absolutely. This was just an attempt at pinpointing where the difference comes from. Let's try this: I started from an empty folder and ran what the JMH guide suggests:
Then, I put the following code into package bench;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
for (int i = 0; i < n; i++) {
a[i] = i;
}
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
blackhole.consume(a[i]);
}
}
} followed by Output:
and
(* I ran this with |
Original benchmark; M1 native;
Numbers on 17.0.9 or newer look similar to 21 for me |
Upgrade JVM to
and
|
Huh! I didn't expect this to come out of a minor version change. Well, that at least explains what happens in that particular case ... Thanks for trying with an upgraded version! |
Benchmark in JVM versions |
6d42c89
to
ed0c362
Compare
Aha! The 17.0.9 changelog includes Since your performance was similar when the blackhole was moved outside the loop, I am guessing this is the culprit. This would mean the performance difference was caused by blackhole and would not be observable in real applications (which aligns with what I have seen in e.g. JBDD) |
Wow. That's a great and constructive discussion! My 2€¢: I welcomed the PRs from @mouse0w0 because this is how fastutil code is written. Beyond 2 accesses to the same array I try to cache. I know it's up and down with actual improvements etc., but semantically it really makes sense and aften works (remember we're supporting Java 8). @mouse0w0 noted that in some places I wasn't being coherent and fixed those places. So I'm totally happy to see this discussion but at least for the time being (and seeing different results on different architectures) for 3 accesses or more the rule is to cache. If you spot a relevant method where this does not happen I'm happy to accept PRs. |
I'm fine with that. Its just important to note that a) this primarily is a style decision, not performance and b) it should be tested that such a change doesn't actually negatively impact performance (which can happen). The only thing I am "unhappy" about is the |
Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule. What about for(i = 0, _size = size, i < _size; i++)? |
Sure, I just meant that we cant just assume that local caching makes things faster. We tested it, saw that it doesn't make a difference, great
I equally dislike that - for me, a for loop should just have a single initialization. But its your project and your style decision :) |
This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.
Personally, I prefer this code style as well, |
Ok, so let's go for |
public void forEachRemaining(final METHOD_ARG_KEY_CONSUMER action) { | ||
final KEY_GENERIC_TYPE[] a = ARRAY_LIST.this.a; | ||
final int max = to - from; | ||
while(pos < max) { | ||
action.accept(a[from + (lastReturned = pos++)]); | ||
final int from = SubList.this.from, to = SubList.this.to; | ||
for (int i = from + pos; i < to; i++) { | ||
action.accept(a[i]); | ||
} | ||
|
||
final int max = to - from; | ||
pos = max; | ||
lastReturned = max - 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forEachRemaining
change is meaningless (maybe slower) in the latest benchmark, and I'm not sure if I should keep it. I did this optimization because Java's ArrayList
uses the same code, and it seems more efficient (less assignment and self-increment).
Benchmark:
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ForEachRemainingBenchmark {
@Param({"10", "1000", "100000"})
private int max;
private int[] a;
@Setup
public void setup() {
a = new int[max];
Arrays.setAll(a, IntUnaryOperator.identity());
}
private int pos;
private int last;
@Benchmark
public void raw(Blackhole blackhole) {
pos = max >> 1;
last = pos - 1;
while (pos < max) {
blackhole.consume(a[last = pos++]);
}
blackhole.consume(pos);
blackhole.consume(last);
}
@Benchmark
public void optimize(Blackhole blackhole) {
pos = max >> 1;
last = pos - 1;
for (int i = pos; i < max; i++) {
blackhole.consume(a[i]);
}
pos = max;
last = max - 1;
blackhole.consume(pos);
blackhole.consume(last);
}
}
Output:
# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: -javaagent:C:\Program Files\IntelliJ IDEA 2023.2.5\lib\idea_rt.jar=8024:C:\Program Files\IntelliJ IDEA 2023.2.5\bin -Dfile.encoding=UTF-8
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
Benchmark (max) Mode Cnt Score Error Units
ForEachRemainingBenchmark.optimize 10 avgt 5 3.331 ± 0.018 ns/op
ForEachRemainingBenchmark.optimize 1000 avgt 5 56.202 ± 0.023 ns/op
ForEachRemainingBenchmark.optimize 100000 avgt 5 6028.862 ± 7.591 ns/op
ForEachRemainingBenchmark.raw 10 avgt 5 3.786 ± 0.006 ns/op
ForEachRemainingBenchmark.raw 1000 avgt 5 53.319 ± 0.026 ns/op
ForEachRemainingBenchmark.raw 100000 avgt 5 6007.510 ± 14.774 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's a sensible criterion, but the ArrayList code has been written by many people over the years, and optimized for different implementations of Collections and different JVMs, so it's not necessary the best.
However, the difference is so small that I'd keep the new ArrayList-style code. It's definitely cleaner, and writing in Rust I realized that exceeding with side-effects in expression evaluation doesn't do a good service to anybody.
This PR makes tests fail (didn't you check?). Apparently there is a change of behavior of the list used to check the behavior of big lists. I don't have time at the moment to chase this, as it might be involved. I'm releasing 8.5.14 as it's overdue. If you can find the problem I'll add this to the next release. |
My fault. I am not yet familiar with using ant junit, and the lack of prominent failures made me ignore it. I will fix it. |
You can run the test Note that in principle something that you fixed might have uncovered a bug in big lists. I am not keen to think that because those tests ran happily for +20 years. Unfortunately the failed test is a stress test that call itself recursively, which makes not easy to track the point in which divergence between the big list and the standard list happens. |
The issue is located in |
According to the implementation of Java's |
I'm totally confused. What did you fix codewise (forget the docs for the moment)? |
The problem lies in my fastutil/src/it/unimi/dsi/fastutil/Arrays.java Lines 52 to 53 in 5eaffbb
fastutil/src/it/unimi/dsi/fastutil/Arrays.java Lines 63 to 65 in 5eaffbb
It should be: * @param to an end index (
|
Ahh OK, I see. The text is correct. The parameter description is wrong. |
WTF, I have the new release but Sonatype gives me Unauthorized (401). Nothing has changed in my configuration since 5 months ago. If someone has suggestions, let me know, I tried on the community forum but I just met other people with the same problem... |
Sonatype now enforces token authentication instead of username and password. Refer to the user manual to generate your token. |
Thanks for the tip! They could provide a more sensible error message, frankly (or I'm just getting addicted to the Rust ecosystem). |
In the previous pull request (#311), I optimized the array field access of some classes. Recently, with my deeper understanding of fastutil, I have discovered more optimizations and technical debts that need attention. However, resolving these issues is challenging and time-consuming, requiring contributions from the community and multiple pull requests. This pull request is the recent work I have done on ArrayList.