-
-
Save twillouer/ac13eb1dadc8a270f821 to your computer and use it in GitHub Desktop.
@State(Scope.Benchmark) | |
public class ToArrayBench { | |
ArrayList<Byte> list; | |
@Setup | |
public void setup() throws Throwable | |
{ | |
list = new ArrayList<>(); | |
for (int i = 0; i < 10; i++) { | |
list.add((byte) i); | |
} | |
} | |
@Benchmark | |
public void zero_sized_array() | |
{ | |
list.toArray(new Byte[0]); | |
} | |
@Benchmark | |
public void simple_toArray() | |
{ | |
list.toArray(); | |
} | |
@Benchmark | |
public void sized_array_from_list() | |
{ | |
list.toArray(new Byte[list.size()]); | |
} | |
@Benchmark | |
public void sized_array_fixed_size() | |
{ | |
list.toArray(new Byte[100]); | |
} | |
@Benchmark | |
public void defensive_copy() | |
{ | |
new ArrayList<>(list); | |
} | |
public static void main(String[] args) throws RunnerException, IOException | |
{ | |
Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*") | |
.warmupIterations(20) | |
.warmupTime(TimeValue.seconds(1)) | |
.measurementIterations(20) | |
.timeUnit(TimeUnit.MILLISECONDS) | |
.forks(1) | |
// .addProfiler(LinuxPerfProfiler.class) | |
.build(); | |
new Runner(opt).run(); | |
} | |
} |
I do not understand these results. Everyone (PMD, Intellij, stackoverflow) seems to agree that toArray(new String[0]) is less effective than toArray (new String [list.size()], but it suggests the opposite bench ..
that's interesting one. I've simplified your setup to include just the two cases I most interested in:
@State(Scope.Thread)
public class ToArrayBench {
@Param("10")
private int size;
private ArrayList<Byte> list;
@Setup
public void setup() throws Throwable
{
list = new ArrayList<>();
for (int i = 0; i < size; i++) {
list.add((byte) i);
}
}
@Benchmark
public void zero_sized_array(Blackhole bh)
{
Byte[] bytes = list.toArray(new Byte[0]);
bh.consume(bytes);
}
@Benchmark
public void sized_array_fixed_size(Blackhole bh)
{
Byte[] bytes = list.toArray(new Byte[size]);
bh.consume(bytes);
}
public static void main(String[] args) throws Throwable {
Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
.warmupIterations(10)
.warmupTime(TimeValue.seconds(1))
.measurementIterations(20)
.timeUnit(TimeUnit.MILLISECONDS)
.threads(1)
.forks(1)
// .addProfiler(LinuxPerfProfiler.class)
.build();
new Runner(opt).run();
}
}
and indeed the results are counter-intuitive:
Benchmark (size) Mode Samples Score Score error Units
o.o.j.s.ToArrayBench.sized_array_fixed_size 10 thrpt 20 29170.782 2361.987 ops/ms
o.o.j.s.ToArrayBench.zero_sized_array 10 thrpt 20 41963.943 2474.507 ops/ms
however when I change the sized_array_fixed_size() method to use constant there results are more aligned with expectation:
@Benchmark
public void sized_array_fixed_size(Blackhole bh)
{
Byte[] bytes = list.toArray(new Byte[10]);
bh.consume(bytes);
}
Benchmark (size) Mode Samples Score Score error Units
o.o.j.s.ToArrayBench.sized_array_fixed_size 10 thrpt 20 47657.534 607.251 ops/ms
o.o.j.s.ToArrayBench.zero_sized_array 10 thrpt 20 43036.401 716.582 ops/ms
I can see the version the original version has lower instructions per cycle count:
This is the original sized_array_fixed_size():
23289.588754 task-clock (msec) # 0.639 CPUs utilized
14,374 context-switches # 0.617 K/sec
3,054 cpu-migrations # 0.131 K/sec
453 page-faults # 0.019 K/sec
79,254,722,599 cycles # 3.403 GHz [31.19%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
187,817,995,702 instructions # 2.37 insns per cycle [38.93%]
36,922,468,827 branches # 1585.364 M/sec [38.93%]
51,694,474 branch-misses # 0.14% of all branches [38.71%]
46,459,233,680 L1-dcache-loads # 1994.850 M/sec [38.77%]
654,546,759 L1-dcache-load-misses # 1.41% of all L1-dcache hits [38.58%]
57,970,689 LLC-loads # 2.489 M/sec [30.90%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
8,075,209 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [31.19%]
46,181,846,166 dTLB-loads:HG # 1982.940 M/sec [31.09%]
720,667 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [31.05%]
32,330,750 iTLB-loads:HG # 1.388 M/sec [31.14%]
637,481 iTLB-load-misses:HG # 1.97% of all iTLB cache hits [31.39%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [31.40%]
36.442513922 seconds time elapsed
vs. the zero_sized_array():
23279.164686 task-clock (msec) # 0.639 CPUs utilized
15,372 context-switches # 0.660 K/sec
3,401 cpu-migrations # 0.146 K/sec
532 page-faults # 0.023 K/sec
78,414,546,516 cycles # 3.368 GHz [31.04%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
260,885,594,417 instructions # 3.33 insns per cycle [39.01%]
49,221,494,283 branches # 2114.401 M/sec [39.16%]
69,222,874 branch-misses # 0.14% of all branches [39.19%]
62,745,594,182 L1-dcache-loads # 2695.354 M/sec [39.11%]
1,121,079,378 L1-dcache-load-misses # 1.79% of all L1-dcache hits [38.64%]
99,591,088 LLC-loads # 4.278 M/sec [30.57%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
10,673,042 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [30.94%]
63,298,980,413 dTLB-loads:HG # 2719.126 M/sec [31.00%]
1,173,107 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [32.64%]
32,181,074 iTLB-loads:HG # 1.382 M/sec [32.58%]
463,867 iTLB-load-misses:HG # 1.44% of all iTLB cache hits [32.50%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [32.42%]
36.426116098 seconds time elapsed
The new version of sized_array_fixed_size() with constant has IPC similar to the the zero_sized_array():
23346.537627 task-clock (msec) # 0.639 CPUs utilized
14,690 context-switches # 0.629 K/sec
3,524 cpu-migrations # 0.151 K/sec
502 page-faults # 0.022 K/sec
79,594,753,008 cycles # 3.409 GHz [30.88%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
254,160,699,602 instructions # 3.19 insns per cycle [38.88%]
50,744,484,027 branches # 2173.534 M/sec [38.94%]
70,854,819 branch-misses # 0.14% of all branches [39.08%]
62,767,371,751 L1-dcache-loads # 2688.509 M/sec [39.04%]
908,061,562 L1-dcache-load-misses # 1.45% of all L1-dcache hits [38.96%]
83,239,667 LLC-loads # 3.565 M/sec [30.64%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
10,329,198 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [30.71%]
63,590,675,367 dTLB-loads:HG # 2723.773 M/sec [30.74%]
845,789 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [30.69%]
29,044,917 iTLB-loads:HG # 1.244 M/sec [31.01%]
489,904 iTLB-load-misses:HG # 1.69% of all iTLB cache hits [30.74%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [30.72%]
36.531843880 seconds time elapsed
vs. the zero_sized_array():
23167.723860 task-clock (msec) # 0.638 CPUs utilized
14,513 context-switches # 0.626 K/sec
3,424 cpu-migrations # 0.148 K/sec
494 page-faults # 0.021 K/sec
78,630,215,762 cycles # 3.394 GHz [30.94%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
265,990,291,006 instructions # 3.38 insns per cycle [38.66%]
50,353,702,113 branches # 2173.442 M/sec [38.65%]
67,371,027 branch-misses # 0.13% of all branches [38.84%]
63,736,003,459 L1-dcache-loads # 2751.069 M/sec [38.83%]
1,126,015,959 L1-dcache-load-misses # 1.77% of all L1-dcache hits [38.63%]
181,894,495 LLC-loads # 7.851 M/sec [30.82%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
10,028,501 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [32.29%]
59,875,036,318 dTLB-loads:HG # 2584.416 M/sec [32.92%]
1,291,552 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [32.84%]
36,123,503 iTLB-loads:HG # 1.559 M/sec [32.87%]
716,049 iTLB-load-misses:HG # 1.98% of all iTLB cache hits [32.79%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [32.54%]
36.333910926 seconds time elapsed
this is giving me similarly confusing results:
@State(Scope.Thread)
public class ToArrayBench {
@Param("10")
private int size;
private Byte[] buffer;
@Setup
public void setup() throws Throwable {
buffer = new Byte[size];
for (byte i = 0; i < size; i++) {
buffer[i] = i;
}
}
@Benchmark
public void fast(Blackhole bh) {
int s = buffer.length;
Byte[] copy = Arrays.copyOf(buffer, s, Byte[].class);
bh.consume(copy);
}
@Benchmark
public void slow(Blackhole bh) {
int s = buffer.length;
Byte[] copy = (Byte[]) Array.newInstance(Byte[].class.getComponentType(), s);
System.arraycopy(buffer, 0, copy, 0, s);
bh.consume(copy);
}
public static void main(String[] args) throws Throwable {
Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
.warmupIterations(10)
.warmupTime(TimeValue.seconds(1))
.measurementIterations(20)
.timeUnit(TimeUnit.MILLISECONDS)
.threads(1)
.forks(1)
.addProfiler(LinuxPerfProfiler.class)
.build();
new Runner(opt).run();
}
}
Benchmark (size) Mode Samples Score Score error Units
o.o.j.s.ToArrayBench.fast 10 thrpt 20 134199.249 15573.195 ops/ms
o.o.j.s.ToArrayBench.fast:@cpi 10 thrpt 1 0.367 NaN CPI
o.o.j.s.ToArrayBench.slow 10 thrpt 20 49499.140 2661.012 ops/ms
o.o.j.s.ToArrayBench.slow:@cpi 10 thrpt 1 0.764 NaN CPI
perf stats for fast():
23044.197935 task-clock (msec) # 0.632 CPUs utilized
14,410 context-switches # 0.625 K/sec
3,227 cpu-migrations # 0.140 K/sec
423 page-faults # 0.018 K/sec
78,964,852,601 cycles # 3.427 GHz [30.93%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
214,945,315,883 instructions # 2.72 insns per cycle [38.77%]
40,570,072,182 branches # 1760.533 M/sec [39.00%]
8,633,173 branch-misses # 0.02% of all branches [38.99%]
55,630,805,896 L1-dcache-loads # 2414.092 M/sec [39.18%]
2,727,361,923 L1-dcache-load-misses # 4.90% of all L1-dcache hits [38.89%]
1,403,751,595 LLC-loads # 60.916 M/sec [30.76%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
10,486,830 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [31.88%]
54,778,270,818 dTLB-loads:HG # 2377.096 M/sec [31.78%]
584,070 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [31.70%]
23,538,830 iTLB-loads:HG # 1.021 M/sec [31.62%]
312,318 iTLB-load-misses:HG # 1.33% of all iTLB cache hits [31.64%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [31.57%]
36.459311589 seconds time elapsed
perf stats for slow():
23121.720993 task-clock (msec) # 0.635 CPUs utilized
14,615 context-switches # 0.632 K/sec
3,168 cpu-migrations # 0.137 K/sec
463 page-faults # 0.020 K/sec
79,644,122,769 cycles # 3.445 GHz [30.98%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
104,254,073,200 instructions # 1.31 insns per cycle [38.76%]
17,489,978,097 branches # 756.431 M/sec [38.87%]
42,822,998 branch-misses # 0.24% of all branches [38.74%]
21,942,957,608 L1-dcache-loads # 949.019 M/sec [38.73%]
1,022,663,441 L1-dcache-load-misses # 4.66% of all L1-dcache hits [38.71%]
88,585,236 LLC-loads # 3.831 M/sec [30.86%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
11,208,770 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [31.33%]
21,891,428,217 dTLB-loads:HG # 946.791 M/sec [31.21%]
807,245 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [31.04%]
24,236,835 iTLB-loads:HG # 1.048 M/sec [30.99%]
387,664 iTLB-load-misses:HG # 1.60% of all iTLB cache hits [31.06%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [31.13%]
36.428755980 seconds time elapsed
@jerrinot thanks for your time.
Still in trouble to understand the problem :)
No idea either, can someone please elaborate on that ?
Updated:
@State(Scope.Benchmark)
public class ToArrayBench {
private static final int SIZE = 100;
ArrayList<Byte> list;
@Setup
public void setup() throws Throwable
{
list = new ArrayList<>();
for (int i = 0; i < SIZE; i++) {
list.add((byte) i);
}
}
@Benchmark
public void zero_sized_array(Blackhole bh)
{
bh.consume(list.toArray(new Byte[0]));
}
@Benchmark
public void simple_toArray(Blackhole bh)
{
bh.consume(list.toArray());
}
@Benchmark
public void sized_array_from_list(Blackhole bh)
{
bh.consume(list.toArray(new Byte[list.size()]));
}
@Benchmark
public void sized_array_fixed_size(Blackhole bh)
{
bh.consume(list.toArray(new Byte[SIZE]));
}
@Benchmark
public void defensive_copy(Blackhole bh)
{
bh.consume(new ArrayList<>(list));
}
public static void main(String[] args) throws RunnerException, IOException
{
Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
.warmupIterations(20)
.warmupTime(TimeValue.seconds(1))
.measurementIterations(20)
.timeUnit(TimeUnit.MILLISECONDS)
.forks(1)
// .addProfiler(LinuxPerfAsmProfiler.class)
.build();
new Runner(opt).run();
}
}
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
ToArrayBench.defensive_copy | thrpt | 200 | 16 714 192 | ± 129515,217 | ops/s |
ToArrayBench.simple_toArray | thrpt | 200 | 17 918 950 | ± 102801,298 | ops/s |
ToArrayBench.sized_array_fixed_size | thrpt | 200 | 5 799 136 | ± 65921,564 | ops/s |
ToArrayBench.sized_array_from_list | thrpt | 200 | 5 643 162 | ± 85215,009 | ops/s |
ToArrayBench.zero_sized_array | thrpt | 200 | 6 529 068 | ± 78960,062 | ops/s |
Routinely, I will chew on people who can't use perfasm
profiler, but this is not your fault it wasn't helping here. ;) Only in JMH 1.5+ (released yesterday) perfasm can decode the VM stubs, and VM stubs are the crucial piece of info to untangle this. See: http://cr.openjdk.java.net/~shade/scratch/ToArrayBench.java
java -version
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
ubuntu 14.10, 64b
Run complete. Total time: 00:40:26