Last active
October 13, 2020 02:13
-
-
Save arcolife/973f0319fa1e4a6dfdb1892c6b50b8f9 to your computer and use it in GitHub Desktop.
Casper Node LRT3 report - Oct 12, 2020
- memory bloating after stests workloads (gossiper etc)
- sync on restart https://casperlabs.atlassian.net/browse/NDRS-486
- http server doesn't start until sync complete https://casperlabs.atlassian.net/browse/NDRS-480
- linear chain memory hoard https://casperlabs.atlassian.net/browse/NDRS-478
- queue metrics https://casperlabs.atlassian.net/browse/NDRS-422
- http server started sometimes (but not expected) before sync is complete https://casperlabs.atlassian.net/browse/NDRS-505
- finalized block that hasn't been proposed! https://casperlabs.atlassian.net/browse/NDRS-506
- panicked at 'should create ro txn: ReadersFull' https://casperlabs.atlassian.net/browse/NDRS-508
- live stake too low yet the LFB moved https://casperlabs.atlassian.net/browse/NDRS-510
- repetitive INFO logs for "added deploy hash" for the same hash https://casperlabs.atlassian.net/browse/NDRS-511
- memory linear chain eviction https://casperlabs.atlassian.net/browse/NDRS-482
- joiner queue kind suppresed in priority https://casperlabs.atlassian.net/browse/NDRS-490
- identifier in getresponse and getrequest reactor event traces https://casperlabs.atlassian.net/browse/NDRS-489
- defaults in chainspec https://casperlabs.atlassian.net/browse/NDRS-488
- guage metric for reactor queues - https://casperlabs.atlassian.net/browse/NDRS-487
----------------------------------------
ip port hex stake stake_pct
0 172.85.63.221 7777 017448378f74b316c569f6466b6bf4863c929a30dd59e3... 100000000000000 0.333333
1 172.85.59.70 7777 0123806a5ea3b6ea8b460287a63f4f9cd369461abbdf02... 200000000000000 0.666667
2 172.85.77.211 7777 012249e7d56f03863fc6da3f2e2ebaa221fe876e97e1aa... 300000000000000 1.000000
3 172.85.76.71 7777 01564089b0327cb66c4a39a55c03621afb43598edd9601... 400000000000000 1.333333
4 172.85.83.255 7777 01cacdafc1e37050b4b9bce6834a7f34d4a0a7ec6e3718... 500000000000000 1.666667
5 172.85.92.130 7777 01a30191e0edfda795c46f3306fd1bd9eaf732c63aa763... 600000000000000 2.000000
6 172.51.55.0 7777 0126c4ca463bfffb975c16bb504dfb73c2ca3556087633... 700000000000000 2.333333
7 172.51.61.128 7777 01f61621b42ba1a6b7ad031092d89226074e14ed2f70be... 800000000000000 2.666667
8 172.51.94.194 7777 01a57c18f62d3b21b726b1b0fc5cbb4931fc0039e51c63... 900000000000000 3.000000
9 172.51.86.16 7777 012477fc1166fc2ab33fe442b83e702d329f8ece431d68... 1000000000000000 3.333333
10 172.51.66.159 7777 01993fd7ce7fbeb101fef00a676264f715a55cb8900285... 1100000000000000 3.666667
11 172.51.76.219 7777 01d43b5936675e20d8310ead32c231dd21f126eb2b257c... 1200000000000000 4.000000
12 172.142.66.161 7777 0110caecda77ed48374889fbab3b23051dbeed5e0154eb... 1300000000000000 4.333333
13 172.142.65.48 7777 01d9efcab2ea41b63baae674fd883b8fb49a74d424e1be... 1400000000000000 4.666667
14 172.142.50.8 7777 018ba1e48925a48743c7a7ba81f2b7b16eaa1f9878e9a4... 1500000000000000 5.000000
15 172.142.56.109 7777 0165f450c0096451b9b337aa01f13c9e8738d352ded13c... 1600000000000000 5.333333
16 172.142.82.82 7777 018b230de03f532e380c782815e23d8891def6ef838dcb... 1700000000000000 5.666667
17 172.142.93.130 7777 0103fa078970f8f2b73a3db2af246573221ad673b9f37b... 1800000000000000 6.000000
18 172.96.86.203 7777 01ad8cac60e805f6488b0b270540bb23d61cbbe4eff076... 1900000000000000 6.333333
19 172.96.82.84 7777 018c3c5185fcd0838df84e7b4ea2091cb76b1ad74e7ecd... 2000000000000000 6.666667
20 172.96.67.60 7777 0122322b8303abc6b572958b4e09a664e10a5f8572cdcc... 2100000000000000 7.000000
21 172.96.73.159 7777 01f2124a331a91d60ea03b873cb0fee34185ac74825461... 2200000000000000 7.333333
22 172.96.52.96 7777 01da733017a213fa6fb25b065201b8c79e67b8b35947bb... 2300000000000000 7.666667
23 172.96.52.130 7777 016b2fdcaf5701a054ee0720763fc548a309195be796ea... 2400000000000000 8.000000
----------------------------------------
- Version - Oct 2 git commit on
master
-f1f284910244fb20a4a83488634177fd1e586228
- chainspec defaults:
era_duration = '30seconds'
minimum_era_height = 10
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 12
- stopped 3 nodes for restart tests: indexed 0, 1, 23 in
_nodes_stake_distribution.md
above.
172.85.59.70 - pre-existing with TRACE
172.85.63.221 - pre-existing with TRACE
172.96.52.130 - clear state with TRACE
- Fails to sync:
- Post restart (doesn't pick up sync)
- Late bloomer - attempts to catch-up after a couple Era. Synced partially but super slow.
- Version - Oct 8
casper-node v0.1.0_3-438-geccab1cf-eccab1cf
- chainspec defaults:
era_duration = '1500seconds'
minimum_era_height = 100
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 14
$ stests-cache-view-run --run 1 --type wg-100
Phase / Step Execution Start Time Execution End Time Duration (s) Deploys Action Status
R-001 - - 27.073 50 -- COMPLETE
R-001.P-01 - - 26.857 50 -- COMPLETE
R-001.P-01.S-01 - - 0.063 0 create-accounts COMPLETE
R-001.P-01.S-02 - - 26.653 50 fund-run-accounts COMPLETE
LRT-03 - WG-100 - Run 1
$ stests-cache-view-run --run 1 --type wg-110
Phase / Step Execution Start Time Execution End Time Duration (s) Deploys Action Status
R-001 - -- 14031.505 49 -- IN_PROGRESS
R-001.P-01 - -- 14031.474 49 -- IN_PROGRESS
R-001.P-01.S-01 - - 0.053 0 create-accounts COMPLETE
R-001.P-01.S-02 - -- 14031.354 49 fund-run-accounts IN_PROGRESS
LRT-03 - WG-110 - Run 1
- Repetitive deploy log messages for a single node.
--
- Full post wasm transfers, majority of the network panicked and only 5 nodes were up.
Apparently the fix that went in 15 days ago casper-network/casper-node#282 may need a follow-up.
ec2-uw2-1a-LRT3-benchN1 - 172.142.66.161 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN4 - 172.142.65.48 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN0 - 172.142.50.8 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN3 - 172.142.56.109 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN5 - 172.142.82.82 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ew2-1a-LRT3-benchN2 - 172.96.86.203 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN3 - 172.85.59.70 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN1 - 172.85.77.211 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN4 - 172.85.76.71 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN3 - 172.51.55.0 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN0 - 172.51.61.128 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN5 - 172.51.94.194 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN2 - 172.51.86.16 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN4 - 172.51.66.159 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN1 - 172.51.76.219 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN5 - 172.85.83.255 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN2 - 172.142.93.130 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ew2-1a-LRT3-benchN5 - 172.96.82.84 - running
ec2-ew2-1a-LRT3-benchN4 - 172.96.67.60 - running
ec2-ew2-1a-LRT3-benchN1 - 172.96.73.159 - running
ec2-ew2-1a-LRT3-benchN0 - 172.96.52.96 - running
ec2-ec1-1a-LRT3-benchN2 - 172.85.92.130 - running
ec2-ew2-1a-LRT3-benchN3 - 172.96.52.130 - stopped / restart sync super slow - REST metrics started before sync completion
ec2-ec1-1a-LRT3-benchN0 - 172.85.63.221 - stopped / restart sync tests failed
- Even though the stake is about 30% while FTT was 12 (i.e., only 5 of 24 nodes were up), the network's somehow still live:
port hex stake lfb_hash era_id height tstamp stake_pct
172.85.92.130 7777 01a30191e0edfda795c46f3306fd1bd9eaf732c63aa763... 600000000000000 4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00... 117.0 11891.0 2020-10-11T07:09:34.080Z 2.000000
172.96.82.84 7777 018c3c5185fcd0838df84e7b4ea2091cb76b1ad74e7ecd... 2000000000000000 4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00... 117.0 11891.0 2020-10-11T07:09:34.080Z 6.666667
172.96.67.60 7777 0122322b8303abc6b572958b4e09a664e10a5f8572cdcc... 2100000000000000 4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00... 117.0 11891.0 2020-10-11T07:09:34.080Z 7.000000
172.96.73.159 7777 01f2124a331a91d60ea03b873cb0fee34185ac74825461... 2200000000000000 4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00... 117.0 11891.0 2020-10-11T07:09:34.080Z 7.333333
172.96.52.96 7777 01da733017a213fa6fb25b065201b8c79e67b8b35947bb... 2300000000000000 4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00... 117.0 11891.0 2020-10-11T07:09:34.080Z 7.666667
----------------------------------------
Current live stake in network: 30.67
- Version - Oct 10
casper-node v0.1.0_3-484-ga7d0f324-a7d0f324
- chainspec defaults:
era_duration = '1500seconds'
minimum_era_height = 100
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 14
Had 2 stopped nodes for this test. Failed again with similar results as in Run 1. This time though,
- one of the nodes repeatedly got a
6/ABRT
signal and died.
Oct 12 12:27:21 ec2-ew2-1a-LRT3-benchN3 systemd[1]: casper-node.service: Main process exited, code=killed, status=6/ABRT
Oct 12 12:27:21 ec2-ew2-1a-LRT3-benchN3 systemd[1]: casper-node.service: Failed with result 'signal'.
- The other node didn't complete sync but started its http server.
Network = lrt3.
Generator ID Execution Start Time Execution End Time Duration (s) Deploys Status Step
WG-100 R-001 2020-10-12T13:45:28.950 2020-10-12T13:45:57.043 28.094 50 COMPLETE --
WG-110 R-001 2020-10-12T13:59:37.909 2020-10-12T14:00:04.970 27.061 50 COMPLETE --
WG-110 R-002 2020-10-12T14:07:08.123 2020-10-12T17:46:04.775 13136.652 50 COMPLETE --
wg-100
ERROR on
{"timestamp":"Oct 12 13:46:23.918","level":"ERROR","target":"casper_node::components::deploy_buffer","fields":{"message":"finalized block that hasn't been proposed!"}}
wg-110
as soon as I launched this afterwards, I saw 9 nodes panic with same error as in run 2
panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
Current live stake in network: 60.67
stake_pct lfb_hash era_id height tstamp
172.85.77.211 1.000000 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.85.76.71 1.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.85.92.130 2.000000 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.51.61.128 2.666667 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.51.86.16 3.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.51.66.159 3.666667 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.142.66.161 4.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.142.65.48 4.666667 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.142.50.8 5.000000 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.142.56.109 5.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.142.93.130 6.000000 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.96.86.203 6.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.96.73.159 7.333333 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
172.96.52.96 7.666667 fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d... 10.0 1054.0 2020-10-12T14:04:30.592Z
----------------------------------------
- Run 3 hammered the network completely and I now have
----------------------------------------
Current live stake in network: 27.33
stake_pct lfb_hash era_id height tstamp
172.85.77.211 1.000000 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
172.85.92.130 2.000000 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
172.142.65.48 4.666667 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
172.142.93.130 6.000000 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
172.96.86.203 6.333333 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
172.96.73.159 7.333333 0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7... 10.0 1059.0 2020-10-12T14:06:25.280Z
----------------------------------------
Network is stalled and not moving at this point (as opposed to LFB's movement that was observed in Run 2 at ~30%
live stake after wg-110 workloads)
Impossible atm https://casperlabs.atlassian.net/browse/FEAT-227 ref: https://casperlabs-team.slack.com/archives/CENLF23FG/p1602141096066000
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment