Skip to content

Instantly share code, notes, and snippets.

@relyt0925
Created August 18, 2024 21:34
Show Gist options
  • Save relyt0925/58e8e8760d1083e7e5556902e631341d to your computer and use it in GitHub Desktop.
Save relyt0925/58e8e8760d1083e7e5556902e631341d to your computer and use it in GitHub Desktop.
This file has been truncated, but you can view the full file.
[root@tyler-a100-newimage-val instructlab]# nohup /root/bin/ilab.sh train --strategy lab-multiphase --phased-phase1-data /var/mnt/inststg1/instructlab/generated/knowledge_train_msgs_2024-08-18T15_57_14.jsonl --phased-phase2-data /var/mnt/inststg1/instructlab/generated/skills_train_msgs_2024-08-18T15_57_14.jsonl --phased-base-dir /var/mnt/inststg1/instructlab/phasedbasedir --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 --phased-mt-bench-judge /var/mnt/inststg1/instructlab/models/prometheus-eval/prometheus-8x7b-v2.0/ --max-batch-len 10000 --max-seq-len 4096 --phased-phase1-effective-batch-size 128 --phased-phase2-effective-batch-size 3840 --enable-serving-output --gpus 8 --skip-user-confirm --model-path /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ &
[root@tyler-a100-newimage-val instructlab]# cat nohup.out
time="2024-08-18T20:04:24Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
You are using an aliased command, this will be deprecated in a future release. Please consider using `ilab model train` instead
Training Phase 1/2...
TrainingArgs for current phase: TrainingArgs(model_path='/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/var/mnt/inststg1/instructlab/generated/knowledge_train_msgs_2024-08-18T15_57_14.jsonl', ckpt_output_dir='/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints', data_output_dir='/var/mnt/inststg1/instructlab/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=10000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=False, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=None)
[2024-08-18 20:04:33,199] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /var/mnt/inststg1/instructlab/.triton/autotune: No such file or directory
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
INFO 2024-08-18 20:04:40,050 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2024-08-18 20:04:40,051 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2024-08-18 20:04:40,051 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-08-18 20:04:40,206 datasets:58: PyTorch version 2.3.1 available.
INFO 2024-08-18 20:04:40,465 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
Generating train split: 267 examples [00:00, 11032.42 examples/s]
tokenizing the dataset with /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ tokenizer...
Map (num_proc=16): 100% 267/267 [00:00<00:00, 398.01 examples/s]
ten largest length percentiles:
Map (num_proc=16): 100% 267/267 [00:00<00:00, 1626.59 examples/s]
quantile 90th: 1116.4
quantile 91th: 1138.42
quantile 92th: 1165.36
quantile 93th: 1181.0400000000002
quantile 94th: 1210.08
quantile 95th: 1226.2999999999997
quantile 96th: 1278.36
quantile 97th: 1652.6199999999994
quantile 98th: 1689.72
quantile 99th: 1712.7599999999998
quantile 100th: 1734.0
at 4096 max sequence length, the number of samples to be dropped is 0
(0.00% of total)
quantile 0th: 255.0
quantile 1th: 284.66
quantile 2th: 288.32
quantile 3th: 295.88
quantile 4th: 301.0
quantile 5th: 303.0
quantile 6th: 318.84
quantile 7th: 320.62
quantile 8th: 322.28
quantile 9th: 324.94
quantile 10th: 327.6
at 20 min sequence length, the number of samples to be dropped is 0
checking the validity of the samples...
Filter (num_proc=16): 100% 267/267 [00:00<00:00, 435.71 examples/s]
INFO 2024-08-18 20:04:48,018 root:611: number of dropped samples: 0 -- out of 267
Categorizing training data type...
Data type sorting: 100% 267/267 [00:00<00:00, 468764.83it/s]
unmasking the appropriate message content...
Map (num_proc=16): 100% 267/267 [00:00<00:00, 1418.85 examples/s]
The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.
Pretraining ex sample 186: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
The Social Security number (SSN) is a nine-digit identifier in the format "AAA-GG-SSSS," consisting of an area number, group number, and serial number. Prior to June 25, 2011, area numbers were assigned based on geographical region, with numbers issued from the northeast to the southwest. However, the SSN assignment process was randomized in 2011, eliminating the geographical significance of the first three digits and the significance of the highest group number assigned for each area number. Unassigned area numbers, excluding 000, 666, and 900-999, were introduced for assignment. The middle two digits, the group number, range from 01 to 99 and were not assigned consecutively in an area. The last four digits are the serial number. Individual Taxpayer Identification Numbers (ITINs) are not affected by this SSA change as they are issued by the IRS.
What are the three parts of a Social Security number?
<mask>
A Social Security number consists of an area number, group number, and serial number.
<|endoftext|>
Original Input: <|system|>
I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
<|user|>
The Social Security number (SSN) is a nine-digit identifier in the format "AAA-GG-SSSS," consisting of an area number, group number, and serial number. Prior to June 25, 2011, area numbers were assigned based on geographical region, with numbers issued from the northeast to the southwest. However, the SSN assignment process was randomized in 2011, eliminating the geographical significance of the first three digits and the significance of the highest group number assigned for each area number. Unassigned area numbers, excluding 000, 666, and 900-999, were introduced for assignment. The middle two digits, the group number, range from 01 to 99 and were not assigned consecutively in an area. The last four digits are the serial number. Individual Taxpayer Identification Numbers (ITINs) are not affected by this SSA change as they are issued by the IRS.
What are the three parts of a Social Security number?
<|assistant|>
A Social Security number consists of an area number, group number, and serial number.
<|endoftext|>
Pretraining ex sample 75: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Personally Identifiable Information (PII) refers to any information that can be used to identify a specific individual, such as their social security number, full name, email address, or phone number. With the increasing reliance on information technology, the amount of PII shared with organizations has grown, making it a target for cybercriminals. Hackers steal PII to commit identity theft, sell it on the black market, or hold it captive via ransomware, leading to significant costs for individuals and organizations.
PII can be categorized into direct and indirect identifiers. Direct identifiers, such as passport or driver's license numbers, are unique to a person and sufficient to determine their identity. Indirect identifiers, like race and place of birth, are not unique but can identify a person when combined, such as gender, ZIP code, and date of birth.
PII can also be classified as sensitive or non-sensitive. Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
Data privacy laws typically require organizations to safeguard sensitive PII with encryption, access control, or other cybersecurity measures, while non-sensitive PII may or may not be protected depending on the regulations and the organization's policies. The classification of PII as sensitive or non-sensitive depends on the context, such as the specific use case or potential harm resulting from a breach.
What is the difference between sensitive and non-sensitive PII?
<mask>
Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
<|endoftext|>
Original Input: <|system|>
I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
<|user|>
Personally Identifiable Information (PII) refers to any information that can be used to identify a specific individual, such as their social security number, full name, email address, or phone number. With the increasing reliance on information technology, the amount of PII shared with organizations has grown, making it a target for cybercriminals. Hackers steal PII to commit identity theft, sell it on the black market, or hold it captive via ransomware, leading to significant costs for individuals and organizations.
PII can be categorized into direct and indirect identifiers. Direct identifiers, such as passport or driver's license numbers, are unique to a person and sufficient to determine their identity. Indirect identifiers, like race and place of birth, are not unique but can identify a person when combined, such as gender, ZIP code, and date of birth.
PII can also be classified as sensitive or non-sensitive. Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
Data privacy laws typically require organizations to safeguard sensitive PII with encryption, access control, or other cybersecurity measures, while non-sensitive PII may or may not be protected depending on the regulations and the organization's policies. The classification of PII as sensitive or non-sensitive depends on the context, such as the specific use case or potential harm resulting from a breach.
What is the difference between sensitive and non-sensitive PII?
<|assistant|>
Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
<|endoftext|>
Creating json from Arrow format: 100% 1/1 [00:00<00:00, 23.07ba/s]
Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ --data_path=/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints --num_epochs=2 --effective_batch_size=128 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=0 --log_level=INFO --max_batch_len=10000 --seed=42 --chat-tmpl-path=/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py --checkpoint_at_epoch
W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757]
W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] *****************************************
W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] *****************************************
[2024-08-18 20:04:52,891] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,058] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,242] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,304] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,305] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:04:53,335] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
model_name_or_path: /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/
data_path: /var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl
output_dir: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints
num_epochs: 2
last_step: 0
effective_batch_size: 128
learning_rate: 2.0e-05
lr_scheduler: cosine
num_warmup_steps: 25
save_samples: 0
save_samples_ds: null
save_last: false
checkpoint_at_epoch: true
log_level: INFO
seed: 42
mock_data: false
mock_len: 2600
sharding_strategy: FULL_SHARD
is_granite: false
lora_r: 0
lora_alpha: 32
lora_dropout: 0.1
lora_quant_bits: null
lora_target_modules: null
max_batch_len: 10000
cpu_offload_optimizer: false
cpu_offload_optimizer_pin_memory: false
cpu_offload_optimizer_ratio: 1.0
NEFTune_alpha: null
chat_tmpl_path: /opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
disable_flash_attn: false
{
"script_params": {
"model_name_or_path": "/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/",
"data_path": "/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl",
"output_dir": "/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints",
"num_epochs": 2,
"last_step": 0,
"effective_batch_size": 128,
"learning_rate": 2e-05,
"lr_scheduler": "cosine",
"num_warmup_steps": 25,
"save_samples": 0,
"save_samples_ds": null,
"save_last": false,
"checkpoint_at_epoch": true,
"log_level": "INFO",
"seed": 42,
"mock_data": false,
"mock_len": 2600,
"sharding_strategy": "FULL_SHARD",
"is_granite": false,
"lora_r": 0,
"lora_alpha": 32,
"lora_dropout": 0.1,
"lora_quant_bits": null,
"lora_target_modules": null,
"max_batch_len": 10000,
"cpu_offload_optimizer": false,
"cpu_offload_optimizer_pin_memory": false,
"cpu_offload_optimizer_ratio": 1.0,
"NEFTune_alpha": null,
"chat_tmpl_path": "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
"disable_flash_attn": false
},
"timestamp": "2024-08-18T20:04:56.779187"
}
[2024-08-18 20:04:56,857] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:56,857] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-18 20:04:57,155] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:57,392] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:57,719] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-a100-newimage-val:570:570 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:570:570 [0] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:570:570 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:572:572 [2] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:572:572 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:572:572 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:573:573 [3] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:573:573 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:573:573 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:574:574 [4] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:574:574 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:574:574 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-08-18 20:04:57,854] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:57,858] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:57,865] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:04:57,877] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-a100-newimage-val:576:576 [6] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:576:576 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:576:576 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:571:571 [1] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:571:571 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:571:571 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:577:577 [7] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:577:577 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:577:577 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:575:575 [5] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:575:575 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:575:575 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Using network Socket
tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:572:1301 [2] NCCL INFO Using network Socket
tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:573:1302 [3] NCCL INFO Using network Socket
tyler-a100-newimage-val:574:1303 [4] NCCL INFO Using network Socket
tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:576:1312 [6] NCCL INFO Using network Socket
tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:577:1314 [7] NCCL INFO Using network Socket
tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:575:1315 [5] NCCL INFO Using network Socket
tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:571:1313 [1] NCCL INFO Using network Socket
tyler-a100-newimage-val:575:1315 [5] NCCL INFO ncclCommInitRank comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:572:1301 [2] NCCL INFO ncclCommInitRank comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:571:1313 [1] NCCL INFO ncclCommInitRank comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:577:1314 [7] NCCL INFO ncclCommInitRank comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:576:1312 [6] NCCL INFO ncclCommInitRank comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:573:1302 [3] NCCL INFO ncclCommInitRank comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:570:1300 [0] NCCL INFO ncclCommInitRank comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:574:1303 [4] NCCL INFO ncclCommInitRank comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xa1bb5af6fed5ca65 - Init START
tyler-a100-newimage-val:574:1303 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-a100-newimage-val:574:1303 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-a100-newimage-val:573:1302 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-a100-newimage-val:573:1302 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-a100-newimage-val:570:1300 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-a100-newimage-val:575:1315 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-a100-newimage-val:575:1315 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-a100-newimage-val:577:1314 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-a100-newimage-val:577:1314 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-a100-newimage-val:576:1312 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-a100-newimage-val:576:1312 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-a100-newimage-val:572:1301 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-a100-newimage-val:572:1301 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-a100-newimage-val:571:1313 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-a100-newimage-val:571:1313 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-a100-newimage-val:577:1314 [7] NCCL INFO comm 0x558d760ca530 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-a100-newimage-val:576:1312 [6] NCCL INFO comm 0x5640cf2b7db0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-a100-newimage-val:574:1303 [4] NCCL INFO comm 0x560ada6ca910 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-a100-newimage-val:574:1303 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-a100-newimage-val:577:1314 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-a100-newimage-val:576:1312 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-a100-newimage-val:574:1303 [4] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:577:1314 [7] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:576:1312 [6] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1300 [0] NCCL INFO comm 0x55d34ff865c0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-a100-newimage-val:573:1302 [3] NCCL INFO comm 0x55e7b0065170 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:575:1315 [5] NCCL INFO comm 0x5636163750f0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-a100-newimage-val:573:1302 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-a100-newimage-val:570:1300 [0] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:573:1302 [3] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:575:1315 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-a100-newimage-val:575:1315 [5] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:572:1301 [2] NCCL INFO comm 0x55e6aed5c790 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-a100-newimage-val:572:1301 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-a100-newimage-val:572:1301 [2] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:571:1313 [1] NCCL INFO comm 0x55f7ae069780 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-a100-newimage-val:571:1313 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-a100-newimage-val:571:1313 [1] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:574:1303 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:574:1303 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:575:1315 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:575:1315 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:571:1313 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:571:1313 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:573:1302 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:573:1302 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:577:1314 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:577:1314 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:576:1312 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:576:1312 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:570:1300 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:570:1300 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:572:1301 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:572:1301 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:570:1300 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-a100-newimage-val:572:1301 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:572:1301 [2] NCCL INFO ncclCommInitRank comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:572:1301 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.82 (kernels 0.13, bootstrap 0.36, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:577:1314 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:570:1300 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:577:1314 [7] NCCL INFO ncclCommInitRank comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:570:1300 [0] NCCL INFO ncclCommInitRank comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:577:1314 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.75 (kernels 0.25, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:574:1303 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:575:1315 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:576:1312 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:570:1300 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.84 (kernels 0.15, bootstrap 0.36, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:574:1303 [4] NCCL INFO ncclCommInitRank comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:575:1315 [5] NCCL INFO ncclCommInitRank comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:576:1312 [6] NCCL INFO ncclCommInitRank comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:573:1302 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:574:1303 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.82 (kernels 0.15, bootstrap 0.34, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:575:1315 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.75 (kernels 0.25, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:576:1312 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.75 (kernels 0.24, bootstrap 0.18, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:573:1302 [3] NCCL INFO ncclCommInitRank comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:573:1302 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.82 (kernels 0.15, bootstrap 0.34, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:571:1313 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:571:1313 [1] NCCL INFO ncclCommInitRank comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
tyler-a100-newimage-val:571:1313 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.76 (kernels 0.37, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1335 [3] NCCL INFO Connected all rings
tyler-a100-newimage-val:572:1338 [2] NCCL INFO Connected all rings
tyler-a100-newimage-val:571:1339 [1] NCCL INFO Connected all rings
tyler-a100-newimage-val:570:1336 [0] NCCL INFO Connected all rings
tyler-a100-newimage-val:574:1333 [4] NCCL INFO Connected all rings
tyler-a100-newimage-val:577:1334 [7] NCCL INFO Connected all rings
tyler-a100-newimage-val:575:1337 [5] NCCL INFO Connected all rings
tyler-a100-newimage-val:576:1332 [6] NCCL INFO Connected all rings
Generating train split: 267 examples [00:00, 6232.18 examples/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1973.19it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1948.77it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1801.53it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1713.33it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1638.71it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1757.64it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1982.74it/s]
Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1824.15it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.69it/s]
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Creating extension directory /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
{
"num_gpus": 8,
"avg_sample_len": 622.9588014981273,
"effective_batch_size": 128,
"max_batch_len_per_gpu": 10000,
"packing_max_batch_len": 9079,
"grad_accum": 2,
"num_batches": 2,
"avg_samples_per_batch": 133.5,
"samples_per_gpu": 8,
"timestamp": "2024-08-18T20:05:10.241425"
}
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 2.84it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 3.30it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.49it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.48it/s]
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.39it/s]
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.40it/s]
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.49it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 2.70it/s]Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.31it/s]
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00, 2.79it/s]
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/app-root/lib64/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 34.718958377838135 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 30.327874183654785 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 30.326067447662354 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 30.226067066192627 seconds
Time to load fused_adam op: 29.229661464691162 seconds
[2024-08-18 20:05:41,506] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-08-18 20:05:41,506] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
Loading extension module fused_adam...
Time to load fused_adam op: 30.326636791229248 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 30.327282190322876 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 29.730340242385864 seconds
tyler-a100-newimage-val:576:1419 [6] NCCL INFO Using network Socket
tyler-a100-newimage-val:574:1418 [4] NCCL INFO Using network Socket
tyler-a100-newimage-val:572:1420 [2] NCCL INFO Using network Socket
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Using network Socket
tyler-a100-newimage-val:575:1423 [5] NCCL INFO Using network Socket
tyler-a100-newimage-val:577:1422 [7] NCCL INFO Using network Socket
tyler-a100-newimage-val:571:1426 [1] NCCL INFO Using network Socket
tyler-a100-newimage-val:573:1417 [3] NCCL INFO Using network Socket
tyler-a100-newimage-val:571:1426 [1] NCCL INFO bootstrapSplit: comm 0x55f7afbd0a60 parent 0x55f7ae069780 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
tyler-a100-newimage-val:573:1417 [3] NCCL INFO bootstrapSplit: comm 0x55e7b1bda750 parent 0x55e7b0065170 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
tyler-a100-newimage-val:575:1423 [5] NCCL INFO bootstrapSplit: comm 0x563617fd12e0 parent 0x5636163750f0 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
tyler-a100-newimage-val:577:1422 [7] NCCL INFO bootstrapSplit: comm 0x558d77c33a30 parent 0x558d760ca530 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
tyler-a100-newimage-val:572:1420 [2] NCCL INFO bootstrapSplit: comm 0x55e6b08cfc90 parent 0x55e6aed5c790 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
tyler-a100-newimage-val:574:1418 [4] NCCL INFO bootstrapSplit: comm 0x560adc23dc90 parent 0x560ada6ca910 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
tyler-a100-newimage-val:573:1417 [3] NCCL INFO ncclCommSplit comm 0x55e7b1bda750 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55e7b0065170 color -934961569 key 3 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:570:1421 [0] NCCL INFO bootstrapSplit: comm 0x55d351b0d620 parent 0x55d34ff865c0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
tyler-a100-newimage-val:577:1422 [7] NCCL INFO ncclCommSplit comm 0x558d77c33a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558d760ca530 color -934961569 key 7 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:572:1420 [2] NCCL INFO ncclCommSplit comm 0x55e6b08cfc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e6aed5c790 color -934961569 key 2 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:576:1419 [6] NCCL INFO bootstrapSplit: comm 0x5640d0e2b2b0 parent 0x5640cf2b7db0 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
tyler-a100-newimage-val:575:1423 [5] NCCL INFO ncclCommSplit comm 0x563617fd12e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x5636163750f0 color -934961569 key 5 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:574:1418 [4] NCCL INFO ncclCommSplit comm 0x560adc23dc90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x560ada6ca910 color -934961569 key 4 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:571:1426 [1] NCCL INFO ncclCommSplit comm 0x55f7afbd0a60 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55f7ae069780 color -934961569 key 1 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:570:1421 [0] NCCL INFO ncclCommSplit comm 0x55d351b0d620 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55d34ff865c0 color -934961569 key 0 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:576:1419 [6] NCCL INFO ncclCommSplit comm 0x5640d0e2b2b0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5640cf2b7db0 color -934961569 key 6 commId 0xed88d95c67fb6a92 - Init START
tyler-a100-newimage-val:571:1426 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-a100-newimage-val:571:1426 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-a100-newimage-val:575:1423 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-a100-newimage-val:575:1423 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-a100-newimage-val:572:1420 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-a100-newimage-val:572:1420 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-a100-newimage-val:574:1418 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-a100-newimage-val:574:1418 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-a100-newimage-val:570:1421 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-a100-newimage-val:573:1417 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-a100-newimage-val:576:1419 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-a100-newimage-val:576:1419 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-a100-newimage-val:573:1417 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-a100-newimage-val:577:1422 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-a100-newimage-val:577:1422 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-a100-newimage-val:577:1422 [7] NCCL INFO comm 0x558d77c33a30 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-a100-newimage-val:576:1419 [6] NCCL INFO comm 0x5640d0e2b2b0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-a100-newimage-val:572:1420 [2] NCCL INFO comm 0x55e6b08cfc90 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-a100-newimage-val:575:1423 [5] NCCL INFO comm 0x563617fd12e0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-a100-newimage-val:570:1421 [0] NCCL INFO comm 0x55d351b0d620 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-a100-newimage-val:571:1426 [1] NCCL INFO comm 0x55f7afbd0a60 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-a100-newimage-val:574:1418 [4] NCCL INFO comm 0x560adc23dc90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-a100-newimage-val:573:1417 [3] NCCL INFO comm 0x55e7b1bda750 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:577:1422 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-a100-newimage-val:577:1422 [7] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:576:1419 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-a100-newimage-val:576:1419 [6] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:575:1423 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:575:1423 [5] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:574:1418 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:574:1418 [4] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:572:1420 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:572:1420 [2] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:571:1426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:573:1417 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:573:1417 [3] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:571:1426 [1] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-a100-newimage-val:570:1421 [0] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:570:1421 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:570:1421 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:576:1419 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:576:1419 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:570:1421 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-a100-newimage-val:571:1426 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:571:1426 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:575:1423 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:575:1423 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:573:1417 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:573:1417 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:574:1418 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:574:1418 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:577:1422 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:577:1422 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:572:1420 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:572:1420 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:576:1419 [6] NCCL INFO ncclCommSplit comm 0x5640d0e2b2b0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5640cf2b7db0 color -934961569 key 6 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:570:1421 [0] NCCL INFO ncclCommSplit comm 0x55d351b0d620 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55d34ff865c0 color -934961569 key 0 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:576:1419 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:572:1420 [2] NCCL INFO ncclCommSplit comm 0x55e6b08cfc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e6aed5c790 color -934961569 key 2 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:575:1423 [5] NCCL INFO ncclCommSplit comm 0x563617fd12e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x5636163750f0 color -934961569 key 5 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:570:1421 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:572:1420 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
tyler-a100-newimage-val:571:1426 [1] NCCL INFO ncclCommSplit comm 0x55f7afbd0a60 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55f7ae069780 color -934961569 key 1 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:575:1423 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:571:1426 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.33 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
tyler-a100-newimage-val:574:1418 [4] NCCL INFO ncclCommSplit comm 0x560adc23dc90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x560ada6ca910 color -934961569 key 4 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:577:1422 [7] NCCL INFO ncclCommSplit comm 0x558d77c33a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558d760ca530 color -934961569 key 7 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:574:1418 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
tyler-a100-newimage-val:577:1422 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:573:1417 [3] NCCL INFO ncclCommSplit comm 0x55e7b1bda750 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55e7b0065170 color -934961569 key 3 commId 0xed88d95c67fb6a92 - Init COMPLETE
tyler-a100-newimage-val:573:1417 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:572:1444 [2] NCCL INFO Connected all rings
tyler-a100-newimage-val:573:1446 [3] NCCL INFO Connected all rings
tyler-a100-newimage-val:574:1450 [4] NCCL INFO Connected all rings
tyler-a100-newimage-val:571:1449 [1] NCCL INFO Connected all rings
tyler-a100-newimage-val:570:1445 [0] NCCL INFO Connected all rings
tyler-a100-newimage-val:577:1448 [7] NCCL INFO Connected all rings
tyler-a100-newimage-val:575:1447 [5] NCCL INFO Connected all rings
tyler-a100-newimage-val:576:1443 [6] NCCL INFO Connected all rings
[2024-08-18 20:05:47,116] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-08-18 20:05:47,117] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-08-18 20:05:47,117] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-08-18 20:05:47,130] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-08-18 20:05:47,130] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-08-18 20:05:47,130] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-08-18 20:05:59,693] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:00,706] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:00,871] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:01,012] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:01,166] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:01,497] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:01,620] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:06:01,858] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-08-18 20:06:01,859] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB
[2024-08-18 20:06:01,860] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.63 GB, percent = 2.4%
[2024-08-18 20:06:02,079] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-08-18 20:06:02,080] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 18.83 GB CA 20.4 GB Max_CA 20 GB
[2024-08-18 20:06:02,080] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.64 GB, percent = 2.4%
[2024-08-18 20:06:02,080] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-08-18 20:06:02,301] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-08-18 20:06:02,302] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 15.69 GB CA 20.4 GB Max_CA 20 GB
[2024-08-18 20:06:02,302] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.64 GB, percent = 2.4%
[2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7eff204ab310>
[2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-08-18 20:06:02,305] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-08-18 20:06:02,305] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-08-18 20:06:02,305] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-08-18 20:06:02,305] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-08-18 20:06:02,305] [INFO] [config.py:1001:print] amp_params ................... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] bfloat16_enabled ............. True
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0270163fd0>
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] dump_state ................... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] fp16_auto_cast ............... None
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] fp16_enabled ................. False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 2
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-08-18 20:06:02,306] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] loss_scale ................... 1.0
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] pld_params ................... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] steps_per_print .............. 1
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] train_batch_size ............. 128
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 8
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] world_size ................... 8
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-08-18 20:06:02,307] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-08-18 20:06:02,307] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 128,
"gradient_accumulation_steps": 2,
"train_micro_batch_size_per_gpu": 8,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
}
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[2024-08-18 20:06:02,308] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Epoch 0: 0%| | 0/2 [00:00<?, ?it/s] total tokens: 8715 num samples: 15 num padding tokens: 392 - rank: 5 max len: 581 min len: 525 avg len: 554.8666666666667 num_loss_counted_tokens: 7438 total tokens: 8565 num samples: 15 num padding tokens: 401 - rank: 5 max len: 571 min len: 519 avg len: 544.2666666666667 num_loss_counted_tokens: 7279
total tokens: 8477 num samples: 7 num padding tokens: 255 - rank: 1 max len: 1211 min len: 1136 avg len: 1174.5714285714287 num_loss_counted_tokens: 7809
total tokens: 8666 num samples: 7 num padding tokens: 540 - rank: 1 max len: 1238 min len: 1097 avg len: 1160.857142857143 num_loss_counted_tokens: 7713
total tokens: 8896 num samples: 8 num padding tokens: 933 - rank: 2 max len: 1112 min len: 883 avg len: 995.375 num_loss_counted_tokens: 7491
total tokens: 8200 num samples: 8 num padding tokens: 824 - rank: 2 max len: 1025 min len: 886 avg len: 922.0 num_loss_counted_tokens: 6904
total tokens: 8492 num samples: 22 num padding tokens: 1012 - rank: 7 max len: 386 min len: 288 avg len: 340.0 num_loss_counted_tokens: 6182
total tokens: 8579 num samples: 23 num padding tokens: 1024 - rank: 7 max len: 373 min len: 282 avg len: 328.4782608695652 num_loss_counted_tokens: 6198
total tokens: 8567 num samples: 13 num padding tokens: 536 - rank: 4 max len: 659 min len: 577 avg len: 617.7692307692307 num_loss_counted_tokens: 7264
total tokens: 8723 num samples: 13 num padding tokens: 546 - rank: 4 max len: 671 min len: 581 avg len: 629.0 num_loss_counted_tokens: 7410
total tokens: 8600 num samples: 5 num padding tokens: 654 - rank: 0 max len: 1720 min len: 1230 avg len: 1589.2 num_loss_counted_tokens: 7651
total tokens: 8908 num samples: 17 num padding tokens: 1767 - rank: 6 max len: 524 min len: 384 avg len: 420.05882352941177 num_loss_counted_tokens: 6138
total tokens: 8704 num samples: 17 num padding tokens: 1162 - rank: 6 max len: 512 min len: 388 avg len: 443.6470588235294 num_loss_counted_tokens: 6539
total tokens: 8660 num samples: 5 num padding tokens: 130 - rank: 0 max len: 1732 min len: 1681 avg len: 1706.0 num_loss_counted_tokens: 8235
total tokens: 8688 num samples: 12 num padding tokens: 317 - rank: 3 max len: 724 min len: 662 avg len: 697.5833333333334 num_loss_counted_tokens: 7663
total tokens: 8712 num samples: 12 num padding tokens: 241 - rank: 3 max len: 726 min len: 673 avg len: 705.9166666666666 num_loss_counted_tokens: 7763
Per-token loss scaled by world size: 0.00018896172696258873Per-token loss scaled by world size: 0.00022490561241284013Per-token loss scaled by world size: 0.00020114783546887338Per-token loss scaled by world size: 0.00021589698735624552Per-token loss scaled by world size: 0.0002045775472652167
Per-token loss scaled by world size: 0.0001989303418667987
Per-token loss scaled by world size: 0.00020551522902678698
Epoch: 0, Step: 1, Rank: 3, loss = 1.6271358728408813Epoch: 0, Step: 1, Rank: 6, loss = 1.3670908212661743
Epoch: 0, Step: 1, Rank: 5, loss = 1.4800673723220825
Epoch: 0, Step: 1, Rank: 2, loss = 1.455254316329956
Epoch: 0, Step: 1, Rank: 1, loss = 1.5619606971740723
Epoch: 0, Step: 1, Rank: 7, loss = 1.4392112493515015
Epoch: 0, Step: 1, Rank: 4, loss = 1.4868513345718384
Per-token loss scaled by world size: 0.0001951669983100146
Epoch: 0, Step: 1, Rank: 0, loss = 1.4119844436645508
Epoch 0: 50%|█████ | 1/2 [00:03<00:03, 3.83s/it]{
"epoch": 0,
"step": 1,
"rank": 0,
"loss": 1.4119844436645508,
"overall_throughput": 18.825078649542128,
"lr": 0.0,
"cuda_mem_allocated": 18.31652021408081,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 57878,
"batch_size": 99,
"total_loss": 1.4786945581436157,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:06:06.148365"
}
Per-token loss scaled by world size: 0.00022342400916386396Per-token loss scaled by world size: 0.00018432749493513256Per-token loss scaled by world size: 0.00020338631293270737Per-token loss scaled by world size: 0.00018564658239483833
Per-token loss scaled by world size: 0.00020801158098038286
Per-token loss scaled by world size: 0.00018980413733515888
Per-token loss scaled by world size: 0.00019465763762127608Epoch: 0, Step: 2, Rank: 1, loss = 1.3317431211471558
Epoch: 0, Step: 2, Rank: 5, loss = 1.4694406986236572Epoch: 0, Step: 2, Rank: 3, loss = 1.6142104864120483
Epoch: 0, Step: 2, Rank: 6, loss = 1.341273307800293
Epoch: 0, Step: 2, Rank: 2, loss = 1.5028576850891113
Epoch: 0, Step: 2, Rank: 4, loss = 1.3713111877441406
Epoch: 0, Step: 2, Rank: 7, loss = 1.4063770771026611
Per-token loss scaled by world size: 0.00021443456353154033
Epoch: 0, Step: 2, Rank: 0, loss = 1.5492628812789917
[2024-08-18 20:06:08,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
Epoch 0: 100%|██████████| 2/2 [00:06<00:00, 3.29s/it]{
"epoch": 0,
"step": 2,
"rank": 0,
"loss": 1.5492628812789917,
"overall_throughput": 22.634999728904948,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 23.017526626586914,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 57799,
"batch_size": 100,
"total_loss": 1.4483095407485962,
"gradnorm": 3.2187001705169678,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:06:09.050231"
}
Saving model in huggingface format at samples_seen: 192
Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192
[20:06:27] INFO saving took 18.58629083633423 seconds utils.py:611
Epoch 0: 100%|██████████| 2/2 [00:25<00:00, 12.69s/it]
total tokens: 8908 num samples: 17 num padding tokens: 1042 - rank: 6 max len: 524 min len: 389 avg len: 462.70588235294116 num_loss_counted_tokens: 6863
total tokens: 8904 num samples: 21 num padding tokens: 847 - rank: 6 max len: 424 min len: 347 avg len: 383.6666666666667 num_loss_counted_tokens: 6818
total tokens: 8996 num samples: 13 num padding tokens: 686 - rank: 4 max len: 692 min len: 600 avg len: 639.2307692307693 num_loss_counted_tokens: 7543
total tokens: 8610 num samples: 14 num padding tokens: 511 - rank: 4 max len: 615 min len: 546 avg len: 578.5 num_loss_counted_tokens: 7273
total tokens: 8495 num samples: 5 num padding tokens: 1475 - rank: 0 max len: 1699 min len: 1186 avg len: 1404.0 num_loss_counted_tokens: 6725
total tokens: 8660 num samples: 5 num padding tokens: 88 - rank: 0 max len: 1732 min len: 1685 avg len: 1714.4 num_loss_counted_tokens: 8277
total tokens: 8536 num samples: 22 num padding tokens: 1108 - rank: 7 max len: 388 min len: 283 avg len: 337.6363636363636 num_loss_counted_tokens: 6130 total tokens: 8835 num samples: 15 num padding tokens: 400 - rank: 5 max len: 589 min len: 524 avg len: 562.3333333333334 num_loss_counted_tokens: 7550
total tokens: 8688 num samples: 16 num padding tokens: 651 - rank: 5 max len: 543 min len: 428 avg len: 502.3125 num_loss_counted_tokens: 7093
total tokens: 8970 num samples: 26 num padding tokens: 730 - rank: 7 max len: 345 min len: 253 avg len: 316.9230769230769 num_loss_counted_tokens: 6706
total tokens: 8397 num samples: 9 num padding tokens: 1203 - rank: 2 max len: 933 min len: 709 avg len: 799.3333333333334 num_loss_counted_tokens: 6663
total tokens: 8288 num samples: 7 num padding tokens: 408 - rank: 1 max len: 1184 min len: 1017 avg len: 1125.7142857142858 num_loss_counted_tokens: 7467
total tokens: 8470 num samples: 7 num padding tokens: 755 - rank: 2 max len: 1210 min len: 931 avg len: 1102.142857142857 num_loss_counted_tokens: 7302
total tokens: 8950 num samples: 10 num padding tokens: 1495 - rank: 3 max len: 895 min len: 694 avg len: 745.5 num_loss_counted_tokens: 6865
total tokens: 8405 num samples: 5 num padding tokens: 928 - rank: 1 max len: 1681 min len: 1230 avg len: 1495.4 num_loss_counted_tokens: 7182
total tokens: 8508 num samples: 12 num padding tokens: 397 - rank: 3 max len: 709 min len: 628 avg len: 675.9166666666666 num_loss_counted_tokens: 7403
Per-token loss scaled by world size: 0.00018219766207039356Per-token loss scaled by world size: 0.00019190594321116805Per-token loss scaled by world size: 0.00019805562624242157Per-token loss scaled by world size: 0.0001988127187360078Per-token loss scaled by world size: 0.00019518414046615362Per-token loss scaled by world size: 0.00021522259339690208
Per-token loss scaled by world size: 0.00020449883595574647
Epoch: 1, Step: 3, Rank: 6, loss = 1.3844094276428223
Epoch: 1, Step: 3, Rank: 1, loss = 1.3143739700317383Epoch: 1, Step: 3, Rank: 4, loss = 1.428773283958435Epoch: 1, Step: 3, Rank: 7, loss = 1.4080584049224854Epoch: 1, Step: 3, Rank: 3, loss = 1.4342349767684937
Epoch: 1, Step: 3, Rank: 5, loss = 1.552615761756897
Epoch: 1, Step: 3, Rank: 2, loss = 1.4752546548843384
Per-token loss scaled by world size: 0.00021497253328561783
Epoch: 1, Step: 3, Rank: 0, loss = 1.5508118867874146
{
"epoch": 1,████ | 1/2 [00:03<00:03, 3.17s/it]
"step": 3,
"rank": 0,
"loss": 1.5508118867874146,
"overall_throughput": 23.69786494018801,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.60161828994751,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 57712,
"batch_size": 94,
"total_loss": 1.4435664415359497,
"gradnorm": 3.2187001705169678,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:06:30.868391"
}
Per-token loss scaled by world size: 0.00020728030358441174Per-token loss scaled by world size: 0.00018060434376820922Per-token loss scaled by world size: 0.0002226187934866175Per-token loss scaled by world size: 0.00021563439804594964
Per-token loss scaled by world size: 0.00020198585116304457
Per-token loss scaled by world size: 0.0002103917795466259
Per-token loss scaled by world size: 0.0002255949075333774
Epoch: 1, Step: 4, Rank: 3, loss = 1.5624500513076782
Epoch: 1, Step: 4, Rank: 5, loss = 1.5134299993515015
Epoch: 1, Step: 4, Rank: 0, loss = 1.2675715684890747Epoch: 1, Step: 4, Rank: 4, loss = 1.4547967910766602
Epoch: 1, Step: 4, Rank: 1, loss = 1.4176377058029175
Epoch: 1, Step: 4, Rank: 2, loss = 1.4766347408294678
Epoch: 1, Step: 4, Rank: 7, loss = 1.5833379030227661
Per-token loss scaled by world size: 0.00022538744087796658
Epoch: 1, Step: 4, Rank: 6, loss = 1.5818817615509033
[2024-08-18 20:06:33,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
{
"epoch": 1,█████████| 2/2 [00:06<00:00, 2.98s/it]
"step": 4,
"rank": 0,
"loss": 1.2675715684890747,
"overall_throughput": 23.434918025287512,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 22.997975826263428,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 56148,
"batch_size": 110,
"total_loss": 1.4822176694869995,
"gradnorm": 3.2527425289154053,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:06:33.767906"
}
Saving model in huggingface format at samples_seen: 320
Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320
[20:06:52] INFO saving took 18.69700264930725 seconds utils.py:611
Epoch 1: 100%|██████████| 2/2 [00:24<00:00, 12.41s/it]
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:550 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:573 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:621 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:574:1319 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:574:1319 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 4, res=3, closed=0
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:576:1316 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:574:1319 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Close from rank 4, retcode 3
tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:571:1330 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:570:1321 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:47 -> 3
tyler-a100-newimage-val:576:1316 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:752 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:58 -> 3
tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:775 -> 3
tyler-a100-newimage-val:577:1318 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:576:1316 [6] proxy.cc:1521 NCCL WARN [Proxy Service 6] Failed to execute operation Close from rank 6, retcode 3
tyler-a100-newimage-val:572:1327 [2] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:428 -> 3
tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:570:1321 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:564 -> 3
tyler-a100-newimage-val:570:1321 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
tyler-a100-newimage-val:571:1330 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:668 -> 3
tyler-a100-newimage-val:577:1318 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0
tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:577:1318 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Close from rank 7, retcode 3
tyler-a100-newimage-val:575:1325 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:572:1327 [2] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 2, res=3, closed=0
tyler-a100-newimage-val:572:1327 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Close from rank 2, retcode 3
tyler-a100-newimage-val:571:1330 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
tyler-a100-newimage-val:573:1324 [3] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:575:1325 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0
tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:826 -> 3
tyler-a100-newimage-val:573:1324 [3] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 3, res=3, closed=0
tyler-a100-newimage-val:575:1325 [5] proxy.cc:1521 NCCL WARN [Proxy Service 5] Failed to execute operation Close from rank 5, retcode 3
tyler-a100-newimage-val:573:1324 [3] proxy.cc:1521 NCCL WARN [Proxy Service 3] Failed to execute operation Close from rank 3, retcode 3
tyler-a100-newimage-val:570:9772 [0] NCCL INFO comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
tyler-a100-newimage-val:576:9777 [6] NCCL INFO comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
tyler-a100-newimage-val:574:9775 [4] NCCL INFO comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
tyler-a100-newimage-val:575:9771 [5] NCCL INFO comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
tyler-a100-newimage-val:572:9778 [2] NCCL INFO comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
tyler-a100-newimage-val:577:9774 [7] NCCL INFO comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
tyler-a100-newimage-val:573:9776 [3] NCCL INFO comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
tyler-a100-newimage-val:571:9773 [1] NCCL INFO comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
Operation completed successfully! 🎉
MMLU evaluation for Phase 1...
INFO 2024-08-18 20:07:09,101 lm-eval:152: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
INFO 2024-08-18 20:07:09,102 lm-eval:189: Initializing hf model, with arguments: {'pretrained': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192', 'dtype': 'bfloat16'}
INFO 2024-08-18 20:07:09,231 lm-eval:170: Using device 'cuda'
Downloading builder script: 100% 5.86k/5.86k [00:00<00:00, 23.9MB/s]
Downloading readme: 100% 1.11k/1.11k [00:00<00:00, 10.3MB/s]
Downloading data: 100% 166M/166M [00:01<00:00, 102MB/s]
Generating test split: 100 examples [00:00, 1153.22 examples/s]
Generating validation split: 11 examples [00:00, 3373.60 examples/s]
Generating dev split: 5 examples [00:00, 59.65 examples/s]
Generating test split: 135 examples [00:00, 1635.55 examples/s]
Generating validation split: 14 examples [00:00, 5139.63 examples/s]
Generating dev split: 5 examples [00:00, 61.55 examples/s]
Generating test split: 152 examples [00:00, 1850.10 examples/s]
Generating validation split: 16 examples [00:00, 5953.06 examples/s]
Generating dev split: 5 examples [00:00, 60.12 examples/s]
Generating test split: 100 examples [00:00, 1177.68 examples/s]
Generating validation split: 11 examples [00:00, 4201.56 examples/s]
Generating dev split: 5 examples [00:00, 60.37 examples/s]
Generating test split: 265 examples [00:00, 2954.66 examples/s]
Generating validation split: 29 examples [00:00, 7941.16 examples/s]
Generating dev split: 5 examples [00:00, 61.76 examples/s]
Generating test split: 144 examples [00:00, 1774.03 examples/s]
Generating validation split: 16 examples [00:00, 3479.13 examples/s]
Generating dev split: 5 examples [00:00, 61.13 examples/s]
Generating test split: 100 examples [00:00, 1222.35 examples/s]
Generating validation split: 8 examples [00:00, 1853.63 examples/s]
Generating dev split: 5 examples [00:00, 60.05 examples/s]
Generating test split: 100 examples [00:00, 1183.27 examples/s]
Generating validation split: 11 examples [00:00, 2736.66 examples/s]
Generating dev split: 5 examples [00:00, 60.60 examples/s]
Generating test split: 100 examples [00:00, 1185.71 examples/s]
Generating validation split: 11 examples [00:00, 3130.71 examples/s]
Generating dev split: 5 examples [00:00, 61.68 examples/s]
Generating test split: 173 examples [00:00, 2042.05 examples/s]
Generating validation split: 22 examples [00:00, 6416.88 examples/s]
Generating dev split: 5 examples [00:00, 62.23 examples/s]
Generating test split: 102 examples [00:00, 1194.64 examples/s]
Generating validation split: 11 examples [00:00, 4022.44 examples/s]
Generating dev split: 5 examples [00:00, 61.46 examples/s]
Generating test split: 100 examples [00:00, 1207.67 examples/s]
Generating validation split: 11 examples [00:00, 2994.57 examples/s]
Generating dev split: 5 examples [00:00, 60.33 examples/s]
Generating test split: 235 examples [00:00, 2704.80 examples/s]
Generating validation split: 26 examples [00:00, 8991.75 examples/s]
Generating dev split: 5 examples [00:00, 60.17 examples/s]
Generating test split: 114 examples [00:00, 1390.51 examples/s]
Generating validation split: 12 examples [00:00, 3749.38 examples/s]
Generating dev split: 5 examples [00:00, 60.87 examples/s]
Generating test split: 145 examples [00:00, 1763.73 examples/s]
Generating validation split: 16 examples [00:00, 5013.74 examples/s]
Generating dev split: 5 examples [00:00, 60.14 examples/s]
Generating test split: 378 examples [00:00, 4030.96 examples/s]
Generating validation split: 41 examples [00:00, 7887.28 examples/s]
Generating dev split: 5 examples [00:00, 60.97 examples/s]
Generating test split: 126 examples [00:00, 1470.10 examples/s]
Generating validation split: 14 examples [00:00, 4072.99 examples/s]
Generating dev split: 5 examples [00:00, 59.27 examples/s]
Generating test split: 100 examples [00:00, 1255.89 examples/s]
Generating validation split: 10 examples [00:00, 3940.16 examples/s]
Generating dev split: 5 examples [00:00, 61.03 examples/s]
Generating test split: 310 examples [00:00, 3516.83 examples/s]
Generating validation split: 32 examples [00:00, 8804.05 examples/s]
Generating dev split: 5 examples [00:00, 61.07 examples/s]
Generating test split: 203 examples [00:00, 2410.97 examples/s]
Generating validation split: 22 examples [00:00, 4926.05 examples/s]
Generating dev split: 5 examples [00:00, 62.35 examples/s]
Generating test split: 100 examples [00:00, 1268.03 examples/s]
Generating validation split: 9 examples [00:00, 3895.64 examples/s]
Generating dev split: 5 examples [00:00, 62.00 examples/s]
Generating test split: 165 examples [00:00, 1938.49 examples/s]
Generating validation split: 18 examples [00:00, 3426.10 examples/s]
Generating dev split: 5 examples [00:00, 62.50 examples/s]
Generating test split: 198 examples [00:00, 2282.64 examples/s]
Generating validation split: 22 examples [00:00, 7912.42 examples/s]
Generating dev split: 5 examples [00:00, 61.81 examples/s]
Generating test split: 193 examples [00:00, 2366.57 examples/s]
Generating validation split: 21 examples [00:00, 7132.01 examples/s]
Generating dev split: 5 examples [00:00, 61.24 examples/s]
Generating test split: 390 examples [00:00, 4338.53 examples/s]
Generating validation split: 43 examples [00:00, 9807.77 examples/s]
Generating dev split: 5 examples [00:00, 62.74 examples/s]
Generating test split: 270 examples [00:00, 3156.55 examples/s]
Generating validation split: 29 examples [00:00, 8374.17 examples/s]
Generating dev split: 5 examples [00:00, 61.80 examples/s]
Generating test split: 238 examples [00:00, 2714.10 examples/s]
Generating validation split: 26 examples [00:00, 5558.48 examples/s]
Generating dev split: 5 examples [00:00, 60.55 examples/s]
Generating test split: 151 examples [00:00, 1801.49 examples/s]
Generating validation split: 17 examples [00:00, 4671.64 examples/s]
Generating dev split: 5 examples [00:00, 61.15 examples/s]
Generating test split: 545 examples [00:00, 5738.37 examples/s]
Generating validation split: 60 examples [00:00, 9898.84 examples/s]
Generating dev split: 5 examples [00:00, 61.26 examples/s]
Generating test split: 216 examples [00:00, 2474.35 examples/s]
Generating validation split: 23 examples [00:00, 6018.78 examples/s]
Generating dev split: 5 examples [00:00, 61.26 examples/s]
Generating test split: 204 examples [00:00, 2282.53 examples/s]
Generating validation split: 22 examples [00:00, 4064.43 examples/s]
Generating dev split: 5 examples [00:00, 61.72 examples/s]
Generating test split: 237 examples [00:00, 2575.45 examples/s]
Generating validation split: 26 examples [00:00, 4640.90 examples/s]
Generating dev split: 5 examples [00:00, 61.22 examples/s]
Generating test split: 223 examples [00:00, 2635.67 examples/s]
Generating validation split: 23 examples [00:00, 5733.33 examples/s]
Generating dev split: 5 examples [00:00, 62.89 examples/s]
Generating test split: 131 examples [00:00, 1591.09 examples/s]
Generating validation split: 12 examples [00:00, 3250.77 examples/s]
Generating dev split: 5 examples [00:00, 61.29 examples/s]
Generating test split: 121 examples [00:00, 1415.60 examples/s]
Generating validation split: 13 examples [00:00, 3564.02 examples/s]
Generating dev split: 5 examples [00:00, 61.46 examples/s]
Generating test split: 108 examples [00:00, 1342.97 examples/s]
Generating validation split: 11 examples [00:00, 2489.47 examples/s]
Generating dev split: 5 examples [00:00, 58.93 examples/s]
Generating test split: 163 examples [00:00, 2010.64 examples/s]
Generating validation split: 18 examples [00:00, 4509.73 examples/s]
Generating dev split: 5 examples [00:00, 61.46 examples/s]
Generating test split: 112 examples [00:00, 1324.02 examples/s]
Generating validation split: 11 examples [00:00, 3809.85 examples/s]
Generating dev split: 5 examples [00:00, 61.62 examples/s]
Generating test split: 103 examples [00:00, 1277.65 examples/s]
Generating validation split: 11 examples [00:00, 3080.55 examples/s]
Generating dev split: 5 examples [00:00, 59.10 examples/s]
Generating test split: 234 examples [00:00, 2697.17 examples/s]
Generating validation split: 25 examples [00:00, 5543.62 examples/s]
Generating dev split: 5 examples [00:00, 60.37 examples/s]
Generating test split: 100 examples [00:00, 1202.59 examples/s]
Generating validation split: 11 examples [00:00, 4425.22 examples/s]
Generating dev split: 5 examples [00:00, 62.03 examples/s]
Generating test split: 783 examples [00:00, 7376.63 examples/s]
Generating validation split: 86 examples [00:00, 16538.75 examples/s]
Generating dev split: 5 examples [00:00, 59.82 examples/s]
Generating test split: 346 examples [00:00, 3763.46 examples/s]
Generating validation split: 38 examples [00:00, 6601.65 examples/s]
Generating dev split: 5 examples [00:00, 61.74 examples/s]
Generating test split: 895 examples [00:00, 7717.62 examples/s]
Generating validation split: 100 examples [00:00, 14271.19 examples/s]
Generating dev split: 5 examples [00:00, 61.34 examples/s]
Generating test split: 306 examples [00:00, 3463.47 examples/s]
Generating validation split: 33 examples [00:00, 7868.34 examples/s]
Generating dev split: 5 examples [00:00, 62.63 examples/s]
Generating test split: 311 examples [00:00, 3373.99 examples/s]
Generating validation split: 34 examples [00:00, 9395.59 examples/s]
Generating dev split: 5 examples [00:00, 61.80 examples/s]
Generating test split: 324 examples [00:00, 3525.14 examples/s]
Generating validation split: 35 examples [00:00, 7428.05 examples/s]
Generating dev split: 5 examples [00:00, 61.91 examples/s]
Generating test split: 282 examples [00:00, 3107.75 examples/s]
Generating validation split: 31 examples [00:00, 5669.96 examples/s]
Generating dev split: 5 examples [00:00, 62.70 examples/s]
Generating test split: 1534 examples [00:00, 10061.95 examples/s]
Generating validation split: 170 examples [00:00, 14781.84 examples/s]
Generating dev split: 5 examples [00:00, 59.81 examples/s]
Generating test split: 272 examples [00:00, 2957.00 examples/s]
Generating validation split: 31 examples [00:00, 6405.41 examples/s]
Generating dev split: 5 examples [00:00, 61.40 examples/s]
Generating test split: 612 examples [00:00, 6144.16 examples/s]
Generating validation split: 69 examples [00:00, 10691.85 examples/s]
Generating dev split: 5 examples [00:00, 59.34 examples/s]
Generating test split: 110 examples [00:00, 1355.61 examples/s]
Generating validation split: 12 examples [00:00, 3074.06 examples/s]
Generating dev split: 5 examples [00:00, 61.81 examples/s]
Generating test split: 245 examples [00:00, 2837.26 examples/s]
Generating validation split: 27 examples [00:00, 4828.44 examples/s]
Generating dev split: 5 examples [00:00, 61.01 examples/s]
Generating test split: 201 examples [00:00, 2289.38 examples/s]
Generating validation split: 22 examples [00:00, 4368.45 examples/s]
Generating dev split: 5 examples [00:00, 61.55 examples/s]
Generating test split: 100 examples [00:00, 1184.69 examples/s]
Generating validation split: 11 examples [00:00, 3479.70 examples/s]
Generating dev split: 5 examples [00:00, 61.74 examples/s]
Generating test split: 166 examples [00:00, 2008.24 examples/s]
Generating validation split: 18 examples [00:00, 4386.07 examples/s]
Generating dev split: 5 examples [00:00, 59.23 examples/s]
Generating test split: 171 examples [00:00, 2049.86 examples/s]
Generating validation split: 19 examples [00:00, 5742.72 examples/s]
Generating dev split: 5 examples [00:00, 61.61 examples/s]
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_world_religions from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_virology from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_us_foreign_policy from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_sociology from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_security_studies from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_public_relations from None to 5
INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_psychology from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_medicine from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_law from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_accounting from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_prehistory from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_philosophy from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_nutrition from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_scenarios from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_disputes from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_miscellaneous from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_medical_genetics from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_marketing from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_management from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_machine_learning from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_logical_fallacies from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_jurisprudence from None to 5
INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_international_law from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_human_sexuality from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_human_aging from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_world_history from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_us_history from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_psychology from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_microeconomics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_macroeconomics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_government_and_politics from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_geography from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_european_history from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_computer_science from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_global_facts from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_formal_logic from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_elementary_mathematics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_electrical_engineering from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_econometrics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_conceptual_physics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_computer_security from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_physics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_medicine from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_mathematics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_computer_science from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_chemistry from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_biology from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_clinical_knowledge from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_business_ethics from None to 5
INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_astronomy from None to 5
INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,840 lm-eval:251: Overwriting default num_fewshot of mmlu_anatomy from None to 5
INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:08:01,840 lm-eval:251: Overwriting default num_fewshot of mmlu_abstract_algebra from None to 5
INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
INFO 2024-08-18 20:08:01,845 lm-eval:411: Building contexts for mmlu_world_religions on rank 0...
100% 171/171 [00:01<00:00, 136.70it/s]
INFO 2024-08-18 20:08:03,104 lm-eval:411: Building contexts for mmlu_virology on rank 0...
100% 166/166 [00:01<00:00, 137.65it/s]
INFO 2024-08-18 20:08:04,318 lm-eval:411: Building contexts for mmlu_us_foreign_policy on rank 0...
100% 100/100 [00:00<00:00, 137.17it/s]
INFO 2024-08-18 20:08:05,053 lm-eval:411: Building contexts for mmlu_sociology on rank 0...
100% 201/201 [00:01<00:00, 137.20it/s]
INFO 2024-08-18 20:08:06,528 lm-eval:411: Building contexts for mmlu_security_studies on rank 0...
100% 245/245 [00:01<00:00, 137.06it/s]
INFO 2024-08-18 20:08:08,328 lm-eval:411: Building contexts for mmlu_public_relations on rank 0...
100% 110/110 [00:00<00:00, 137.91it/s]
INFO 2024-08-18 20:08:09,132 lm-eval:411: Building contexts for mmlu_professional_psychology on rank 0...
100% 612/612 [00:04<00:00, 137.30it/s]
INFO 2024-08-18 20:08:13,618 lm-eval:411: Building contexts for mmlu_professional_medicine on rank 0...
100% 272/272 [00:01<00:00, 137.57it/s]
INFO 2024-08-18 20:08:15,610 lm-eval:411: Building contexts for mmlu_professional_law on rank 0...
100% 1534/1534 [00:11<00:00, 137.50it/s]
INFO 2024-08-18 20:08:26,840 lm-eval:411: Building contexts for mmlu_professional_accounting on rank 0...
100% 282/282 [00:02<00:00, 137.72it/s]
INFO 2024-08-18 20:08:28,902 lm-eval:411: Building contexts for mmlu_prehistory on rank 0...
100% 324/324 [00:02<00:00, 137.90it/s]
INFO 2024-08-18 20:08:31,267 lm-eval:411: Building contexts for mmlu_philosophy on rank 0...
100% 311/311 [00:02<00:00, 137.34it/s]
INFO 2024-08-18 20:08:33,547 lm-eval:411: Building contexts for mmlu_nutrition on rank 0...
100% 306/306 [00:02<00:00, 137.31it/s]
INFO 2024-08-18 20:08:35,790 lm-eval:411: Building contexts for mmlu_moral_scenarios on rank 0...
100% 895/895 [00:06<00:00, 137.60it/s]
INFO 2024-08-18 20:08:42,336 lm-eval:411: Building contexts for mmlu_moral_disputes on rank 0...
100% 346/346 [00:02<00:00, 137.95it/s]
INFO 2024-08-18 20:08:44,861 lm-eval:411: Building contexts for mmlu_miscellaneous on rank 0...
100% 783/783 [00:05<00:00, 138.18it/s]
INFO 2024-08-18 20:08:50,564 lm-eval:411: Building contexts for mmlu_medical_genetics on rank 0...
100% 100/100 [00:00<00:00, 137.18it/s]
INFO 2024-08-18 20:08:51,298 lm-eval:411: Building contexts for mmlu_marketing on rank 0...
100% 234/234 [00:01<00:00, 137.53it/s]
INFO 2024-08-18 20:08:53,011 lm-eval:411: Building contexts for mmlu_management on rank 0...
100% 103/103 [00:00<00:00, 137.86it/s]
INFO 2024-08-18 20:08:53,764 lm-eval:411: Building contexts for mmlu_machine_learning on rank 0...
100% 112/112 [00:00<00:00, 137.95it/s]
INFO 2024-08-18 20:08:54,582 lm-eval:411: Building contexts for mmlu_logical_fallacies on rank 0...
100% 163/163 [00:01<00:00, 137.78it/s]
INFO 2024-08-18 20:08:55,773 lm-eval:411: Building contexts for mmlu_jurisprudence on rank 0...
100% 108/108 [00:00<00:00, 138.26it/s]
INFO 2024-08-18 20:08:56,559 lm-eval:411: Building contexts for mmlu_international_law on rank 0...
100% 121/121 [00:00<00:00, 137.86it/s]
INFO 2024-08-18 20:08:57,444 lm-eval:411: Building contexts for mmlu_human_sexuality on rank 0...
100% 131/131 [00:00<00:00, 137.92it/s]
INFO 2024-08-18 20:08:58,400 lm-eval:411: Building contexts for mmlu_human_aging on rank 0...
100% 223/223 [00:01<00:00, 138.55it/s]
INFO 2024-08-18 20:09:00,021 lm-eval:411: Building contexts for mmlu_high_school_world_history on rank 0...
100% 237/237 [00:01<00:00, 137.41it/s]
INFO 2024-08-18 20:09:01,757 lm-eval:411: Building contexts for mmlu_high_school_us_history on rank 0...
100% 204/204 [00:01<00:00, 137.87it/s]
INFO 2024-08-18 20:09:03,248 lm-eval:411: Building contexts for mmlu_high_school_statistics on rank 0...
100% 216/216 [00:01<00:00, 138.70it/s]
INFO 2024-08-18 20:09:04,816 lm-eval:411: Building contexts for mmlu_high_school_psychology on rank 0...
100% 545/545 [00:03<00:00, 138.10it/s]
INFO 2024-08-18 20:09:08,787 lm-eval:411: Building contexts for mmlu_high_school_physics on rank 0...
100% 151/151 [00:01<00:00, 137.64it/s]
INFO 2024-08-18 20:09:09,892 lm-eval:411: Building contexts for mmlu_high_school_microeconomics on rank 0...
100% 238/238 [00:01<00:00, 138.03it/s]
INFO 2024-08-18 20:09:11,628 lm-eval:411: Building contexts for mmlu_high_school_mathematics on rank 0...
100% 270/270 [00:01<00:00, 137.88it/s]
INFO 2024-08-18 20:09:13,599 lm-eval:411: Building contexts for mmlu_high_school_macroeconomics on rank 0...
100% 390/390 [00:02<00:00, 138.12it/s]
INFO 2024-08-18 20:09:16,441 lm-eval:411: Building contexts for mmlu_high_school_government_and_politics on rank 0...
100% 193/193 [00:01<00:00, 137.78it/s]
INFO 2024-08-18 20:09:17,851 lm-eval:411: Building contexts for mmlu_high_school_geography on rank 0...
100% 198/198 [00:01<00:00, 138.13it/s]
INFO 2024-08-18 20:09:19,294 lm-eval:411: Building contexts for mmlu_high_school_european_history on rank 0...
100% 165/165 [00:01<00:00, 136.67it/s]
INFO 2024-08-18 20:09:20,511 lm-eval:411: Building contexts for mmlu_high_school_computer_science on rank 0...
100% 100/100 [00:00<00:00, 137.40it/s]
INFO 2024-08-18 20:09:21,244 lm-eval:411: Building contexts for mmlu_high_school_chemistry on rank 0...
100% 203/203 [00:01<00:00, 107.90it/s]
INFO 2024-08-18 20:09:23,135 lm-eval:411: Building contexts for mmlu_high_school_biology on rank 0...
100% 310/310 [00:02<00:00, 137.74it/s]
INFO 2024-08-18 20:09:25,401 lm-eval:411: Building contexts for mmlu_global_facts on rank 0...
100% 100/100 [00:00<00:00, 138.25it/s]
INFO 2024-08-18 20:09:26,129 lm-eval:411: Building contexts for mmlu_formal_logic on rank 0...
100% 126/126 [00:00<00:00, 137.97it/s]
INFO 2024-08-18 20:09:27,049 lm-eval:411: Building contexts for mmlu_elementary_mathematics on rank 0...
100% 378/378 [00:02<00:00, 138.66it/s]
INFO 2024-08-18 20:09:29,792 lm-eval:411: Building contexts for mmlu_electrical_engineering on rank 0...
100% 145/145 [00:01<00:00, 138.30it/s]
INFO 2024-08-18 20:09:30,848 lm-eval:411: Building contexts for mmlu_econometrics on rank 0...
100% 114/114 [00:00<00:00, 138.62it/s]
INFO 2024-08-18 20:09:31,676 lm-eval:411: Building contexts for mmlu_conceptual_physics on rank 0...
100% 235/235 [00:01<00:00, 138.75it/s]
INFO 2024-08-18 20:09:33,382 lm-eval:411: Building contexts for mmlu_computer_security on rank 0...
100% 100/100 [00:00<00:00, 138.84it/s]
INFO 2024-08-18 20:09:34,107 lm-eval:411: Building contexts for mmlu_college_physics on rank 0...
100% 102/102 [00:00<00:00, 138.92it/s]
INFO 2024-08-18 20:09:34,846 lm-eval:411: Building contexts for mmlu_college_medicine on rank 0...
100% 173/173 [00:01<00:00, 138.30it/s]
INFO 2024-08-18 20:09:36,106 lm-eval:411: Building contexts for mmlu_college_mathematics on rank 0...
100% 100/100 [00:00<00:00, 138.92it/s]
INFO 2024-08-18 20:09:36,831 lm-eval:411: Building contexts for mmlu_college_computer_science on rank 0...
100% 100/100 [00:00<00:00, 138.70it/s]
INFO 2024-08-18 20:09:37,557 lm-eval:411: Building contexts for mmlu_college_chemistry on rank 0...
100% 100/100 [00:00<00:00, 138.24it/s]
INFO 2024-08-18 20:09:38,286 lm-eval:411: Building contexts for mmlu_college_biology on rank 0...
100% 144/144 [00:01<00:00, 139.08it/s]
INFO 2024-08-18 20:09:39,329 lm-eval:411: Building contexts for mmlu_clinical_knowledge on rank 0...
100% 265/265 [00:01<00:00, 138.97it/s]
INFO 2024-08-18 20:09:41,248 lm-eval:411: Building contexts for mmlu_business_ethics on rank 0...
100% 100/100 [00:00<00:00, 138.40it/s]
INFO 2024-08-18 20:09:41,976 lm-eval:411: Building contexts for mmlu_astronomy on rank 0...
100% 152/152 [00:01<00:00, 139.01it/s]
INFO 2024-08-18 20:09:43,077 lm-eval:411: Building contexts for mmlu_anatomy on rank 0...
100% 135/135 [00:00<00:00, 138.23it/s]
INFO 2024-08-18 20:09:44,061 lm-eval:411: Building contexts for mmlu_abstract_algebra on rank 0...
100% 100/100 [00:00<00:00, 139.49it/s]
INFO 2024-08-18 20:09:44,783 lm-eval:438: Running loglikelihood requests
Running loglikelihood requests: 0% 0/56168 [00:00<?, ?it/s]Passed argument batch_size = auto:1. Detecting largest batch size
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Determined largest batch size: 16
Running loglikelihood requests: 100% 56168/56168 [13:58<00:00, 66.96it/s]
WARNING 2024-08-18 20:26:55,272 lm-eval:1315: Failed to get model SHA for /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192 at revision main. Error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192'. Use `repo_type` argument if needed.
fatal: not a git repository (or any of the parent directories): .git
CHECKPOINT EVALUATION: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192 SCORED 0.5271100893470168
INFO 2024-08-18 20:27:00,200 lm-eval:152: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
INFO 2024-08-18 20:27:00,200 lm-eval:189: Initializing hf model, with arguments: {'pretrained': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320', 'dtype': 'bfloat16'}
INFO 2024-08-18 20:27:00,202 lm-eval:170: Using device 'cuda'
WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_world_religions from None to 5
INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_virology from None to 5
INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_us_foreign_policy from None to 5
INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_sociology from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_security_studies from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_public_relations from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_psychology from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_medicine from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_law from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_accounting from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_prehistory from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_philosophy from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_nutrition from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_scenarios from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_disputes from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_miscellaneous from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_medical_genetics from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_marketing from None to 5
INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_management from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_machine_learning from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_logical_fallacies from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_jurisprudence from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_international_law from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_human_sexuality from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_human_aging from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_world_history from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_us_history from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_psychology from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_microeconomics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_macroeconomics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_government_and_politics from None to 5
INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_geography from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_european_history from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_computer_science from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_global_facts from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_formal_logic from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_elementary_mathematics from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_electrical_engineering from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_econometrics from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_conceptual_physics from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_computer_security from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_physics from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_medicine from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_mathematics from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_computer_science from None to 5
INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_college_chemistry from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_college_biology from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_clinical_knowledge from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_business_ethics from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_astronomy from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_anatomy from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_abstract_algebra from None to 5
INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
INFO 2024-08-18 20:27:41,451 lm-eval:411: Building contexts for mmlu_world_religions on rank 0...
100% 171/171 [00:01<00:00, 136.86it/s]
INFO 2024-08-18 20:27:42,709 lm-eval:411: Building contexts for mmlu_virology on rank 0...
100% 166/166 [00:01<00:00, 137.15it/s]
INFO 2024-08-18 20:27:43,928 lm-eval:411: Building contexts for mmlu_us_foreign_policy on rank 0...
100% 100/100 [00:00<00:00, 136.60it/s]
INFO 2024-08-18 20:27:44,666 lm-eval:411: Building contexts for mmlu_sociology on rank 0...
100% 201/201 [00:01<00:00, 136.57it/s]
INFO 2024-08-18 20:27:46,148 lm-eval:411: Building contexts for mmlu_security_studies on rank 0...
100% 245/245 [00:01<00:00, 136.39it/s]
INFO 2024-08-18 20:27:47,957 lm-eval:411: Building contexts for mmlu_public_relations on rank 0...
100% 110/110 [00:00<00:00, 137.57it/s]
INFO 2024-08-18 20:27:48,763 lm-eval:411: Building contexts for mmlu_professional_psychology on rank 0...
100% 612/612 [00:04<00:00, 136.85it/s]
INFO 2024-08-18 20:27:53,266 lm-eval:411: Building contexts for mmlu_professional_medicine on rank 0...
100% 272/272 [00:01<00:00, 136.23it/s]
INFO 2024-08-18 20:27:55,277 lm-eval:411: Building contexts for mmlu_professional_law on rank 0...
100% 1534/1534 [00:11<00:00, 137.38it/s]
INFO 2024-08-18 20:28:06,522 lm-eval:411: Building contexts for mmlu_professional_accounting on rank 0...
100% 282/282 [00:02<00:00, 137.60it/s]
INFO 2024-08-18 20:28:08,586 lm-eval:411: Building contexts for mmlu_prehistory on rank 0...
100% 324/324 [00:02<00:00, 137.90it/s]
INFO 2024-08-18 20:28:10,951 lm-eval:411: Building contexts for mmlu_philosophy on rank 0...
100% 311/311 [00:02<00:00, 138.05it/s]
INFO 2024-08-18 20:28:13,219 lm-eval:411: Building contexts for mmlu_nutrition on rank 0...
100% 306/306 [00:02<00:00, 138.17it/s]
INFO 2024-08-18 20:28:15,448 lm-eval:411: Building contexts for mmlu_moral_scenarios on rank 0...
100% 895/895 [00:06<00:00, 138.76it/s]
INFO 2024-08-18 20:28:21,941 lm-eval:411: Building contexts for mmlu_moral_disputes on rank 0...
100% 346/346 [00:02<00:00, 138.65it/s]
INFO 2024-08-18 20:28:24,454 lm-eval:411: Building contexts for mmlu_miscellaneous on rank 0...
100% 783/783 [00:05<00:00, 138.51it/s]
INFO 2024-08-18 20:28:30,144 lm-eval:411: Building contexts for mmlu_medical_genetics on rank 0...
100% 100/100 [00:00<00:00, 138.01it/s]
INFO 2024-08-18 20:28:30,874 lm-eval:411: Building contexts for mmlu_marketing on rank 0...
100% 234/234 [00:01<00:00, 138.10it/s]
INFO 2024-08-18 20:28:32,580 lm-eval:411: Building contexts for mmlu_management on rank 0...
100% 103/103 [00:00<00:00, 138.77it/s]
INFO 2024-08-18 20:28:33,327 lm-eval:411: Building contexts for mmlu_machine_learning on rank 0...
100% 112/112 [00:00<00:00, 138.61it/s]
INFO 2024-08-18 20:28:34,141 lm-eval:411: Building contexts for mmlu_logical_fallacies on rank 0...
100% 163/163 [00:01<00:00, 139.13it/s]
INFO 2024-08-18 20:28:35,321 lm-eval:411: Building contexts for mmlu_jurisprudence on rank 0...
100% 108/108 [00:00<00:00, 138.78it/s]
INFO 2024-08-18 20:28:36,105 lm-eval:411: Building contexts for mmlu_international_law on rank 0...
100% 121/121 [00:00<00:00, 138.84it/s]
INFO 2024-08-18 20:28:36,982 lm-eval:411: Building contexts for mmlu_human_sexuality on rank 0...
100% 131/131 [00:00<00:00, 138.95it/s]
INFO 2024-08-18 20:28:37,932 lm-eval:411: Building contexts for mmlu_human_aging on rank 0...
100% 223/223 [00:01<00:00, 139.25it/s]
INFO 2024-08-18 20:28:39,544 lm-eval:411: Building contexts for mmlu_high_school_world_history on rank 0...
100% 237/237 [00:01<00:00, 138.81it/s]
INFO 2024-08-18 20:28:41,264 lm-eval:411: Building contexts for mmlu_high_school_us_history on rank 0...
100% 204/204 [00:01<00:00, 136.33it/s]
INFO 2024-08-18 20:28:42,771 lm-eval:411: Building contexts for mmlu_high_school_statistics on rank 0...
100% 216/216 [00:01<00:00, 138.37it/s]
INFO 2024-08-18 20:28:44,343 lm-eval:411: Building contexts for mmlu_high_school_psychology on rank 0...
100% 545/545 [00:03<00:00, 137.23it/s]
INFO 2024-08-18 20:28:48,339 lm-eval:411: Building contexts for mmlu_high_school_physics on rank 0...
100% 151/151 [00:01<00:00, 138.28it/s]
INFO 2024-08-18 20:28:49,439 lm-eval:411: Building contexts for mmlu_high_school_microeconomics on rank 0...
100% 238/238 [00:01<00:00, 138.44it/s]
INFO 2024-08-18 20:28:51,170 lm-eval:411: Building contexts for mmlu_high_school_mathematics on rank 0...
100% 270/270 [00:01<00:00, 137.69it/s]
INFO 2024-08-18 20:28:53,144 lm-eval:411: Building contexts for mmlu_high_school_macroeconomics on rank 0...
100% 390/390 [00:02<00:00, 138.61it/s]
INFO 2024-08-18 20:28:55,975 lm-eval:411: Building contexts for mmlu_high_school_government_and_politics on rank 0...
100% 193/193 [00:01<00:00, 138.85it/s]
INFO 2024-08-18 20:28:57,375 lm-eval:411: Building contexts for mmlu_high_school_geography on rank 0...
100% 198/198 [00:01<00:00, 138.88it/s]
INFO 2024-08-18 20:28:58,810 lm-eval:411: Building contexts for mmlu_high_school_european_history on rank 0...
100% 165/165 [00:01<00:00, 135.99it/s]
INFO 2024-08-18 20:29:00,033 lm-eval:411: Building contexts for mmlu_high_school_computer_science on rank 0...
100% 100/100 [00:00<00:00, 138.28it/s]
INFO 2024-08-18 20:29:00,761 lm-eval:411: Building contexts for mmlu_high_school_chemistry on rank 0...
100% 203/203 [00:01<00:00, 135.51it/s]
INFO 2024-08-18 20:29:02,269 lm-eval:411: Building contexts for mmlu_high_school_biology on rank 0...
100% 310/310 [00:02<00:00, 115.72it/s]
INFO 2024-08-18 20:29:04,963 lm-eval:411: Building contexts for mmlu_global_facts on rank 0...
100% 100/100 [00:00<00:00, 135.44it/s]
INFO 2024-08-18 20:29:05,706 lm-eval:411: Building contexts for mmlu_formal_logic on rank 0...
100% 126/126 [00:00<00:00, 135.56it/s]
INFO 2024-08-18 20:29:06,642 lm-eval:411: Building contexts for mmlu_elementary_mathematics on rank 0...
100% 378/378 [00:02<00:00, 135.60it/s]
INFO 2024-08-18 20:29:09,447 lm-eval:411: Building contexts for mmlu_electrical_engineering on rank 0...
100% 145/145 [00:01<00:00, 136.37it/s]
INFO 2024-08-18 20:29:10,518 lm-eval:411: Building contexts for mmlu_econometrics on rank 0...
100% 114/114 [00:00<00:00, 135.71it/s]
INFO 2024-08-18 20:29:11,363 lm-eval:411: Building contexts for mmlu_conceptual_physics on rank 0...
100% 235/235 [00:01<00:00, 136.95it/s]
INFO 2024-08-18 20:29:13,091 lm-eval:411: Building contexts for mmlu_computer_security on rank 0...
100% 100/100 [00:00<00:00, 136.87it/s]
INFO 2024-08-18 20:29:13,827 lm-eval:411: Building contexts for mmlu_college_physics on rank 0...
100% 102/102 [00:00<00:00, 136.91it/s]
INFO 2024-08-18 20:29:14,577 lm-eval:411: Building contexts for mmlu_college_medicine on rank 0...
100% 173/173 [00:01<00:00, 136.72it/s]
INFO 2024-08-18 20:29:15,851 lm-eval:411: Building contexts for mmlu_college_mathematics on rank 0...
100% 100/100 [00:00<00:00, 136.59it/s]
INFO 2024-08-18 20:29:16,589 lm-eval:411: Building contexts for mmlu_college_computer_science on rank 0...
100% 100/100 [00:00<00:00, 136.43it/s]
INFO 2024-08-18 20:29:17,327 lm-eval:411: Building contexts for mmlu_college_chemistry on rank 0...
100% 100/100 [00:00<00:00, 136.50it/s]
INFO 2024-08-18 20:29:18,065 lm-eval:411: Building contexts for mmlu_college_biology on rank 0...
100% 144/144 [00:01<00:00, 136.85it/s]
INFO 2024-08-18 20:29:19,125 lm-eval:411: Building contexts for mmlu_clinical_knowledge on rank 0...
100% 265/265 [00:01<00:00, 137.03it/s]
INFO 2024-08-18 20:29:21,071 lm-eval:411: Building contexts for mmlu_business_ethics on rank 0...
100% 100/100 [00:00<00:00, 136.58it/s]
INFO 2024-08-18 20:29:21,808 lm-eval:411: Building contexts for mmlu_astronomy on rank 0...
100% 152/152 [00:01<00:00, 136.77it/s]
INFO 2024-08-18 20:29:22,927 lm-eval:411: Building contexts for mmlu_anatomy on rank 0...
100% 135/135 [00:00<00:00, 137.49it/s]
INFO 2024-08-18 20:29:23,916 lm-eval:411: Building contexts for mmlu_abstract_algebra on rank 0...
100% 100/100 [00:00<00:00, 138.03it/s]
INFO 2024-08-18 20:29:24,646 lm-eval:438: Running loglikelihood requests
Running loglikelihood requests: 0% 0/56168 [00:00<?, ?it/s]Passed argument batch_size = auto:1. Detecting largest batch size
Determined largest batch size: 16
Running loglikelihood requests: 100% 56168/56168 [13:58<00:00, 66.98it/s]
WARNING 2024-08-18 20:46:42,688 lm-eval:1315: Failed to get model SHA for /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 at revision main. Error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320'. Use `repo_type` argument if needed.
fatal: not a git repository (or any of the parent directories): .git
CHECKPOINT EVALUATION: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 SCORED 0.5283330937684491
Training Phase 2/2...
TrainingArgs for current phase: TrainingArgs(model_path='/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/var/mnt/inststg1/instructlab/generated/skills_train_msgs_2024-08-18T15_57_14.jsonl', ckpt_output_dir='/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints', data_output_dir='/var/mnt/inststg1/instructlab/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=10000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=False, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=None)
INFO 2024-08-18 20:46:47,145 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
Generating train split: 10000 examples [00:00, 100155.07 examples/s]
tokenizing the dataset with /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 tokenizer...
Setting TOKENIZERS_PARALLELISM=false for forked processes.
WARNING 2024-08-18 20:46:47,316 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
Map (num_proc=16): 100% 10000/10000 [00:02<00:00, 3765.08 examples/s]
ten largest length percentiles:
Setting TOKENIZERS_PARALLELISM=false for forked processes.
WARNING 2024-08-18 20:46:50,962 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
Map (num_proc=16): 100% 10000/10000 [00:00<00:00, 16336.33 examples/s]
quantile 90th: 1283.0
quantile 91th: 1367.0
quantile 92th: 1453.0
quantile 93th: 1579.0
quantile 94th: 1704.0599999999995
quantile 95th: 1843.1499999999978
quantile 96th: 2046.1199999999972
quantile 97th: 2356.179999999993
quantile 98th: 2724.100000000002
quantile 99th: 3213.0200000000004
quantile 100th: 5765.0
at 4096 max sequence length, the number of samples to be dropped is 19
(0.19% of total)
quantile 0th: 70.0
quantile 1th: 81.0
quantile 2th: 85.0
quantile 3th: 87.0
quantile 4th: 91.0
quantile 5th: 94.0
quantile 6th: 97.93999999999994
quantile 7th: 102.0
quantile 8th: 108.0
quantile 9th: 113.0
quantile 10th: 118.0
at 20 min sequence length, the number of samples to be dropped is 0
checking the validity of the samples...
Setting TOKENIZERS_PARALLELISM=false for forked processes.
WARNING 2024-08-18 20:46:52,663 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
Filter (num_proc=16): 100% 10000/10000 [00:01<00:00, 8244.08 examples/s]
INFO 2024-08-18 20:46:54,896 root:611: number of dropped samples: 19 -- out of 10000
Categorizing training data type...
Data type sorting: 100% 9981/9981 [00:00<00:00, 112525.30it/s]
unmasking the appropriate message content...
Setting TOKENIZERS_PARALLELISM=false for forked processes.
WARNING 2024-08-18 20:46:57,636 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
Map (num_proc=16): 100% 9981/9981 [00:01<00:00, 9644.98 examples/s]
The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.
Instruction ex sample 6312: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
The TVs in "2001: A Space Odyssey" might initially seem random, but they are actually thoughtfully integrated into the film's narrative and symbolism. Here are some instances where TVs play a significant role:
1. **Planetary Alignment:** In the opening scene, a TV screen displays the alignment of the planets, which is crucial to the plot and the film's exploration of extraterrestrial intelligence.
2. **News Broadcasts:** Later in the movie, TVs are used to show news broadcasts, providing context and information about the ongoing space exploration and human presence beyond Earth.
3. **Interviews:** TVs are also used to present interviews with characters, offering insights into their thoughts and motivations.
4. **Monolith's Influence:** The iconic scene where the monolith appears to the apes is accompanied by a flashing TV screen. This juxtaposition highlights the monolith's influence over technology and human progress, as if the TV screen is a modern-day monolith guiding human development.
In essence, the TVs in "2001: A Space Odyssey" are carefully crafted elements that contribute to the film's visual language and themes, such as technology, human evolution, and the search for extraterrestrial intelligence.<|endoftext|>
Original Input: <|system|>
I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
<|user|>
What are the random TVs in 2001?
<|assistant|>
The TVs in "2001: A Space Odyssey" might initially seem random, but they are actually thoughtfully integrated into the film's narrative and symbolism. Here are some instances where TVs play a significant role:
1. **Planetary Alignment:** In the opening scene, a TV screen displays the alignment of the planets, which is crucial to the plot and the film's exploration of extraterrestrial intelligence.
2. **News Broadcasts:** Later in the movie, TVs are used to show news broadcasts, providing context and information about the ongoing space exploration and human presence beyond Earth.
3. **Interviews:** TVs are also used to present interviews with characters, offering insights into their thoughts and motivations.
4. **Monolith's Influence:** The iconic scene where the monolith appears to the apes is accompanied by a flashing TV screen. This juxtaposition highlights the monolith's influence over technology and human progress, as if the TV screen is a modern-day monolith guiding human development.
In essence, the TVs in "2001: A Space Odyssey" are carefully crafted elements that contribute to the film's visual language and themes, such as technology, human evolution, and the search for extraterrestrial intelligence.<|endoftext|>
Instruction ex sample 6891: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask>
Mars, the fourth planet from the Sun, has a fascinating and complex weather system. Here are some intriguing facts about the weather on Mars:
- **Thin Atmosphere:** Mars has a very thin atmosphere, which is only about 1% as dense as Earth's. It is primarily composed of carbon dioxide (CO2), with minor amounts of nitrogen and argon.
- **Extreme Temperature Swings:** Due to the thin atmosphere and Mars' distance from the Sun, the planet experiences extreme temperature fluctuations. Daytime temperatures can reach up to 70°F (20°C), while nighttime temperatures can plummet to -100°F (-73°C).
- **Dust Storms:** Mars is known for its massive dust storms that can cover the entire planet and last for months. These storms can reach speeds of up to 70 mph (113 km/h) and are so intense that they can alter the planet's albedo (reflectivity) and even be detected by telescopes on Earth.
- **Winds:** Mars has a surprisingly active wind system, with average wind speeds of about 22 mph (35 km/h). These winds are primarily caused by the planet's rotation and temperature differences between the equator and the poles.
- **Frozen Carbon Dioxide:** During the Martian winter, temperatures at the poles drop low enough for carbon dioxide to freeze, forming a layer of dry ice. This process contributes to the formation of the polar ice caps, which are primarily composed of water ice and dust.
- **Seasonal Changes:** Mars experiences seasonal changes due to its elliptical orbit around the Sun. The planet is closer to the Sun during its summer, leading to more extreme temperature variations and more frequent dust storms.
- **Absence of Liquid Water:** Due to Mars' low atmospheric pressure and temperature, liquid water cannot exist on the surface for an extended period. However, evidence suggests that water may have flowed on the planet's surface in the past, and there is the possibility of subsurface water reservoirs.
These fascinating weather characteristics make Mars an intriguing subject for scientists and researchers studying planetary atmospheres and climates.<|endoftext|>
Original Input: <|system|>
I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
<|user|>
What are some interesting facts about the weather on Mars? Please use bullet points.
<|assistant|>
Mars, the fourth planet from the Sun, has a fascinating and complex weather system. Here are some intriguing facts about the weather on Mars:
- **Thin Atmosphere:** Mars has a very thin atmosphere, which is only about 1% as dense as Earth's. It is primarily composed of carbon dioxide (CO2), with minor amounts of nitrogen and argon.
- **Extreme Temperature Swings:** Due to the thin atmosphere and Mars' distance from the Sun, the planet experiences extreme temperature fluctuations. Daytime temperatures can reach up to 70°F (20°C), while nighttime temperatures can plummet to -100°F (-73°C).
- **Dust Storms:** Mars is known for its massive dust storms that can cover the entire planet and last for months. These storms can reach speeds of up to 70 mph (113 km/h) and are so intense that they can alter the planet's albedo (reflectivity) and even be detected by telescopes on Earth.
- **Winds:** Mars has a surprisingly active wind system, with average wind speeds of about 22 mph (35 km/h). These winds are primarily caused by the planet's rotation and temperature differences between the equator and the poles.
- **Frozen Carbon Dioxide:** During the Martian winter, temperatures at the poles drop low enough for carbon dioxide to freeze, forming a layer of dry ice. This process contributes to the formation of the polar ice caps, which are primarily composed of water ice and dust.
- **Seasonal Changes:** Mars experiences seasonal changes due to its elliptical orbit around the Sun. The planet is closer to the Sun during its summer, leading to more extreme temperature variations and more frequent dust storms.
- **Absence of Liquid Water:** Due to Mars' low atmospheric pressure and temperature, liquid water cannot exist on the surface for an extended period. However, evidence suggests that water may have flowed on the planet's surface in the past, and there is the possibility of subsurface water reservoirs.
These fascinating weather characteristics make Mars an intriguing subject for scientists and researchers studying planetary atmospheres and climates.<|endoftext|>
Creating json from Arrow format: 100% 10/10 [00:01<00:00, 7.02ba/s]
Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 --data_path=/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints --num_epochs=2 --effective_batch_size=3840 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=0 --log_level=INFO --max_batch_len=10000 --seed=42 --chat-tmpl-path=/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py --checkpoint_at_epoch
W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757]
W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] *****************************************
W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] *****************************************
[2024-08-18 20:47:10,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,042] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,152] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,205] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,218] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,232] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,260] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-18 20:47:11,287] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
model_name_or_path: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320
data_path: /var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl
output_dir: /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints
num_epochs: 2
last_step: 0
effective_batch_size: 3840
learning_rate: 2.0e-05
lr_scheduler: cosine
num_warmup_steps: 25
save_samples: 0
save_samples_ds: null
save_last: false
checkpoint_at_epoch: true
log_level: INFO
seed: 42
mock_data: false
mock_len: 2600
sharding_strategy: FULL_SHARD
is_granite: false
lora_r: 0
lora_alpha: 32
lora_dropout: 0.1
lora_quant_bits: null
lora_target_modules: null
max_batch_len: 10000
cpu_offload_optimizer: false
cpu_offload_optimizer_pin_memory: false
cpu_offload_optimizer_ratio: 1.0
NEFTune_alpha: null
chat_tmpl_path: /opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
disable_flash_attn: false
{
"script_params": {
"model_name_or_path": "/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320",
"data_path": "/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl",
"output_dir": "/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints",
"num_epochs": 2,
"last_step": 0,
"effective_batch_size": 3840,
"learning_rate": 2e-05,
"lr_scheduler": "cosine",
"num_warmup_steps": 25,
"save_samples": 0,
"save_samples_ds": null,
"save_last": false,
"checkpoint_at_epoch": true,
"log_level": "INFO",
"seed": 42,
"mock_data": false,
"mock_len": 2600,
"sharding_strategy": "FULL_SHARD",
"is_granite": false,
"lora_r": 0,
"lora_alpha": 32,
"lora_dropout": 0.1,
"lora_quant_bits": null,
"lora_target_modules": null,
"max_batch_len": 10000,
"cpu_offload_optimizer": false,
"cpu_offload_optimizer_pin_memory": false,
"cpu_offload_optimizer_ratio": 1.0,
"NEFTune_alpha": null,
"chat_tmpl_path": "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
"disable_flash_attn": false
},
"timestamp": "2024-08-18T20:47:14.720513"
}
[2024-08-18 20:47:14,794] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:47:14,794] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
tyler-a100-newimage-val:10546:10546 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10546:10546 [0] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10546:10546 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-08-18 20:47:15,959] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:47:15,969] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:47:15,974] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:47:15,976] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-a100-newimage-val:10548:10548 [2] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10548:10548 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10548:10548 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-08-18 20:47:15,987] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-18 20:47:15,995] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-a100-newimage-val:10550:10550 [4] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10550:10550 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10550:10550 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:10552:10552 [6] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10552:10552 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10552:10552 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
[2024-08-18 20:47:16,031] [INFO] [comm.py:637:init_distributed] cdb=None
tyler-a100-newimage-val:10551:10551 [5] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10551:10551 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10551:10551 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:10549:10549 [3] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10549:10549 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10549:10549 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:10547:10547 [1] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10547:10547 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10547:10547 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:10553:10553 [7] NCCL INFO cudaDriverVersion 12040
tyler-a100-newimage-val:10553:10553 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10553:10553 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Using network Socket
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Using network Socket
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Using network Socket
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Using network Socket
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Using network Socket
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Using network Socket
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Using network Socket
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/IB : No device found.
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Using network Socket
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO ncclCommInitRank comm 0x56556127bf10 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO ncclCommInitRank comm 0x562f40810030 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO ncclCommInitRank comm 0x55812d458df0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO ncclCommInitRank comm 0x55ed47dad9a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO ncclCommInitRank comm 0x55e00ec7dc40 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO ncclCommInitRank comm 0x558cdebf44c0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO ncclCommInitRank comm 0x55e4c1493930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO ncclCommInitRank comm 0x55919a812fd0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x8ecf20a94c156f4c - Init START
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO comm 0x55e00ec7dc40 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO comm 0x56556127bf10 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO comm 0x55812d458df0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO comm 0x55ed47dad9a0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO comm 0x558cdebf44c0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO comm 0x55919a812fd0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO comm 0x55e4c1493930 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO comm 0x562f40810030 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO ncclCommInitRank comm 0x562f40810030 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.76 (kernels 0.14, bootstrap 0.27, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO ncclCommInitRank comm 0x55e4c1493930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.79 (kernels 0.16, bootstrap 0.29, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO ncclCommInitRank comm 0x55812d458df0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO ncclCommInitRank comm 0x55e00ec7dc40 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.77 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.77 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO ncclCommInitRank comm 0x56556127bf10 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.76 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO ncclCommInitRank comm 0x55ed47dad9a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.77 (kernels 0.15, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO ncclCommInitRank comm 0x55919a812fd0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.91 (kernels 0.18, bootstrap 0.39, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO ncclCommInitRank comm 0x558cdebf44c0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x8ecf20a94c156f4c - Init COMPLETE
tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.75 (kernels 0.16, bootstrap 0.25, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Connected all rings
tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Connected all rings
tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Connected all rings
tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Connected all rings
tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Connected all rings
tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Connected all rings
tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Connected all rings
tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Connected all rings
Generating train split: 9981 examples [00:01, 7916.84 examples/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1972.79it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1957.24it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1893.76it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1968.18it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1937.07it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1968.84it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1891.13it/s]
Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1843.73it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1128382682800293 seconds
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11071896553039551 seconds
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11228275299072266 seconds
{
"num_gpus": 8,
"avg_sample_len": 608.8641418695521,
"effective_batch_size": 3840,
"max_batch_len_per_gpu": 10000,
"packing_max_batch_len": 8118,
"grad_accum": 36,
"num_batches": 121,
"avg_samples_per_batch": 82.48760330578513,
"samples_per_gpu": 13,
"timestamp": "2024-08-18T20:47:39.867974"
}
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1144556999206543 seconds
[2024-08-18 20:47:40,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-08-18 20:47:40,239] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
Loading extension module fused_adam...
Time to load fused_adam op: 0.10187482833862305 seconds
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11427879333496094 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10196876525878906 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10352158546447754 seconds
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Using network Socket
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Using network Socket
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Using network Socket
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Using network Socket
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Using network Socket
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Using network Socket
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Using network Socket
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Using network Socket
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO bootstrapSplit: comm 0x565562e66530 parent 0x56556127bf10 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO bootstrapSplit: comm 0x55e4c30806d0 parent 0x55e4c1493930 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO bootstrapSplit: comm 0x55ed499a84f0 parent 0x55ed47dad9a0 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO bootstrapSplit: comm 0x562f423f9400 parent 0x562f40810030 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO bootstrapSplit: comm 0x558ce0813360 parent 0x558cdebf44c0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO bootstrapSplit: comm 0x55919c3fd580 parent 0x55919a812fd0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO bootstrapSplit: comm 0x55812f05a9a0 parent 0x55812d458df0 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO ncclCommSplit comm 0x55e4c30806d0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e4c1493930 color -934961569 key 2 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO ncclCommSplit comm 0x55ed499a84f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x55ed47dad9a0 color -934961569 key 4 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO bootstrapSplit: comm 0x55e0108836a0 parent 0x55e00ec7dc40 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO ncclCommSplit comm 0x558ce0813360 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558cdebf44c0 color -934961569 key 7 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO ncclCommSplit comm 0x55919c3fd580 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55919a812fd0 color -934961569 key 0 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO ncclCommSplit comm 0x562f423f9400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x562f40810030 color -934961569 key 1 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO ncclCommSplit comm 0x55812f05a9a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x55812d458df0 color -934961569 key 5 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO ncclCommSplit comm 0x565562e66530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x56556127bf10 color -934961569 key 3 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO ncclCommSplit comm 0x55e0108836a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x55e00ec7dc40 color -934961569 key 6 commId 0xc6ecd14a22a5889f - Init START
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO NVLS multicast support is not available on dev 3
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO NVLS multicast support is not available on dev 2
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO NVLS multicast support is not available on dev 1
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO NVLS multicast support is not available on dev 0
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO NVLS multicast support is not available on dev 4
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO NVLS multicast support is not available on dev 6
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO NVLS multicast support is not available on dev 7
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO NVLS multicast support is not available on dev 5
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO comm 0x55e4c30806d0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO comm 0x562f423f9400 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO comm 0x558ce0813360 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO comm 0x55919c3fd580 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO comm 0x55e0108836a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO comm 0x55812f05a9a0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO comm 0x55ed499a84f0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO comm 0x565562e66530 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO P2P Chunksize set to 524288
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO ncclCommSplit comm 0x558ce0813360 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558cdebf44c0 color -934961569 key 7 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO ncclCommSplit comm 0x565562e66530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x56556127bf10 color -934961569 key 3 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO ncclCommSplit comm 0x55812f05a9a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x55812d458df0 color -934961569 key 5 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO ncclCommSplit comm 0x562f423f9400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x562f40810030 color -934961569 key 1 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.34 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO ncclCommSplit comm 0x55919c3fd580 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55919a812fd0 color -934961569 key 0 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO ncclCommSplit comm 0x55e0108836a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x55e00ec7dc40 color -934961569 key 6 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO ncclCommSplit comm 0x55e4c30806d0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e4c1493930 color -934961569 key 2 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO ncclCommSplit comm 0x55ed499a84f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x55ed47dad9a0 color -934961569 key 4 commId 0xc6ecd14a22a5889f - Init COMPLETE
tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Connected all rings
tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Connected all rings
tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Connected all rings
tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Connected all rings
tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Connected all rings
tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Connected all rings
tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Connected all rings
tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Connected all rings
[2024-08-18 20:47:46,090] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-08-18 20:47:46,091] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-08-18 20:47:46,091] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-08-18 20:47:46,104] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-08-18 20:47:46,104] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-08-18 20:47:46,104] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
[2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
[2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
[2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
[2024-08-18 20:47:59,000] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:47:59,024] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:00,036] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:00,385] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:00,831] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:00,924] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:01,063] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2024-08-18 20:48:01,367] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-08-18 20:48:01,367] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB
[2024-08-18 20:48:01,368] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 33.57 GB, percent = 2.7%
[2024-08-18 20:48:01,588] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-08-18 20:48:01,589] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 18.83 GB CA 20.4 GB Max_CA 20 GB
[2024-08-18 20:48:01,589] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 33.58 GB, percent = 2.7%
[2024-08-18 20:48:01,590] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
[2024-08-18 20:48:01,807] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-08-18 20:48:01,808] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB Max_MA 15.69 GB CA 20.4 GB Max_CA 20 GB
[2024-08-18 20:48:01,808] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 33.58 GB, percent = 2.7%
[2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f171cc77e10>
[2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
[2024-08-18 20:48:01,811] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] amp_params ................... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] bfloat16_enabled ............. True
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f171cc59a90>
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] dump_state ................... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] fp16_auto_cast ............... None
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] fp16_enabled ................. False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-08-18 20:48:01,812] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 36
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] gradient_clipping ............ 1.0
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 1
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] loss_scale ................... 1.0
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] pld_params ................... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] steps_per_print .............. 1
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] train_batch_size ............. 3744
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 13
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] world_size ................... 8
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] zero_allow_untested_optimizer False
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-08-18 20:48:01,813] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-08-18 20:48:01,814] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 3.744000e+03,
"gradient_accumulation_steps": 36,
"train_micro_batch_size_per_gpu": 13,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "none"
},
"offload_optimizer": {
"device": "none"
}
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false
}
[2024-08-18 20:48:01,814] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Epoch 0: 0%| | 0/121 [00:00<?, ?it/s] total tokens: 7992 num samples: 18 num padding tokens: 2022 - rank: 6 max len: 444 min len: 240 avg len: 331.6666666666667 num_loss_counted_tokens: 3649
total tokens: 7752 num samples: 17 num padding tokens: 1950 - rank: 6 max len: 456 min len: 267 avg len: 341.29411764705884 num_loss_counted_tokens: 3614
total tokens: 7784 num samples: 14 num padding tokens: 1814 - rank: 6 max len: 556 min len: 342 avg len: 426.42857142857144 num_loss_counted_tokens: 3061 total tokens: 8094 num samples: 19 num padding tokens: 2216 - rank: 6 max len: 426 min len: 215 avg len: 309.36842105263156 num_loss_counted_tokens: 3253
total tokens: 7524 num samples: 12 num padding tokens: 2256 - rank: 6 max len: 627 min len: 286 avg len: 439.0 num_loss_counted_tokens: 3465
total tokens: 8056 num samples: 19 num padding tokens: 1799 - rank: 6 max len: 424 min len: 237 avg len: 329.3157894736842 num_loss_counted_tokens: 3512
total tokens: 7803 num samples: 17 num padding tokens: 1146 - rank: 6 max len: 459 min len: 281 avg len: 391.5882352941176 num_loss_counted_tokens: 4201
total tokens: 6561 num samples: 3 num padding tokens: 1099 - rank: 1 max len: 2187 min len: 1316 avg len: 1820.6666666666667 num_loss_counted_tokens: 687
total tokens: 8118 num samples: 3 num padding tokens: 403 - rank: 1 max len: 2706 min len: 2307 avg len: 2571.6666666666665 num_loss_counted_tokens: 336
total tokens: 7851 num samples: 3 num padding tokens: 789 - rank: 1 max len: 2617 min len: 1940 avg len: 2354.0 num_loss_counted_tokens: 768
total tokens: 8096 num samples: 16 num padding tokens: 1551 - rank: 6 max len: 506 min len: 312 avg len: 409.0625 num_loss_counted_tokens: 3732
total tokens: 7089 num samples: 3 num padding tokens: 249 - rank: 1 max len: 2363 min len: 2182 avg len: 2280.0 num_loss_counted_tokens: 237
total tokens: 7629 num samples: 3 num padding tokens: 370 - rank: 1 max len: 2543 min len: 2326 avg len: 2419.6666666666665 num_loss_counted_tokens: 668
total tokens: 7752 num samples: 24 num padding tokens: 3285 - rank: 7 max len: 323 min len: 83 avg len: 186.125 num_loss_counted_tokens: 2064
total tokens: 7776 num samples: 16 num padding tokens: 2005 - rank: 6 max len: 486 min len: 266 avg len: 360.6875 num_loss_counted_tokens: 3202
total tokens: 7794 num samples: 9 num padding tokens: 1201 - rank: 4 max len: 866 min len: 628 avg len: 732.5555555555555 num_loss_counted_tokens: 3797
total tokens: 8075 num samples: 19 num padding tokens: 2226 - rank: 6 max len: 425 min len: 232 avg len: 307.8421052631579 num_loss_counted_tokens: 3319
total tokens: 7872 num samples: 16 num padding tokens: 1566 - rank: 6 max len: 492 min len: 271 avg len: 394.125 num_loss_counted_tokens: 3233
total tokens: 5934 num samples: 23 num padding tokens: 2656 - rank: 7 max len: 258 min len: 72 avg len: 142.52173913043478 num_loss_counted_tokens: 1182
total tokens: 8024 num samples: 8 num padding tokens: 985 - rank: 4 max len: 1003 min len: 741 avg len: 879.875 num_loss_counted_tokens: 5064
total tokens: 7840 num samples: 8 num padding tokens: 691 - rank: 4 max len: 980 min len: 778 avg len: 893.625 num_loss_counted_tokens: 4008
total tokens: 8016 num samples: 3 num padding tokens: 730 - rank: 1 max len: 2672 min len: 2280 avg len: 2428.6666666666665 num_loss_counted_tokens: 336
total tokens: 8032 num samples: 8 num padding tokens: 791 - rank: 4 max len: 1004 min len: 846 avg len: 905.125 num_loss_counted_tokens: 6538
total tokens: 8094 num samples: 19 num padding tokens: 1536 - rank: 6 max len: 426 min len: 259 avg len: 345.1578947368421 num_loss_counted_tokens: 3814
total tokens: 7476 num samples: 4 num padding tokens: 1909 - rank: 1 max len: 1869 min len: 1081 avg len: 1391.75 num_loss_counted_tokens: 2210
total tokens: 6380 num samples: 29 num padding tokens: 2359 - rank: 7 max len: 220 min len: 77 avg len: 138.6551724137931 num_loss_counted_tokens: 1514
total tokens: 7328 num samples: 32 num padding tokens: 2417 - rank: 7 max len: 229 min len: 83 avg len: 153.46875 num_loss_counted_tokens: 2050
total tokens: 6348 num samples: 23 num padding tokens: 1924 - rank: 7 max len: 276 min len: 79 avg len: 192.34782608695653 num_loss_counted_tokens: 1895
total tokens: 7460 num samples: 5 num padding tokens: 807 - rank: 1 max len: 1492 min len: 1178 avg len: 1330.6 num_loss_counted_tokens: 2708
total tokens: 7890 num samples: 30 num padding tokens: 2915 - rank: 7 max len: 263 min len: 77 avg len: 165.83333333333334 num_loss_counted_tokens: 1979
total tokens: 7704 num samples: 9 num padding tokens: 736 - rank: 4 max len: 856 min len: 719 avg len: 774.2222222222222 num_loss_counted_tokens: 2859
total tokens: 7920 num samples: 10 num padding tokens: 665 - rank: 4 max len: 792 min len: 622 avg len: 725.5 num_loss_counted_tokens: 4725
total tokens: 7812 num samples: 9 num padding tokens: 769 - rank: 4 max len: 868 min len: 744 avg len: 782.5555555555555 num_loss_counted_tokens: 5045
total tokens: 6171 num samples: 3 num padding tokens: 752 - rank: 1 max len: 2057 min len: 1416 avg len: 1806.3333333333333 num_loss_counted_tokens: 750
total tokens: 7432 num samples: 4 num padding tokens: 296 - rank: 1 max len: 1858 min len: 1710 avg len: 1784.0 num_loss_counted_tokens: 829
total tokens: 5684 num samples: 2 num padding tokens: 688 - rank: 1 max len: 2842 min len: 2154 avg len: 2498.0 num_loss_counted_tokens: 182
total tokens: 7992 num samples: 18 num padding tokens: 1629 - rank: 6 max len: 444 min len: 270 avg len: 353.5 num_loss_counted_tokens: 3206
total tokens: 7625 num samples: 25 num padding tokens: 3089 - rank: 7 max len: 305 min len: 83 avg len: 181.44 num_loss_counted_tokens: 2298
total tokens: 7812 num samples: 31 num padding tokens: 2673 - rank: 7 max len: 252 min len: 81 avg len: 165.7741935483871 num_loss_counted_tokens: 2047 total tokens: 7871 num samples: 17 num padding tokens: 2098 - rank: 6 max len: 463 min len: 248 avg len: 339.5882352941176 num_loss_counted_tokens: 3582
total tokens: 8060 num samples: 20 num padding tokens: 1971 - rank: 6 max len: 403 min len: 249 avg len: 304.45 num_loss_counted_tokens: 3294
total tokens: 7288 num samples: 4 num padding tokens: 745 - rank: 1 max len: 1822 min len: 1504 avg len: 1635.75 num_loss_counted_tokens: 956
total tokens: 6479 num samples: 31 num padding tokens: 2352 - rank: 7 max len: 209 min len: 79 avg len: 133.1290322580645 num_loss_counted_tokens: 1439
total tokens: 6892 num samples: 4 num padding tokens: 402 - rank: 1 max len: 1723 min len: 1368 avg len: 1622.5 num_loss_counted_tokens: 752
total tokens: 6423 num samples: 3 num padding tokens: 305 - rank: 1 max len: 2141 min len: 1976 avg len: 2039.3333333333333 num_loss_counted_tokens: 448
total tokens: 7634 num samples: 11 num padding tokens: 386 - rank: 4 max len: 694 min len: 621 avg len: 658.9090909090909 num_loss_counted_tokens: 5313
total tokens: 8070 num samples: 10 num padding tokens: 726 - rank: 4 max len: 807 min len: 627 avg len: 734.4 num_loss_counted_tokens: 5736
total tokens: 8096 num samples: 11 num padding tokens: 777 - rank: 4 max len: 736 min len: 594 avg len: 665.3636363636364 num_loss_counted_tokens: 4205
total tokens: 7560 num samples: 10 num padding tokens: 629 - rank: 4 max len: 756 min len: 626 avg len: 693.1 num_loss_counted_tokens: 4492
total tokens: 7998 num samples: 31 num padding tokens: 3034 - rank: 7 max len: 258 min len: 74 avg len: 160.1290322580645 num_loss_counted_tokens: 2091
total tokens: 8090 num samples: 10 num padding tokens: 409 - rank: 4 max len: 809 min len: 692 avg len: 768.1 num_loss_counted_tokens: 5031
total tokens: 7740 num samples: 18 num padding tokens: 1725 - rank: 6 max len: 430 min len: 265 avg len: 334.1666666666667 num_loss_counted_tokens: 3515
total tokens: 8080 num samples: 5 num padding tokens: 765 - rank: 1 max len: 1616 min len: 1290 avg len: 1463.0 num_loss_counted_tokens: 2912
total tokens: 7904 num samples: 32 num padding tokens: 2758 - rank: 7 max len: 247 min len: 84 avg len: 160.8125 num_loss_counted_tokens: 2044
total tokens: 7461 num samples: 9 num padding tokens: 766 - rank: 4 max len: 829 min len: 675 avg len: 743.8888888888889 num_loss_counted_tokens: 4179
total tokens: 6168 num samples: 2 num padding tokens: 447 - rank: 1 max len: 3084 min len: 2637 avg len: 2860.5 num_loss_counted_tokens: 177
total tokens: 7395 num samples: 29 num padding tokens: 2459 - rank: 7 max len: 255 min len: 81 avg len: 170.20689655172413 num_loss_counted_tokens: 1981
total tokens: 6830 num samples: 5 num padding tokens: 395 - rank: 1 max len: 1366 min len: 1223 avg len: 1287.0 num_loss_counted_tokens: 2516
total tokens: 7786 num samples: 17 num padding tokens: 1200 - rank: 6 max len: 458 min len: 290 avg len: 387.4117647058824 num_loss_counted_tokens: 3600
total tokens: 6888 num samples: 24 num padding tokens: 2293 - rank: 7 max len: 287 min len: 81 avg len: 191.45833333333334 num_loss_counted_tokens: 2153
total tokens: 7627 num samples: 29 num padding tokens: 2769 - rank: 7 max len: 263 min len: 78 avg len: 167.51724137931035 num_loss_counted_tokens: 2159
total tokens: 5475 num samples: 25 num padding tokens: 1896 - rank: 7 max len: 219 min len: 81 avg len: 143.16 num_loss_counted_tokens: 1372
total tokens: 6916 num samples: 28 num padding tokens: 2558 - rank: 7 max len: 247 min len: 77 avg len: 155.64285714285714 num_loss_counted_tokens: 1696
total tokens: 7304 num samples: 8 num padding tokens: 673 - rank: 4 max len: 913 min len: 718 avg len: 828.875 num_loss_counted_tokens: 4366
total tokens: 7950 num samples: 10 num padding tokens: 340 - rank: 4 max len: 795 min len: 724 avg len: 761.0 num_loss_counted_tokens: 5963
total tokens: 7964 num samples: 11 num padding tokens: 504 - rank: 4 max len: 724 min len: 630 avg len: 678.1818181818181 num_loss_counted_tokens: 4558
total tokens: 7410 num samples: 30 num padding tokens: 2942 - rank: 7 max len: 247 min len: 75 avg len: 148.93333333333334 num_loss_counted_tokens: 1669
total tokens: 7630 num samples: 7 num padding tokens: 606 - rank: 4 max len: 1090 min len: 831 avg len: 1003.4285714285714 num_loss_counted_tokens: 3943
total tokens: 7368 num samples: 4 num padding tokens: 761 - rank: 2 max len: 1842 min len: 1539 avg len: 1651.75 num_loss_counted_tokens: 2748
total tokens: 7596 num samples: 6 num padding tokens: 1064 - rank: 2 max len: 1266 min len: 985 avg len: 1088.6666666666667 num_loss_counted_tokens: 3149
total tokens: 7623 num samples: 11 num padding tokens: 1032 - rank: 5 max len: 693 min len: 517 avg len: 599.1818181818181 num_loss_counted_tokens: 4483
total tokens: 7410 num samples: 10 num padding tokens: 803 - rank: 5 max len: 741 min len: 563 avg len: 660.7 num_loss_counted_tokens: 5017
total tokens: 7969 num samples: 13 num padding tokens: 1180 - rank: 5 max len: 613 min len: 448 avg len: 522.2307692307693 num_loss_counted_tokens: 4157
total tokens: 7596 num samples: 9 num padding tokens: 893 - rank: 5 max len: 844 min len: 632 avg len: 744.7777777777778 num_loss_counted_tokens: 4339
total tokens: 7044 num samples: 4 num padding tokens: 776 - rank: 2 max len: 1761 min len: 1377 avg len: 1567.0 num_loss_counted_tokens: 1196
total tokens: 7678 num samples: 11 num padding tokens: 1417 - rank: 5 max len: 698 min len: 476 avg len: 569.1818181818181 num_loss_counted_tokens: 4381
total tokens: 7576 num samples: 4 num padding tokens: 2237 - rank: 2 max len: 1894 min len: 1097 avg len: 1334.75 num_loss_counted_tokens: 2139
total tokens: 7656 num samples: 11 num padding tokens: 1011 - rank: 5 max len: 696 min len: 496 avg len: 604.0909090909091 num_loss_counted_tokens: 4175
total tokens: 7204 num samples: 4 num padding tokens: 443 - rank: 2 max len: 1801 min len: 1576 avg len: 1690.25 num_loss_counted_tokens: 2660 total tokens: 8073 num samples: 13 num padding tokens: 925 - rank: 5 max len: 621 min len: 461 avg len: 549.8461538461538 num_loss_counted_tokens: 5127
total tokens: 8076 num samples: 4 num padding tokens: 1351 - rank: 2 max len: 2019 min len: 1214 avg len: 1681.25 num_loss_counted_tokens: 936
total tokens: 7032 num samples: 6 num padding tokens: 478 - rank: 2 max len: 1172 min len: 1016 avg len: 1092.3333333333333 num_loss_counted_tokens: 3008
total tokens: 7917 num samples: 13 num padding tokens: 1158 - rank: 5 max len: 609 min len: 429 avg len: 519.9230769230769 num_loss_counted_tokens: 4603
total tokens: 7329 num samples: 7 num padding tokens: 546 - rank: 2 max len: 1047 min len: 911 avg len: 969.0 num_loss_counted_tokens: 3492
total tokens: 7020 num samples: 5 num padding tokens: 506 - rank: 2 max len: 1404 min len: 1134 avg len: 1302.8 num_loss_counted_tokens: 3834
total tokens: 7692 num samples: 6 num padding tokens: 974 - rank: 2 max len: 1282 min len: 958 avg len: 1119.6666666666667 num_loss_counted_tokens: 3913
total tokens: 7667 num samples: 11 num padding tokens: 908 - rank: 5 max len: 697 min len: 509 avg len: 614.4545454545455 num_loss_counted_tokens: 3445
total tokens: 8099 num samples: 13 num padding tokens: 1144 - rank: 5 max len: 623 min len: 461 avg len: 535.0 num_loss_counted_tokens: 4390
total tokens: 5862 num samples: 2 num padding tokens: 136 - rank: 0 max len: 2931 min len: 2795 avg len: 2863.0 num_loss_counted_tokens: 203
total tokens: 5944 num samples: 2 num padding tokens: 101 - rank: 0 max len: 2972 min len: 2871 avg len: 2921.5 num_loss_counted_tokens: 163
total tokens: 6814 num samples: 2 num padding tokens: 689 - rank: 0 max len: 3407 min len: 2718 avg len: 3062.5 num_loss_counted_tokens: 1104
total tokens: 7872 num samples: 12 num padding tokens: 1453 - rank: 5 max len: 656 min len: 432 avg len: 534.9166666666666 num_loss_counted_tokens: 5009
total tokens: 5966 num samples: 2 num padding tokens: 300 - rank: 0 max len: 2983 min len: 2683 avg len: 2833.0 num_loss_counted_tokens: 223
total tokens: 7100 num samples: 5 num padding tokens: 1259 - rank: 3 max len: 1420 min len: 1008 avg len: 1168.2 num_loss_counted_tokens: 3506
total tokens: 7858 num samples: 2 num padding tokens: 1075 - rank: 0 max len: 3929 min len: 2854 avg len: 3391.5 num_loss_counted_tokens: 419
total tokens: 6586 num samples: 2 num padding tokens: 534 - rank: 0 max len: 3293 min len: 2759 avg len: 3026.0 num_loss_counted_tokens: 208
total tokens: 6802 num samples: 2 num padding tokens: 582 - rank: 0 max len: 3401 min len: 2819 avg len: 3110.0 num_loss_counted_tokens: 197
total tokens: 8076 num samples: 12 num padding tokens: 941 - rank: 5 max len: 673 min len: 519 avg len: 594.5833333333334 num_loss_counted_tokens: 4341
total tokens: 8021 num samples: 13 num padding tokens: 958 - rank: 5 max len: 617 min len: 492 avg len: 543.3076923076923 num_loss_counted_tokens: 5546
total tokens: 7404 num samples: 6 num padding tokens: 691 - rank: 2 max len: 1234 min len: 1037 avg len: 1118.8333333333333 num_loss_counted_tokens: 4452
total tokens: 6990 num samples: 6 num padding tokens: 455 - rank: 3 max len: 1165 min len: 1010 avg len: 1089.1666666666667 num_loss_counted_tokens: 2326
total tokens: 6835 num samples: 5 num padding tokens: 883 - rank: 2 max len: 1367 min len: 1028 avg len: 1190.4 num_loss_counted_tokens: 1769
total tokens: 6480 num samples: 3 num padding tokens: 1330 - rank: 2 max len: 2160 min len: 1455 avg len: 1716.6666666666667 num_loss_counted_tokens: 3402
total tokens: 7852 num samples: 13 num padding tokens: 1008 - rank: 5 max len: 604 min len: 462 avg len: 526.4615384615385 num_loss_counted_tokens: 4343
total tokens: 6516 num samples: 4 num padding tokens: 643 - rank: 2 max len: 1629 min len: 1320 avg len: 1468.25 num_loss_counted_tokens: 3036
total tokens: 7765 num samples: 5 num padding tokens: 937 - rank: 2 max len: 1553 min len: 1107 avg len: 1365.6 num_loss_counted_tokens: 2840
total tokens: 7014 num samples: 6 num padding tokens: 643 - rank: 2 max len: 1169 min len: 980 avg len: 1061.8333333333333 num_loss_counted_tokens: 3641
total tokens: 8106 num samples: 7 num padding tokens: 448 - rank: 2 max len: 1158 min len: 1032 avg len: 1094.0 num_loss_counted_tokens: 3634
total tokens: 7014 num samples: 3 num padding tokens: 944 - rank: 0 max len: 2338 min len: 1749 avg len: 2023.3333333333333 num_loss_counted_tokens: 1782
total tokens: 7696 num samples: 8 num padding tokens: 623 - rank: 3 max len: 962 min len: 803 avg len: 884.125 num_loss_counted_tokens: 4778
total tokens: 7618 num samples: 13 num padding tokens: 886 - rank: 5 max len: 586 min len: 437 avg len: 517.8461538461538 num_loss_counted_tokens: 3929
total tokens: 7155 num samples: 5 num padding tokens: 513 - rank: 3 max len: 1431 min len: 1168 avg len: 1328.4 num_loss_counted_tokens: 3100
total tokens: 7350 num samples: 7 num padding tokens: 751 - rank: 3 max len: 1050 min len: 873 avg len: 942.7142857142857 num_loss_counted_tokens: 4426
total tokens: 7744 num samples: 8 num padding tokens: 283 - rank: 3 max len: 968 min len: 869 avg len: 932.625 num_loss_counted_tokens: 4872
total tokens: 5448 num samples: 2 num padding tokens: 825 - rank: 0 max len: 2724 min len: 1899 avg len: 2311.5 num_loss_counted_tokens: 314
total tokens: 7836 num samples: 6 num padding tokens: 1317 - rank: 3 max len: 1306 min len: 968 avg len: 1086.5 num_loss_counted_tokens: 4937
total tokens: 7854 num samples: 11 num padding tokens: 871 - rank: 5 max len: 714 min len: 532 avg len: 634.8181818181819 num_loss_counted_tokens: 4452
total tokens: 7788 num samples: 3 num padding tokens: 965 - rank: 0 max len: 2596 min len: 1888 avg len: 2274.3333333333335 num_loss_counted_tokens: 304
total tokens: 5614 num samples: 2 num padding tokens: 364 - rank: 0 max len: 2807 min len: 2443 avg len: 2625.0 num_loss_counted_tokens: 241
total tokens: 7086 num samples: 3 num padding tokens: 1107 - rank: 0 max len: 2362 min len: 1776 avg len: 1993.0 num_loss_counted_tokens: 301
total tokens: 8037 num samples: 9 num padding tokens: 856 - rank: 3 max len: 893 min len: 698 avg len: 797.8888888888889 num_loss_counted_tokens: 6093
total tokens: 5792 num samples: 2 num padding tokens: 18 - rank: 0 max len: 2896 min len: 2878 avg len: 2887.0 num_loss_counted_tokens: 176
total tokens: 6306 num samples: 2 num padding tokens: 290 - rank: 0 max len: 3153 min len: 2863 avg len: 3008.0 num_loss_counted_tokens: 181
total tokens: 7942 num samples: 11 num padding tokens: 1739 - rank: 5 max len: 722 min len: 430 avg len: 563.9090909090909 num_loss_counted_tokens: 2898
total tokens: 7796 num samples: 4 num padding tokens: 715 - rank: 0 max len: 1949 min len: 1412 avg len: 1770.25 num_loss_counted_tokens: 1543
total tokens: 7984 num samples: 8 num padding tokens: 514 - rank: 3 max len: 998 min len: 886 avg len: 933.75 num_loss_counted_tokens: 3889
total tokens: 7248 num samples: 8 num padding tokens: 747 - rank: 3 max len: 906 min len: 734 avg len: 812.625 num_loss_counted_tokens: 5216
total tokens: 6489 num samples: 3 num padding tokens: 267 - rank: 0 max len: 2163 min len: 1989 avg len: 2074.0 num_loss_counted_tokens: 774
total tokens: 8000 num samples: 8 num padding tokens: 939 - rank: 3 max len: 1000 min len: 809 avg len: 882.625 num_loss_counted_tokens: 4397
total tokens: 7146 num samples: 6 num padding tokens: 436 - rank: 3 max len: 1191 min len: 1062 avg len: 1118.3333333333333 num_loss_counted_tokens: 3632
total tokens: 7693 num samples: 7 num padding tokens: 1196 - rank: 3 max len: 1099 min len: 833 avg len: 928.1428571428571 num_loss_counted_tokens: 4428
total tokens: 7592 num samples: 8 num padding tokens: 548 - rank: 3 max len: 949 min len: 799 avg len: 880.5 num_loss_counted_tokens: 5791
total tokens: 8064 num samples: 8 num padding tokens: 1193 - rank: 3 max len: 1008 min len: 741 avg len: 858.875 num_loss_counted_tokens: 5624
total tokens: 7960 num samples: 8 num padding tokens: 519 - rank: 3 max len: 995 min len: 797 avg len: 930.125 num_loss_counted_tokens: 4667
total tokens: 7140 num samples: 5 num padding tokens: 805 - rank: 3 max len: 1428 min len: 1157 avg len: 1267.0 num_loss_counted_tokens: 2481
total tokens: 7376 num samples: 2 num padding tokens: 122 - rank: 0 max len: 3688 min len: 3566 avg len: 3627.0 num_loss_counted_tokens: 334
Per-token loss scaled by world size: 2.2650606297247577e-06Per-token loss scaled by world size: 0.0005290773115120828Per-token loss scaled by world size: 0.00031778833363205194Per-token loss scaled by world size: 0.0002596491831354797Per-token loss scaled by world size: 0.00032042598468251526Per-token loss scaled by world size: 0.00037021367461420596
Per-token loss scaled by world size: 3.6662072488979902e-06
Epoch: 0, Step: 1, Rank: 3, loss = 0.8319301605224609
Epoch: 0, Step: 1, Rank: 2, loss = 0.6797291040420532Epoch: 0, Step: 1, Rank: 5, loss = 1.3850582838058472
Epoch: 0, Step: 1, Rank: 7, loss = 0.8388351798057556Epoch: 0, Step: 1, Rank: 1, loss = 0.005929645616561174
Epoch: 0, Step: 1, Rank: 4, loss = 0.9691731333732605
Epoch: 0, Step: 1, Rank: 0, loss = 0.009597672149538994
Per-token loss scaled by world size: 0.0004498241178225726
Epoch: 0, Step: 1, Rank: 6, loss = 1.1775833368301392
Epoch 0: 1%| | 1/121 [00:03<06:45, 3.38s/it]{
"epoch": 0,
"step": 1,
"rank": 0,
"loss": 0.009597672149538994,
"overall_throughput": 35.709783908823596,
"lr": 0.0,
"cuda_mem_allocated": 17.990560054779053,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20943,
"batch_size": 70,
"total_loss": 0.737229585647583,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:05.199538"
}
total tokens: 6234 num samples: 3 num padding tokens: 200 - rank: 2 max len: 2078 min len: 1890 avg len: 2011.3333333333333 num_loss_counted_tokens: 1341
total tokens: 5714 num samples: 2 num padding tokens: 5 - rank: 0 max len: 2857 min len: 2852 avg len: 2854.5 num_loss_counted_tokens: 145
total tokens: 7392 num samples: 8 num padding tokens: 1326 - rank: 5 max len: 924 min len: 561 avg len: 758.25 num_loss_counted_tokens: 3044
total tokens: 7340 num samples: 4 num padding tokens: 699 - rank: 3 max len: 1835 min len: 1455 avg len: 1660.25 num_loss_counted_tokens: 2623
total tokens: 7627 num samples: 29 num padding tokens: 3037 - rank: 7 max len: 263 min len: 77 avg len: 158.27586206896552 num_loss_counted_tokens: 1957
total tokens: 7242 num samples: 3 num padding tokens: 171 - rank: 1 max len: 2414 min len: 2311 avg len: 2357.0 num_loss_counted_tokens: 254
total tokens: 7125 num samples: 5 num padding tokens: 1031 - rank: 4 max len: 1425 min len: 945 avg len: 1218.8 num_loss_counted_tokens: 3947
total tokens: 8025 num samples: 15 num padding tokens: 2508 - rank: 6 max len: 535 min len: 266 avg len: 367.8 num_loss_counted_tokens: 3113
Per-token loss scaled by world size: 0.00031914791907183826Per-token loss scaled by world size: 0.0003141801571473479Per-token loss scaled by world size: 0.0003882426244672388
Per-token loss scaled by world size: 0.00020227984350640327
Per-token loss scaled by world size: 5.1077040552627295e-05Per-token loss scaled by world size: 5.200964369578287e-05
Per-token loss scaled by world size: 0.0002763153170235455
Epoch: 0, Step: 2, Rank: 4, loss = 0.963906466960907
Epoch: 0, Step: 2, Rank: 3, loss = 0.9489026069641113
Epoch: 0, Step: 2, Rank: 2, loss = 0.6109356880187988
Epoch: 0, Step: 2, Rank: 5, loss = 1.1725897789001465
Epoch: 0, Step: 2, Rank: 0, loss = 0.15426543354988098
Epoch: 0, Step: 2, Rank: 7, loss = 0.8345413208007812
Per-token loss scaled by world size: 0.0004248657787684351
Epoch: 0, Step: 2, Rank: 1, loss = 0.15708212554454803
Epoch: 0, Step: 2, Rank: 6, loss = 1.2832008600234985
Epoch 0: 2%|▏ | 2/121 [00:05<05:38, 2.85s/it] total tokens: 7986 num samples: 11 num padding tokens: 734 - rank: 4 max len: 726 min len: 605 avg len: 659.2727272727273 num_loss_counted_tokens: 4132
{
"epoch": 0,
"step": 2,
"rank": 0,
"loss": 0.15426543354988098,
"overall_throughput": 43.2007637745835,
"lr": 0.0,
"cuda_mem_allocated": 18.104323863983154,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24162,
"batch_size": 93,
"total_loss": 0.7656780481338501,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:07.698724"
}
total tokens: 7735 num samples: 17 num padding tokens: 1736 - rank: 6 max len: 455 min len: 245 avg len: 352.88235294117646 num_loss_counted_tokens: 3396
total tokens: 5966 num samples: 2 num padding tokens: 555 - rank: 0 max len: 2983 min len: 2428 avg len: 2705.5 num_loss_counted_tokens: 181
total tokens: 7852 num samples: 13 num padding tokens: 1205 - rank: 5 max len: 604 min len: 464 avg len: 511.3076923076923 num_loss_counted_tokens: 4550
total tokens: 7928 num samples: 4 num padding tokens: 535 - rank: 1 max len: 1982 min len: 1721 avg len: 1848.25 num_loss_counted_tokens: 2525
total tokens: 7836 num samples: 6 num padding tokens: 1664 - rank: 2 max len: 1306 min len: 926 avg len: 1028.6666666666667 num_loss_counted_tokens: 3620
total tokens: 7821 num samples: 9 num padding tokens: 747 - rank: 3 max len: 869 min len: 729 avg len: 786.0 num_loss_counted_tokens: 5513
total tokens: 4598 num samples: 19 num padding tokens: 1719 - rank: 7 max len: 242 min len: 75 avg len: 151.52631578947367 num_loss_counted_tokens: 1068
Per-token loss scaled by world size: 0.00018360439571551979Per-token loss scaled by world size: 0.0003279669035691768
Per-token loss scaled by world size: 2.2890385480422992e-06Per-token loss scaled by world size: 6.500220479210839e-05Per-token loss scaled by world size: 0.00032116335933096707
Per-token loss scaled by world size: 0.00036416525836102664Per-token loss scaled by world size: 0.0005080102127976716
Epoch: 0, Step: 3, Rank: 5, loss = 0.84148108959198
Epoch: 0, Step: 3, Rank: 3, loss = 0.47108298540115356Epoch: 0, Step: 3, Rank: 1, loss = 0.16677941381931305
Epoch: 0, Step: 3, Rank: 4, loss = 0.8240249156951904Epoch: 0, Step: 3, Rank: 0, loss = 0.005873100366443396
Epoch: 0, Step: 3, Rank: 6, loss = 1.3034272193908691
Per-token loss scaled by world size: 7.88167308201082e-05
Epoch: 0, Step: 3, Rank: 7, loss = 0.9343570470809937
Epoch: 0, Step: 3, Rank: 2, loss = 0.2022240310907364
Epoch 0: 2%|▏ | 3/121 [00:08<05:19, 2.70s/it]{
"epoch": 0,
"step": 3,
"rank": 0,
"loss": 0.005873100366443396,
"overall_throughput": 42.42993987932287,
"lr": 0.0,
"cuda_mem_allocated": 18.00035810470581,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20526,
"batch_size": 75,
"total_loss": 0.5936562418937683,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:10.210211"
}
total tokens: 6324 num samples: 2 num padding tokens: 135 - rank: 1 max len: 3162 min len: 3027 avg len: 3094.5 num_loss_counted_tokens: 175
total tokens: 7644 num samples: 7 num padding tokens: 892 - rank: 4 max len: 1092 min len: 825 avg len: 964.5714285714286 num_loss_counted_tokens: 4646
total tokens: 7192 num samples: 2 num padding tokens: 318 - rank: 0 max len: 3596 min len: 3278 avg len: 3437.0 num_loss_counted_tokens: 213
total tokens: 7620 num samples: 10 num padding tokens: 1177 - rank: 5 max len: 762 min len: 501 avg len: 644.3 num_loss_counted_tokens: 4818
total tokens: 7776 num samples: 16 num padding tokens: 1778 - rank: 6 max len: 486 min len: 269 avg len: 374.875 num_loss_counted_tokens: 3462
total tokens: 8060 num samples: 31 num padding tokens: 2946 - rank: 7 max len: 260 min len: 79 avg len: 164.96774193548387 num_loss_counted_tokens: 2135
total tokens: 8095 num samples: 5 num padding tokens: 1374 - rank: 3 max len: 1619 min len: 1120 avg len: 1344.2 num_loss_counted_tokens: 2765
total tokens: 7320 num samples: 3 num padding tokens: 432 - rank: 2 max len: 2440 min len: 2027 avg len: 2296.0 num_loss_counted_tokens: 857
Per-token loss scaled by world size: 0.0004511360311880708Per-token loss scaled by world size: 0.0004869260301347822Per-token loss scaled by world size: 4.640718543669209e-05Per-token loss scaled by world size: 8.355799946002662e-05
Per-token loss scaled by world size: 6.561249392689206e-06
Per-token loss scaled by world size: 0.00017396389739587903
Epoch: 0, Step: 4, Rank: 1, loss = 0.12441766262054443
Epoch: 0, Step: 4, Rank: 6, loss = 1.2094956636428833
Epoch: 0, Step: 4, Rank: 5, loss = 1.3054486513137817
Epoch: 0, Step: 4, Rank: 2, loss = 0.22401900589466095
Epoch: 0, Step: 4, Rank: 0, loss = 0.017590709030628204
Per-token loss scaled by world size: 0.00043431558879092336Epoch: 0, Step: 4, Rank: 7, loss = 0.4663971960544586
Per-token loss scaled by world size: 0.00029926959541626275
Epoch: 0, Step: 4, Rank: 4, loss = 1.1644001007080078
Epoch: 0, Step: 4, Rank: 3, loss = 0.8023418188095093
Epoch 0: 3%|▎ | 4/121 [00:10<05:08, 2.63s/it] total tokens: 7940 num samples: 10 num padding tokens: 987 - rank: 4 max len: 794 min len: 627 avg len: 695.3 num_loss_counted_tokens: 4306
{
"epoch": 0,
"step": 4,
"rank": 0,
"loss": 0.017590709030628204,
"overall_throughput": 42.48427743949919,
"lr": 0.0,
"cuda_mem_allocated": 18.00298833847046,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21448,
"batch_size": 75,
"total_loss": 0.664263904094696,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:12.736014"
}
total tokens: 7260 num samples: 4 num padding tokens: 876 - rank: 1 max len: 1815 min len: 1442 avg len: 1596.0 num_loss_counted_tokens: 1808
total tokens: 8097 num samples: 3 num padding tokens: 887 - rank: 0 max len: 2699 min len: 2231 avg len: 2403.3333333333335 num_loss_counted_tokens: 260
total tokens: 7776 num samples: 32 num padding tokens: 2768 - rank: 7 max len: 243 min len: 77 avg len: 156.5 num_loss_counted_tokens: 1979
total tokens: 6895 num samples: 5 num padding tokens: 531 - rank: 2 max len: 1379 min len: 1160 avg len: 1272.8 num_loss_counted_tokens: 3055
total tokens: 7923 num samples: 19 num padding tokens: 1608 - rank: 6 max len: 417 min len: 252 avg len: 332.36842105263156 num_loss_counted_tokens: 3042
total tokens: 7512 num samples: 8 num padding tokens: 560 - rank: 3 max len: 939 min len: 795 avg len: 869.0 num_loss_counted_tokens: 4478
total tokens: 8047 num samples: 13 num padding tokens: 1167 - rank: 5 max len: 619 min len: 439 avg len: 529.2307692307693 num_loss_counted_tokens: 4239
Per-token loss scaled by world size: 0.00024933897657319903Per-token loss scaled by world size: 0.000386894796974957
Per-token loss scaled by world size: 0.00021959797595627606
Per-token loss scaled by world size: 3.401555431992165e-06
Per-token loss scaled by world size: 5.781253548775567e-06Per-token loss scaled by world size: 0.00047684554010629654
Per-token loss scaled by world size: 0.0002837673237081617
Epoch: 0, Step: 5, Rank: 4, loss = 1.0060231685638428
Epoch: 0, Step: 5, Rank: 2, loss = 0.6483436822891235
Epoch: 0, Step: 5, Rank: 0, loss = 0.008844894357025623Epoch: 0, Step: 5, Rank: 3, loss = 0.571009635925293
Epoch: 0, Step: 5, Rank: 1, loss = 0.015032704919576645
Epoch: 0, Step: 5, Rank: 5, loss = 1.2399176359176636
Epoch: 0, Step: 5, Rank: 7, loss = 0.7378659844398499
Per-token loss scaled by world size: 0.00046695370110683143
Epoch: 0, Step: 5, Rank: 6, loss = 1.2141963243484497
Epoch 0: 4%|▍ | 5/121 [00:13<04:59, 2.58s/it] total tokens: 7651 num samples: 7 num padding tokens: 746 - rank: 4 max len: 1093 min len: 866 avg len: 986.4285714285714 num_loss_counted_tokens: 5678
{
"epoch": 0,
"step": 5,
"rank": 0,
"loss": 0.008844894357025623,
"overall_throughput": 43.14041038651036,
"lr": 0.0,
"cuda_mem_allocated": 18.102890491485596,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20802,
"batch_size": 80,
"total_loss": 0.6801542043685913,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:15.218258"
}
total tokens: 7038 num samples: 2 num padding tokens: 1091 - rank: 1 max len: 3519 min len: 2428 avg len: 2973.5 num_loss_counted_tokens: 159
total tokens: 8028 num samples: 2 num padding tokens: 123 - rank: 0 max len: 4014 min len: 3891 avg len: 3952.5 num_loss_counted_tokens: 168
total tokens: 7596 num samples: 12 num padding tokens: 1879 - rank: 6 max len: 633 min len: 336 avg len: 476.4166666666667 num_loss_counted_tokens: 3821
total tokens: 8064 num samples: 24 num padding tokens: 3562 - rank: 7 max len: 336 min len: 89 avg len: 187.58333333333334 num_loss_counted_tokens: 1860
total tokens: 7970 num samples: 5 num padding tokens: 726 - rank: 3 max len: 1594 min len: 1187 avg len: 1448.8 num_loss_counted_tokens: 2543
total tokens: 7890 num samples: 10 num padding tokens: 666 - rank: 5 max len: 789 min len: 634 avg len: 722.4 num_loss_counted_tokens: 3801
total tokens: 7000 num samples: 4 num padding tokens: 270 - rank: 2 max len: 1750 min len: 1627 avg len: 1682.5 num_loss_counted_tokens: 759
Per-token loss scaled by world size: 0.0001642795541556552Per-token loss scaled by world size: 0.00021280848886817694Per-token loss scaled by world size: 0.00032824286608956754
Per-token loss scaled by world size: 8.065341717156116e-06
Per-token loss scaled by world size: 3.2945732527878135e-05
Per-token loss scaled by world size: 0.00023678457364439964Per-token loss scaled by world size: 0.00048681392217986286
Epoch: 0, Step: 6, Rank: 2, loss = 0.5886548757553101
Epoch: 0, Step: 6, Rank: 4, loss = 0.907960832118988Epoch: 0, Step: 6, Rank: 3, loss = 0.45441779494285583
Epoch: 0, Step: 6, Rank: 0, loss = 0.022309742867946625
Epoch: 0, Step: 6, Rank: 1, loss = 0.0911320149898529
Epoch: 0, Step: 6, Rank: 6, loss = 1.346588134765625
Epoch: 0, Step: 6, Rank: 7, loss = 0.6549757122993469
Per-token loss scaled by world size: 0.0005791043513454497
Epoch: 0, Step: 6, Rank: 5, loss = 1.6018750667572021
Epoch 0: 5%|▍ | 6/121 [00:15<04:56, 2.58s/it] total tokens: 7911 num samples: 9 num padding tokens: 476 - rank: 4 max len: 879 min len: 707 avg len: 826.1111111111111 num_loss_counted_tokens: 6136
total tokens: 6741 num samples: 3 num padding tokens: 433 - rank: 1 max len: 2247 min len: 1894 avg len: 2102.6666666666665 num_loss_counted_tokens: 1931
total tokens: 7502 num samples: 11 num padding tokens: 1303 - rank: 5 max len: 682 min len: 424 avg len: 563.5454545454545 num_loss_counted_tokens: 3560
{
"epoch": 0,
"step": 6,
"rank": 0,
"loss": 0.022309742867946625,
"overall_throughput": 41.65718757774177,
"lr": 0.0,
"cuda_mem_allocated": 18.077077388763428,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22129,
"batch_size": 78,
"total_loss": 0.7084892988204956,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:17.804745"
}
total tokens: 8020 num samples: 20 num padding tokens: 1280 - rank: 6 max len: 401 min len: 273 avg len: 337.0 num_loss_counted_tokens: 3403
total tokens: 7714 num samples: 29 num padding tokens: 2377 - rank: 7 max len: 266 min len: 79 avg len: 184.0344827586207 num_loss_counted_tokens: 2540
total tokens: 6716 num samples: 4 num padding tokens: 1107 - rank: 2 max len: 1679 min len: 1232 avg len: 1402.25 num_loss_counted_tokens: 2407
total tokens: 6960 num samples: 6 num padding tokens: 657 - rank: 3 max len: 1160 min len: 960 avg len: 1050.5 num_loss_counted_tokens: 5018
total tokens: 6996 num samples: 2 num padding tokens: 327 - rank: 0 max len: 3498 min len: 3171 avg len: 3334.5 num_loss_counted_tokens: 153
Per-token loss scaled by world size: 0.0002843443362507969Per-token loss scaled by world size: 0.00017875904450193048Per-token loss scaled by world size: 0.00013562251115217805
Per-token loss scaled by world size: 0.0001140675667556934
Per-token loss scaled by world size: 0.00023188847990240902Per-token loss scaled by world size: 0.00043197604827582836
Per-token loss scaled by world size: 0.00019801303278654814
Epoch: 0, Step: 7, Rank: 6, loss = 0.8738256692886353
Epoch: 0, Step: 7, Rank: 4, loss = 0.5493488907814026
Epoch: 0, Step: 7, Rank: 0, loss = 0.4167849123477936Epoch: 0, Step: 7, Rank: 7, loss = 0.7126222848892212Epoch: 0, Step: 7, Rank: 1, loss = 0.35054388642311096
Epoch: 0, Step: 7, Rank: 2, loss = 0.6085187792778015
Epoch: 0, Step: 7, Rank: 5, loss = 1.3275164365768433
Per-token loss scaled by world size: 0.00021533554536290467
Epoch: 0, Step: 7, Rank: 3, loss = 0.6617530584335327
Epoch 0: 6%|▌ | 7/121 [00:18<04:50, 2.55s/it] total tokens: 7806 num samples: 3 num padding tokens: 966 - rank: 1 max len: 2602 min len: 1917 avg len: 2280.0 num_loss_counted_tokens: 944
total tokens: 7488 num samples: 9 num padding tokens: 542 - rank: 4 max len: 832 min len: 711 avg len: 771.7777777777778 num_loss_counted_tokens: 3897
{
"epoch": 0,
"step": 7,
"rank": 0,
"loss": 0.4167849123477936,
"overall_throughput": 43.4992160904002,
"lr": 0.0,
"cuda_mem_allocated": 18.127665996551514,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24585,
"batch_size": 88,
"total_loss": 0.6876142621040344,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:20.296605"
}
total tokens: 7400 num samples: 4 num padding tokens: 570 - rank: 2 max len: 1850 min len: 1579 avg len: 1707.5 num_loss_counted_tokens: 1604
total tokens: 7920 num samples: 16 num padding tokens: 2028 - rank: 6 max len: 495 min len: 286 avg len: 368.25 num_loss_counted_tokens: 3190
total tokens: 7140 num samples: 6 num padding tokens: 1155 - rank: 3 max len: 1190 min len: 845 avg len: 997.5 num_loss_counted_tokens: 3914
total tokens: 7424 num samples: 2 num padding tokens: 302 - rank: 0 max len: 3712 min len: 3410 avg len: 3561.0 num_loss_counted_tokens: 261
total tokens: 7744 num samples: 11 num padding tokens: 907 - rank: 5 max len: 704 min len: 540 avg len: 621.5454545454545 num_loss_counted_tokens: 4091
total tokens: 7672 num samples: 28 num padding tokens: 2305 - rank: 7 max len: 274 min len: 83 avg len: 191.67857142857142 num_loss_counted_tokens: 2631
Per-token loss scaled by world size: 0.00014226992789190263Per-token loss scaled by world size: 0.00031214559567160904Per-token loss scaled by world size: 0.00031845251214690506Per-token loss scaled by world size: 0.0002571563527453691
Per-token loss scaled by world size: 0.00039170257514342666
Per-token loss scaled by world size: 0.00012162854545749724
Per-token loss scaled by world size: 0.00013610723544843495
Epoch: 0, Step: 8, Rank: 6, loss = 1.0465071201324463
Epoch: 0, Step: 8, Rank: 4, loss = 1.0676518678665161
Epoch: 0, Step: 8, Rank: 2, loss = 0.47697773575782776Epoch: 0, Step: 8, Rank: 3, loss = 0.8621488213539124
Epoch: 0, Step: 8, Rank: 1, loss = 0.4077748954296112Epoch: 0, Step: 8, Rank: 5, loss = 1.3132318258285522
Epoch: 0, Step: 8, Rank: 7, loss = 0.4563165009021759
Per-token loss scaled by world size: 2.5762397854123265e-05
Epoch: 0, Step: 8, Rank: 0, loss = 0.08637166023254395
Epoch 0: 7%|▋ | 8/121 [00:21<04:49, 2.56s/it] total tokens: 6510 num samples: 3 num padding tokens: 1131 - rank: 1 max len: 2170 min len: 1481 avg len: 1793.0 num_loss_counted_tokens: 969
total tokens: 8001 num samples: 9 num padding tokens: 764 - rank: 4 max len: 889 min len: 750 avg len: 804.1111111111111 num_loss_counted_tokens: 5953
total tokens: 6042 num samples: 19 num padding tokens: 2404 - rank: 7 max len: 318 min len: 88 avg len: 191.47368421052633 num_loss_counted_tokens: 1475
{
"epoch": 0,
"step": 8,
"rank": 0,
"loss": 0.08637166023254395,
"overall_throughput": 41.99042053701942,
"lr": 0.0,
"cuda_mem_allocated": 18.229600429534912,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26821,
"batch_size": 90,
"total_loss": 0.7146224975585938,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:22.870728"
}
total tokens: 7525 num samples: 7 num padding tokens: 620 - rank: 3 max len: 1075 min len: 938 avg len: 986.4285714285714 num_loss_counted_tokens: 5075
total tokens: 7710 num samples: 6 num padding tokens: 714 - rank: 2 max len: 1285 min len: 1097 avg len: 1166.0 num_loss_counted_tokens: 4799
total tokens: 7742 num samples: 14 num padding tokens: 1362 - rank: 6 max len: 553 min len: 343 avg len: 455.7142857142857 num_loss_counted_tokens: 3929
total tokens: 6622 num samples: 2 num padding tokens: 836 - rank: 0 max len: 3311 min len: 2475 avg len: 2893.0 num_loss_counted_tokens: 693
total tokens: 8096 num samples: 11 num padding tokens: 746 - rank: 5 max len: 736 min len: 587 avg len: 668.1818181818181 num_loss_counted_tokens: 5289
Per-token loss scaled by world size: 0.00032484609982930124Per-token loss scaled by world size: 0.000325270724715665Per-token loss scaled by world size: 0.000487986282678321Per-token loss scaled by world size: 0.00039277857285924256
Per-token loss scaled by world size: 0.00037055223947390914Per-token loss scaled by world size: 1.2821310519939288e-06
Per-token loss scaled by world size: 2.2653903215541504e-05Epoch: 0, Step: 9, Rank: 4, loss = 0.9635331630706787Epoch: 0, Step: 9, Rank: 6, loss = 1.1635082960128784
Epoch: 0, Step: 9, Rank: 5, loss = 1.4455373287200928Epoch: 0, Step: 9, Rank: 0, loss = 0.0037979925982654095
Epoch: 0, Step: 9, Rank: 7, loss = 1.0976684093475342Epoch: 0, Step: 9, Rank: 2, loss = 0.9622753262519836
Epoch: 0, Step: 9, Rank: 1, loss = 0.06710652261972427
Per-token loss scaled by world size: 0.0003276054630987346
Epoch: 0, Step: 9, Rank: 3, loss = 0.9704492688179016
Epoch 0: 7%|▋ | 9/121 [00:23<04:45, 2.55s/it] total tokens: 7932 num samples: 3 num padding tokens: 1331 - rank: 1 max len: 2644 min len: 1965 avg len: 2200.3333333333335 num_loss_counted_tokens: 863
total tokens: 7389 num samples: 9 num padding tokens: 630 - rank: 4 max len: 821 min len: 683 avg len: 751.0 num_loss_counted_tokens: 4429
{
"epoch": 0,
"step": 9,
"rank": 0,
"loss": 0.0037979925982654095,
"overall_throughput": 42.802430491911245,
"lr": 0.0,
"cuda_mem_allocated": 17.982967853546143,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23698,
"batch_size": 89,
"total_loss": 0.8342345356941223,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:25.401772"
}
total tokens: 8088 num samples: 12 num padding tokens: 982 - rank: 5 max len: 674 min len: 500 avg len: 592.1666666666666 num_loss_counted_tokens: 4684
total tokens: 6798 num samples: 2 num padding tokens: 340 - rank: 0 max len: 3399 min len: 3059 avg len: 3229.0 num_loss_counted_tokens: 156
total tokens: 6642 num samples: 27 num padding tokens: 2019 - rank: 7 max len: 246 min len: 76 avg len: 171.22222222222223 num_loss_counted_tokens: 1990
total tokens: 6668 num samples: 4 num padding tokens: 1622 - rank: 2 max len: 1667 min len: 1049 avg len: 1261.5 num_loss_counted_tokens: 1065
total tokens: 7939 num samples: 17 num padding tokens: 2169 - rank: 6 max len: 467 min len: 248 avg len: 339.4117647058824 num_loss_counted_tokens: 3544
total tokens: 7528 num samples: 8 num padding tokens: 554 - rank: 3 max len: 941 min len: 832 avg len: 871.75 num_loss_counted_tokens: 4720
Per-token loss scaled by world size: 0.00029833969892933965Per-token loss scaled by world size: 0.0003632043662946671Per-token loss scaled by world size: 0.00018327771977055818Per-token loss scaled by world size: 0.00017725562793202698
Per-token loss scaled by world size: 1.4638754691986833e-05Per-token loss scaled by world size: 0.0002214470150647685Per-token loss scaled by world size: 5.87268550589215e-06
Epoch: 0, Step: 10, Rank: 2, loss = 0.5432663559913635Epoch: 0, Step: 10, Rank: 6, loss = 0.9143738746643066
Epoch: 0, Step: 10, Rank: 4, loss = 1.1131759881973267
Epoch: 0, Step: 10, Rank: 3, loss = 0.5617232918739319
Epoch: 0, Step: 10, Rank: 0, loss = 0.01799904741346836
Epoch: 0, Step: 10, Rank: 1, loss = 0.04486595466732979
Epoch: 0, Step: 10, Rank: 7, loss = 0.6787074208259583
Per-token loss scaled by world size: 0.0003915868583135307
Epoch: 0, Step: 10, Rank: 5, loss = 1.200164794921875
Epoch 0: 8%|▊ | 10/121 [00:26<04:41, 2.54s/it] total tokens: 7056 num samples: 6 num padding tokens: 1539 - rank: 4 max len: 1176 min len: 760 avg len: 919.5 num_loss_counted_tokens: 3752
total tokens: 6318 num samples: 2 num padding tokens: 744 - rank: 1 max len: 3159 min len: 2415 avg len: 2787.0 num_loss_counted_tokens: 641
{
"epoch": 0,
"step": 10,
"rank": 0,
"loss": 0.01799904741346836,
"overall_throughput": 43.20698130216357,
"lr": 0.0,
"cuda_mem_allocated": 17.961437225341797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24519,
"batch_size": 84,
"total_loss": 0.6342846155166626,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:27.904244"
}
total tokens: 7876 num samples: 11 num padding tokens: 1081 - rank: 5 max len: 716 min len: 528 avg len: 617.7272727272727 num_loss_counted_tokens: 3741
total tokens: 6586 num samples: 2 num padding tokens: 33 - rank: 0 max len: 3293 min len: 3260 avg len: 3276.5 num_loss_counted_tokens: 401
total tokens: 7084 num samples: 4 num padding tokens: 1038 - rank: 3 max len: 1771 min len: 1326 avg len: 1511.5 num_loss_counted_tokens: 2816
total tokens: 7956 num samples: 17 num padding tokens: 1245 - rank: 6 max len: 468 min len: 316 avg len: 394.7647058823529 num_loss_counted_tokens: 3928
total tokens: 6378 num samples: 3 num padding tokens: 279 - rank: 2 max len: 2126 min len: 1873 avg len: 2033.0 num_loss_counted_tokens: 1061
total tokens: 7852 num samples: 26 num padding tokens: 3174 - rank: 7 max len: 302 min len: 84 avg len: 179.92307692307693 num_loss_counted_tokens: 2103
Per-token loss scaled by world size: 0.00014277359878178686Per-token loss scaled by world size: 0.00017172133084386587Per-token loss scaled by world size: 0.00032692356035113335Per-token loss scaled by world size: 0.00021291511075105518Per-token loss scaled by world size: 0.0002838084474205971
Per-token loss scaled by world size: 8.463160156679805e-06
Per-token loss scaled by world size: 0.0002184695185860619
Epoch: 0, Step: 11, Rank: 5, loss = 1.096379041671753
Epoch: 0, Step: 11, Rank: 0, loss = 0.02838226407766342
Epoch: 0, Step: 11, Rank: 2, loss = 0.5758889317512512
Epoch: 0, Step: 11, Rank: 1, loss = 0.47880908846855164Epoch: 0, Step: 11, Rank: 6, loss = 0.7140374183654785Epoch: 0, Step: 11, Rank: 4, loss = 0.9517871141433716
Epoch: 0, Step: 11, Rank: 7, loss = 0.7326648235321045
Per-token loss scaled by world size: 0.00020073003543075174
Epoch: 0, Step: 11, Rank: 3, loss = 0.6731732487678528
Epoch 0: 9%|▉ | 11/121 [00:28<04:39, 2.54s/it] total tokens: 7000 num samples: 4 num padding tokens: 1372 - rank: 1 max len: 1750 min len: 1261 avg len: 1407.0 num_loss_counted_tokens: 902
total tokens: 7730 num samples: 10 num padding tokens: 533 - rank: 4 max len: 773 min len: 666 avg len: 719.7 num_loss_counted_tokens: 4342
total tokens: 7980 num samples: 12 num padding tokens: 984 - rank: 5 max len: 665 min len: 484 avg len: 583.0 num_loss_counted_tokens: 4069
{
"epoch": 0,
"step": 11,
"rank": 0,
"loss": 0.02838226407766342,
"overall_throughput": 42.203161621451606,
"lr": 0.0,
"cuda_mem_allocated": 18.220097064971924,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26829,
"batch_size": 89,
"total_loss": 0.6563901901245117,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:30.459939"
}
total tokens: 7854 num samples: 17 num padding tokens: 1998 - rank: 6 max len: 462 min len: 265 avg len: 344.47058823529414 num_loss_counted_tokens: 3040
total tokens: 7953 num samples: 3 num padding tokens: 883 - rank: 0 max len: 2651 min len: 1841 avg len: 2356.6666666666665 num_loss_counted_tokens: 501
total tokens: 7992 num samples: 8 num padding tokens: 643 - rank: 3 max len: 999 min len: 846 avg len: 918.625 num_loss_counted_tokens: 5458
total tokens: 8060 num samples: 31 num padding tokens: 2438 - rank: 7 max len: 260 min len: 78 avg len: 181.3548387096774 num_loss_counted_tokens: 2341
total tokens: 7314 num samples: 6 num padding tokens: 570 - rank: 2 max len: 1219 min len: 1012 avg len: 1124.0 num_loss_counted_tokens: 2551
Per-token loss scaled by world size: 0.0002459367678966373Per-token loss scaled by world size: 0.00032957716030068696Per-token loss scaled by world size: 2.192043893955997e-06
Per-token loss scaled by world size: 3.339715112815611e-05
Per-token loss scaled by world size: 0.0003242892271373421
Per-token loss scaled by world size: 0.00027953533572144806
Per-token loss scaled by world size: 0.0003221108636353165
Epoch: 0, Step: 12, Rank: 0, loss = 0.005932766944169998
Epoch: 0, Step: 12, Rank: 2, loss = 0.8920005559921265
Epoch: 0, Step: 12, Rank: 1, loss = 0.09038939327001572
Epoch: 0, Step: 12, Rank: 3, loss = 0.6656278967857361
Epoch: 0, Step: 12, Rank: 5, loss = 0.8776887655258179
Epoch: 0, Step: 12, Rank: 4, loss = 0.7565624117851257
Epoch: 0, Step: 12, Rank: 7, loss = 0.8717930316925049
Per-token loss scaled by world size: 0.00047857032041065395
Epoch: 0, Step: 12, Rank: 6, loss = 1.2952505350112915
Epoch 0: 10%|▉ | 12/121 [00:31<04:35, 2.53s/it] total tokens: 7648 num samples: 8 num padding tokens: 826 - rank: 4 max len: 956 min len: 722 avg len: 852.75 num_loss_counted_tokens: 4969
total tokens: 7680 num samples: 3 num padding tokens: 714 - rank: 1 max len: 2560 min len: 2201 avg len: 2322.0 num_loss_counted_tokens: 995
{
"epoch": 0,
"step": 12,
"rank": 0,
"loss": 0.005932766944169998,
"overall_throughput": 43.402175224914764,
"lr": 0.0,
"cuda_mem_allocated": 17.94186305999756,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21652,
"batch_size": 76,
"total_loss": 0.6819056272506714,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:32.991146"
}
total tokens: 7664 num samples: 16 num padding tokens: 1534 - rank: 6 max len: 479 min len: 276 avg len: 383.125 num_loss_counted_tokens: 3860
total tokens: 8106 num samples: 7 num padding tokens: 750 - rank: 3 max len: 1158 min len: 961 avg len: 1050.857142857143 num_loss_counted_tokens: 5265
total tokens: 7888 num samples: 29 num padding tokens: 2471 - rank: 7 max len: 272 min len: 82 avg len: 186.79310344827587 num_loss_counted_tokens: 2429
total tokens: 7306 num samples: 2 num padding tokens: 864 - rank: 0 max len: 3653 min len: 2789 avg len: 3221.0 num_loss_counted_tokens: 191
total tokens: 6890 num samples: 5 num padding tokens: 518 - rank: 2 max len: 1378 min len: 1170 avg len: 1274.4 num_loss_counted_tokens: 2274
total tokens: 7689 num samples: 11 num padding tokens: 1127 - rank: 5 max len: 699 min len: 483 avg len: 596.5454545454545 num_loss_counted_tokens: 4497
Per-token loss scaled by world size: 0.00013505632523447275Per-token loss scaled by world size: 0.00034081621561199427Per-token loss scaled by world size: 0.0004060634528286755Per-token loss scaled by world size: 0.0003271996683906764
Per-token loss scaled by world size: 2.6962425181409344e-06
Per-token loss scaled by world size: 0.00019702856661751866
Per-token loss scaled by world size: 2.0444547317310935e-06
Epoch: 0, Step: 13, Rank: 2, loss = 0.39850056171417236
Epoch: 0, Step: 13, Rank: 6, loss = 1.0056208372116089Epoch: 0, Step: 13, Rank: 4, loss = 1.1981409788131714Epoch: 0, Step: 13, Rank: 3, loss = 0.9654435515403748
Epoch: 0, Step: 13, Rank: 0, loss = 0.007955600507557392
Epoch: 0, Step: 13, Rank: 7, loss = 0.5813574194908142
Epoch: 0, Step: 13, Rank: 1, loss = 0.006032418925315142
Per-token loss scaled by world size: 0.000633390387520194
Epoch: 0, Step: 13, Rank: 5, loss = 1.868897557258606
Epoch 0: 11%|█ | 13/121 [00:33<04:32, 2.53s/it] total tokens: 8085 num samples: 11 num padding tokens: 649 - rank: 4 max len: 735 min len: 614 avg len: 676.0 num_loss_counted_tokens: 3645
total tokens: 6609 num samples: 3 num padding tokens: 940 - rank: 1 max len: 2203 min len: 1630 avg len: 1889.6666666666667 num_loss_counted_tokens: 239
total tokens: 7854 num samples: 17 num padding tokens: 1445 - rank: 6 max len: 462 min len: 277 avg len: 377.0 num_loss_counted_tokens: 3417
{
"epoch": 0,
"step": 13,
"rank": 0,
"loss": 0.007955600507557392,
"overall_throughput": 42.85566372115263,
"lr": 0.0,
"cuda_mem_allocated": 18.043649196624756,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23605,
"batch_size": 89,
"total_loss": 0.7539936304092407,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:35.510670"
}
total tokens: 7722 num samples: 13 num padding tokens: 1086 - rank: 5 max len: 594 min len: 465 avg len: 510.46153846153845 num_loss_counted_tokens: 3752
total tokens: 4062 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4062 min len: 4062 avg len: 4062.0 num_loss_counted_tokens: 85
total tokens: 7917 num samples: 29 num padding tokens: 2683 - rank: 7 max len: 273 min len: 76 avg len: 180.48275862068965 num_loss_counted_tokens: 2192
total tokens: 8091 num samples: 9 num padding tokens: 685 - rank: 3 max len: 899 min len: 760 avg len: 822.8888888888889 num_loss_counted_tokens: 6014
total tokens: 7815 num samples: 5 num padding tokens: 2132 - rank: 2 max len: 1563 min len: 901 avg len: 1136.6 num_loss_counted_tokens: 3256
Per-token loss scaled by world size: 0.00030266563408076763Per-token loss scaled by world size: 0.0001793744886526838Per-token loss scaled by world size: 0.00013343075988814235
Per-token loss scaled by world size: 8.247572259278968e-05
Per-token loss scaled by world size: 0.00023257164866663516Per-token loss scaled by world size: 0.00025159039068967104
Epoch: 0, Step: 14, Rank: 3, loss = 0.6006578803062439
Epoch: 0, Step: 14, Rank: 5, loss = 1.0135136842727661
Epoch: 0, Step: 14, Rank: 0, loss = 0.2761802673339844
Epoch: 0, Step: 14, Rank: 1, loss = 0.4468095600605011
Per-token loss scaled by world size: 0.0003185720997862518
Epoch: 0, Step: 14, Rank: 4, loss = 0.8424819111824036
Epoch: 0, Step: 14, Rank: 7, loss = 0.7787952423095703
Per-token loss scaled by world size: 9.590814443072304e-05
Epoch: 0, Step: 14, Rank: 6, loss = 1.066778540611267
Epoch: 0, Step: 14, Rank: 2, loss = 0.3211604058742523
Epoch 0: 12%|█▏ | 14/121 [00:36<04:31, 2.54s/it] total tokens: 6432 num samples: 3 num padding tokens: 621 - rank: 1 max len: 2144 min len: 1779 avg len: 1937.0 num_loss_counted_tokens: 328
total tokens: 7650 num samples: 9 num padding tokens: 968 - rank: 4 max len: 850 min len: 696 avg len: 742.4444444444445 num_loss_counted_tokens: 4090
total tokens: 7776 num samples: 16 num padding tokens: 1817 - rank: 6 max len: 486 min len: 267 avg len: 372.4375 num_loss_counted_tokens: 2945
{
"epoch": 0,
"step": 14,
"rank": 0,
"loss": 0.2761802673339844,
"overall_throughput": 41.914163559432374,
"lr": 0.0,
"cuda_mem_allocated": 18.220842361450195,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26789,
"batch_size": 100,
"total_loss": 0.6682971715927124,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:38.048693"
}
total tokens: 7546 num samples: 11 num padding tokens: 994 - rank: 5 max len: 686 min len: 507 avg len: 595.6363636363636 num_loss_counted_tokens: 4306
total tokens: 7953 num samples: 33 num padding tokens: 2532 - rank: 7 max len: 241 min len: 83 avg len: 164.27272727272728 num_loss_counted_tokens: 2377
total tokens: 7696 num samples: 8 num padding tokens: 273 - rank: 3 max len: 962 min len: 903 avg len: 927.875 num_loss_counted_tokens: 3069
total tokens: 7708 num samples: 2 num padding tokens: 1229 - rank: 0 max len: 3854 min len: 2625 avg len: 3239.5 num_loss_counted_tokens: 176
total tokens: 6624 num samples: 4 num padding tokens: 1021 - rank: 2 max len: 1656 min len: 1051 avg len: 1400.75 num_loss_counted_tokens: 1421
Per-token loss scaled by world size: 0.0004944643005728722Per-token loss scaled by world size: 0.0002606770722195506Per-token loss scaled by world size: 0.0004838722525164485Per-token loss scaled by world size: 5.0423706852598116e-05
Per-token loss scaled by world size: 0.0004016592283733189
Per-token loss scaled by world size: 6.617276085307822e-05
Epoch: 0, Step: 15, Rank: 6, loss = 0.7229878306388855Epoch: 0, Step: 15, Rank: 5, loss = 1.3713966608047485
Epoch: 0, Step: 15, Rank: 4, loss = 1.3420196771621704
Epoch: 0, Step: 15, Rank: 0, loss = 0.13985015451908112Epoch: 0, Step: 15, Rank: 1, loss = 0.18353015184402466
Epoch: 0, Step: 15, Rank: 7, loss = 1.1140018701553345
Per-token loss scaled by world size: 8.276257722172886e-05
Per-token loss scaled by world size: 0.000270542484940961
Epoch: 0, Step: 15, Rank: 2, loss = 0.22954201698303223
Epoch: 0, Step: 15, Rank: 3, loss = 0.7503495812416077
Epoch 0: 12%|█▏ | 15/121 [00:38<04:29, 2.54s/it] total tokens: 7595 num samples: 7 num padding tokens: 570 - rank: 4 max len: 1085 min len: 923 avg len: 1003.5714285714286 num_loss_counted_tokens: 4160
total tokens: 6930 num samples: 2 num padding tokens: 569 - rank: 1 max len: 3465 min len: 2896 avg len: 3180.5 num_loss_counted_tokens: 174
total tokens: 7542 num samples: 9 num padding tokens: 1278 - rank: 5 max len: 838 min len: 611 avg len: 696.0 num_loss_counted_tokens: 3843
total tokens: 8061 num samples: 3 num padding tokens: 917 - rank: 2 max len: 2687 min len: 1952 avg len: 2381.3333333333335 num_loss_counted_tokens: 302
{
"epoch": 0,
"step": 15,
"rank": 0,
"loss": 0.13985015451908112,
"overall_throughput": 42.649491925532274,
"lr": 0.0,
"cuda_mem_allocated": 18.06497097015381,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22188,
"batch_size": 90,
"total_loss": 0.7317097187042236,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:40.589466"
}
total tokens: 4070 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4070 min len: 4070 avg len: 4070.0 num_loss_counted_tokens: 1038
total tokens: 7565 num samples: 5 num padding tokens: 974 - rank: 3 max len: 1513 min len: 1101 avg len: 1318.2 num_loss_counted_tokens: 2867
total tokens: 7852 num samples: 13 num padding tokens: 2212 - rank: 6 max len: 604 min len: 287 avg len: 433.84615384615387 num_loss_counted_tokens: 3284
total tokens: 5720 num samples: 20 num padding tokens: 1667 - rank: 7 max len: 286 min len: 83 avg len: 202.65 num_loss_counted_tokens: 1807
Per-token loss scaled by world size: 0.00036834663478657603Per-token loss scaled by world size: 0.00037199081270955503Per-token loss scaled by world size: 0.00025284758885391057
Per-token loss scaled by world size: 0.00017388923151884228
Per-token loss scaled by world size: 4.525140013811324e-07
Per-token loss scaled by world size: 0.00022274823277257383
Per-token loss scaled by world size: 9.103088814299554e-05
Epoch: 0, Step: 16, Rank: 4, loss = 1.2040791511535645
Epoch: 0, Step: 16, Rank: 5, loss = 1.215991497039795Epoch: 0, Step: 16, Rank: 3, loss = 0.8265271782875061
Epoch: 0, Step: 16, Rank: 7, loss = 0.5684221386909485
Epoch: 0, Step: 16, Rank: 0, loss = 0.0014792117290198803
Epoch: 0, Step: 16, Rank: 2, loss = 0.7281361222267151
Epoch: 0, Step: 16, Rank: 1, loss = 0.29756858944892883
Per-token loss scaled by world size: 0.00036798955989070237
Epoch: 0, Step: 16, Rank: 6, loss = 1.2029118537902832
Epoch 0: 13%|█▎ | 16/121 [00:41<04:25, 2.52s/it] total tokens: 7950 num samples: 10 num padding tokens: 556 - rank: 4 max len: 795 min len: 706 avg len: 739.4 num_loss_counted_tokens: 4706
total tokens: 7167 num samples: 3 num padding tokens: 1839 - rank: 1 max len: 2389 min len: 1434 avg len: 1776.0 num_loss_counted_tokens: 2175
{
"epoch": 0,
"step": 16,
"rank": 0,
"loss": 0.0014792117290198803,
"overall_throughput": 43.42571067791906,
"lr": 0.0,
"cuda_mem_allocated": 18.137038707733154,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26151,
"batch_size": 89,
"total_loss": 0.7556394934654236,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:43.079559"
}
total tokens: 7904 num samples: 16 num padding tokens: 2012 - rank: 6 max len: 494 min len: 282 avg len: 368.25 num_loss_counted_tokens: 3677
total tokens: 7689 num samples: 11 num padding tokens: 1030 - rank: 5 max len: 699 min len: 501 avg len: 605.3636363636364 num_loss_counted_tokens: 4515
total tokens: 6075 num samples: 25 num padding tokens: 2510 - rank: 7 max len: 243 min len: 84 avg len: 142.6 num_loss_counted_tokens: 1404
total tokens: 7752 num samples: 8 num padding tokens: 648 - rank: 3 max len: 969 min len: 812 avg len: 888.0 num_loss_counted_tokens: 4511
total tokens: 7658 num samples: 7 num padding tokens: 214 - rank: 2 max len: 1094 min len: 982 avg len: 1063.4285714285713 num_loss_counted_tokens: 4703
total tokens: 6666 num samples: 2 num padding tokens: 735 - rank: 0 max len: 3333 min len: 2598 avg len: 2965.5 num_loss_counted_tokens: 156
Per-token loss scaled by world size: 0.0003832130169030279Per-token loss scaled by world size: 0.00025972415460273623Per-token loss scaled by world size: 0.0007209046743810177
Per-token loss scaled by world size: 3.690614175866358e-05Per-token loss scaled by world size: 0.0004276617255527526
Per-token loss scaled by world size: 7.648386599612422e-06
Per-token loss scaled by world size: 0.00031741950078867376
Epoch: 0, Step: 17, Rank: 0, loss = 0.0850963369011879
Epoch: 0, Step: 17, Rank: 4, loss = 0.8835934400558472Epoch: 0, Step: 17, Rank: 6, loss = 1.6622259616851807
Epoch: 0, Step: 17, Rank: 3, loss = 0.5988589525222778
Epoch: 0, Step: 17, Rank: 2, loss = 0.9860810041427612Epoch: 0, Step: 17, Rank: 1, loss = 0.017635267227888107
Epoch: 0, Step: 17, Rank: 7, loss = 0.7318900227546692
Per-token loss scaled by world size: 0.00037659640656784177
Epoch: 0, Step: 17, Rank: 5, loss = 0.8683371543884277
Epoch 0: 14%|█▍ | 17/121 [00:43<04:21, 2.51s/it] total tokens: 8082 num samples: 9 num padding tokens: 734 - rank: 4 max len: 898 min len: 742 avg len: 816.4444444444445 num_loss_counted_tokens: 4032
total tokens: 8112 num samples: 4 num padding tokens: 808 - rank: 1 max len: 2028 min len: 1729 avg len: 1826.0 num_loss_counted_tokens: 1009
total tokens: 7680 num samples: 12 num padding tokens: 950 - rank: 5 max len: 640 min len: 496 avg len: 560.8333333333334 num_loss_counted_tokens: 4500
{
"epoch": 0,
"step": 17,
"rank": 0,
"loss": 0.0850963369011879,
"overall_throughput": 43.64044780600529,
"lr": 0.0,
"cuda_mem_allocated": 18.1714825630188,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18446,
"batch_size": 76,
"total_loss": 0.7292147874832153,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:45.559713"
}
total tokens: 7664 num samples: 16 num padding tokens: 2265 - rank: 6 max len: 479 min len: 253 avg len: 337.4375 num_loss_counted_tokens: 2968
total tokens: 7756 num samples: 7 num padding tokens: 792 - rank: 3 max len: 1108 min len: 923 avg len: 994.8571428571429 num_loss_counted_tokens: 4875
total tokens: 7194 num samples: 2 num padding tokens: 1367 - rank: 0 max len: 3597 min len: 2230 avg len: 2913.5 num_loss_counted_tokens: 192
total tokens: 6448 num samples: 26 num padding tokens: 1927 - rank: 7 max len: 248 min len: 98 avg len: 173.8846153846154 num_loss_counted_tokens: 2050
total tokens: 7075 num samples: 5 num padding tokens: 857 - rank: 2 max len: 1415 min len: 1116 avg len: 1243.6 num_loss_counted_tokens: 3663
Per-token loss scaled by world size: 0.00015676565817557275Per-token loss scaled by world size: 0.0004898877814412117
Per-token loss scaled by world size: 0.00045096693793311715
Per-token loss scaled by world size: 5.226111625233898e-06Per-token loss scaled by world size: 8.227287253248505e-06
Per-token loss scaled by world size: 0.0005342444637790322Per-token loss scaled by world size: 0.0005171209922991693
Epoch: 0, Step: 18, Rank: 3, loss = 1.0057395696640015Epoch: 0, Step: 18, Rank: 2, loss = 0.3218398988246918
Epoch: 0, Step: 18, Rank: 5, loss = 0.925835132598877
Epoch: 0, Step: 18, Rank: 1, loss = 0.01072920672595501
Epoch: 0, Step: 18, Rank: 0, loss = 0.016890620812773705
Epoch: 0, Step: 18, Rank: 7, loss = 1.096803903579712
Epoch: 0, Step: 18, Rank: 4, loss = 1.0616494417190552
Per-token loss scaled by world size: 0.0007897767936810851
Epoch: 0, Step: 18, Rank: 6, loss = 1.6214118003845215
Epoch 0: 15%|█▍ | 18/121 [00:46<04:17, 2.50s/it] total tokens: 7335 num samples: 3 num padding tokens: 432 - rank: 1 max len: 2445 min len: 2093 avg len: 2301.0 num_loss_counted_tokens: 1145
total tokens: 7455 num samples: 7 num padding tokens: 746 - rank: 4 max len: 1065 min len: 855 avg len: 958.4285714285714 num_loss_counted_tokens: 3764
{
"epoch": 0,
"step": 18,
"rank": 0,
"loss": 0.016890620812773705,
"overall_throughput": 43.99619876252009,
"lr": 0.0,
"cuda_mem_allocated": 17.973182678222656,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16424,
"batch_size": 69,
"total_loss": 0.7576124668121338,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:48.023775"
}
total tokens: 7188 num samples: 4 num padding tokens: 887 - rank: 2 max len: 1797 min len: 1322 avg len: 1575.25 num_loss_counted_tokens: 1620
total tokens: 8016 num samples: 16 num padding tokens: 1990 - rank: 6 max len: 501 min len: 289 avg len: 376.625 num_loss_counted_tokens: 3443 total tokens: 7200 num samples: 25 num padding tokens: 2743 - rank: 7 max len: 288 min len: 77 avg len: 178.28 num_loss_counted_tokens: 2061
total tokens: 7980 num samples: 10 num padding tokens: 1229 - rank: 5 max len: 798 min len: 551 avg len: 675.1 num_loss_counted_tokens: 3751
total tokens: 6852 num samples: 2 num padding tokens: 182 - rank: 0 max len: 3426 min len: 3244 avg len: 3335.0 num_loss_counted_tokens: 203
total tokens: 7290 num samples: 6 num padding tokens: 282 - rank: 3 max len: 1215 min len: 1115 avg len: 1168.0 num_loss_counted_tokens: 4808
Per-token loss scaled by world size: 0.0002617795253172517Per-token loss scaled by world size: 0.0003632722655311227Per-token loss scaled by world size: 0.00038097533979453146Per-token loss scaled by world size: 0.00018842382996808738Per-token loss scaled by world size: 0.00016033223073463887
Per-token loss scaled by world size: 2.0858458356087795e-06
Per-token loss scaled by world size: 0.00016425059584435076
Epoch: 0, Step: 19, Rank: 2, loss = 0.5884711742401123Epoch: 0, Step: 19, Rank: 4, loss = 1.1345447301864624
Epoch: 0, Step: 19, Rank: 1, loss = 0.5007376074790955Epoch: 0, Step: 19, Rank: 6, loss = 0.8175702095031738
Epoch: 0, Step: 19, Rank: 5, loss = 1.189833641052246
Epoch: 0, Step: 19, Rank: 0, loss = 0.006514357402920723
Epoch: 0, Step: 19, Rank: 7, loss = 0.5129751563072205
Per-token loss scaled by world size: 0.00031291748746298254
Epoch: 0, Step: 19, Rank: 3, loss = 0.9772804379463196
Epoch 0: 16%|█▌ | 19/121 [00:48<04:16, 2.51s/it] total tokens: 7540 num samples: 10 num padding tokens: 800 - rank: 4 max len: 754 min len: 591 avg len: 674.0 num_loss_counted_tokens: 3840
total tokens: 6924 num samples: 4 num padding tokens: 344 - rank: 1 max len: 1731 min len: 1592 avg len: 1645.0 num_loss_counted_tokens: 954
total tokens: 7018 num samples: 29 num padding tokens: 2429 - rank: 7 max len: 242 min len: 75 avg len: 158.24137931034483 num_loss_counted_tokens: 1886
{
"epoch": 0,
"step": 19,
"rank": 0,
"loss": 0.006514357402920723,
"overall_throughput": 42.48141431618144,
"lr": 0.0,
"cuda_mem_allocated": 18.00298833847046,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24985,
"batch_size": 81,
"total_loss": 0.7159909009933472,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:50.578884"
}
total tokens: 7980 num samples: 14 num padding tokens: 1456 - rank: 5 max len: 570 min len: 372 avg len: 466.0 num_loss_counted_tokens: 4358
total tokens: 7658 num samples: 7 num padding tokens: 1001 - rank: 3 max len: 1094 min len: 779 avg len: 951.0 num_loss_counted_tokens: 3845
total tokens: 6948 num samples: 3 num padding tokens: 823 - rank: 0 max len: 2316 min len: 1860 avg len: 2041.6666666666667 num_loss_counted_tokens: 2568
total tokens: 8030 num samples: 22 num padding tokens: 1466 - rank: 6 max len: 365 min len: 248 avg len: 298.3636363636364 num_loss_counted_tokens: 3426
total tokens: 7350 num samples: 5 num padding tokens: 547 - rank: 2 max len: 1470 min len: 1198 avg len: 1360.6 num_loss_counted_tokens: 936
Per-token loss scaled by world size: 4.6467721404042095e-06Per-token loss scaled by world size: 1.0636807928676717e-05Per-token loss scaled by world size: 7.807435031281784e-05Per-token loss scaled by world size: 0.000532266276422888Per-token loss scaled by world size: 0.0005103853181935847
Per-token loss scaled by world size: 0.0004919093335047364Per-token loss scaled by world size: 0.0005770818097516894
Epoch: 0, Step: 20, Rank: 0, loss = 0.025356819853186607
Epoch: 0, Step: 20, Rank: 2, loss = 0.18611949682235718
Epoch: 0, Step: 20, Rank: 1, loss = 0.01107732392847538
Epoch: 0, Step: 20, Rank: 6, loss = 1.2688562870025635
Epoch: 0, Step: 20, Rank: 4, loss = 1.1726503372192383Epoch: 0, Step: 20, Rank: 7, loss = 1.2166948318481445
Epoch: 0, Step: 20, Rank: 5, loss = 1.3756909370422363
Per-token loss scaled by world size: 0.00012562941992655396
Epoch: 0, Step: 20, Rank: 3, loss = 0.29948481917381287
Epoch 0: 17%|█▋ | 20/121 [00:51<04:14, 2.52s/it] total tokens: 5636 num samples: 2 num padding tokens: 753 - rank: 1 max len: 2818 min len: 2065 avg len: 2441.5 num_loss_counted_tokens: 151
total tokens: 8030 num samples: 11 num padding tokens: 694 - rank: 4 max len: 730 min len: 602 avg len: 666.9090909090909 num_loss_counted_tokens: 4998
{
"epoch": 0,
"step": 20,
"rank": 0,
"loss": 0.025356819853186607,
"overall_throughput": 42.41401913459215,
"lr": 0.0,
"cuda_mem_allocated": 18.149494647979736,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19071,
"batch_size": 76,
"total_loss": 0.6944913268089294,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:53.125315"
}
total tokens: 7472 num samples: 8 num padding tokens: 683 - rank: 3 max len: 934 min len: 765 avg len: 848.625 num_loss_counted_tokens: 4018
total tokens: 7984 num samples: 4 num padding tokens: 1049 - rank: 2 max len: 1996 min len: 1356 avg len: 1733.75 num_loss_counted_tokens: 1408
total tokens: 7440 num samples: 24 num padding tokens: 2407 - rank: 7 max len: 310 min len: 86 avg len: 209.70833333333334 num_loss_counted_tokens: 2199
total tokens: 7635 num samples: 15 num padding tokens: 1392 - rank: 6 max len: 509 min len: 313 avg len: 416.2 num_loss_counted_tokens: 3986
total tokens: 5858 num samples: 2 num padding tokens: 28 - rank: 0 max len: 2929 min len: 2901 avg len: 2915.0 num_loss_counted_tokens: 167
total tokens: 7813 num samples: 13 num padding tokens: 588 - rank: 5 max len: 601 min len: 517 avg len: 555.7692307692307 num_loss_counted_tokens: 4116
Per-token loss scaled by world size: 0.0003165990929119289Per-token loss scaled by world size: 0.00013289590424392372Per-token loss scaled by world size: 0.00024944794131442904Per-token loss scaled by world size: 0.00046029582154005766
Per-token loss scaled by world size: 0.00031609757570549846Per-token loss scaled by world size: 0.000298293714877218Per-token loss scaled by world size: 0.0003350040642544627
Epoch: 0, Step: 21, Rank: 1, loss = 0.3848499357700348
Epoch: 0, Step: 21, Rank: 4, loss = 1.3329591751098633
Epoch: 0, Step: 21, Rank: 2, loss = 0.7223700284957886
Epoch: 0, Step: 21, Rank: 3, loss = 0.9168314337730408
Epoch: 0, Step: 21, Rank: 6, loss = 0.9153790473937988
Epoch: 0, Step: 21, Rank: 7, loss = 0.8638213276863098
Epoch: 0, Step: 21, Rank: 5, loss = 0.9701299071311951
Per-token loss scaled by world size: 2.031196345342323e-06
Epoch: 0, Step: 21, Rank: 0, loss = 0.005882090888917446
Epoch 0: 17%|█▋ | 21/121 [00:53<04:14, 2.55s/it] total tokens: 7450 num samples: 10 num padding tokens: 522 - rank: 4 max len: 745 min len: 656 avg len: 692.8 num_loss_counted_tokens: 3966
total tokens: 6090 num samples: 3 num padding tokens: 775 - rank: 1 max len: 2030 min len: 1546 avg len: 1771.6666666666667 num_loss_counted_tokens: 2594
{
"epoch": 0,
"step": 21,
"rank": 0,
"loss": 0.005882090888917446,
"overall_throughput": 41.659196665794404,
"lr": 0.0,
"cuda_mem_allocated": 18.256999015808105,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23167,
"batch_size": 94,
"total_loss": 0.7640278339385986,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:55.723471"
}
total tokens: 7656 num samples: 12 num padding tokens: 886 - rank: 5 max len: 638 min len: 429 avg len: 564.1666666666666 num_loss_counted_tokens: 4895
total tokens: 7392 num samples: 8 num padding tokens: 529 - rank: 3 max len: 924 min len: 753 avg len: 857.875 num_loss_counted_tokens: 5456
total tokens: 7110 num samples: 3 num padding tokens: 189 - rank: 0 max len: 2370 min len: 2243 avg len: 2307.0 num_loss_counted_tokens: 433
total tokens: 7885 num samples: 19 num padding tokens: 1785 - rank: 6 max len: 415 min len: 258 avg len: 321.05263157894734 num_loss_counted_tokens: 3103
total tokens: 7453 num samples: 29 num padding tokens: 2759 - rank: 7 max len: 257 min len: 79 avg len: 161.86206896551724 num_loss_counted_tokens: 1834
total tokens: 8082 num samples: 6 num padding tokens: 1602 - rank: 2 max len: 1347 min len: 925 avg len: 1080.0 num_loss_counted_tokens: 3092
Per-token loss scaled by world size: 0.00046551000559702516Per-token loss scaled by world size: 0.00031762762228026986Per-token loss scaled by world size: 0.0006262522656470537
Per-token loss scaled by world size: 0.00032212832593359053
Per-token loss scaled by world size: 0.0005319734336808324
Per-token loss scaled by world size: 7.452299178112298e-05Per-token loss scaled by world size: 8.590232027927414e-06
Epoch: 0, Step: 22, Rank: 4, loss = 1.093308448791504Epoch: 0, Step: 22, Rank: 7, loss = 0.7459881901741028
Epoch: 0, Step: 22, Rank: 6, loss = 1.4708317518234253
Epoch: 0, Step: 22, Rank: 3, loss = 0.7565586566925049
Epoch: 0, Step: 22, Rank: 5, loss = 1.249406099319458
Epoch: 0, Step: 22, Rank: 2, loss = 0.1750265657901764
Epoch: 0, Step: 22, Rank: 1, loss = 0.020175233483314514
Per-token loss scaled by world size: 8.447452273685485e-06
Epoch: 0, Step: 22, Rank: 0, loss = 0.019839897751808167
Epoch 0: 18%|█▊ | 22/121 [00:56<04:16, 2.59s/it] total tokens: 7168 num samples: 7 num padding tokens: 1126 - rank: 4 max len: 1024 min len: 712 avg len: 863.1428571428571 num_loss_counted_tokens: 4103
total tokens: 6628 num samples: 2 num padding tokens: 99 - rank: 1 max len: 3314 min len: 3215 avg len: 3264.5 num_loss_counted_tokens: 217
{
"epoch": 0,
"step": 22,
"rank": 0,
"loss": 0.019839897751808167,
"overall_throughput": 40.13505236925628,
"lr": 0.0,
"cuda_mem_allocated": 18.249396324157715,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18789,
"batch_size": 66,
"total_loss": 0.6913918256759644,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:48:58.416941"
}
total tokens: 7821 num samples: 11 num padding tokens: 1199 - rank: 5 max len: 711 min len: 507 avg len: 602.0 num_loss_counted_tokens: 4500
total tokens: 7330 num samples: 5 num padding tokens: 1004 - rank: 3 max len: 1466 min len: 1087 avg len: 1265.2 num_loss_counted_tokens: 1284
total tokens: 7936 num samples: 16 num padding tokens: 1776 - rank: 6 max len: 496 min len: 286 avg len: 385.0 num_loss_counted_tokens: 4076
total tokens: 8070 num samples: 30 num padding tokens: 2921 - rank: 7 max len: 269 min len: 82 avg len: 171.63333333333333 num_loss_counted_tokens: 2107 total tokens: 7552 num samples: 2 num padding tokens: 409 - rank: 0 max len: 3776 min len: 3367 avg len: 3571.5 num_loss_counted_tokens: 182
total tokens: 8049 num samples: 3 num padding tokens: 2360 - rank: 2 max len: 2683 min len: 1483 avg len: 1896.3333333333333 num_loss_counted_tokens: 515
Per-token loss scaled by world size: 0.0002995161630678922Per-token loss scaled by world size: 0.0001995390048250556Per-token loss scaled by world size: 4.880544565821765e-06Per-token loss scaled by world size: 0.00033291871659457684
Per-token loss scaled by world size: 0.00010466719686519355Per-token loss scaled by world size: 0.00026131211780011654
Per-token loss scaled by world size: 0.00020597832917701453
Epoch: 0, Step: 23, Rank: 0, loss = 0.015341991558670998
Epoch: 0, Step: 23, Rank: 2, loss = 0.6272508502006531
Epoch: 0, Step: 23, Rank: 3, loss = 0.9415290355682373
Epoch: 0, Step: 23, Rank: 5, loss = 1.04653000831604
Epoch: 0, Step: 23, Rank: 7, loss = 0.8214346170425415Epoch: 0, Step: 23, Rank: 1, loss = 0.3290213346481323Epoch: 0, Step: 23, Rank: 4, loss = 0.6474928855895996
Per-token loss scaled by world size: 0.00029968167655169964
Epoch: 0, Step: 23, Rank: 6, loss = 0.9420493841171265
Epoch 0: 19%|█▉ | 23/121 [00:59<04:11, 2.57s/it] total tokens: 7217 num samples: 7 num padding tokens: 812 - rank: 4 max len: 1031 min len: 810 avg len: 915.0 num_loss_counted_tokens: 5570
total tokens: 7335 num samples: 3 num padding tokens: 893 - rank: 1 max len: 2445 min len: 1991 avg len: 2147.3333333333335 num_loss_counted_tokens: 303
{
"epoch": 0,
"step": 23,
"rank": 0,
"loss": 0.015341991558670998,
"overall_throughput": 43.085386891030744,
"lr": 0.0,
"cuda_mem_allocated": 18.126072883605957,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25148,
"batch_size": 84,
"total_loss": 0.6713312864303589,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:00.926547"
}
total tokens: 8025 num samples: 15 num padding tokens: 1566 - rank: 6 max len: 535 min len: 337 avg len: 430.6 num_loss_counted_tokens: 4270
total tokens: 7872 num samples: 24 num padding tokens: 3099 - rank: 7 max len: 328 min len: 81 avg len: 198.875 num_loss_counted_tokens: 2380
total tokens: 7830 num samples: 10 num padding tokens: 881 - rank: 5 max len: 783 min len: 590 avg len: 694.9 num_loss_counted_tokens: 5375
total tokens: 7664 num samples: 4 num padding tokens: 1037 - rank: 2 max len: 1916 min len: 1514 avg len: 1656.75 num_loss_counted_tokens: 1726
total tokens: 7794 num samples: 6 num padding tokens: 783 - rank: 3 max len: 1299 min len: 1032 avg len: 1168.5 num_loss_counted_tokens: 3507
total tokens: 7062 num samples: 2 num padding tokens: 743 - rank: 0 max len: 3531 min len: 2788 avg len: 3159.5 num_loss_counted_tokens: 192
Per-token loss scaled by world size: 0.00014580012066289783Per-token loss scaled by world size: 0.0005001741228625178Per-token loss scaled by world size: 0.0002923521969933063
Per-token loss scaled by world size: 0.00035334189306013286Per-token loss scaled by world size: 0.00044065553811378777
Per-token loss scaled by world size: 6.498985749203712e-05
Epoch: 0, Step: 24, Rank: 2, loss = 0.37419599294662476
Epoch: 0, Step: 24, Rank: 4, loss = 1.2836968898773193Epoch: 0, Step: 24, Rank: 6, loss = 0.9068519473075867
Epoch: 0, Step: 24, Rank: 1, loss = 0.1667964607477188
Epoch: 0, Step: 24, Rank: 3, loss = 0.7503219246864319
Epoch: 0, Step: 24, Rank: 7, loss = 1.130942463874817
Per-token loss scaled by world size: 0.0006109050591476262
Epoch: 0, Step: 24, Rank: 5, loss = 1.567887783050537
Per-token loss scaled by world size: 1.688349584583193e-05
Epoch: 0, Step: 24, Rank: 0, loss = 0.04333149269223213
Epoch 0: 20%|█▉ | 24/121 [01:01<04:07, 2.55s/it] total tokens: 7492 num samples: 4 num padding tokens: 997 - rank: 1 max len: 1873 min len: 1355 avg len: 1623.75 num_loss_counted_tokens: 1310
total tokens: 8000 num samples: 10 num padding tokens: 696 - rank: 4 max len: 800 min len: 666 avg len: 730.4 num_loss_counted_tokens: 5339
{
"epoch": 0,
"step": 24,
"rank": 0,
"loss": 0.04333149269223213,
"overall_throughput": 42.861407474478405,
"lr": 0.0,
"cuda_mem_allocated": 18.177217483520508,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20532,
"batch_size": 79,
"total_loss": 0.7780030965805054,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:03.453834"
}
total tokens: 7904 num samples: 19 num padding tokens: 1180 - rank: 6 max len: 416 min len: 291 avg len: 353.89473684210526 num_loss_counted_tokens: 3756
total tokens: 7920 num samples: 12 num padding tokens: 1406 - rank: 5 max len: 660 min len: 441 avg len: 542.8333333333334 num_loss_counted_tokens: 4093
total tokens: 7830 num samples: 27 num padding tokens: 2551 - rank: 7 max len: 290 min len: 78 avg len: 195.5185185185185 num_loss_counted_tokens: 2643
total tokens: 5734 num samples: 2 num padding tokens: 736 - rank: 0 max len: 2867 min len: 2131 avg len: 2499.0 num_loss_counted_tokens: 169
total tokens: 8088 num samples: 6 num padding tokens: 1050 - rank: 2 max len: 1348 min len: 1049 avg len: 1173.0 num_loss_counted_tokens: 3708
total tokens: 8040 num samples: 8 num padding tokens: 661 - rank: 3 max len: 1005 min len: 825 avg len: 922.375 num_loss_counted_tokens: 5105
Per-token loss scaled by world size: 0.00023669454094488174Per-token loss scaled by world size: 0.00023404715466313064Per-token loss scaled by world size: 0.00022105168318375945
Per-token loss scaled by world size: 0.00032421553623862565
Per-token loss scaled by world size: 1.7478114386904053e-05
Per-token loss scaled by world size: 0.00014570211351383477Per-token loss scaled by world size: 5.8525503845885396e-05
Epoch: 0, Step: 25, Rank: 2, loss = 0.8244895935058594
Epoch: 0, Step: 25, Rank: 4, loss = 0.7787098288536072Epoch: 0, Step: 25, Rank: 6, loss = 1.1421302556991577Epoch: 0, Step: 25, Rank: 3, loss = 0.8338156938552856
Epoch: 0, Step: 25, Rank: 0, loss = 0.06157102435827255
Epoch: 0, Step: 25, Rank: 1, loss = 0.2061707228422165
Epoch: 0, Step: 25, Rank: 7, loss = 0.5132721066474915
Per-token loss scaled by world size: 0.00032130838371813297
Epoch: 0, Step: 25, Rank: 5, loss = 1.1318891048431396
Epoch 0: 21%|██ | 25/121 [01:04<04:04, 2.55s/it] total tokens: 7385 num samples: 7 num padding tokens: 1010 - rank: 4 max len: 1055 min len: 819 avg len: 910.7142857142857 num_loss_counted_tokens: 2829
total tokens: 6350 num samples: 2 num padding tokens: 809 - rank: 1 max len: 3175 min len: 2366 avg len: 2770.5 num_loss_counted_tokens: 1136
total tokens: 7100 num samples: 25 num padding tokens: 2255 - rank: 7 max len: 284 min len: 85 avg len: 193.8 num_loss_counted_tokens: 2161
{
"epoch": 0,
"step": 25,
"rank": 0,
"loss": 0.06157102435827255,
"overall_throughput": 42.60322198576968,
"lr": 0.0,
"cuda_mem_allocated": 18.081379890441895,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28182,
"batch_size": 71,
"total_loss": 0.6865060329437256,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:05.995018"
}
total tokens: 7800 num samples: 15 num padding tokens: 2044 - rank: 6 max len: 520 min len: 291 avg len: 383.73333333333335 num_loss_counted_tokens: 4084
total tokens: 7464 num samples: 4 num padding tokens: 487 - rank: 2 max len: 1866 min len: 1672 avg len: 1744.25 num_loss_counted_tokens: 862
total tokens: 7970 num samples: 10 num padding tokens: 1552 - rank: 5 max len: 797 min len: 528 avg len: 641.8 num_loss_counted_tokens: 3059
total tokens: 7472 num samples: 2 num padding tokens: 315 - rank: 0 max len: 3736 min len: 3421 avg len: 3578.5 num_loss_counted_tokens: 178
total tokens: 7635 num samples: 5 num padding tokens: 1225 - rank: 3 max len: 1527 min len: 1071 avg len: 1282.0 num_loss_counted_tokens: 2498
Per-token loss scaled by world size: 0.0005163732566870749Per-token loss scaled by world size: 0.00043581612408161163Per-token loss scaled by world size: 0.0003451558295637369Per-token loss scaled by world size: 0.00011102599819423631Per-token loss scaled by world size: 7.73636857047677e-05
Per-token loss scaled by world size: 1.7906730818140204e-06Per-token loss scaled by world size: 0.00031262030825018883
Epoch: 0, Step: 26, Rank: 4, loss = 1.1685864925384521Epoch: 0, Step: 26, Rank: 2, loss = 0.2977023422718048Epoch: 0, Step: 26, Rank: 3, loss = 0.9254922270774841Epoch: 0, Step: 26, Rank: 0, loss = 0.004801466129720211Epoch: 0, Step: 26, Rank: 6, loss = 1.3845903873443604
Epoch: 0, Step: 26, Rank: 1, loss = 0.207441046833992
Epoch: 0, Step: 26, Rank: 7, loss = 0.8382523059844971
Per-token loss scaled by world size: 0.0004404305072966963
Epoch: 0, Step: 26, Rank: 5, loss = 1.1809593439102173
Epoch 0: 21%|██▏ | 26/121 [01:06<04:02, 2.55s/it] total tokens: 8012 num samples: 4 num padding tokens: 1086 - rank: 1 max len: 2003 min len: 1474 avg len: 1731.5 num_loss_counted_tokens: 2801
total tokens: 7784 num samples: 8 num padding tokens: 708 - rank: 4 max len: 973 min len: 760 avg len: 884.5 num_loss_counted_tokens: 4202
{
"epoch": 0,
"step": 26,
"rank": 0,
"loss": 0.004801466129720211,
"overall_throughput": 42.41860146698506,
"lr": 0.0,
"cuda_mem_allocated": 18.102412223815918,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21451,
"batch_size": 82,
"total_loss": 0.7509781718254089,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:08.547049"
}
total tokens: 6995 num samples: 5 num padding tokens: 461 - rank: 2 max len: 1399 min len: 1155 avg len: 1306.8 num_loss_counted_tokens: 2776
total tokens: 7920 num samples: 11 num padding tokens: 1235 - rank: 5 max len: 720 min len: 497 avg len: 607.7272727272727 num_loss_counted_tokens: 4165
total tokens: 7553 num samples: 7 num padding tokens: 401 - rank: 3 max len: 1079 min len: 985 avg len: 1021.7142857142857 num_loss_counted_tokens: 5788
total tokens: 7540 num samples: 26 num padding tokens: 2466 - rank: 7 max len: 290 min len: 86 avg len: 195.15384615384616 num_loss_counted_tokens: 2386
total tokens: 7776 num samples: 16 num padding tokens: 1715 - rank: 6 max len: 486 min len: 293 avg len: 378.8125 num_loss_counted_tokens: 3442
total tokens: 8067 num samples: 3 num padding tokens: 761 - rank: 0 max len: 2689 min len: 2058 avg len: 2435.3333333333335 num_loss_counted_tokens: 864
Per-token loss scaled by world size: 0.0004792925319634378Per-token loss scaled by world size: 0.0002461661642882973Per-token loss scaled by world size: 0.0004854958679061383Per-token loss scaled by world size: 3.6859477404505014e-05
Per-token loss scaled by world size: 0.00043515616562217474Per-token loss scaled by world size: 3.782783096539788e-05Per-token loss scaled by world size: 1.9859728126903065e-05
Epoch: 0, Step: 27, Rank: 3, loss = 0.5675053000450134
Epoch: 0, Step: 27, Rank: 5, loss = 1.1192500591278076Epoch: 0, Step: 27, Rank: 0, loss = 0.08497491478919983
Epoch: 0, Step: 27, Rank: 7, loss = 1.1049489974975586
Epoch: 0, Step: 27, Rank: 4, loss = 1.0031981468200684
Epoch: 0, Step: 27, Rank: 2, loss = 0.0457841195166111
Epoch: 0, Step: 27, Rank: 1, loss = 0.08720733970403671
Per-token loss scaled by world size: 0.0007065049139782786
Epoch: 0, Step: 27, Rank: 6, loss = 1.6287587881088257
Epoch 0: 22%|██▏ | 27/121 [01:09<03:58, 2.54s/it] total tokens: 7665 num samples: 3 num padding tokens: 261 - rank: 1 max len: 2555 min len: 2313 avg len: 2468.0 num_loss_counted_tokens: 2129
total tokens: 7147 num samples: 7 num padding tokens: 782 - rank: 4 max len: 1021 min len: 860 avg len: 909.2857142857143 num_loss_counted_tokens: 4214
total tokens: 7680 num samples: 10 num padding tokens: 867 - rank: 5 max len: 768 min len: 585 avg len: 681.3 num_loss_counted_tokens: 4557
{
"epoch": 0,
"step": 27,
"rank": 0,
"loss": 0.08497491478919983,
"overall_throughput": 43.23576441221499,
"lr": 0.0,
"cuda_mem_allocated": 18.077077388763428,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18443,
"batch_size": 71,
"total_loss": 0.7052034735679626,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:11.052914"
}
total tokens: 5712 num samples: 2 num padding tokens: 241 - rank: 0 max len: 2856 min len: 2615 avg len: 2735.5 num_loss_counted_tokens: 234
total tokens: 6724 num samples: 4 num padding tokens: 898 - rank: 3 max len: 1681 min len: 1052 avg len: 1456.5 num_loss_counted_tokens: 2574
total tokens: 8092 num samples: 14 num padding tokens: 1753 - rank: 6 max len: 578 min len: 324 avg len: 452.7857142857143 num_loss_counted_tokens: 2859
total tokens: 8060 num samples: 4 num padding tokens: 681 - rank: 2 max len: 2015 min len: 1739 avg len: 1844.75 num_loss_counted_tokens: 813
total tokens: 8092 num samples: 28 num padding tokens: 3097 - rank: 7 max len: 289 min len: 81 avg len: 178.39285714285714 num_loss_counted_tokens: 2310
Per-token loss scaled by world size: 0.0003594549198169261Per-token loss scaled by world size: 0.0002452041080687195Per-token loss scaled by world size: 0.0001836109149735421Per-token loss scaled by world size: 0.00030733394669368863Per-token loss scaled by world size: 0.0003232009767089039
Per-token loss scaled by world size: 0.000357407407136634
Per-token loss scaled by world size: 3.657307388493791e-05
Epoch: 0, Step: 28, Rank: 3, loss = 0.7112144827842712Epoch: 0, Step: 28, Rank: 2, loss = 0.5325634479522705
Epoch: 0, Step: 28, Rank: 6, loss = 1.0425989627838135
Epoch: 0, Step: 28, Rank: 4, loss = 0.8914220929145813Epoch: 0, Step: 28, Rank: 5, loss = 1.0366601943969727
Epoch: 0, Step: 28, Rank: 7, loss = 0.9374444484710693
Epoch: 0, Step: 28, Rank: 1, loss = 0.10608020424842834
Per-token loss scaled by world size: 4.0188886487158015e-05
Epoch: 0, Step: 28, Rank: 0, loss = 0.11656786501407623
Epoch 0: 23%|██▎ | 28/121 [01:11<03:56, 2.55s/it] total tokens: 7986 num samples: 11 num padding tokens: 839 - rank: 4 max len: 726 min len: 536 avg len: 649.7272727272727 num_loss_counted_tokens: 4414
total tokens: 7616 num samples: 4 num padding tokens: 1782 - rank: 1 max len: 1904 min len: 1228 avg len: 1458.5 num_loss_counted_tokens: 1665
{
"epoch": 0,
"step": 28,
"rank": 0,
"loss": 0.11656786501407623,
"overall_throughput": 42.00576591775989,
"lr": 0.0,
"cuda_mem_allocated": 18.240712642669678,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23204,
"batch_size": 91,
"total_loss": 0.6718189716339111,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:13.626212"
}
total tokens: 8040 num samples: 20 num padding tokens: 1891 - rank: 6 max len: 402 min len: 208 avg len: 307.45 num_loss_counted_tokens: 3493
total tokens: 4944 num samples: 24 num padding tokens: 1763 - rank: 7 max len: 206 min len: 77 avg len: 132.54166666666666 num_loss_counted_tokens: 1195
total tokens: 6352 num samples: 2 num padding tokens: 1168 - rank: 0 max len: 3176 min len: 2008 avg len: 2592.0 num_loss_counted_tokens: 1140
total tokens: 7960 num samples: 10 num padding tokens: 318 - rank: 3 max len: 796 min len: 730 avg len: 764.2 num_loss_counted_tokens: 6637
total tokens: 7294 num samples: 7 num padding tokens: 669 - rank: 2 max len: 1042 min len: 832 avg len: 946.4285714285714 num_loss_counted_tokens: 3958
total tokens: 7905 num samples: 15 num padding tokens: 563 - rank: 5 max len: 527 min len: 412 avg len: 489.46666666666664 num_loss_counted_tokens: 4169
Per-token loss scaled by world size: 0.0004329077200964093Per-token loss scaled by world size: 5.7912915508495644e-05Per-token loss scaled by world size: 0.0001508180284872651Per-token loss scaled by world size: 3.933108018827625e-06
Per-token loss scaled by world size: 0.00037304253783077
Per-token loss scaled by world size: 0.0003069050144404173Per-token loss scaled by world size: 0.00022216846991796046
Epoch: 0, Step: 29, Rank: 0, loss = 0.0120353102684021
Epoch: 0, Step: 29, Rank: 5, loss = 1.3246976137161255
Epoch: 0, Step: 29, Rank: 1, loss = 0.17721351981163025Epoch: 0, Step: 29, Rank: 2, loss = 0.46150317788124084
Epoch: 0, Step: 29, Rank: 4, loss = 0.6798354983329773Epoch: 0, Step: 29, Rank: 7, loss = 0.9391293525695801Epoch: 0, Step: 29, Rank: 6, loss = 1.1415101289749146
Per-token loss scaled by world size: 0.00031356202089227736
Epoch: 0, Step: 29, Rank: 3, loss = 0.9594997763633728
Epoch 0: 24%|██▍ | 29/121 [01:14<03:55, 2.55s/it] total tokens: 8085 num samples: 11 num padding tokens: 512 - rank: 4 max len: 735 min len: 626 avg len: 688.4545454545455 num_loss_counted_tokens: 5373
total tokens: 6820 num samples: 4 num padding tokens: 921 - rank: 1 max len: 1705 min len: 1306 avg len: 1474.75 num_loss_counted_tokens: 1453
{
"epoch": 0,
"step": 29,
"rank": 0,
"loss": 0.0120353102684021,
"overall_throughput": 42.0762720880897,
"lr": 0.0,
"cuda_mem_allocated": 18.163118362426758,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24480,
"batch_size": 81,
"total_loss": 0.711928129196167,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:16.196587"
}
total tokens: 8056 num samples: 8 num padding tokens: 879 - rank: 3 max len: 1007 min len: 755 avg len: 897.125 num_loss_counted_tokens: 5796
total tokens: 8021 num samples: 13 num padding tokens: 880 - rank: 5 max len: 617 min len: 473 avg len: 549.3076923076923 num_loss_counted_tokens: 4711
total tokens: 7992 num samples: 2 num padding tokens: 2114 - rank: 0 max len: 3996 min len: 1882 avg len: 2939.0 num_loss_counted_tokens: 854
total tokens: 7803 num samples: 17 num padding tokens: 1971 - rank: 6 max len: 459 min len: 257 avg len: 343.05882352941177 num_loss_counted_tokens: 3525
total tokens: 8064 num samples: 32 num padding tokens: 3179 - rank: 7 max len: 252 min len: 75 avg len: 152.65625 num_loss_counted_tokens: 2045
total tokens: 7854 num samples: 7 num padding tokens: 337 - rank: 2 max len: 1122 min len: 1010 avg len: 1073.857142857143 num_loss_counted_tokens: 4505
Per-token loss scaled by world size: 0.00014981771528255194Per-token loss scaled by world size: 0.00036861959961242974Per-token loss scaled by world size: 0.00042424429557286203Per-token loss scaled by world size: 6.0521342675201595e-06
Per-token loss scaled by world size: 0.00027959863655269146
Per-token loss scaled by world size: 0.00027874435181729496
Per-token loss scaled by world size: 2.435126134514576e-06
Epoch: 0, Step: 30, Rank: 6, loss = 1.0413503646850586Epoch: 0, Step: 30, Rank: 5, loss = 1.1984901428222656
Epoch: 0, Step: 30, Rank: 2, loss = 0.4232350289821625Epoch: 0, Step: 30, Rank: 0, loss = 0.01709727942943573
Epoch: 0, Step: 30, Rank: 4, loss = 0.7898661494255066
Epoch: 0, Step: 30, Rank: 7, loss = 0.7874528169631958
Epoch: 0, Step: 30, Rank: 1, loss = 0.0068792314268648624
Per-token loss scaled by world size: 0.0004282180452719331
Epoch: 0, Step: 30, Rank: 3, loss = 1.2097159624099731
Epoch 0: 25%|██▍ | 30/121 [01:16<03:52, 2.56s/it] total tokens: 7893 num samples: 9 num padding tokens: 1033 - rank: 4 max len: 877 min len: 680 avg len: 762.2222222222222 num_loss_counted_tokens: 5017
{
"epoch": 0,
"step": 30,
"rank": 0,
"loss": 0.01709727942943573,
"overall_throughput": 42.42562330633586,
"lr": 0.0,
"cuda_mem_allocated": 17.776453971862793,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22600,
"batch_size": 88,
"total_loss": 0.684260904788971,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:18.756982"
}
total tokens: 6591 num samples: 3 num padding tokens: 424 - rank: 1 max len: 2197 min len: 1974 avg len: 2055.6666666666665 num_loss_counted_tokens: 2362
total tokens: 7974 num samples: 2 num padding tokens: 1251 - rank: 0 max len: 3987 min len: 2736 avg len: 3361.5 num_loss_counted_tokens: 364
total tokens: 7440 num samples: 6 num padding tokens: 982 - rank: 3 max len: 1240 min len: 953 avg len: 1076.3333333333333 num_loss_counted_tokens: 2782
total tokens: 7192 num samples: 31 num padding tokens: 2694 - rank: 7 max len: 232 min len: 81 avg len: 145.09677419354838 num_loss_counted_tokens: 1761
total tokens: 8056 num samples: 19 num padding tokens: 2186 - rank: 6 max len: 424 min len: 234 avg len: 308.94736842105266 num_loss_counted_tokens: 3218
total tokens: 7920 num samples: 12 num padding tokens: 1246 - rank: 5 max len: 660 min len: 465 avg len: 556.1666666666666 num_loss_counted_tokens: 4600
total tokens: 7096 num samples: 4 num padding tokens: 1017 - rank: 2 max len: 1774 min len: 1287 avg len: 1519.75 num_loss_counted_tokens: 2154
Per-token loss scaled by world size: 0.00010154087067348883Per-token loss scaled by world size: 8.184825674106833e-06Per-token loss scaled by world size: 1.3264020708447788e-06Per-token loss scaled by world size: 0.0004786914505530149
Per-token loss scaled by world size: 0.0006691211019642651
Per-token loss scaled by world size: 0.0006968624657019973Per-token loss scaled by world size: 0.0005032268818467855
Epoch: 0, Step: 31, Rank: 0, loss = 0.01914430782198906
Epoch: 0, Step: 31, Rank: 6, loss = 1.1196593046188354
Epoch: 0, Step: 31, Rank: 1, loss = 0.0031024543568491936
Epoch: 0, Step: 31, Rank: 2, loss = 0.23750409483909607
Epoch: 0, Step: 31, Rank: 4, loss = 1.6299612522125244Epoch: 0, Step: 31, Rank: 5, loss = 1.5650743246078491Epoch: 0, Step: 31, Rank: 7, loss = 1.1770477294921875
Per-token loss scaled by world size: 0.00016472380957566202
Epoch: 0, Step: 31, Rank: 3, loss = 0.3852889835834503
Epoch 0: 26%|██▌ | 31/121 [01:19<03:49, 2.55s/it] total tokens: 5876 num samples: 2 num padding tokens: 1112 - rank: 1 max len: 2938 min len: 1826 avg len: 2382.0 num_loss_counted_tokens: 148
total tokens: 7476 num samples: 7 num padding tokens: 1132 - rank: 4 max len: 1068 min len: 776 avg len: 906.2857142857143 num_loss_counted_tokens: 5713
{
"epoch": 0,
"step": 31,
"rank": 0,
"loss": 0.01914430782198906,
"overall_throughput": 42.41999987357855,
"lr": 0.0,
"cuda_mem_allocated": 18.21198320388794,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18712,
"batch_size": 86,
"total_loss": 0.7670978307723999,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:21.283912"
}
total tokens: 7170 num samples: 6 num padding tokens: 395 - rank: 3 max len: 1195 min len: 1074 avg len: 1129.1666666666667 num_loss_counted_tokens: 4609
total tokens: 7808 num samples: 16 num padding tokens: 1337 - rank: 6 max len: 488 min len: 315 avg len: 404.4375 num_loss_counted_tokens: 4322
total tokens: 6692 num samples: 4 num padding tokens: 769 - rank: 2 max len: 1673 min len: 1240 avg len: 1480.75 num_loss_counted_tokens: 983
total tokens: 8036 num samples: 28 num padding tokens: 2902 - rank: 7 max len: 287 min len: 83 avg len: 183.35714285714286 num_loss_counted_tokens: 2351
total tokens: 6152 num samples: 2 num padding tokens: 42 - rank: 0 max len: 3076 min len: 3034 avg len: 3055.0 num_loss_counted_tokens: 181
total tokens: 8030 num samples: 11 num padding tokens: 1077 - rank: 5 max len: 730 min len: 530 avg len: 632.0909090909091 num_loss_counted_tokens: 4192
Per-token loss scaled by world size: 0.0005925196455791593Per-token loss scaled by world size: 0.0004467185935936868Per-token loss scaled by world size: 0.0003141453198622912
Per-token loss scaled by world size: 0.0005347821279428899Per-token loss scaled by world size: 7.6175206231710035e-06Per-token loss scaled by world size: 0.0005325234378688037
Per-token loss scaled by world size: 7.270013156812638e-05
Epoch: 0, Step: 32, Rank: 3, loss = 0.6862111687660217
Epoch: 0, Step: 32, Rank: 6, loss = 1.2942850589752197
Epoch: 0, Step: 32, Rank: 4, loss = 0.9758009314537048
Epoch: 0, Step: 32, Rank: 1, loss = 0.016639521345496178Epoch: 0, Step: 32, Rank: 5, loss = 1.1681647300720215
Epoch: 0, Step: 32, Rank: 0, loss = 0.15880435705184937
Epoch: 0, Step: 32, Rank: 7, loss = 1.1632308959960938
Per-token loss scaled by world size: 2.2659537535218988e-06
Epoch: 0, Step: 32, Rank: 2, loss = 0.004949692636728287
Epoch 0: 26%|██▋ | 32/121 [01:22<03:48, 2.57s/it]{
"epoch": 0,
"step": 32,
"rank": 0,
"loss": 0.15880435705184937,
"overall_throughput": 41.6872406777795,
"lr": 0.0,
"cuda_mem_allocated": 17.77738618850708,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17475,
"batch_size": 60,
"total_loss": 0.6835108399391174,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:23.890605"
}
total tokens: 8016 num samples: 4 num padding tokens: 1255 - rank: 1 max len: 2004 min len: 1389 avg len: 1690.25 num_loss_counted_tokens: 3118
total tokens: 7668 num samples: 9 num padding tokens: 475 - rank: 4 max len: 852 min len: 747 avg len: 799.2222222222222 num_loss_counted_tokens: 4806
total tokens: 6662 num samples: 2 num padding tokens: 558 - rank: 0 max len: 3331 min len: 2773 avg len: 3052.0 num_loss_counted_tokens: 194
total tokens: 6006 num samples: 22 num padding tokens: 1923 - rank: 7 max len: 273 min len: 87 avg len: 185.5909090909091 num_loss_counted_tokens: 1883
total tokens: 8000 num samples: 16 num padding tokens: 1855 - rank: 6 max len: 500 min len: 279 avg len: 384.0625 num_loss_counted_tokens: 3217
total tokens: 8043 num samples: 7 num padding tokens: 1088 - rank: 3 max len: 1149 min len: 859 avg len: 993.5714285714286 num_loss_counted_tokens: 2802
total tokens: 7986 num samples: 6 num padding tokens: 566 - rank: 2 max len: 1331 min len: 1187 avg len: 1236.6666666666667 num_loss_counted_tokens: 1663
total tokens: 7920 num samples: 11 num padding tokens: 899 - rank: 5 max len: 720 min len: 534 avg len: 638.2727272727273 num_loss_counted_tokens: 5060
Per-token loss scaled by world size: 0.00021643155196215957Per-token loss scaled by world size: 0.00016316254914272577Per-token loss scaled by world size: 0.0003613443404901773Per-token loss scaled by world size: 0.0002246944495709613Per-token loss scaled by world size: 0.00030351741588674486
Per-token loss scaled by world size: 1.8829136934073176e-06
Per-token loss scaled by world size: 0.0001424902438884601
Epoch: 0, Step: 33, Rank: 6, loss = 0.7259596586227417
Epoch: 0, Step: 33, Rank: 2, loss = 0.6992632746696472
Epoch: 0, Step: 33, Rank: 1, loss = 0.5271577835083008
Epoch: 0, Step: 33, Rank: 5, loss = 1.167458415031433
Epoch: 0, Step: 33, Rank: 0, loss = 0.006083458662033081Epoch: 0, Step: 33, Rank: 4, loss = 0.9806268215179443
Epoch: 0, Step: 33, Rank: 7, loss = 0.46036818623542786
Per-token loss scaled by world size: 0.0003278390795458108
Epoch: 0, Step: 33, Rank: 3, loss = 1.0592070817947388
Epoch 0: 27%|██▋ | 33/121 [01:24<03:44, 2.55s/it] total tokens: 7733 num samples: 11 num padding tokens: 698 - rank: 4 max len: 703 min len: 598 avg len: 639.5454545454545 num_loss_counted_tokens: 4778
total tokens: 6987 num samples: 3 num padding tokens: 912 - rank: 1 max len: 2329 min len: 1634 avg len: 2025.0 num_loss_counted_tokens: 1273
{
"epoch": 0,
"step": 33,
"rank": 0,
"loss": 0.006083458662033081,
"overall_throughput": 42.6693002092833,
"lr": 0.0,
"cuda_mem_allocated": 18.08671236038208,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25847,
"batch_size": 82,
"total_loss": 0.7032655477523804,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:26.403211"
}
total tokens: 7062 num samples: 6 num padding tokens: 1127 - rank: 2 max len: 1177 min len: 929 avg len: 989.1666666666666 num_loss_counted_tokens: 3984
total tokens: 7810 num samples: 22 num padding tokens: 1750 - rank: 6 max len: 355 min len: 236 avg len: 275.45454545454544 num_loss_counted_tokens: 2894
total tokens: 7312 num samples: 8 num padding tokens: 597 - rank: 3 max len: 914 min len: 774 avg len: 839.375 num_loss_counted_tokens: 4040
total tokens: 7990 num samples: 34 num padding tokens: 2478 - rank: 7 max len: 235 min len: 87 avg len: 162.11764705882354 num_loss_counted_tokens: 1905
total tokens: 7683 num samples: 13 num padding tokens: 1541 - rank: 5 max len: 591 min len: 394 avg len: 472.46153846153845 num_loss_counted_tokens: 3832
total tokens: 6662 num samples: 2 num padding tokens: 451 - rank: 0 max len: 3331 min len: 2880 avg len: 3105.5 num_loss_counted_tokens: 160
Per-token loss scaled by world size: 0.0002979582059197128Per-token loss scaled by world size: 5.8393885410623625e-05Per-token loss scaled by world size: 8.860254183673533e-07Per-token loss scaled by world size: 0.00030089422944001853Per-token loss scaled by world size: 0.0003473999386187643Per-token loss scaled by world size: 0.00043760568951256573
Per-token loss scaled by world size: 0.00025744541198946536
Epoch: 0, Step: 34, Rank: 0, loss = 0.002579330699518323Epoch: 0, Step: 34, Rank: 1, loss = 0.1699918955564499
Epoch: 0, Step: 34, Rank: 2, loss = 0.8673935532569885
Epoch: 0, Step: 34, Rank: 4, loss = 0.87594074010849
Epoch: 0, Step: 34, Rank: 5, loss = 1.2739248275756836Epoch: 0, Step: 34, Rank: 6, loss = 1.0113246440887451
Epoch: 0, Step: 34, Rank: 7, loss = 0.7494558095932007
Per-token loss scaled by world size: 0.00023933911870699376
Epoch: 0, Step: 34, Rank: 3, loss = 0.6967461109161377
Epoch 0: 28%|██▊ | 34/121 [01:27<03:42, 2.56s/it] total tokens: 7905 num samples: 3 num padding tokens: 745 - rank: 1 max len: 2635 min len: 2036 avg len: 2386.6666666666665 num_loss_counted_tokens: 966
total tokens: 7504 num samples: 7 num padding tokens: 972 - rank: 4 max len: 1072 min len: 783 avg len: 933.1428571428571 num_loss_counted_tokens: 3983
{
"epoch": 0,
"step": 34,
"rank": 0,
"loss": 0.002579330699518323,
"overall_throughput": 41.96938385940714,
"lr": 0.0,
"cuda_mem_allocated": 18.149734020233154,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23289,
"batch_size": 81,
"total_loss": 0.7059195637702942,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:28.982671"
}
total tokens: 7917 num samples: 13 num padding tokens: 2041 - rank: 6 max len: 609 min len: 308 avg len: 452.0 num_loss_counted_tokens: 4345
total tokens: 7310 num samples: 5 num padding tokens: 1026 - rank: 3 max len: 1462 min len: 1079 avg len: 1256.8 num_loss_counted_tokens: 3091
total tokens: 6994 num samples: 26 num padding tokens: 2708 - rank: 7 max len: 269 min len: 78 avg len: 164.84615384615384 num_loss_counted_tokens: 1792
total tokens: 7628 num samples: 4 num padding tokens: 1180 - rank: 2 max len: 1907 min len: 1481 avg len: 1612.0 num_loss_counted_tokens: 3200
total tokens: 6582 num samples: 2 num padding tokens: 537 - rank: 0 max len: 3291 min len: 2754 avg len: 3022.5 num_loss_counted_tokens: 226
total tokens: 7740 num samples: 10 num padding tokens: 701 - rank: 5 max len: 774 min len: 622 avg len: 703.9 num_loss_counted_tokens: 5163
Per-token loss scaled by world size: 0.00038499291986227036Per-token loss scaled by world size: 0.00033336589694954455Per-token loss scaled by world size: 6.31858165434096e-06
Per-token loss scaled by world size: 0.0004092359740752727Per-token loss scaled by world size: 0.000349278881913051
Per-token loss scaled by world size: 9.329826571047306e-05
Per-token loss scaled by world size: 0.000329840142512694
Epoch: 0, Step: 35, Rank: 1, loss = 0.8665429949760437Epoch: 0, Step: 35, Rank: 6, loss = 0.9079067707061768Epoch: 0, Step: 35, Rank: 3, loss = 1.0007410049438477Epoch: 0, Step: 35, Rank: 0, loss = 0.01642436347901821
Epoch: 0, Step: 35, Rank: 4, loss = 1.0637577772140503Epoch: 0, Step: 35, Rank: 2, loss = 0.24251717329025269
Epoch: 0, Step: 35, Rank: 7, loss = 0.8573781847953796
Per-token loss scaled by world size: 0.0002692708803806454
Epoch: 0, Step: 35, Rank: 5, loss = 0.6999359726905823
Epoch 0: 29%|██▉ | 35/121 [01:29<03:38, 2.55s/it] total tokens: 7188 num samples: 4 num padding tokens: 317 - rank: 1 max len: 1797 min len: 1593 avg len: 1717.75 num_loss_counted_tokens: 2604
total tokens: 7240 num samples: 8 num padding tokens: 858 - rank: 4 max len: 905 min len: 728 avg len: 797.75 num_loss_counted_tokens: 5344
{
"epoch": 0,
"step": 35,
"rank": 0,
"loss": 0.01642436347901821,
"overall_throughput": 43.07030586294916,
"lr": 0.0,
"cuda_mem_allocated": 18.10886526107788,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20795,
"batch_size": 73,
"total_loss": 0.7069005370140076,
"gradnorm": null,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:31.496950"
}
total tokens: 7574 num samples: 7 num padding tokens: 648 - rank: 3 max len: 1082 min len: 913 avg len: 989.4285714285714 num_loss_counted_tokens: 4812
total tokens: 7353 num samples: 3 num padding tokens: 1218 - rank: 0 max len: 2451 min len: 1808 avg len: 2045.0 num_loss_counted_tokens: 890
total tokens: 7225 num samples: 5 num padding tokens: 790 - rank: 2 max len: 1445 min len: 1127 avg len: 1287.0 num_loss_counted_tokens: 2573
total tokens: 7777 num samples: 11 num padding tokens: 857 - rank: 5 max len: 707 min len: 516 avg len: 629.0909090909091 num_loss_counted_tokens: 3335
total tokens: 7672 num samples: 28 num padding tokens: 2867 - rank: 7 max len: 274 min len: 75 avg len: 171.60714285714286 num_loss_counted_tokens: 2189
total tokens: 7650 num samples: 15 num padding tokens: 1519 - rank: 6 max len: 510 min len: 298 avg len: 408.73333333333335 num_loss_counted_tokens: 3925
Per-token loss scaled by world size: 0.00017819351342041045Per-token loss scaled by world size: 0.00010071766882902011Per-token loss scaled by world size: 0.00047568074660375714Per-token loss scaled by world size: 0.00022655159409623593Per-token loss scaled by world size: 0.00010502615623408929Per-token loss scaled by world size: 0.0003213490708731115
Per-token loss scaled by world size: 0.00031152847805060446
Epoch: 0, Step: 36, Rank: 3, loss = 0.6177212595939636
Epoch: 0, Step: 36, Rank: 2, loss = 0.28636693954467773Epoch: 0, Step: 36, Rank: 5, loss = 1.2970030307769775
Epoch: 0, Step: 36, Rank: 4, loss = 0.876198410987854
Epoch: 0, Step: 36, Rank: 0, loss = 0.485866904258728
Epoch: 0, Step: 36, Rank: 1, loss = 0.27461931109428406
Epoch: 0, Step: 36, Rank: 7, loss = 0.8494213223457336
Per-token loss scaled by world size: 0.0004315480182413012
Epoch: 0, Step: 36, Rank: 6, loss = 1.1766695976257324
[2024-08-18 20:49:34,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
Epoch 0: 30%|██▉ | 36/121 [01:32<03:38, 2.57s/it] total tokens: 8019 num samples: 11 num padding tokens: 830 - rank: 4 max len: 729 min len: 589 avg len: 653.5454545454545 num_loss_counted_tokens: 4361
total tokens: 7575 num samples: 5 num padding tokens: 525 - rank: 1 max len: 1515 min len: 1255 avg len: 1410.0 num_loss_counted_tokens: 2654
{
"epoch": 0,
"step": 36,
"rank": 0,
"loss": 0.485866904258728,
"overall_throughput": 41.05076993788474,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 22.813036918640137,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21813,
"batch_size": 94,
"total_loss": 0.7329833507537842,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:34.202497"
}
total tokens: 8078 num samples: 14 num padding tokens: 1294 - rank: 5 max len: 577 min len: 385 avg len: 484.57142857142856 num_loss_counted_tokens: 4682
total tokens: 8085 num samples: 21 num padding tokens: 1544 - rank: 6 max len: 385 min len: 243 avg len: 311.4761904761905 num_loss_counted_tokens: 3795
total tokens: 7680 num samples: 32 num padding tokens: 2281 - rank: 7 max len: 240 min len: 78 avg len: 168.71875 num_loss_counted_tokens: 2460
total tokens: 7004 num samples: 2 num padding tokens: 1514 - rank: 0 max len: 3502 min len: 1988 avg len: 2745.0 num_loss_counted_tokens: 188
total tokens: 7840 num samples: 7 num padding tokens: 725 - rank: 2 max len: 1120 min len: 917 avg len: 1016.4285714285714 num_loss_counted_tokens: 4215
total tokens: 7911 num samples: 9 num padding tokens: 701 - rank: 3 max len: 879 min len: 758 avg len: 801.1111111111111 num_loss_counted_tokens: 5865
Per-token loss scaled by world size: 0.0003182947402819991Per-token loss scaled by world size: 0.00048430776223540306Per-token loss scaled by world size: 0.00047009342233650386Per-token loss scaled by world size: 0.00042154916445724666Per-token loss scaled by world size: 5.139104814588791e-06
Per-token loss scaled by world size: 0.0004850963596254587
Epoch: 0, Step: 37, Rank: 4, loss = 1.2365219593048096Epoch: 0, Step: 37, Rank: 5, loss = 1.2739109992980957
Epoch: 0, Step: 37, Rank: 3, loss = 0.8372344970703125Epoch: 0, Step: 37, Rank: 7, loss = 1.1088323593139648
Epoch: 0, Step: 37, Rank: 0, loss = 0.013517772778868675
Per-token loss scaled by world size: 2.1868495423404966e-06Epoch: 0, Step: 37, Rank: 6, loss = 1.2759853601455688
Per-token loss scaled by world size: 0.00011173654638696462
Epoch: 0, Step: 37, Rank: 1, loss = 0.005752234254032373Epoch: 0, Step: 37, Rank: 2, loss = 0.2939090132713318
Epoch 0: 31%|███ | 37/121 [01:34<03:36, 2.58s/it] total tokens: 8060 num samples: 10 num padding tokens: 654 - rank: 4 max len: 806 min len: 663 avg len: 740.6 num_loss_counted_tokens: 4953
total tokens: 7920 num samples: 4 num padding tokens: 1463 - rank: 1 max len: 1980 min len: 1324 avg len: 1614.25 num_loss_counted_tokens: 2571
{
"epoch": 0,
"step": 37,
"rank": 0,
"loss": 0.013517772778868675,
"overall_throughput": 42.263056677916275,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.268142223358154,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21043,
"batch_size": 79,
"total_loss": 0.7557079792022705,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:36.725200"
}
total tokens: 7872 num samples: 12 num padding tokens: 1903 - rank: 5 max len: 656 min len: 391 avg len: 497.4166666666667 num_loss_counted_tokens: 3985
total tokens: 7497 num samples: 7 num padding tokens: 1399 - rank: 3 max len: 1071 min len: 808 avg len: 871.1428571428571 num_loss_counted_tokens: 4584
total tokens: 7800 num samples: 20 num padding tokens: 2069 - rank: 6 max len: 390 min len: 221 avg len: 286.55 num_loss_counted_tokens: 3264
total tokens: 7914 num samples: 6 num padding tokens: 519 - rank: 2 max len: 1319 min len: 1143 avg len: 1232.5 num_loss_counted_tokens: 2296
total tokens: 6765 num samples: 3 num padding tokens: 359 - rank: 0 max len: 2255 min len: 2010 avg len: 2135.3333333333335 num_loss_counted_tokens: 396
total tokens: 4796 num samples: 22 num padding tokens: 1785 - rank: 7 max len: 218 min len: 77 avg len: 136.86363636363637 num_loss_counted_tokens: 1137
Per-token loss scaled by world size: 0.00023003398382570595Per-token loss scaled by world size: 0.00023557165695820004Per-token loss scaled by world size: 0.0003318150993436575Per-token loss scaled by world size: 3.583161378628574e-05Per-token loss scaled by world size: 0.00029603790608234704
Per-token loss scaled by world size: 0.00038251461228355765
Per-token loss scaled by world size: 0.00028330745408311486
Epoch: 0, Step: 38, Rank: 6, loss = 0.7295815348625183
Epoch: 0, Step: 38, Rank: 0, loss = 0.11364444345235825
Epoch: 0, Step: 38, Rank: 7, loss = 0.7471449375152588
Epoch: 0, Step: 38, Rank: 1, loss = 1.0523930788040161
Epoch: 0, Step: 38, Rank: 4, loss = 0.8985450267791748
Epoch: 0, Step: 38, Rank: 3, loss = 0.9389212131500244Epoch: 0, Step: 38, Rank: 5, loss = 1.2131929397583008
Per-token loss scaled by world size: 0.00017416744958609343
Epoch: 0, Step: 38, Rank: 2, loss = 0.5523938536643982
Epoch 0: 31%|███▏ | 38/121 [01:37<03:32, 2.56s/it] total tokens: 8055 num samples: 9 num padding tokens: 294 - rank: 4 max len: 895 min len: 809 avg len: 862.3333333333334 num_loss_counted_tokens: 5164
total tokens: 8007 num samples: 3 num padding tokens: 615 - rank: 1 max len: 2669 min len: 2320 avg len: 2464.0 num_loss_counted_tokens: 299
total tokens: 7556 num samples: 4 num padding tokens: 692 - rank: 2 max len: 1889 min len: 1486 avg len: 1716.0 num_loss_counted_tokens: 3448
total tokens: 7990 num samples: 10 num padding tokens: 1519 - rank: 5 max len: 799 min len: 505 avg len: 647.1 num_loss_counted_tokens: 3664
{
"epoch": 0,
"step": 38,
"rank": 0,
"loss": 0.11364444345235825,
"overall_throughput": 42.686060661097876,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.416475296020508,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25373,
"batch_size": 90,
"total_loss": 0.7807271480560303,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:39.251905"
}
total tokens: 8064 num samples: 16 num padding tokens: 1612 - rank: 6 max len: 504 min len: 282 avg len: 403.25 num_loss_counted_tokens: 3518
total tokens: 7266 num samples: 6 num padding tokens: 1143 - rank: 3 max len: 1211 min len: 903 avg len: 1020.5 num_loss_counted_tokens: 5513
total tokens: 6204 num samples: 22 num padding tokens: 2569 - rank: 7 max len: 282 min len: 76 avg len: 165.22727272727272 num_loss_counted_tokens: 1531
total tokens: 7148 num samples: 2 num padding tokens: 587 - rank: 0 max len: 3574 min len: 2987 avg len: 3280.5 num_loss_counted_tokens: 224
Per-token loss scaled by world size: 0.000557654129806906Per-token loss scaled by world size: 0.0007022880017757416Per-token loss scaled by world size: 0.0008294832659885287Per-token loss scaled by world size: 0.00019126593542750925Per-token loss scaled by world size: 0.00040766337770037353
Per-token loss scaled by world size: 6.8775539148191456e-06
Per-token loss scaled by world size: 7.391309281956637e-06
Epoch: 0, Step: 39, Rank: 6, loss = 1.4909573793411255
Epoch: 0, Step: 39, Rank: 0, loss = 0.014601047150790691Epoch: 0, Step: 39, Rank: 5, loss = 1.7609930038452148
Epoch: 0, Step: 39, Rank: 7, loss = 1.1838997602462769
Epoch: 0, Step: 39, Rank: 3, loss = 0.40605756640434265
Epoch: 0, Step: 39, Rank: 4, loss = 0.8654693365097046
Epoch: 0, Step: 39, Rank: 1, loss = 0.01569174975156784
Per-token loss scaled by world size: 8.316225284943357e-05
Epoch: 0, Step: 39, Rank: 2, loss = 0.17655345797538757
Epoch 0: 32%|███▏ | 39/121 [01:39<03:29, 2.55s/it] total tokens: 7136 num samples: 4 num padding tokens: 383 - rank: 1 max len: 1784 min len: 1652 avg len: 1688.25 num_loss_counted_tokens: 2347
total tokens: 7360 num samples: 8 num padding tokens: 773 - rank: 4 max len: 920 min len: 750 avg len: 823.375 num_loss_counted_tokens: 5270
{
"epoch": 0,
"step": 39,
"rank": 0,
"loss": 0.014601047150790691,
"overall_throughput": 42.970882858906755,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.469857692718506,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16984,
"batch_size": 76,
"total_loss": 0.7392778992652893,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:41.778226"
}
total tokens: 7925 num samples: 5 num padding tokens: 458 - rank: 2 max len: 1585 min len: 1370 avg len: 1493.4 num_loss_counted_tokens: 2405
total tokens: 8022 num samples: 14 num padding tokens: 2297 - rank: 6 max len: 573 min len: 288 avg len: 408.92857142857144 num_loss_counted_tokens: 2953
total tokens: 7733 num samples: 11 num padding tokens: 641 - rank: 5 max len: 703 min len: 579 avg len: 644.7272727272727 num_loss_counted_tokens: 5059
total tokens: 7326 num samples: 6 num padding tokens: 857 - rank: 3 max len: 1221 min len: 972 avg len: 1078.1666666666667 num_loss_counted_tokens: 4306
total tokens: 5446 num samples: 2 num padding tokens: 263 - rank: 0 max len: 2723 min len: 2460 avg len: 2591.5 num_loss_counted_tokens: 241
total tokens: 8100 num samples: 30 num padding tokens: 2620 - rank: 7 max len: 270 min len: 87 avg len: 182.66666666666666 num_loss_counted_tokens: 2660
Per-token loss scaled by world size: 0.0004144566773902625Per-token loss scaled by world size: 0.0004883252549916506Per-token loss scaled by world size: 0.00025119862402789295Per-token loss scaled by world size: 0.0002532459329813719Per-token loss scaled by world size: 0.0001632209459785372Per-token loss scaled by world size: 6.004169335938059e-06
Per-token loss scaled by world size: 2.448785608066828e-06
Epoch: 0, Step: 40, Rank: 0, loss = 0.017504405230283737Epoch: 0, Step: 40, Rank: 5, loss = 1.4236512184143066
Epoch: 0, Step: 40, Rank: 7, loss = 0.7323381900787354Epoch: 0, Step: 40, Rank: 2, loss = 0.47585028409957886Epoch: 0, Step: 40, Rank: 4, loss = 0.7383068799972534
Epoch: 0, Step: 40, Rank: 6, loss = 1.2082966566085815
Epoch: 0, Step: 40, Rank: 1, loss = 0.007139128167182207
Per-token loss scaled by world size: 0.00020900421077385545
Epoch: 0, Step: 40, Rank: 3, loss = 0.609325647354126
Epoch 0: 33%|███▎ | 40/121 [01:42<03:25, 2.54s/it] total tokens: 5720 num samples: 2 num padding tokens: 100 - rank: 1 max len: 2860 min len: 2760 avg len: 2810.0 num_loss_counted_tokens: 172
total tokens: 7550 num samples: 10 num padding tokens: 676 - rank: 4 max len: 755 min len: 637 avg len: 687.4 num_loss_counted_tokens: 3236
{
"epoch": 0,
"step": 40,
"rank": 0,
"loss": 0.017504405230283737,
"overall_throughput": 43.000670432034845,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.411304473876953,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23323,
"batch_size": 71,
"total_loss": 0.6515514850616455,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:44.295336"
}
total tokens: 7617 num samples: 3 num padding tokens: 1777 - rank: 2 max len: 2539 min len: 1345 avg len: 1946.6666666666667 num_loss_counted_tokens: 824
total tokens: 7627 num samples: 29 num padding tokens: 2077 - rank: 7 max len: 263 min len: 91 avg len: 191.3793103448276 num_loss_counted_tokens: 2397
total tokens: 8024 num samples: 8 num padding tokens: 1091 - rank: 3 max len: 1003 min len: 759 avg len: 866.625 num_loss_counted_tokens: 3193
total tokens: 7800 num samples: 20 num padding tokens: 1486 - rank: 6 max len: 390 min len: 264 avg len: 315.7 num_loss_counted_tokens: 3339
total tokens: 6426 num samples: 2 num padding tokens: 109 - rank: 0 max len: 3213 min len: 3104 avg len: 3158.5 num_loss_counted_tokens: 160
total tokens: 8099 num samples: 13 num padding tokens: 1516 - rank: 5 max len: 623 min len: 406 avg len: 506.38461538461536 num_loss_counted_tokens: 4165
Per-token loss scaled by world size: 0.00030310056172311306Per-token loss scaled by world size: 0.0002651209069881588Per-token loss scaled by world size: 0.00026846988475881517
Per-token loss scaled by world size: 0.00022448382514994591Per-token loss scaled by world size: 2.106957708747359e-06Per-token loss scaled by world size: 0.0002726210805121809
Per-token loss scaled by world size: 9.102401236305013e-05
Epoch: 0, Step: 41, Rank: 2, loss = 0.8657191395759583
Epoch: 0, Step: 41, Rank: 6, loss = 0.876654863357544Epoch: 0, Step: 41, Rank: 5, loss = 0.9897370338439941
Epoch: 0, Step: 41, Rank: 4, loss = 0.7330238819122314
Epoch: 0, Step: 41, Rank: 7, loss = 0.8902100324630737Epoch: 0, Step: 41, Rank: 0, loss = 0.006880007218569517
Epoch: 0, Step: 41, Rank: 1, loss = 0.29722753167152405
Per-token loss scaled by world size: 0.00029196811374276876
Epoch: 0, Step: 41, Rank: 3, loss = 0.9533854126930237
Epoch 0: 34%|███▍ | 41/121 [01:45<03:23, 2.55s/it] total tokens: 7158 num samples: 3 num padding tokens: 546 - rank: 1 max len: 2386 min len: 2074 avg len: 2204.0 num_loss_counted_tokens: 274
total tokens: 8024 num samples: 8 num padding tokens: 1137 - rank: 4 max len: 1003 min len: 719 avg len: 860.875 num_loss_counted_tokens: 6152
{
"epoch": 0,
"step": 41,
"rank": 0,
"loss": 0.006880007218569517,
"overall_throughput": 42.297091054292345,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.25260829925537,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26123,
"batch_size": 88,
"total_loss": 0.7016047239303589,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:46.855893"
}
total tokens: 6468 num samples: 22 num padding tokens: 2174 - rank: 7 max len: 294 min len: 86 avg len: 195.1818181818182 num_loss_counted_tokens: 2218
total tokens: 8112 num samples: 4 num padding tokens: 1504 - rank: 2 max len: 2028 min len: 1392 avg len: 1652.0 num_loss_counted_tokens: 1907
total tokens: 7936 num samples: 16 num padding tokens: 1181 - rank: 6 max len: 496 min len: 317 avg len: 422.1875 num_loss_counted_tokens: 3813
total tokens: 5644 num samples: 2 num padding tokens: 18 - rank: 0 max len: 2822 min len: 2804 avg len: 2813.0 num_loss_counted_tokens: 165
total tokens: 7667 num samples: 11 num padding tokens: 1375 - rank: 5 max len: 697 min len: 508 avg len: 572.0 num_loss_counted_tokens: 4263
total tokens: 6950 num samples: 5 num padding tokens: 1094 - rank: 3 max len: 1390 min len: 1006 avg len: 1171.2 num_loss_counted_tokens: 2509
Per-token loss scaled by world size: 0.0001050201608450152Per-token loss scaled by world size: 0.00023053436598274857Per-token loss scaled by world size: 0.0009069386287592351Per-token loss scaled by world size: 0.0004071406729053706
Per-token loss scaled by world size: 0.0005334314191713929Per-token loss scaled by world size: 4.823124982067384e-06
Per-token loss scaled by world size: 7.580199599033222e-05
Epoch: 0, Step: 42, Rank: 3, loss = 0.4843238890171051Epoch: 0, Step: 42, Rank: 6, loss = 1.9053646326065063
Epoch: 0, Step: 42, Rank: 0, loss = 0.010132783092558384Epoch: 0, Step: 42, Rank: 4, loss = 0.8553516864776611Epoch: 0, Step: 42, Rank: 2, loss = 0.22063423693180084
Epoch: 0, Step: 42, Rank: 7, loss = 1.1206727027893066
Epoch: 0, Step: 42, Rank: 1, loss = 0.15925051271915436
Per-token loss scaled by world size: 0.0006252930616028607
Epoch: 0, Step: 42, Rank: 5, loss = 1.3136625289916992
Epoch 0: 35%|███▍ | 42/121 [01:47<03:20, 2.53s/it] total tokens: 8063 num samples: 11 num padding tokens: 996 - rank: 4 max len: 733 min len: 561 avg len: 642.4545454545455 num_loss_counted_tokens: 4498
total tokens: 6616 num samples: 4 num padding tokens: 1100 - rank: 1 max len: 1654 min len: 1144 avg len: 1379.0 num_loss_counted_tokens: 966
{
"epoch": 0,
"step": 42,
"rank": 0,
"loss": 0.010132783092558384,
"overall_throughput": 43.158935099078796,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.46029806137085,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16807,
"batch_size": 70,
"total_loss": 0.758674144744873,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:49.361713"
}
total tokens: 7756 num samples: 14 num padding tokens: 706 - rank: 5 max len: 554 min len: 470 avg len: 503.57142857142856 num_loss_counted_tokens: 4826
total tokens: 7803 num samples: 17 num padding tokens: 1773 - rank: 6 max len: 459 min len: 256 avg len: 354.70588235294116 num_loss_counted_tokens: 2701
total tokens: 7983 num samples: 9 num padding tokens: 500 - rank: 3 max len: 887 min len: 781 avg len: 831.4444444444445 num_loss_counted_tokens: 4578
total tokens: 7525 num samples: 7 num padding tokens: 786 - rank: 2 max len: 1075 min len: 894 avg len: 962.7142857142857 num_loss_counted_tokens: 3223
total tokens: 6855 num samples: 3 num padding tokens: 919 - rank: 0 max len: 2285 min len: 1795 avg len: 1978.6666666666667 num_loss_counted_tokens: 310
total tokens: 7808 num samples: 32 num padding tokens: 2971 - rank: 7 max len: 244 min len: 79 avg len: 151.15625 num_loss_counted_tokens: 1950
Per-token loss scaled by world size: 0.00014639626897405833Per-token loss scaled by world size: 0.00042091766954399645Per-token loss scaled by world size: 0.00011771616118494421Per-token loss scaled by world size: 0.00020373229926917702Per-token loss scaled by world size: 0.00029096510843373835Per-token loss scaled by world size: 0.0002795852196868509
Per-token loss scaled by world size: 0.0003187089751008898
Epoch: 0, Step: 43, Rank: 5, loss = 1.3902910947799683
Epoch: 0, Step: 43, Rank: 4, loss = 0.6729277968406677
Epoch: 0, Step: 43, Rank: 2, loss = 0.3888164758682251Epoch: 0, Step: 43, Rank: 1, loss = 0.4835468530654907
Epoch: 0, Step: 43, Rank: 3, loss = 1.0526957511901855
Epoch: 0, Step: 43, Rank: 7, loss = 0.9234700202941895
Epoch: 0, Step: 43, Rank: 6, loss = 0.9610577821731567
Per-token loss scaled by world size: 1.8369590179645456e-05
Epoch: 0, Step: 43, Rank: 0, loss = 0.0606747567653656
Epoch 0: 36%|███▌ | 43/121 [01:50<03:19, 2.56s/it] total tokens: 7484 num samples: 4 num padding tokens: 327 - rank: 1 max len: 1871 min len: 1693 avg len: 1789.25 num_loss_counted_tokens: 3106
total tokens: 8050 num samples: 10 num padding tokens: 630 - rank: 4 max len: 805 min len: 673 avg len: 742.0 num_loss_counted_tokens: 6379
total tokens: 7696 num samples: 8 num padding tokens: 569 - rank: 3 max len: 962 min len: 816 avg len: 890.875 num_loss_counted_tokens: 5177{
"epoch": 0,
"step": 43,
"rank": 0,
"loss": 0.0606747567653656,
"overall_throughput": 41.53024928853794,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.530761241912842,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26424,
"batch_size": 80,
"total_loss": 0.7416850328445435,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:51.968138"
}
total tokens: 7056 num samples: 28 num padding tokens: 3094 - rank: 7 max len: 252 min len: 75 avg len: 141.5 num_loss_counted_tokens: 1402
total tokens: 8086 num samples: 13 num padding tokens: 1584 - rank: 5 max len: 622 min len: 401 avg len: 500.15384615384613 num_loss_counted_tokens: 4091
total tokens: 5748 num samples: 2 num padding tokens: 405 - rank: 0 max len: 2874 min len: 2469 avg len: 2671.5 num_loss_counted_tokens: 176
total tokens: 7158 num samples: 6 num padding tokens: 886 - rank: 2 max len: 1193 min len: 985 avg len: 1045.3333333333333 num_loss_counted_tokens: 2751
total tokens: 7780 num samples: 20 num padding tokens: 1751 - rank: 6 max len: 389 min len: 257 avg len: 301.45 num_loss_counted_tokens: 3387
Per-token loss scaled by world size: 0.0004709336790256202Per-token loss scaled by world size: 0.00041861337376758456Per-token loss scaled by world size: 0.00045743229566141963Per-token loss scaled by world size: 0.00041176279773935676Per-token loss scaled by world size: 1.3287355614011176e-05Per-token loss scaled by world size: 0.00043119132169522345Per-token loss scaled by world size: 0.0003093344275839627
Epoch: 0, Step: 44, Rank: 5, loss = 1.1258552074432373
Epoch: 0, Step: 44, Rank: 1, loss = 1.030312180519104Epoch: 0, Step: 44, Rank: 7, loss = 1.1590855121612549
Epoch: 0, Step: 44, Rank: 0, loss = 0.03270350396633148
Epoch: 0, Step: 44, Rank: 4, loss = 1.0134512186050415Epoch: 0, Step: 44, Rank: 6, loss = 1.0612696409225464
Epoch: 0, Step: 44, Rank: 3, loss = 0.7613493800163269
Per-token loss scaled by world size: 6.520144233945757e-05
Epoch: 0, Step: 44, Rank: 2, loss = 0.16047704219818115
Epoch 0: 36%|███▋ | 44/121 [01:52<03:17, 2.57s/it] total tokens: 5546 num samples: 2 num padding tokens: 467 - rank: 1 max len: 2773 min len: 2306 avg len: 2539.5 num_loss_counted_tokens: 136
total tokens: 7947 num samples: 9 num padding tokens: 1026 - rank: 4 max len: 883 min len: 694 avg len: 769.0 num_loss_counted_tokens: 4965
{
"epoch": 0,
"step": 44,
"rank": 0,
"loss": 0.03270350396633148,
"overall_throughput": 41.61664148118311,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.250525951385498,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19690,
"batch_size": 72,
"total_loss": 0.7930629253387451,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:54.572590"
}
total tokens: 7632 num samples: 6 num padding tokens: 867 - rank: 3 max len: 1272 min len: 946 avg len: 1127.5 num_loss_counted_tokens: 1800
total tokens: 6420 num samples: 30 num padding tokens: 2314 - rank: 7 max len: 214 min len: 73 avg len: 136.86666666666667 num_loss_counted_tokens: 1458
total tokens: 6984 num samples: 4 num padding tokens: 806 - rank: 2 max len: 1746 min len: 1334 avg len: 1544.5 num_loss_counted_tokens: 1942
total tokens: 6094 num samples: 2 num padding tokens: 150 - rank: 0 max len: 3047 min len: 2897 avg len: 2972.0 num_loss_counted_tokens: 493
total tokens: 7900 num samples: 20 num padding tokens: 1997 - rank: 6 max len: 395 min len: 231 avg len: 295.15 num_loss_counted_tokens: 3387
total tokens: 8052 num samples: 12 num padding tokens: 956 - rank: 5 max len: 671 min len: 482 avg len: 591.3333333333334 num_loss_counted_tokens: 5097
Per-token loss scaled by world size: 0.0003629309358075261Per-token loss scaled by world size: 0.00019072710711043328Per-token loss scaled by world size: 0.0003041441086679697Per-token loss scaled by world size: 0.00040468695806339383Per-token loss scaled by world size: 5.418559885583818e-05Per-token loss scaled by world size: 7.64827273087576e-05
Per-token loss scaled by world size: 0.00011607163469307125
Epoch: 0, Step: 45, Rank: 6, loss = 1.2099664211273193
Epoch: 0, Step: 45, Rank: 2, loss = 0.6358603239059448
Epoch: 0, Step: 45, Rank: 0, loss = 0.2549838423728943Epoch: 0, Step: 45, Rank: 5, loss = 1.3491756916046143
Epoch: 0, Step: 45, Rank: 4, loss = 1.0139784812927246Epoch: 0, Step: 45, Rank: 1, loss = 0.18064801394939423
Per-token loss scaled by world size: 0.00039161398308351636
Epoch: 0, Step: 45, Rank: 7, loss = 0.38696831464767456
Epoch: 0, Step: 45, Rank: 3, loss = 1.3055920600891113
Epoch 0: 37%|███▋ | 45/121 [01:55<03:14, 2.57s/it] total tokens: 7215 num samples: 3 num padding tokens: 642 - rank: 1 max len: 2405 min len: 2050 avg len: 2191.0 num_loss_counted_tokens: 335
total tokens: 8046 num samples: 9 num padding tokens: 1094 - rank: 4 max len: 894 min len: 726 avg len: 772.4444444444445 num_loss_counted_tokens: 5000
total tokens: 4725 num samples: 25 num padding tokens: 1257 - rank: 7 max len: 189 min len: 75 avg len: 138.72 num_loss_counted_tokens: 1367
{
"epoch": 0,
"step": 45,
"rank": 0,
"loss": 0.2549838423728943,
"overall_throughput": 42.42738944609394,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.32686471939087,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26671,
"batch_size": 93,
"total_loss": 0.792146623134613,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:57.124489"
}
total tokens: 7820 num samples: 17 num padding tokens: 2391 - rank: 6 max len: 460 min len: 196 avg len: 319.3529411764706 num_loss_counted_tokens: 2893
total tokens: 7320 num samples: 6 num padding tokens: 832 - rank: 3 max len: 1220 min len: 937 avg len: 1081.3333333333333 num_loss_counted_tokens: 4277
total tokens: 7832 num samples: 11 num padding tokens: 1541 - rank: 5 max len: 712 min len: 471 avg len: 571.9090909090909 num_loss_counted_tokens: 4862
total tokens: 6974 num samples: 2 num padding tokens: 808 - rank: 0 max len: 3487 min len: 2679 avg len: 3083.0 num_loss_counted_tokens: 194
total tokens: 6796 num samples: 4 num padding tokens: 645 - rank: 2 max len: 1699 min len: 1367 avg len: 1537.75 num_loss_counted_tokens: 2448
Per-token loss scaled by world size: 0.00021682196529582143Per-token loss scaled by world size: 0.00032285196357406676Per-token loss scaled by world size: 0.00028426622156985104Per-token loss scaled by world size: 0.000325443601468578Per-token loss scaled by world size: 0.00019496992172207683Per-token loss scaled by world size: 0.0003603589429985732
Per-token loss scaled by world size: 9.507144568488002e-05
Epoch: 0, Step: 46, Rank: 5, loss = 1.2730580568313599Epoch: 0, Step: 46, Rank: 6, loss = 1.0042414665222168Epoch: 0, Step: 46, Rank: 3, loss = 0.7659777998924255
Epoch: 0, Step: 46, Rank: 4, loss = 1.1497108936309814Epoch: 0, Step: 46, Rank: 2, loss = 1.1405552625656128Epoch: 0, Step: 46, Rank: 7, loss = 0.6887800097465515
Epoch: 0, Step: 46, Rank: 1, loss = 0.3358636498451233
Per-token loss scaled by world size: 4.201751289656386e-05
Epoch: 0, Step: 46, Rank: 0, loss = 0.14843736588954926
Epoch 0: 38%|███▊ | 46/121 [01:57<03:12, 2.57s/it] total tokens: 6948 num samples: 4 num padding tokens: 662 - rank: 1 max len: 1737 min len: 1416 avg len: 1571.5 num_loss_counted_tokens: 3855
total tokens: 7308 num samples: 9 num padding tokens: 773 - rank: 4 max len: 812 min len: 625 avg len: 726.1111111111111 num_loss_counted_tokens: 4386
{
"epoch": 0,
"step": 46,
"rank": 0,
"loss": 0.14843736588954926,
"overall_throughput": 42.16837318958055,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.52260398864746,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28262,
"batch_size": 94,
"total_loss": 0.8133281469345093,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:49:59.691426"
}
total tokens: 8021 num samples: 13 num padding tokens: 1587 - rank: 5 max len: 617 min len: 431 avg len: 494.9230769230769 num_loss_counted_tokens: 4488
total tokens: 7343 num samples: 7 num padding tokens: 562 - rank: 3 max len: 1049 min len: 905 avg len: 968.7142857142857 num_loss_counted_tokens: 4549
total tokens: 7704 num samples: 18 num padding tokens: 1519 - rank: 6 max len: 428 min len: 278 avg len: 343.6111111111111 num_loss_counted_tokens: 3313
total tokens: 8004 num samples: 29 num padding tokens: 3202 - rank: 7 max len: 276 min len: 86 avg len: 165.58620689655172 num_loss_counted_tokens: 1941
total tokens: 8016 num samples: 6 num padding tokens: 940 - rank: 2 max len: 1336 min len: 1075 avg len: 1179.3333333333333 num_loss_counted_tokens: 3672
total tokens: 7050 num samples: 3 num padding tokens: 713 - rank: 0 max len: 2350 min len: 1747 avg len: 2112.3333333333335 num_loss_counted_tokens: 1822
Per-token loss scaled by world size: 0.00035897750058211386Per-token loss scaled by world size: 0.00042433346970938146Per-token loss scaled by world size: 0.00017432670574635267Per-token loss scaled by world size: 0.0005883869016543031Per-token loss scaled by world size: 0.00017508945893496275
Per-token loss scaled by world size: 0.00023119074467103928
Per-token loss scaled by world size: 0.0002113927184836939
Epoch: 0, Step: 47, Rank: 5, loss = 1.6370394229888916
Epoch: 0, Step: 47, Rank: 6, loss = 1.1806018352508545Epoch: 0, Step: 47, Rank: 1, loss = 0.4850204885005951
Epoch: 0, Step: 47, Rank: 4, loss = 0.9987651109695435
Epoch: 0, Step: 47, Rank: 7, loss = 0.6432304382324219Epoch: 0, Step: 47, Rank: 2, loss = 0.4871426522731781
Epoch: 0, Step: 47, Rank: 3, loss = 0.5881474018096924
Per-token loss scaled by world size: 3.264834595029242e-05
Epoch: 0, Step: 47, Rank: 0, loss = 0.09083586186170578
Epoch 0: 39%|███▉ | 47/121 [02:00<03:10, 2.58s/it] total tokens: 5918 num samples: 2 num padding tokens: 291 - rank: 1 max len: 2959 min len: 2668 avg len: 2813.5 num_loss_counted_tokens: 895
total tokens: 5478 num samples: 22 num padding tokens: 2168 - rank: 7 max len: 249 min len: 85 avg len: 150.45454545454547 num_loss_counted_tokens: 1254
total tokens: 7456 num samples: 8 num padding tokens: 650 - rank: 4 max len: 932 min len: 776 avg len: 850.75 num_loss_counted_tokens: 5212
total tokens: 7815 num samples: 15 num padding tokens: 1566 - rank: 6 max len: 521 min len: 301 avg len: 416.6 num_loss_counted_tokens: 3998
{
"epoch": 0,
"step": 47,
"rank": 0,
"loss": 0.09083586186170578,
"overall_throughput": 41.45521968839562,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.520647048950195,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22258,
"batch_size": 86,
"total_loss": 0.7638478875160217,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:02.300320"
}
total tokens: 7480 num samples: 4 num padding tokens: 1532 - rank: 2 max len: 1870 min len: 1277 avg len: 1487.0 num_loss_counted_tokens: 1969
total tokens: 6978 num samples: 6 num padding tokens: 608 - rank: 3 max len: 1163 min len: 946 avg len: 1061.6666666666667 num_loss_counted_tokens: 4439
total tokens: 7890 num samples: 2 num padding tokens: 381 - rank: 0 max len: 3945 min len: 3564 avg len: 3754.5 num_loss_counted_tokens: 441
total tokens: 8019 num samples: 11 num padding tokens: 950 - rank: 5 max len: 729 min len: 528 avg len: 642.6363636363636 num_loss_counted_tokens: 5004
Per-token loss scaled by world size: 0.0004503819509409368Per-token loss scaled by world size: 0.0003560640325304121Per-token loss scaled by world size: 0.0002961498685181141Per-token loss scaled by world size: 0.0003928189689759165
Per-token loss scaled by world size: 6.809273327235132e-05Per-token loss scaled by world size: 3.832683887594612e-06Per-token loss scaled by world size: 5.2919685913366266e-06
Epoch: 0, Step: 48, Rank: 7, loss = 1.0013855695724487Epoch: 0, Step: 48, Rank: 3, loss = 0.8328844904899597
Epoch: 0, Step: 48, Rank: 6, loss = 1.2666429281234741
Epoch: 0, Step: 48, Rank: 4, loss = 1.1047542095184326
Epoch: 0, Step: 48, Rank: 0, loss = 0.010778944939374924Epoch: 0, Step: 48, Rank: 2, loss = 0.19150230288505554Epoch: 0, Step: 48, Rank: 1, loss = 0.014883000403642654
Per-token loss scaled by world size: 0.00040955503936856985
Epoch: 0, Step: 48, Rank: 5, loss = 1.1518223285675049
Epoch 0: 40%|███▉ | 48/121 [02:03<03:07, 2.56s/it] total tokens: 7580 num samples: 10 num padding tokens: 754 - rank: 4 max len: 758 min len: 613 avg len: 682.6 num_loss_counted_tokens: 3916
total tokens: 7806 num samples: 3 num padding tokens: 1441 - rank: 1 max len: 2602 min len: 1696 avg len: 2121.6666666666665 num_loss_counted_tokens: 1824
{
"epoch": 0,
"step": 48,
"rank": 0,
"loss": 0.010778944939374924,
"overall_throughput": 42.77144474456274,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.303375244140625,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22499,
"batch_size": 76,
"total_loss": 0.6968317627906799,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:04.867177"
}
total tokens: 7440 num samples: 30 num padding tokens: 2674 - rank: 7 max len: 248 min len: 85 avg len: 158.86666666666667 num_loss_counted_tokens: 1971
total tokens: 7644 num samples: 7 num padding tokens: 772 - rank: 3 max len: 1092 min len: 894 avg len: 981.7142857142857 num_loss_counted_tokens: 5193
total tokens: 7580 num samples: 2 num padding tokens: 877 - rank: 0 max len: 3790 min len: 2913 avg len: 3351.5 num_loss_counted_tokens: 999
total tokens: 8025 num samples: 5 num padding tokens: 1188 - rank: 2 max len: 1605 min len: 1101 avg len: 1367.4 num_loss_counted_tokens: 2038
total tokens: 8100 num samples: 18 num padding tokens: 2074 - rank: 6 max len: 450 min len: 258 avg len: 334.77777777777777 num_loss_counted_tokens: 3301
total tokens: 7709 num samples: 13 num padding tokens: 980 - rank: 5 max len: 593 min len: 457 avg len: 517.6153846153846 num_loss_counted_tokens: 4236
Per-token loss scaled by world size: 0.0005574136739596725Per-token loss scaled by world size: 0.0002091079077217728Per-token loss scaled by world size: 0.0003604785306379199
Per-token loss scaled by world size: 0.00025925517547875643Per-token loss scaled by world size: 0.00025941740022972226Per-token loss scaled by world size: 0.0002179427247028798
Per-token loss scaled by world size: 5.799124210170703e-06
Epoch: 0, Step: 49, Rank: 6, loss = 1.024795413017273Epoch: 0, Step: 49, Rank: 3, loss = 0.5944676399230957
Epoch: 0, Step: 49, Rank: 5, loss = 1.5846574306488037
Epoch: 0, Step: 49, Rank: 1, loss = 0.6195839047431946
Epoch: 0, Step: 49, Rank: 4, loss = 0.7370300889015198Epoch: 0, Step: 49, Rank: 0, loss = 0.016486184671521187Epoch: 0, Step: 49, Rank: 7, loss = 0.737491250038147
Per-token loss scaled by world size: 0.00016608217265456915
Epoch: 0, Step: 49, Rank: 2, loss = 0.47215086221694946
Epoch 0: 40%|████ | 49/121 [02:05<03:04, 2.57s/it] total tokens: 7560 num samples: 6 num padding tokens: 946 - rank: 1 max len: 1260 min len: 979 avg len: 1102.3333333333333 num_loss_counted_tokens: 3644
total tokens: 7788 num samples: 12 num padding tokens: 734 - rank: 4 max len: 649 min len: 534 avg len: 587.8333333333334 num_loss_counted_tokens: 4810
{
"epoch": 0,
"step": 49,
"rank": 0,
"loss": 0.016486184671521187,
"overall_throughput": 41.99615706250001,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.364055633544922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22743,
"batch_size": 77,
"total_loss": 0.7233328223228455,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:07.406672"
}
total tokens: 7995 num samples: 15 num padding tokens: 1460 - rank: 5 max len: 533 min len: 364 avg len: 435.6666666666667 num_loss_counted_tokens: 4262
total tokens: 3168 num samples: 18 num padding tokens: 844 - rank: 7 max len: 176 min len: 76 avg len: 129.11111111111111 num_loss_counted_tokens: 813
total tokens: 7920 num samples: 22 num padding tokens: 2223 - rank: 6 max len: 360 min len: 188 avg len: 258.95454545454544 num_loss_counted_tokens: 2786
total tokens: 7416 num samples: 8 num padding tokens: 720 - rank: 2 max len: 927 min len: 750 avg len: 837.0 num_loss_counted_tokens: 4900
total tokens: 7248 num samples: 3 num padding tokens: 1314 - rank: 0 max len: 2416 min len: 1340 avg len: 1978.0 num_loss_counted_tokens: 2507
total tokens: 7500 num samples: 10 num padding tokens: 393 - rank: 3 max len: 750 min len: 674 avg len: 710.7 num_loss_counted_tokens: 4132
Per-token loss scaled by world size: 0.0003894062538165599Per-token loss scaled by world size: 0.0003162138455081731Per-token loss scaled by world size: 0.00023851577134337276Per-token loss scaled by world size: 0.00032866382389329374
Per-token loss scaled by world size: 0.00045591729576699436
Per-token loss scaled by world size: 4.146520223002881e-05Per-token loss scaled by world size: 5.4783604355179705e-06
Epoch: 0, Step: 50, Rank: 7, loss = 0.9038181900978088
Epoch: 0, Step: 50, Rank: 5, loss = 1.113020420074463
Epoch: 0, Step: 50, Rank: 4, loss = 1.3031256198883057
Epoch: 0, Step: 50, Rank: 3, loss = 0.6817377209663391
Epoch: 0, Step: 50, Rank: 2, loss = 0.9394033551216125
Epoch: 0, Step: 50, Rank: 0, loss = 0.01565852388739586Epoch: 0, Step: 50, Rank: 1, loss = 0.1185179129242897
Per-token loss scaled by world size: 0.00044163045822642744
Epoch: 0, Step: 50, Rank: 6, loss = 1.2622902393341064
Epoch 0: 41%|████▏ | 50/121 [02:08<03:00, 2.54s/it] total tokens: 7206 num samples: 3 num padding tokens: 1221 - rank: 1 max len: 2402 min len: 1648 avg len: 1995.0 num_loss_counted_tokens: 854
total tokens: 7592 num samples: 8 num padding tokens: 1031 - rank: 4 max len: 949 min len: 737 avg len: 820.125 num_loss_counted_tokens: 4482
{
"epoch": 0,
"step": 50,
"rank": 0,
"loss": 0.01565852388739586,
"overall_throughput": 43.999185393816596,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.364055633544922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22866,
"batch_size": 99,
"total_loss": 0.79219651222229,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:09.869463"
}
total tokens: 7938 num samples: 14 num padding tokens: 2583 - rank: 6 max len: 567 min len: 269 avg len: 382.5 num_loss_counted_tokens: 3524
total tokens: 7014 num samples: 6 num padding tokens: 736 - rank: 3 max len: 1169 min len: 949 avg len: 1046.3333333333333 num_loss_counted_tokens: 3459
total tokens: 7525 num samples: 5 num padding tokens: 656 - rank: 2 max len: 1505 min len: 1205 avg len: 1373.8 num_loss_counted_tokens: 3152
total tokens: 8070 num samples: 30 num padding tokens: 3026 - rank: 7 max len: 269 min len: 86 avg len: 168.13333333333333 num_loss_counted_tokens: 2116
total tokens: 7821 num samples: 11 num padding tokens: 546 - rank: 5 max len: 711 min len: 583 avg len: 661.3636363636364 num_loss_counted_tokens: 4062
total tokens: 6454 num samples: 2 num padding tokens: 77 - rank: 0 max len: 3227 min len: 3150 avg len: 3188.5 num_loss_counted_tokens: 196
Per-token loss scaled by world size: 0.00025689046015031636Per-token loss scaled by world size: 0.0004693289229180664Per-token loss scaled by world size: 0.0002972199581563473Per-token loss scaled by world size: 0.00019820936722680926Per-token loss scaled by world size: 0.0002842875546775758
Per-token loss scaled by world size: 4.825916403206065e-05Per-token loss scaled by world size: 2.7838473215524573e-06
Epoch: 0, Step: 51, Rank: 6, loss = 1.3355927467346191
Epoch: 0, Step: 51, Rank: 7, loss = 0.8458136916160583
Epoch: 0, Step: 51, Rank: 2, loss = 0.7310460209846497
Epoch: 0, Step: 51, Rank: 1, loss = 0.13733351230621338
Epoch: 0, Step: 51, Rank: 4, loss = 0.5640543103218079
Epoch: 0, Step: 51, Rank: 3, loss = 0.8090112805366516
Epoch: 0, Step: 51, Rank: 0, loss = 0.007922133430838585
Per-token loss scaled by world size: 0.0003865555045194924
Epoch: 0, Step: 51, Rank: 5, loss = 1.100040316581726
Epoch 0: 42%|████▏ | 51/121 [02:10<02:58, 2.55s/it] total tokens: 7845 num samples: 5 num padding tokens: 525 - rank: 1 max len: 1569 min len: 1326 avg len: 1464.0 num_loss_counted_tokens: 3941
total tokens: 7520 num samples: 10 num padding tokens: 686 - rank: 4 max len: 752 min len: 619 avg len: 683.4 num_loss_counted_tokens: 4307
{
"epoch": 0,
"step": 51,
"rank": 0,
"loss": 0.007922133430838585,
"overall_throughput": 42.29003789361672,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.354268074035645,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22766,
"batch_size": 70,
"total_loss": 0.6913517117500305,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:12.436230"
}
total tokens: 7780 num samples: 20 num padding tokens: 1768 - rank: 6 max len: 389 min len: 232 avg len: 300.6 num_loss_counted_tokens: 3299
total tokens: 5290 num samples: 23 num padding tokens: 1933 - rank: 7 max len: 230 min len: 81 avg len: 145.95652173913044 num_loss_counted_tokens: 1376
total tokens: 7761 num samples: 13 num padding tokens: 933 - rank: 5 max len: 597 min len: 418 avg len: 525.2307692307693 num_loss_counted_tokens: 4968
total tokens: 7912 num samples: 8 num padding tokens: 555 - rank: 3 max len: 989 min len: 821 avg len: 919.625 num_loss_counted_tokens: 5323
total tokens: 7374 num samples: 3 num padding tokens: 1194 - rank: 0 max len: 2458 min len: 1773 avg len: 2060.0 num_loss_counted_tokens: 336
total tokens: 7728 num samples: 6 num padding tokens: 549 - rank: 2 max len: 1288 min len: 1035 avg len: 1196.5 num_loss_counted_tokens: 3900
Per-token loss scaled by world size: 0.00019784610776696354Per-token loss scaled by world size: 0.000183841708349064Per-token loss scaled by world size: 9.399676491739228e-05
Per-token loss scaled by world size: 0.0001539696240797639
Per-token loss scaled by world size: 0.0002607290807645768Per-token loss scaled by world size: 0.000336777011398226
Per-token loss scaled by world size: 0.0004033475706819445
Epoch: 0, Step: 52, Rank: 0, loss = 0.30163562297821045
Epoch: 0, Step: 52, Rank: 3, loss = 0.5899480581283569Epoch: 0, Step: 52, Rank: 2, loss = 0.6348881721496582
Epoch: 0, Step: 52, Rank: 1, loss = 0.4940885007381439
Epoch: 0, Step: 52, Rank: 6, loss = 1.0807174444198608
Epoch: 0, Step: 52, Rank: 7, loss = 0.8366796374320984
Epoch: 0, Step: 52, Rank: 4, loss = 1.2943423986434937
Per-token loss scaled by world size: 0.00024687970289960504
Epoch: 0, Step: 52, Rank: 5, loss = 0.7922369241714478
Epoch 0: 43%|████▎ | 52/121 [02:13<02:54, 2.53s/it] total tokens: 5434 num samples: 2 num padding tokens: 361 - rank: 1 max len: 2717 min len: 2356 avg len: 2536.5 num_loss_counted_tokens: 214
total tokens: 8091 num samples: 9 num padding tokens: 724 - rank: 4 max len: 899 min len: 723 avg len: 818.5555555555555 num_loss_counted_tokens: 4597
{
"epoch": 0,
"step": 52,
"rank": 0,
"loss": 0.30163562297821045,
"overall_throughput": 43.565651051419955,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.44568157196045,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25672,
"batch_size": 81,
"total_loss": 0.7530670762062073,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:14.922017"
}
total tokens: 7384 num samples: 4 num padding tokens: 597 - rank: 2 max len: 1846 min len: 1598 avg len: 1696.75 num_loss_counted_tokens: 648
total tokens: 5825 num samples: 25 num padding tokens: 2240 - rank: 7 max len: 233 min len: 87 avg len: 143.4 num_loss_counted_tokens: 1267
total tokens: 8040 num samples: 15 num padding tokens: 2522 - rank: 6 max len: 536 min len: 238 avg len: 367.8666666666667 num_loss_counted_tokens: 3089
total tokens: 7434 num samples: 6 num padding tokens: 1318 - rank: 3 max len: 1239 min len: 914 avg len: 1019.3333333333334 num_loss_counted_tokens: 4080
total tokens: 7854 num samples: 11 num padding tokens: 785 - rank: 5 max len: 714 min len: 537 avg len: 642.6363636363636 num_loss_counted_tokens: 3685
total tokens: 7772 num samples: 2 num padding tokens: 602 - rank: 0 max len: 3886 min len: 3284 avg len: 3585.0 num_loss_counted_tokens: 194
Per-token loss scaled by world size: 0.00032663694582879543Per-token loss scaled by world size: 0.0003048842481803149Per-token loss scaled by world size: 0.000158169845235534Per-token loss scaled by world size: 0.00026047308347187936
Per-token loss scaled by world size: 0.00023948316811583936
Per-token loss scaled by world size: 0.00011947691382374614Per-token loss scaled by world size: 5.034709374740487e-06
Epoch: 0, Step: 53, Rank: 6, loss = 1.0754791498184204
Epoch: 0, Step: 53, Rank: 5, loss = 1.1522117853164673
Epoch: 0, Step: 53, Rank: 1, loss = 0.557944118976593
Epoch: 0, Step: 53, Rank: 4, loss = 0.9188187718391418Epoch: 0, Step: 53, Rank: 2, loss = 0.4214548170566559
Epoch: 0, Step: 53, Rank: 7, loss = 0.8447768688201904Epoch: 0, Step: 53, Rank: 0, loss = 0.017759937793016434
Per-token loss scaled by world size: 0.00029550379258580506
Epoch: 0, Step: 53, Rank: 3, loss = 1.0423896312713623
Epoch 0: 44%|████▍ | 53/121 [02:15<02:51, 2.53s/it] total tokens: 6792 num samples: 3 num padding tokens: 382 - rank: 1 max len: 2264 min len: 2050 avg len: 2136.6666666666665 num_loss_counted_tokens: 2284
total tokens: 7866 num samples: 9 num padding tokens: 794 - rank: 4 max len: 874 min len: 678 avg len: 785.7777777777778 num_loss_counted_tokens: 5164
{
"epoch": 0,
"step": 53,
"rank": 0,
"loss": 0.017759937793016434,
"overall_throughput": 42.74150473533702,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.40515947341919,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28220,
"batch_size": 101,
"total_loss": 0.7538543939590454,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:17.454732"
}
total tokens: 7777 num samples: 7 num padding tokens: 867 - rank: 3 max len: 1111 min len: 895 avg len: 987.1428571428571 num_loss_counted_tokens: 4712
total tokens: 5152 num samples: 23 num padding tokens: 1800 - rank: 7 max len: 224 min len: 71 avg len: 145.7391304347826 num_loss_counted_tokens: 1286
total tokens: 7932 num samples: 12 num padding tokens: 1042 - rank: 5 max len: 661 min len: 452 avg len: 574.1666666666666 num_loss_counted_tokens: 4158
total tokens: 6598 num samples: 2 num padding tokens: 710 - rank: 0 max len: 3299 min len: 2589 avg len: 2944.0 num_loss_counted_tokens: 177
total tokens: 8100 num samples: 18 num padding tokens: 1853 - rank: 6 max len: 450 min len: 235 avg len: 347.05555555555554 num_loss_counted_tokens: 3720
total tokens: 8076 num samples: 4 num padding tokens: 2220 - rank: 2 max len: 2019 min len: 1211 avg len: 1464.0 num_loss_counted_tokens: 1401
Per-token loss scaled by world size: 0.00026205729227513075Per-token loss scaled by world size: 0.00021129030210431665Per-token loss scaled by world size: 0.00041250750655308366Per-token loss scaled by world size: 0.0004137573123443872Per-token loss scaled by world size: 0.00038468287675641477
Per-token loss scaled by world size: 6.601931090699509e-06
Per-token loss scaled by world size: 0.0001686068280832842
Epoch: 0, Step: 54, Rank: 5, loss = 1.1955498456954956
Epoch: 0, Step: 54, Rank: 1, loss = 0.612372100353241
Epoch: 0, Step: 54, Rank: 3, loss = 0.7595075368881226
Epoch: 0, Step: 54, Rank: 6, loss = 1.1991721391677856
Epoch: 0, Step: 54, Rank: 4, loss = 1.114907145500183
Epoch: 0, Step: 54, Rank: 0, loss = 0.019134046509861946
Epoch: 0, Step: 54, Rank: 7, loss = 0.48866474628448486
Per-token loss scaled by world size: 0.00018809801258612424
Epoch: 0, Step: 54, Rank: 2, loss = 0.5451550483703613
Epoch 0: 45%|████▍ | 54/121 [02:18<02:49, 2.53s/it] total tokens: 6536 num samples: 4 num padding tokens: 440 - rank: 1 max len: 1634 min len: 1437 avg len: 1524.0 num_loss_counted_tokens: 1512
total tokens: 7530 num samples: 10 num padding tokens: 291 - rank: 4 max len: 753 min len: 677 avg len: 723.9 num_loss_counted_tokens: 3325
total tokens: 6312 num samples: 24 num padding tokens: 2423 - rank: 7 max len: 263 min len: 87 avg len: 162.04166666666666 num_loss_counted_tokens: 1548
{
"epoch": 0,
"step": 54,
"rank": 0,
"loss": 0.019134046509861946,
"overall_throughput": 42.550455941342655,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.375274658203125,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23186,
"batch_size": 84,
"total_loss": 0.7418078184127808,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:20.000394"
}
total tokens: 7968 num samples: 12 num padding tokens: 974 - rank: 5 max len: 664 min len: 458 avg len: 582.8333333333334 num_loss_counted_tokens: 4692
total tokens: 7020 num samples: 5 num padding tokens: 1164 - rank: 2 max len: 1404 min len: 1078 avg len: 1171.2 num_loss_counted_tokens: 2148
total tokens: 7786 num samples: 17 num padding tokens: 1313 - rank: 6 max len: 458 min len: 268 avg len: 380.7647058823529 num_loss_counted_tokens: 3857
total tokens: 6240 num samples: 3 num padding tokens: 692 - rank: 0 max len: 2080 min len: 1719 avg len: 1849.3333333333333 num_loss_counted_tokens: 339
total tokens: 8032 num samples: 8 num padding tokens: 1051 - rank: 3 max len: 1004 min len: 774 avg len: 872.625 num_loss_counted_tokens: 4797
Per-token loss scaled by world size: 0.00031934864819049835Per-token loss scaled by world size: 0.00028757311520166695Per-token loss scaled by world size: 0.0002729191619437188Per-token loss scaled by world size: 5.039776624471415e-06Per-token loss scaled by world size: 5.635480647470104e-06Per-token loss scaled by world size: 0.00025633463519625366
Per-token loss scaled by world size: 0.00021784953423775733
Epoch: 0, Step: 55, Rank: 1, loss = 0.014716777950525284
Epoch: 0, Step: 55, Rank: 4, loss = 0.7969580292701721Epoch: 0, Step: 55, Rank: 2, loss = 0.8397494554519653
Epoch: 0, Step: 55, Rank: 5, loss = 0.9325379729270935Epoch: 0, Step: 55, Rank: 0, loss = 0.016456307843327522
Epoch: 0, Step: 55, Rank: 3, loss = 0.7485291957855225
Epoch: 0, Step: 55, Rank: 7, loss = 0.6361478567123413
Per-token loss scaled by world size: 0.0004028878756798804
Epoch: 0, Step: 55, Rank: 6, loss = 1.176482915878296
Epoch 0: 45%|████▌ | 55/121 [02:20<02:48, 2.55s/it] total tokens: 6669 num samples: 3 num padding tokens: 1016 - rank: 1 max len: 2223 min len: 1683 avg len: 1884.3333333333333 num_loss_counted_tokens: 2050
total tokens: 7551 num samples: 9 num padding tokens: 500 - rank: 4 max len: 839 min len: 731 avg len: 783.4444444444445 num_loss_counted_tokens: 5307
total tokens: 7815 num samples: 5 num padding tokens: 1302 - rank: 2 max len: 1563 min len: 1114 avg len: 1302.6 num_loss_counted_tokens: 3153
total tokens: 6000 num samples: 25 num padding tokens: 1935 - rank: 7 max len: 240 min len: 71 avg len: 162.6 num_loss_counted_tokens: 1696
{
"epoch": 0,
"step": 55,
"rank": 0,
"loss": 0.016456307843327522,
"overall_throughput": 41.61745941199812,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.42158031463623,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23361,
"batch_size": 72,
"total_loss": 0.6451972723007202,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:22.600693"
}
total tokens: 8008 num samples: 8 num padding tokens: 692 - rank: 3 max len: 1001 min len: 855 avg len: 914.5 num_loss_counted_tokens: 2914
total tokens: 7755 num samples: 11 num padding tokens: 872 - rank: 5 max len: 705 min len: 568 avg len: 625.7272727272727 num_loss_counted_tokens: 3890
total tokens: 5474 num samples: 2 num padding tokens: 71 - rank: 0 max len: 2737 min len: 2666 avg len: 2701.5 num_loss_counted_tokens: 203
total tokens: 7952 num samples: 16 num padding tokens: 1838 - rank: 6 max len: 497 min len: 242 avg len: 382.125 num_loss_counted_tokens: 3616
Per-token loss scaled by world size: 0.00027937223785556853Per-token loss scaled by world size: 0.0003551024419721216Per-token loss scaled by world size: 0.0003812481591012329
Per-token loss scaled by world size: 0.00039887617458589375
Per-token loss scaled by world size: 0.00017063321138266474
Per-token loss scaled by world size: 0.0001592883054399863
Per-token loss scaled by world size: 5.90530635236064e-06
Epoch: 0, Step: 56, Rank: 5, loss = 1.202885627746582Epoch: 0, Step: 56, Rank: 6, loss = 1.1203925609588623
Epoch: 0, Step: 56, Rank: 7, loss = 0.8814542889595032
Epoch: 0, Step: 56, Rank: 4, loss = 1.2585041522979736
Epoch: 0, Step: 56, Rank: 3, loss = 0.5025745034217834
Epoch: 0, Step: 56, Rank: 1, loss = 0.5383691191673279
Epoch: 0, Step: 56, Rank: 0, loss = 0.018631979823112488
Per-token loss scaled by world size: 0.0001534399198135361
Epoch: 0, Step: 56, Rank: 2, loss = 0.4841221272945404
Epoch 0: 46%|████▋ | 56/121 [02:23<02:46, 2.55s/it] total tokens: 8090 num samples: 10 num padding tokens: 807 - rank: 4 max len: 809 min len: 667 avg len: 728.3 num_loss_counted_tokens: 3324
total tokens: 7916 num samples: 4 num padding tokens: 1432 - rank: 1 max len: 1979 min len: 1204 avg len: 1621.0 num_loss_counted_tokens: 3189
{
"epoch": 0,
"step": 56,
"rank": 0,
"loss": 0.018631979823112488,
"overall_throughput": 42.399025208803295,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.21819305419922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25241,
"batch_size": 80,
"total_loss": 0.7508668899536133,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:25.156511"
}
total tokens: 7794 num samples: 18 num padding tokens: 1747 - rank: 6 max len: 433 min len: 260 avg len: 335.94444444444446 num_loss_counted_tokens: 3466
total tokens: 7560 num samples: 3 num padding tokens: 565 - rank: 0 max len: 2520 min len: 2123 avg len: 2331.6666666666665 num_loss_counted_tokens: 291
total tokens: 7956 num samples: 12 num padding tokens: 1282 - rank: 5 max len: 663 min len: 471 avg len: 556.1666666666666 num_loss_counted_tokens: 3682
total tokens: 7158 num samples: 6 num padding tokens: 910 - rank: 2 max len: 1193 min len: 973 avg len: 1041.3333333333333 num_loss_counted_tokens: 3894
total tokens: 7020 num samples: 27 num padding tokens: 2005 - rank: 7 max len: 260 min len: 75 avg len: 185.74074074074073 num_loss_counted_tokens: 2158
total tokens: 7528 num samples: 8 num padding tokens: 511 - rank: 3 max len: 941 min len: 825 avg len: 877.125 num_loss_counted_tokens: 5733
Per-token loss scaled by world size: 0.0006988957757130265Per-token loss scaled by world size: 0.0001346730423392728Per-token loss scaled by world size: 0.0006961524486541748Per-token loss scaled by world size: 0.0004565907292999327
Per-token loss scaled by world size: 0.0009174557635560632Per-token loss scaled by world size: 6.04268007009523e-06
Per-token loss scaled by world size: 2.950769612652948e-06
Epoch: 0, Step: 57, Rank: 2, loss = 0.29436159133911133
Epoch: 0, Step: 57, Rank: 7, loss = 1.5216152667999268
Epoch: 0, Step: 57, Rank: 4, loss = 0.9979931712150574
Epoch: 0, Step: 57, Rank: 6, loss = 1.527611494064331
Epoch: 0, Step: 57, Rank: 5, loss = 2.005328893661499Epoch: 0, Step: 57, Rank: 0, loss = 0.013207787647843361
Epoch: 0, Step: 57, Rank: 1, loss = 0.006449644919484854
Per-token loss scaled by world size: 0.0004939697682857513
Epoch: 0, Step: 57, Rank: 3, loss = 1.079694390296936
Epoch 0: 47%|████▋ | 57/121 [02:25<02:43, 2.55s/it] total tokens: 7420 num samples: 4 num padding tokens: 675 - rank: 1 max len: 1855 min len: 1520 avg len: 1686.25 num_loss_counted_tokens: 2384
total tokens: 7770 num samples: 10 num padding tokens: 828 - rank: 4 max len: 777 min len: 623 avg len: 694.2 num_loss_counted_tokens: 2018
{
"epoch": 0,
"step": 57,
"rank": 0,
"loss": 0.013207787647843361,
"overall_throughput": 42.49635893812786,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.335302352905273,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17486,
"batch_size": 87,
"total_loss": 0.9307827949523926,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:27.695911"
}
total tokens: 8062 num samples: 29 num padding tokens: 2902 - rank: 7 max len: 278 min len: 80 avg len: 177.93103448275863 num_loss_counted_tokens: 2355
total tokens: 7464 num samples: 6 num padding tokens: 1141 - rank: 2 max len: 1244 min len: 904 avg len: 1053.8333333333333 num_loss_counted_tokens: 4123
total tokens: 7808 num samples: 16 num padding tokens: 1561 - rank: 6 max len: 488 min len: 309 avg len: 390.4375 num_loss_counted_tokens: 4152
total tokens: 7865 num samples: 13 num padding tokens: 912 - rank: 5 max len: 605 min len: 489 avg len: 534.8461538461538 num_loss_counted_tokens: 4080
total tokens: 7839 num samples: 9 num padding tokens: 355 - rank: 3 max len: 871 min len: 779 avg len: 831.5555555555555 num_loss_counted_tokens: 4751
total tokens: 6874 num samples: 2 num padding tokens: 130 - rank: 0 max len: 3437 min len: 3307 avg len: 3372.0 num_loss_counted_tokens: 164
Per-token loss scaled by world size: 0.0002828103897627443Per-token loss scaled by world size: 0.0004572872712742537Per-token loss scaled by world size: 1.0020711442848551e-06Per-token loss scaled by world size: 0.0006151991547085345Per-token loss scaled by world size: 0.00043195782927796245
Per-token loss scaled by world size: 7.227377864182927e-06
Per-token loss scaled by world size: 0.00031230703461915255
Epoch: 0, Step: 58, Rank: 6, loss = 1.217584490776062
Epoch: 0, Step: 58, Rank: 0, loss = 0.0026681397575885057
Epoch: 0, Step: 58, Rank: 3, loss = 0.7530180215835571Epoch: 0, Step: 58, Rank: 4, loss = 1.150141716003418
Epoch: 0, Step: 58, Rank: 5, loss = 1.6380445957183838
Epoch: 0, Step: 58, Rank: 1, loss = 0.019243797287344933
Epoch: 0, Step: 58, Rank: 7, loss = 0.831556499004364
Per-token loss scaled by world size: 7.78582543716766e-05
Epoch: 0, Step: 58, Rank: 2, loss = 0.2073073387145996
Epoch 0: 48%|████▊ | 58/121 [02:28<02:40, 2.56s/it] total tokens: 6570 num samples: 3 num padding tokens: 500 - rank: 1 max len: 2190 min len: 1919 avg len: 2023.3333333333333 num_loss_counted_tokens: 450
total tokens: 7944 num samples: 8 num padding tokens: 930 - rank: 4 max len: 993 min len: 794 avg len: 876.75 num_loss_counted_tokens: 6270
{
"epoch": 0,
"step": 58,
"rank": 0,
"loss": 0.0026681397575885057,
"overall_throughput": 42.239983019087354,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.242695808410645,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21301,
"batch_size": 71,
"total_loss": 0.7274456024169922,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:30.263573"
}
total tokens: 5496 num samples: 24 num padding tokens: 2346 - rank: 7 max len: 229 min len: 77 avg len: 131.25 num_loss_counted_tokens: 1044
total tokens: 7190 num samples: 2 num padding tokens: 692 - rank: 0 max len: 3595 min len: 2903 avg len: 3249.0 num_loss_counted_tokens: 177
total tokens: 8113 num samples: 19 num padding tokens: 2176 - rank: 6 max len: 427 min len: 230 avg len: 312.4736842105263 num_loss_counted_tokens: 3174
total tokens: 7020 num samples: 4 num padding tokens: 898 - rank: 2 max len: 1755 min len: 1310 avg len: 1530.5 num_loss_counted_tokens: 778
total tokens: 7494 num samples: 6 num padding tokens: 623 - rank: 3 max len: 1249 min len: 1027 avg len: 1145.1666666666667 num_loss_counted_tokens: 2420
total tokens: 8107 num samples: 11 num padding tokens: 1465 - rank: 5 max len: 737 min len: 430 avg len: 603.8181818181819 num_loss_counted_tokens: 4656
Per-token loss scaled by world size: 0.00026943007833324373Per-token loss scaled by world size: 0.00027157709700986743Per-token loss scaled by world size: 0.00050318957073614
Per-token loss scaled by world size: 0.0003404158051125705Per-token loss scaled by world size: 0.0005186275229789317
Per-token loss scaled by world size: 5.279351626086282e-06
Per-token loss scaled by world size: 6.721797399222851e-05
Epoch: 0, Step: 59, Rank: 4, loss = 1.4499406814575195Epoch: 0, Step: 59, Rank: 7, loss = 0.7825493812561035
Epoch: 0, Step: 59, Rank: 6, loss = 0.7763627767562866
Epoch: 0, Step: 59, Rank: 5, loss = 1.4944251775741577Epoch: 0, Step: 59, Rank: 2, loss = 0.9809081554412842
Epoch: 0, Step: 59, Rank: 0, loss = 0.015212451107800007
Epoch: 0, Step: 59, Rank: 1, loss = 0.19368860125541687
Per-token loss scaled by world size: 0.0002404269325779751
Epoch: 0, Step: 59, Rank: 3, loss = 0.6927902102470398
Epoch 0: 49%|████▉ | 59/121 [02:31<02:38, 2.56s/it] total tokens: 7623 num samples: 11 num padding tokens: 1253 - rank: 4 max len: 693 min len: 515 avg len: 579.0909090909091 num_loss_counted_tokens: 4023
total tokens: 7490 num samples: 5 num padding tokens: 1411 - rank: 1 max len: 1498 min len: 1040 avg len: 1215.8 num_loss_counted_tokens: 1224
{
"epoch": 0,
"step": 59,
"rank": 0,
"loss": 0.015212451107800007,
"overall_throughput": 42.27999530101241,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.386998653411865,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23052,
"batch_size": 97,
"total_loss": 0.7982346415519714,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:32.865411"
}
total tokens: 7981 num samples: 23 num padding tokens: 1404 - rank: 6 max len: 347 min len: 241 avg len: 285.95652173913044 num_loss_counted_tokens: 2899
total tokens: 7140 num samples: 30 num padding tokens: 2458 - rank: 7 max len: 238 min len: 77 avg len: 156.06666666666666 num_loss_counted_tokens: 1738
total tokens: 7938 num samples: 9 num padding tokens: 784 - rank: 3 max len: 882 min len: 719 avg len: 794.8888888888889 num_loss_counted_tokens: 5307
total tokens: 7189 num samples: 7 num padding tokens: 565 - rank: 2 max len: 1027 min len: 894 avg len: 946.2857142857143 num_loss_counted_tokens: 3237
total tokens: 8096 num samples: 16 num padding tokens: 1168 - rank: 5 max len: 506 min len: 348 avg len: 433.0 num_loss_counted_tokens: 3888
total tokens: 7446 num samples: 3 num padding tokens: 1212 - rank: 0 max len: 2482 min len: 1681 avg len: 2078.0 num_loss_counted_tokens: 1401
Per-token loss scaled by world size: 0.0002737885224632919Per-token loss scaled by world size: 0.00014497540541924536Per-token loss scaled by world size: 0.00019299837003927678Per-token loss scaled by world size: 0.0003046545316465199Per-token loss scaled by world size: 0.0004120226949453354Per-token loss scaled by world size: 0.00017287737864535302
Per-token loss scaled by world size: 9.607095989849768e-07
Epoch: 0, Step: 60, Rank: 2, loss = 0.6385592222213745Epoch: 0, Step: 60, Rank: 1, loss = 0.4796692430973053
Epoch: 0, Step: 60, Rank: 6, loss = 1.00798761844635Epoch: 0, Step: 60, Rank: 4, loss = 1.3632285594940186Epoch: 0, Step: 60, Rank: 3, loss = 0.9058635830879211
Epoch: 0, Step: 60, Rank: 7, loss = 0.5719864368438721
Epoch: 0, Step: 60, Rank: 0, loss = 0.0031786279287189245
Per-token loss scaled by world size: 0.0002788409183267504
Epoch: 0, Step: 60, Rank: 5, loss = 0.9225800633430481
Epoch 0: 50%|████▉ | 60/121 [02:33<02:35, 2.55s/it] total tokens: 6675 num samples: 3 num padding tokens: 898 - rank: 1 max len: 2225 min len: 1369 avg len: 1925.6666666666667 num_loss_counted_tokens: 1482
total tokens: 7610 num samples: 10 num padding tokens: 836 - rank: 4 max len: 761 min len: 583 avg len: 677.4 num_loss_counted_tokens: 5671
{
"epoch": 0,
"step": 60,
"rank": 0,
"loss": 0.0031786279287189245,
"overall_throughput": 42.89061324541665,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.254440784454346,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26469,
"batch_size": 91,
"total_loss": 0.7366316914558411,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:35.353516"
}
total tokens: 7735 num samples: 17 num padding tokens: 2066 - rank: 6 max len: 455 min len: 253 avg len: 333.47058823529414 num_loss_counted_tokens: 2881
total tokens: 7116 num samples: 6 num padding tokens: 703 - rank: 2 max len: 1186 min len: 991 avg len: 1068.8333333333333 num_loss_counted_tokens: 4466
total tokens: 5920 num samples: 2 num padding tokens: 120 - rank: 0 max len: 2960 min len: 2840 avg len: 2900.0 num_loss_counted_tokens: 179
total tokens: 7696 num samples: 8 num padding tokens: 616 - rank: 3 max len: 962 min len: 775 avg len: 885.0 num_loss_counted_tokens: 4386
total tokens: 8096 num samples: 32 num padding tokens: 2221 - rank: 7 max len: 253 min len: 72 avg len: 183.59375 num_loss_counted_tokens: 2722
total tokens: 7566 num samples: 13 num padding tokens: 544 - rank: 5 max len: 582 min len: 459 avg len: 540.1538461538462 num_loss_counted_tokens: 4576
Per-token loss scaled by world size: 0.0001878267794381827Per-token loss scaled by world size: 0.0002473706554155797Per-token loss scaled by world size: 0.0005609646323136985Per-token loss scaled by world size: 2.916532139352057e-05Per-token loss scaled by world size: 0.0005521869170479476
Per-token loss scaled by world size: 2.548310840211343e-05
Per-token loss scaled by world size: 0.0002686498628463596
Epoch: 0, Step: 61, Rank: 3, loss = 0.4526155889034271
Epoch: 0, Step: 61, Rank: 0, loss = 0.07028113305568695Epoch: 0, Step: 61, Rank: 2, loss = 0.5961014032363892
Epoch: 0, Step: 61, Rank: 6, loss = 1.3306324481964111Epoch: 0, Step: 61, Rank: 4, loss = 1.3517844676971436
Epoch: 0, Step: 61, Rank: 1, loss = 0.061407919973134995
Epoch: 0, Step: 61, Rank: 7, loss = 0.6473789811134338
Per-token loss scaled by world size: 0.000720518350135535
Epoch: 0, Step: 61, Rank: 5, loss = 1.7362691164016724
Epoch 0: 50%|█████ | 61/121 [02:36<02:32, 2.55s/it] total tokens: 6764 num samples: 4 num padding tokens: 645 - rank: 1 max len: 1691 min len: 1416 avg len: 1529.75 num_loss_counted_tokens: 1674
total tokens: 7416 num samples: 9 num padding tokens: 608 - rank: 4 max len: 824 min len: 677 avg len: 756.4444444444445 num_loss_counted_tokens: 5244
total tokens: 8088 num samples: 12 num padding tokens: 844 - rank: 5 max len: 674 min len: 545 avg len: 603.6666666666666 num_loss_counted_tokens: 4651
{
"epoch": 0,
"step": 61,
"rank": 0,
"loss": 0.07028113305568695,
"overall_throughput": 42.67067345316129,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.295628547668457,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19278,
"batch_size": 85,
"total_loss": 0.7808088660240173,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:37.891642"
}
total tokens: 7800 num samples: 24 num padding tokens: 3935 - rank: 7 max len: 325 min len: 82 avg len: 161.04166666666666 num_loss_counted_tokens: 1576
total tokens: 7965 num samples: 15 num padding tokens: 1199 - rank: 6 max len: 531 min len: 325 avg len: 451.06666666666666 num_loss_counted_tokens: 4050
total tokens: 7488 num samples: 8 num padding tokens: 552 - rank: 3 max len: 936 min len: 830 avg len: 867.0 num_loss_counted_tokens: 2535
total tokens: 6438 num samples: 2 num padding tokens: 1075 - rank: 0 max len: 3219 min len: 2144 avg len: 2681.5 num_loss_counted_tokens: 201
total tokens: 7952 num samples: 7 num padding tokens: 583 - rank: 2 max len: 1136 min len: 993 avg len: 1052.7142857142858 num_loss_counted_tokens: 5968
Per-token loss scaled by world size: 0.0004150475433561951Per-token loss scaled by world size: 0.0003149463445879519Per-token loss scaled by world size: 0.000596669502556324
Per-token loss scaled by world size: 6.351516731228912e-06Per-token loss scaled by world size: 8.575078709327499e-07Per-token loss scaled by world size: 0.00024007105093915015
Per-token loss scaled by world size: 0.00015118405281100422
Epoch: 0, Step: 62, Rank: 5, loss = 1.5943009853363037
Epoch: 0, Step: 62, Rank: 3, loss = 0.8415366411209106
Epoch: 0, Step: 62, Rank: 0, loss = 0.016971252858638763
Epoch: 0, Step: 62, Rank: 4, loss = 1.1090070009231567Epoch: 0, Step: 62, Rank: 2, loss = 0.6414698362350464
Epoch: 0, Step: 62, Rank: 1, loss = 0.0022912609856575727
Epoch: 0, Step: 62, Rank: 7, loss = 0.40396377444267273
Per-token loss scaled by world size: 0.0003845185856334865
Epoch: 0, Step: 62, Rank: 6, loss = 1.0274336338043213
Epoch 0: 51%|█████ | 62/121 [02:38<02:29, 2.53s/it] total tokens: 7392 num samples: 4 num padding tokens: 338 - rank: 1 max len: 1848 min len: 1671 avg len: 1763.5 num_loss_counted_tokens: 2320
total tokens: 8112 num samples: 8 num padding tokens: 1667 - rank: 4 max len: 1014 min len: 705 avg len: 805.625 num_loss_counted_tokens: 4978
{
"epoch": 0,
"step": 62,
"rank": 0,
"loss": 0.016971252858638763,
"overall_throughput": 43.357387151746714,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.4012451171875,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21376,
"batch_size": 77,
"total_loss": 0.7046218514442444,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:40.388446"
}
total tokens: 7860 num samples: 30 num padding tokens: 2842 - rank: 7 max len: 262 min len: 79 avg len: 167.26666666666668 num_loss_counted_tokens: 2118
total tokens: 7711 num samples: 11 num padding tokens: 1331 - rank: 5 max len: 701 min len: 445 avg len: 580.0 num_loss_counted_tokens: 5182
total tokens: 7974 num samples: 18 num padding tokens: 2210 - rank: 6 max len: 443 min len: 270 avg len: 320.22222222222223 num_loss_counted_tokens: 2977
total tokens: 6676 num samples: 4 num padding tokens: 617 - rank: 2 max len: 1669 min len: 1336 avg len: 1514.75 num_loss_counted_tokens: 2305
total tokens: 7974 num samples: 6 num padding tokens: 795 - rank: 3 max len: 1329 min len: 1046 avg len: 1196.5 num_loss_counted_tokens: 3823
total tokens: 6596 num samples: 2 num padding tokens: 195 - rank: 0 max len: 3298 min len: 3103 avg len: 3200.5 num_loss_counted_tokens: 198
Per-token loss scaled by world size: 0.00032304422347806394Per-token loss scaled by world size: 0.0002737718168646097Per-token loss scaled by world size: 0.0002592895762063563Per-token loss scaled by world size: 0.0002177765272790566Per-token loss scaled by world size: 0.00020476435020100325Per-token loss scaled by world size: 0.00020230024529155344Per-token loss scaled by world size: 5.3382074838737026e-05
Epoch: 0, Step: 63, Rank: 0, loss = 0.1870107501745224Epoch: 0, Step: 63, Rank: 6, loss = 0.908356249332428Epoch: 0, Step: 63, Rank: 4, loss = 0.7173407077789307Epoch: 0, Step: 63, Rank: 5, loss = 1.1317046880722046
Epoch: 0, Step: 63, Rank: 1, loss = 0.959091067314148Epoch: 0, Step: 63, Rank: 3, loss = 0.7087083458900452Epoch: 0, Step: 63, Rank: 7, loss = 0.7629256248474121
Per-token loss scaled by world size: 0.0001610093895578757
Epoch: 0, Step: 63, Rank: 2, loss = 0.5640561580657959
Epoch 0: 52%|█████▏ | 63/121 [02:41<02:27, 2.54s/it] total tokens: 7680 num samples: 8 num padding tokens: 903 - rank: 4 max len: 960 min len: 763 avg len: 847.125 num_loss_counted_tokens: 3503
total tokens: 7896 num samples: 3 num padding tokens: 1173 - rank: 1 max len: 2632 min len: 1954 avg len: 2241.0 num_loss_counted_tokens: 499
{
"epoch": 0,
"step": 63,
"rank": 0,
"loss": 0.1870107501745224,
"overall_throughput": 42.41853134282842,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.40930986404419,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28026,
"batch_size": 89,
"total_loss": 0.7423991560935974,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:42.939707"
}
total tokens: 7400 num samples: 10 num padding tokens: 1048 - rank: 5 max len: 740 min len: 555 avg len: 635.2 num_loss_counted_tokens: 4022
total tokens: 7010 num samples: 5 num padding tokens: 576 - rank: 2 max len: 1402 min len: 1179 avg len: 1286.8 num_loss_counted_tokens: 3377
total tokens: 7182 num samples: 27 num padding tokens: 2131 - rank: 7 max len: 266 min len: 74 avg len: 187.07407407407408 num_loss_counted_tokens: 2558
total tokens: 7935 num samples: 15 num padding tokens: 1793 - rank: 6 max len: 529 min len: 277 avg len: 409.46666666666664 num_loss_counted_tokens: 3824
total tokens: 7548 num samples: 2 num padding tokens: 861 - rank: 0 max len: 3774 min len: 2913 avg len: 3343.5 num_loss_counted_tokens: 218
total tokens: 6996 num samples: 6 num padding tokens: 573 - rank: 3 max len: 1166 min len: 965 avg len: 1070.5 num_loss_counted_tokens: 3633
Per-token loss scaled by world size: 0.0005391178419813514Per-token loss scaled by world size: 0.00022075393644627184
Per-token loss scaled by world size: 8.797919872449711e-05
Per-token loss scaled by world size: 0.00035083515103906393
Per-token loss scaled by world size: 0.0003944748896174133Per-token loss scaled by world size: 3.479456063359976e-05
Per-token loss scaled by world size: 0.0002142135490430519
Epoch: 0, Step: 64, Rank: 3, loss = 0.64051753282547
Epoch: 0, Step: 64, Rank: 2, loss = 0.25527164340019226
Epoch: 0, Step: 64, Rank: 5, loss = 1.5642504692077637
Epoch: 0, Step: 64, Rank: 4, loss = 1.0179481506347656
Epoch: 0, Step: 64, Rank: 6, loss = 1.144568920135498
Epoch: 0, Step: 64, Rank: 1, loss = 0.10095641762018204Epoch: 0, Step: 64, Rank: 7, loss = 0.6215406060218811
Per-token loss scaled by world size: 2.743201912380755e-05
Epoch: 0, Step: 64, Rank: 0, loss = 0.07959400117397308
Epoch 0: 53%|█████▎ | 64/121 [02:43<02:26, 2.57s/it] total tokens: 7266 num samples: 2 num padding tokens: 794 - rank: 1 max len: 3633 min len: 2839 avg len: 3236.0 num_loss_counted_tokens: 186
total tokens: 7720 num samples: 8 num padding tokens: 848 - rank: 4 max len: 965 min len: 791 avg len: 859.0 num_loss_counted_tokens: 4634
{
"epoch": 0,
"step": 64,
"rank": 0,
"loss": 0.07959400117397308,
"overall_throughput": 41.08188082412845,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.510859966278076,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23212,
"batch_size": 70,
"total_loss": 0.6780809760093689,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:45.576193"
}
total tokens: 7564 num samples: 31 num padding tokens: 2701 - rank: 7 max len: 244 min len: 82 avg len: 156.8709677419355 num_loss_counted_tokens: 1913
total tokens: 4065 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4065 min len: 4065 avg len: 4065.0 num_loss_counted_tokens: 82
total tokens: 8030 num samples: 5 num padding tokens: 1753 - rank: 3 max len: 1606 min len: 989 avg len: 1255.4 num_loss_counted_tokens: 3735
total tokens: 7720 num samples: 10 num padding tokens: 1003 - rank: 5 max len: 772 min len: 589 avg len: 671.7 num_loss_counted_tokens: 4179
total tokens: 7860 num samples: 4 num padding tokens: 620 - rank: 2 max len: 1965 min len: 1643 avg len: 1810.0 num_loss_counted_tokens: 1863
total tokens: 7800 num samples: 15 num padding tokens: 1926 - rank: 6 max len: 520 min len: 284 avg len: 391.6 num_loss_counted_tokens: 3775
Per-token loss scaled by world size: 0.000316505174851045Per-token loss scaled by world size: 0.00018296584312338382Per-token loss scaled by world size: 0.00035575314541347325Per-token loss scaled by world size: 0.00033105004695244133Per-token loss scaled by world size: 0.00041151116602122784Per-token loss scaled by world size: 4.141435056226328e-05
Per-token loss scaled by world size: 0.0004561956156976521
Epoch: 0, Step: 65, Rank: 6, loss = 0.928863525390625
Epoch: 0, Step: 65, Rank: 0, loss = 0.12154076248407364Epoch: 0, Step: 65, Rank: 1, loss = 0.5369589924812317Epoch: 0, Step: 65, Rank: 3, loss = 1.0440465211868286
Epoch: 0, Step: 65, Rank: 7, loss = 0.9715490937232971
Epoch: 0, Step: 65, Rank: 5, loss = 1.3388200998306274
Epoch: 0, Step: 65, Rank: 4, loss = 1.2076823711395264
Per-token loss scaled by world size: 0.00015677251212764531
Epoch: 0, Step: 65, Rank: 2, loss = 0.4600881338119507
Epoch 0: 54%|█████▎ | 65/121 [02:46<02:23, 2.57s/it] total tokens: 6615 num samples: 3 num padding tokens: 1003 - rank: 1 max len: 2205 min len: 1609 avg len: 1870.6666666666667 num_loss_counted_tokens: 351
total tokens: 7942 num samples: 11 num padding tokens: 995 - rank: 4 max len: 722 min len: 567 avg len: 631.5454545454545 num_loss_counted_tokens: 4015
total tokens: 7820 num samples: 20 num padding tokens: 1923 - rank: 6 max len: 391 min len: 224 avg len: 294.85 num_loss_counted_tokens: 3347
{
"epoch": 0,
"step": 65,
"rank": 0,
"loss": 0.12154076248407364,
"overall_throughput": 42.203863937195706,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.473669052124023,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23478,
"batch_size": 88,
"total_loss": 0.8261936902999878,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:48.138254"
}
total tokens: 4515 num samples: 21 num padding tokens: 1409 - rank: 7 max len: 215 min len: 76 avg len: 147.9047619047619 num_loss_counted_tokens: 1105
total tokens: 7320 num samples: 8 num padding tokens: 903 - rank: 3 max len: 915 min len: 724 avg len: 802.125 num_loss_counted_tokens: 4454
total tokens: 7735 num samples: 5 num padding tokens: 1812 - rank: 2 max len: 1547 min len: 1021 avg len: 1184.6 num_loss_counted_tokens: 1072
total tokens: 7896 num samples: 14 num padding tokens: 1195 - rank: 5 max len: 564 min len: 397 avg len: 478.64285714285717 num_loss_counted_tokens: 4241
total tokens: 6444 num samples: 2 num padding tokens: 493 - rank: 0 max len: 3222 min len: 2729 avg len: 2975.5 num_loss_counted_tokens: 461
Per-token loss scaled by world size: 0.00020512452465482056Per-token loss scaled by world size: 0.00037579398485831916Per-token loss scaled by world size: 0.00022079057816881686Per-token loss scaled by world size: 0.0002576705301180482
Per-token loss scaled by world size: 5.168040661374107e-05
Per-token loss scaled by world size: 0.00019716547103598714
Per-token loss scaled by world size: 7.161292160162702e-05
Epoch: 0, Step: 66, Rank: 6, loss = 0.8971443176269531Epoch: 0, Step: 66, Rank: 4, loss = 1.3084206581115723
Epoch: 0, Step: 66, Rank: 0, loss = 0.17993825674057007
Epoch: 0, Step: 66, Rank: 3, loss = 0.7687376141548157
Epoch: 0, Step: 66, Rank: 2, loss = 0.7141923308372498
Epoch: 0, Step: 66, Rank: 1, loss = 0.6864808797836304
Epoch: 0, Step: 66, Rank: 7, loss = 0.249338299036026
Per-token loss scaled by world size: 0.0002357129706069827
Epoch: 0, Step: 66, Rank: 5, loss = 0.8206936120986938
Epoch 0: 55%|█████▍ | 66/121 [02:48<02:20, 2.55s/it] total tokens: 6852 num samples: 3 num padding tokens: 921 - rank: 1 max len: 2284 min len: 1726 avg len: 1977.0 num_loss_counted_tokens: 3152
total tokens: 7542 num samples: 9 num padding tokens: 1092 - rank: 4 max len: 838 min len: 652 avg len: 716.6666666666666 num_loss_counted_tokens: 3407
total tokens: 7788 num samples: 33 num padding tokens: 2754 - rank: 7 max len: 236 min len: 71 avg len: 152.54545454545453 num_loss_counted_tokens: 2032
{
"epoch": 0,
"step": 66,
"rank": 0,
"loss": 0.17993825674057007,
"overall_throughput": 43.1739543464684,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.432954788208008,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27854,
"batch_size": 94,
"total_loss": 0.7031182050704956,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:50.647621"
}
total tokens: 7980 num samples: 19 num padding tokens: 1539 - rank: 6 max len: 420 min len: 240 avg len: 339.0 num_loss_counted_tokens: 3367
total tokens: 7175 num samples: 7 num padding tokens: 309 - rank: 3 max len: 1025 min len: 903 avg len: 980.8571428571429 num_loss_counted_tokens: 5453
total tokens: 7130 num samples: 5 num padding tokens: 1200 - rank: 2 max len: 1426 min len: 1084 avg len: 1186.0 num_loss_counted_tokens: 3905
total tokens: 7388 num samples: 2 num padding tokens: 641 - rank: 0 max len: 3694 min len: 3053 avg len: 3373.5 num_loss_counted_tokens: 603
total tokens: 7668 num samples: 12 num padding tokens: 1327 - rank: 5 max len: 639 min len: 423 avg len: 528.4166666666666 num_loss_counted_tokens: 3896
Per-token loss scaled by world size: 0.00023726793006062508Per-token loss scaled by world size: 0.00031606658012606204Per-token loss scaled by world size: 0.000504097668454051Per-token loss scaled by world size: 0.00039712167927064Per-token loss scaled by world size: 0.0004929095157422125Per-token loss scaled by world size: 6.066870355425635e-06Per-token loss scaled by world size: 2.3820351998438127e-05
Epoch: 0, Step: 67, Rank: 2, loss = 0.6478897333145142
Epoch: 0, Step: 67, Rank: 3, loss = 0.8630593419075012Epoch: 0, Step: 67, Rank: 6, loss = 1.3765016794204712
Epoch: 0, Step: 67, Rank: 0, loss = 0.01656634733080864
Epoch: 0, Step: 67, Rank: 7, loss = 1.08439040184021
Epoch: 0, Step: 67, Rank: 4, loss = 1.3459510803222656
Epoch: 0, Step: 67, Rank: 1, loss = 0.06504444777965546
Per-token loss scaled by world size: 0.00038537452928721905
Epoch: 0, Step: 67, Rank: 5, loss = 1.0523133277893066
Epoch 0: 55%|█████▌ | 67/121 [02:51<02:16, 2.53s/it] total tokens: 6858 num samples: 3 num padding tokens: 879 - rank: 1 max len: 2286 min len: 1606 avg len: 1993.0 num_loss_counted_tokens: 2226
total tokens: 8001 num samples: 9 num padding tokens: 744 - rank: 4 max len: 889 min len: 750 avg len: 806.3333333333334 num_loss_counted_tokens: 3999
{
"epoch": 0,
"step": 67,
"rank": 0,
"loss": 0.01656634733080864,
"overall_throughput": 43.254684230948214,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.338647842407227,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21845,
"batch_size": 79,
"total_loss": 0.8064644932746887,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:53.149207"
}
total tokens: 8085 num samples: 11 num padding tokens: 2256 - rank: 5 max len: 735 min len: 410 avg len: 529.9090909090909 num_loss_counted_tokens: 4137
total tokens: 7679 num samples: 7 num padding tokens: 788 - rank: 3 max len: 1097 min len: 910 avg len: 984.4285714285714 num_loss_counted_tokens: 4646
total tokens: 6030 num samples: 2 num padding tokens: 245 - rank: 0 max len: 3015 min len: 2770 avg len: 2892.5 num_loss_counted_tokens: 180
total tokens: 8040 num samples: 20 num padding tokens: 1710 - rank: 6 max len: 402 min len: 246 avg len: 316.5 num_loss_counted_tokens: 2948
total tokens: 7520 num samples: 32 num padding tokens: 2159 - rank: 7 max len: 235 min len: 85 avg len: 167.53125 num_loss_counted_tokens: 2187
total tokens: 7895 num samples: 5 num padding tokens: 1603 - rank: 2 max len: 1579 min len: 1158 avg len: 1258.4 num_loss_counted_tokens: 2148
Per-token loss scaled by world size: 0.00019273992802482098Per-token loss scaled by world size: 0.00030933329253457487Per-token loss scaled by world size: 0.00017293139535468072Per-token loss scaled by world size: 0.00035478913923725486
Per-token loss scaled by world size: 0.0003922785690519959
Per-token loss scaled by world size: 4.257708951627137e-06
Per-token loss scaled by world size: 0.0001415474253008142
Epoch: 0, Step: 68, Rank: 6, loss = 1.0613998174667358
Epoch: 0, Step: 68, Rank: 2, loss = 0.6613388657569885Epoch: 0, Step: 68, Rank: 5, loss = 1.3460057973861694
Epoch: 0, Step: 68, Rank: 0, loss = 0.014609264209866524
Epoch: 0, Step: 68, Rank: 4, loss = 1.2173702716827393Epoch: 0, Step: 68, Rank: 1, loss = 0.5933708548545837
Epoch: 0, Step: 68, Rank: 7, loss = 0.4856846034526825
Per-token loss scaled by world size: 0.00021395196381490678
Epoch: 0, Step: 68, Rank: 3, loss = 0.7341226935386658
Epoch 0: 56%|█████▌ | 68/121 [02:53<02:14, 2.54s/it] total tokens: 8082 num samples: 9 num padding tokens: 769 - rank: 4 max len: 898 min len: 707 avg len: 812.5555555555555 num_loss_counted_tokens: 4363
total tokens: 7876 num samples: 4 num padding tokens: 706 - rank: 1 max len: 1969 min len: 1606 avg len: 1792.5 num_loss_counted_tokens: 565
{
"epoch": 0,
"step": 68,
"rank": 0,
"loss": 0.014609264209866524,
"overall_throughput": 42.56272046948144,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.448001861572266,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27450,
"batch_size": 88,
"total_loss": 0.7642378211021423,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:55.692310"
}
total tokens: 7612 num samples: 11 num padding tokens: 1187 - rank: 5 max len: 692 min len: 458 avg len: 584.0909090909091 num_loss_counted_tokens: 4392
total tokens: 5586 num samples: 2 num padding tokens: 652 - rank: 0 max len: 2793 min len: 2141 avg len: 2467.0 num_loss_counted_tokens: 151
total tokens: 8060 num samples: 31 num padding tokens: 3056 - rank: 7 max len: 260 min len: 75 avg len: 161.41935483870967 num_loss_counted_tokens: 2101
total tokens: 7220 num samples: 5 num padding tokens: 823 - rank: 2 max len: 1444 min len: 1143 avg len: 1279.4 num_loss_counted_tokens: 1187
total tokens: 7902 num samples: 18 num padding tokens: 1530 - rank: 6 max len: 439 min len: 282 avg len: 354.0 num_loss_counted_tokens: 3796
total tokens: 7882 num samples: 7 num padding tokens: 581 - rank: 3 max len: 1126 min len: 922 avg len: 1043.0 num_loss_counted_tokens: 4213
Per-token loss scaled by world size: 0.0005160618457011878Per-token loss scaled by world size: 0.00044540074304677546Per-token loss scaled by world size: 4.21712247771211e-05Per-token loss scaled by world size: 0.0002244754577986896Per-token loss scaled by world size: 0.0007427233504131436
Per-token loss scaled by world size: 9.168142241833266e-06
Per-token loss scaled by world size: 0.0003082228358834982
Epoch: 0, Step: 69, Rank: 2, loss = 0.09369391947984695
Epoch: 0, Step: 69, Rank: 5, loss = 0.9895691275596619Epoch: 0, Step: 69, Rank: 6, loss = 1.1465604305267334Epoch: 0, Step: 69, Rank: 3, loss = 0.49872833490371704
Epoch: 0, Step: 69, Rank: 4, loss = 1.6501456499099731
Epoch: 0, Step: 69, Rank: 1, loss = 0.02036931924521923
Epoch: 0, Step: 69, Rank: 7, loss = 0.6847940683364868Per-token loss scaled by world size: 4.934850949211977e-06
Epoch: 0, Step: 69, Rank: 0, loss = 0.010964005254209042
Epoch 0: 57%|█████▋ | 69/121 [02:56<02:12, 2.55s/it] total tokens: 7194 num samples: 6 num padding tokens: 688 - rank: 1 max len: 1199 min len: 1008 avg len: 1084.3333333333333 num_loss_counted_tokens: 3201
total tokens: 7668 num samples: 12 num padding tokens: 514 - rank: 4 max len: 639 min len: 540 avg len: 596.1666666666666 num_loss_counted_tokens: 3626
{
"epoch": 0,
"step": 69,
"rank": 0,
"loss": 0.010964005254209042,
"overall_throughput": 41.668886771410214,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.496148586273193,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17774,
"batch_size": 74,
"total_loss": 0.6368531584739685,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:50:58.281411"
}
total tokens: 3168 num samples: 24 num padding tokens: 653 - rank: 7 max len: 132 min len: 86 avg len: 104.79166666666667 num_loss_counted_tokens: 693
total tokens: 5566 num samples: 2 num padding tokens: 709 - rank: 0 max len: 2783 min len: 2074 avg len: 2428.5 num_loss_counted_tokens: 267
total tokens: 8040 num samples: 15 num padding tokens: 1436 - rank: 5 max len: 536 min len: 357 avg len: 440.26666666666665 num_loss_counted_tokens: 4304
total tokens: 7872 num samples: 24 num padding tokens: 2389 - rank: 6 max len: 328 min len: 136 avg len: 228.45833333333334 num_loss_counted_tokens: 2377
total tokens: 7784 num samples: 8 num padding tokens: 564 - rank: 2 max len: 973 min len: 847 avg len: 902.5 num_loss_counted_tokens: 5185
total tokens: 7800 num samples: 10 num padding tokens: 617 - rank: 3 max len: 780 min len: 641 avg len: 718.3 num_loss_counted_tokens: 5599
Per-token loss scaled by world size: 0.00038898465572856367Per-token loss scaled by world size: 0.0003429916687309742Per-token loss scaled by world size: 0.0005293539143167436Per-token loss scaled by world size: 2.475303517712746e-06
Per-token loss scaled by world size: 0.00041178142419084907
Per-token loss scaled by world size: 0.00016001032781787217Per-token loss scaled by world size: 3.3767562854336575e-05
Epoch: 0, Step: 70, Rank: 6, loss = 0.9818993806838989
Epoch: 0, Step: 70, Rank: 4, loss = 1.515407919883728
Epoch: 0, Step: 70, Rank: 5, loss = 1.1135658025741577
Epoch: 0, Step: 70, Rank: 0, loss = 0.00708617502823472
Epoch: 0, Step: 70, Rank: 3, loss = 1.1788272857666016
Epoch: 0, Step: 70, Rank: 7, loss = 0.4580695629119873
Epoch: 0, Step: 70, Rank: 1, loss = 0.09666808694601059
Per-token loss scaled by world size: 0.00012752025213558227
Epoch: 0, Step: 70, Rank: 2, loss = 0.3650586009025574
Epoch 0: 58%|█████▊ | 70/121 [02:58<02:09, 2.54s/it] total tokens: 7424 num samples: 8 num padding tokens: 763 - rank: 4 max len: 928 min len: 726 avg len: 832.625 num_loss_counted_tokens: 3620
total tokens: 7068 num samples: 4 num padding tokens: 755 - rank: 1 max len: 1767 min len: 1403 avg len: 1578.25 num_loss_counted_tokens: 1097
{
"epoch": 0,
"step": 70,
"rank": 0,
"loss": 0.00708617502823472,
"overall_throughput": 42.86151276291768,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.356226444244385,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22902,
"batch_size": 78,
"total_loss": 0.7145729064941406,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:00.845490"
}
total tokens: 6700 num samples: 25 num padding tokens: 2499 - rank: 7 max len: 268 min len: 77 avg len: 168.04 num_loss_counted_tokens: 1712
total tokens: 7984 num samples: 16 num padding tokens: 1302 - rank: 6 max len: 499 min len: 278 avg len: 417.625 num_loss_counted_tokens: 3478
total tokens: 7644 num samples: 7 num padding tokens: 654 - rank: 3 max len: 1092 min len: 950 avg len: 998.5714285714286 num_loss_counted_tokens: 5657
total tokens: 6950 num samples: 5 num padding tokens: 699 - rank: 2 max len: 1390 min len: 1102 avg len: 1250.2 num_loss_counted_tokens: 2977
total tokens: 7590 num samples: 11 num padding tokens: 1060 - rank: 5 max len: 690 min len: 509 avg len: 593.6363636363636 num_loss_counted_tokens: 4071
total tokens: 7488 num samples: 3 num padding tokens: 1132 - rank: 0 max len: 2496 min len: 1787 avg len: 2118.6666666666665 num_loss_counted_tokens: 1930
Per-token loss scaled by world size: 0.00010220974945696071Per-token loss scaled by world size: 0.0004755923873744905Per-token loss scaled by world size: 0.00013658934039995074Per-token loss scaled by world size: 0.0005745669477619231Per-token loss scaled by world size: 0.00038079574005678296Per-token loss scaled by world size: 1.1699220294758561e-06
Per-token loss scaled by world size: 0.0002442343102302402
Epoch: 0, Step: 71, Rank: 6, loss = 1.3208389282226562
Epoch: 0, Step: 71, Rank: 5, loss = 1.595716118812561
Epoch: 0, Step: 71, Rank: 0, loss = 0.0032491658348590136Epoch: 0, Step: 71, Rank: 1, loss = 0.28386202454566956Epoch: 0, Step: 71, Rank: 2, loss = 0.3793427646160126Epoch: 0, Step: 71, Rank: 4, loss = 1.0575649738311768
Epoch: 0, Step: 71, Rank: 7, loss = 0.6782997250556946
Per-token loss scaled by world size: 0.0002879296080209315
Epoch: 0, Step: 71, Rank: 3, loss = 0.7996525168418884
Epoch 0: 59%|█████▊ | 71/121 [03:01<02:07, 2.54s/it] total tokens: 7900 num samples: 10 num padding tokens: 433 - rank: 4 max len: 790 min len: 700 avg len: 746.7 num_loss_counted_tokens: 4392
total tokens: 7940 num samples: 5 num padding tokens: 1033 - rank: 1 max len: 1588 min len: 1214 avg len: 1381.4 num_loss_counted_tokens: 4515
{
"epoch": 0,
"step": 71,
"rank": 0,
"loss": 0.0032491658348590136,
"overall_throughput": 42.58421912234305,
"lr": 8.000000000000001e-07,
"cuda_mem_allocated": 24.31266736984253,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22218,
"batch_size": 83,
"total_loss": 0.7648157477378845,
"gradnorm": 0.9589425325393677,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:03.350519"
}
total tokens: 7645 num samples: 11 num padding tokens: 1259 - rank: 5 max len: 695 min len: 477 avg len: 580.5454545454545 num_loss_counted_tokens: 4944
total tokens: 8092 num samples: 17 num padding tokens: 1854 - rank: 6 max len: 476 min len: 261 avg len: 366.94117647058823 num_loss_counted_tokens: 3900
total tokens: 7931 num samples: 7 num padding tokens: 505 - rank: 2 max len: 1133 min len: 979 avg len: 1060.857142857143 num_loss_counted_tokens: 4912
total tokens: 7776 num samples: 8 num padding tokens: 515 - rank: 3 max len: 972 min len: 822 avg len: 907.625 num_loss_counted_tokens: 5257
total tokens: 7904 num samples: 32 num padding tokens: 2745 - rank: 7 max len: 247 min len: 71 avg len: 161.21875 num_loss_counted_tokens: 2232
total tokens: 8004 num samples: 4 num padding tokens: 730 - rank: 0 max len: 2001 min len: 1656 avg len: 1818.5 num_loss_counted_tokens: 2493
Per-token loss scaled by world size: 0.0004883022629655898Per-token loss scaled by world size: 0.00043922686018049717Per-token loss scaled by world size: 0.0004386535147204995
Per-token loss scaled by world size: 0.00020862463861703873
Per-token loss scaled by world size: 0.0003196638426743448Per-token loss scaled by world size: 4.3209151954215486e-06
Per-token loss scaled by world size: 0.0001734672114253044
Epoch: 0, Step: 72, Rank: 6, loss = 1.25338876247406
Epoch: 0, Step: 72, Rank: 5, loss = 1.2517526149749756Epoch: 0, Step: 72, Rank: 4, loss = 1.393431544303894
Epoch: 0, Step: 72, Rank: 2, loss = 0.5953364968299866
Epoch: 0, Step: 72, Rank: 7, loss = 0.9122007489204407
Epoch: 0, Step: 72, Rank: 0, loss = 0.012330272234976292
Epoch: 0, Step: 72, Rank: 1, loss = 0.4950103759765625
Per-token loss scaled by world size: 0.00020103121642023325
Epoch: 0, Step: 72, Rank: 3, loss = 0.5736677050590515
[2024-08-18 20:51:05,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
Epoch 0: 60%|█████▉ | 72/121 [03:04<02:06, 2.58s/it] total tokens: 7434 num samples: 7 num padding tokens: 801 - rank: 4 max len: 1062 min len: 831 avg len: 947.5714285714286 num_loss_counted_tokens: 3620
total tokens: 7284 num samples: 3 num padding tokens: 180 - rank: 1 max len: 2428 min len: 2292 avg len: 2368.0 num_loss_counted_tokens: 273
{
"epoch": 0,
"step": 72,
"rank": 0,
"loss": 0.012330272234976292,
"overall_throughput": 41.0709419873187,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 22.637446880340576,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22829,
"batch_size": 79,
"total_loss": 0.8108897805213928,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:06.059291"
}
total tokens: 6880 num samples: 5 num padding tokens: 854 - rank: 3 max len: 1376 min len: 1145 avg len: 1205.2 num_loss_counted_tokens: 2855
total tokens: 6178 num samples: 2 num padding tokens: 622 - rank: 0 max len: 3089 min len: 2467 avg len: 2778.0 num_loss_counted_tokens: 309
total tokens: 7306 num samples: 26 num padding tokens: 2608 - rank: 7 max len: 281 min len: 72 avg len: 180.69230769230768 num_loss_counted_tokens: 1912
total tokens: 7960 num samples: 4 num padding tokens: 988 - rank: 2 max len: 1990 min len: 1586 avg len: 1743.0 num_loss_counted_tokens: 1479
total tokens: 7548 num samples: 12 num padding tokens: 2286 - rank: 6 max len: 629 min len: 305 avg len: 438.5 num_loss_counted_tokens: 3206
total tokens: 8040 num samples: 10 num padding tokens: 894 - rank: 5 max len: 804 min len: 634 avg len: 714.6 num_loss_counted_tokens: 4636
Per-token loss scaled by world size: 0.00029286538483574986Per-token loss scaled by world size: 0.0002602968306746334Per-token loss scaled by world size: 0.00021679738711100072Per-token loss scaled by world size: 0.00021336728241294622Per-token loss scaled by world size: 0.0002807514392770827
Per-token loss scaled by world size: 2.977332087539253e-06
Per-token loss scaled by world size: 0.00019802094902843237
Epoch: 0, Step: 73, Rank: 1, loss = 0.8374074697494507
Epoch: 0, Step: 73, Rank: 6, loss = 0.9421845078468323Epoch: 0, Step: 73, Rank: 4, loss = 0.6974642872810364
Epoch: 0, Step: 73, Rank: 0, loss = 0.00957844965159893Epoch: 0, Step: 73, Rank: 3, loss = 0.6864292025566101
Epoch: 0, Step: 73, Rank: 2, loss = 0.9032124280929565
Epoch: 0, Step: 73, Rank: 7, loss = 0.6370581388473511
Per-token loss scaled by world size: 0.0002398234064457938
Epoch: 0, Step: 73, Rank: 5, loss = 0.7715418934822083
Epoch 0: 60%|██████ | 73/121 [03:06<02:03, 2.57s/it] total tokens: 7832 num samples: 8 num padding tokens: 827 - rank: 4 max len: 979 min len: 822 avg len: 875.625 num_loss_counted_tokens: 5623
total tokens: 7455 num samples: 3 num padding tokens: 291 - rank: 1 max len: 2485 min len: 2232 avg len: 2388.0 num_loss_counted_tokens: 600
{
"epoch": 0,
"step": 73,
"rank": 0,
"loss": 0.00957844965159893,
"overall_throughput": 41.76512858398771,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.471110343933105,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25737,
"batch_size": 88,
"total_loss": 0.6856094598770142,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:08.555309"
}
total tokens: 8020 num samples: 10 num padding tokens: 1418 - rank: 5 max len: 802 min len: 571 avg len: 660.2 num_loss_counted_tokens: 4480
total tokens: 5714 num samples: 2 num padding tokens: 293 - rank: 0 max len: 2857 min len: 2564 avg len: 2710.5 num_loss_counted_tokens: 246
total tokens: 7260 num samples: 6 num padding tokens: 413 - rank: 3 max len: 1210 min len: 1056 avg len: 1141.1666666666667 num_loss_counted_tokens: 3104
total tokens: 6768 num samples: 4 num padding tokens: 992 - rank: 2 max len: 1692 min len: 1291 avg len: 1444.0 num_loss_counted_tokens: 3075
total tokens: 7602 num samples: 14 num padding tokens: 1899 - rank: 6 max len: 543 min len: 306 avg len: 407.35714285714283 num_loss_counted_tokens: 3724
total tokens: 7930 num samples: 26 num padding tokens: 2791 - rank: 7 max len: 305 min len: 79 avg len: 197.65384615384616 num_loss_counted_tokens: 2193
Per-token loss scaled by world size: 0.0004040475469082594Per-token loss scaled by world size: 0.00014303348143585026Per-token loss scaled by world size: 0.00015468306082766503Per-token loss scaled by world size: 0.00038016383768990636Per-token loss scaled by world size: 0.0002839408116415143Per-token loss scaled by world size: 0.00020860570657532662
Per-token loss scaled by world size: 3.69467556993186e-06
Epoch: 0, Step: 74, Rank: 5, loss = 1.1417745351791382
Epoch: 0, Step: 74, Rank: 1, loss = 0.4645712375640869Epoch: 0, Step: 74, Rank: 4, loss = 0.4295831620693207Epoch: 0, Step: 74, Rank: 6, loss = 1.2135063409805298Epoch: 0, Step: 74, Rank: 7, loss = 0.8527806997299194
Epoch: 0, Step: 74, Rank: 0, loss = 0.011096496134996414
Epoch: 0, Step: 74, Rank: 2, loss = 0.6265211701393127
Per-token loss scaled by world size: 0.00020632839004974812
Epoch: 0, Step: 74, Rank: 3, loss = 0.6196815371513367
Epoch 0: 61%|██████ | 74/121 [03:09<02:00, 2.55s/it] total tokens: 7696 num samples: 8 num padding tokens: 1763 - rank: 4 max len: 962 min len: 651 avg len: 741.625 num_loss_counted_tokens: 5122
total tokens: 7854 num samples: 3 num padding tokens: 565 - rank: 1 max len: 2618 min len: 2101 avg len: 2429.6666666666665 num_loss_counted_tokens: 298
{
"epoch": 0,
"step": 74,
"rank": 0,
"loss": 0.011096496134996414,
"overall_throughput": 41.849181139415684,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.389501094818115,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24027,
"batch_size": 89,
"total_loss": 0.6699394583702087,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:11.072098"
}
total tokens: 7530 num samples: 5 num padding tokens: 1325 - rank: 3 max len: 1506 min len: 1053 avg len: 1241.0 num_loss_counted_tokens: 2113
total tokens: 7860 num samples: 30 num padding tokens: 3145 - rank: 7 max len: 262 min len: 80 avg len: 157.16666666666666 num_loss_counted_tokens: 1940
total tokens: 7596 num samples: 12 num padding tokens: 1105 - rank: 5 max len: 633 min len: 472 avg len: 540.9166666666666 num_loss_counted_tokens: 4291
total tokens: 4061 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4061 min len: 4061 avg len: 4061.0 num_loss_counted_tokens: 393
total tokens: 6255 num samples: 3 num padding tokens: 930 - rank: 2 max len: 2085 min len: 1507 avg len: 1775.0 num_loss_counted_tokens: 1265
total tokens: 7905 num samples: 17 num padding tokens: 1989 - rank: 6 max len: 465 min len: 271 avg len: 348.0 num_loss_counted_tokens: 3447
Per-token loss scaled by world size: 0.0008345923852175474Per-token loss scaled by world size: 0.0001891565480036661Per-token loss scaled by world size: 0.0006257555796764791
Per-token loss scaled by world size: 5.515092198038474e-06Per-token loss scaled by world size: 0.00020594018860720098Per-token loss scaled by world size: 4.789793456438929e-05
Per-token loss scaled by world size: 9.402850264450535e-05
Epoch: 0, Step: 75, Rank: 5, loss = 1.483744740486145
Epoch: 0, Step: 75, Rank: 3, loss = 0.44851380586624146
Epoch: 0, Step: 75, Rank: 0, loss = 0.013076973147690296Epoch: 0, Step: 75, Rank: 4, loss = 1.9789228439331055
Epoch: 0, Step: 75, Rank: 2, loss = 0.22295333445072174
Epoch: 0, Step: 75, Rank: 1, loss = 0.11357199400663376Epoch: 0, Step: 75, Rank: 7, loss = 0.48830991983413696
Per-token loss scaled by world size: 0.0005321354838088155
Epoch: 0, Step: 75, Rank: 6, loss = 1.2617597579956055
Epoch 0: 62%|██████▏ | 75/121 [03:11<01:57, 2.54s/it] total tokens: 7851 num samples: 3 num padding tokens: 1455 - rank: 1 max len: 2617 min len: 1665 avg len: 2132.0 num_loss_counted_tokens: 271
total tokens: 7112 num samples: 7 num padding tokens: 1148 - rank: 4 max len: 1016 min len: 665 avg len: 852.0 num_loss_counted_tokens: 3987
total tokens: 8086 num samples: 13 num padding tokens: 1573 - rank: 5 max len: 622 min len: 389 avg len: 501.0 num_loss_counted_tokens: 4062
{
"epoch": 0,
"step": 75,
"rank": 0,
"loss": 0.013076973147690296,
"overall_throughput": 41.95248737744596,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.426692962646484,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18969,
"batch_size": 77,
"total_loss": 0.7513566613197327,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:13.595856"
}
total tokens: 7182 num samples: 6 num padding tokens: 591 - rank: 3 max len: 1197 min len: 1043 avg len: 1098.5 num_loss_counted_tokens: 5313
total tokens: 6628 num samples: 4 num padding tokens: 547 - rank: 2 max len: 1657 min len: 1381 avg len: 1520.25 num_loss_counted_tokens: 947
total tokens: 3496 num samples: 19 num padding tokens: 1312 - rank: 7 max len: 184 min len: 78 avg len: 114.94736842105263 num_loss_counted_tokens: 656
total tokens: 7938 num samples: 21 num padding tokens: 2607 - rank: 6 max len: 378 min len: 187 avg len: 253.85714285714286 num_loss_counted_tokens: 2571
total tokens: 7226 num samples: 2 num padding tokens: 50 - rank: 0 max len: 3613 min len: 3563 avg len: 3588.0 num_loss_counted_tokens: 179
Per-token loss scaled by world size: 0.00010904129885602742Per-token loss scaled by world size: 0.00042939232662320137Per-token loss scaled by world size: 0.0003037904389202595
Per-token loss scaled by world size: 0.00046344727161340415Per-token loss scaled by world size: 6.435919203795493e-05Per-token loss scaled by world size: 0.0002804531832225621
Per-token loss scaled by world size: 0.00045534392120316625
Epoch: 0, Step: 76, Rank: 5, loss = 1.2729872465133667
Epoch: 0, Step: 76, Rank: 6, loss = 1.3739473819732666
Epoch: 0, Step: 76, Rank: 2, loss = 0.900624692440033
Epoch: 0, Step: 76, Rank: 0, loss = 0.19080086052417755Epoch: 0, Step: 76, Rank: 1, loss = 0.32326656579971313
Epoch: 0, Step: 76, Rank: 7, loss = 0.8314384818077087
Epoch: 0, Step: 76, Rank: 4, loss = 1.3499239683151245
Per-token loss scaled by world size: 0.000354817311745137
Epoch: 0, Step: 76, Rank: 3, loss = 1.0519002676010132
Epoch 0: 63%|██████▎ | 76/121 [03:14<01:54, 2.54s/it] total tokens: 7095 num samples: 5 num padding tokens: 648 - rank: 4 max len: 1419 min len: 1179 avg len: 1289.4 num_loss_counted_tokens: 3185
total tokens: 6052 num samples: 2 num padding tokens: 305 - rank: 1 max len: 3026 min len: 2721 avg len: 2873.5 num_loss_counted_tokens: 349
{
"epoch": 0,
"step": 76,
"rank": 0,
"loss": 0.19080086052417755,
"overall_throughput": 41.77716057032778,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.457417488098145,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23717,
"batch_size": 104,
"total_loss": 0.9118610620498657,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:16.129445"
}
total tokens: 7774 num samples: 13 num padding tokens: 2031 - rank: 6 max len: 598 min len: 311 avg len: 441.7692307692308 num_loss_counted_tokens: 3287
total tokens: 8034 num samples: 3 num padding tokens: 672 - rank: 2 max len: 2678 min len: 2124 avg len: 2454.0 num_loss_counted_tokens: 1267
total tokens: 7544 num samples: 4 num padding tokens: 572 - rank: 3 max len: 1886 min len: 1658 avg len: 1743.0 num_loss_counted_tokens: 711
total tokens: 7700 num samples: 7 num padding tokens: 2236 - rank: 5 max len: 1100 min len: 637 avg len: 780.5714285714286 num_loss_counted_tokens: 3735
total tokens: 6648 num samples: 24 num padding tokens: 2581 - rank: 7 max len: 277 min len: 90 avg len: 169.45833333333334 num_loss_counted_tokens: 1750
total tokens: 6410 num samples: 2 num padding tokens: 53 - rank: 0 max len: 3205 min len: 3152 avg len: 3178.5 num_loss_counted_tokens: 196
Per-token loss scaled by world size: 0.00034956797026097775Per-token loss scaled by world size: 0.00019042924395762384Per-token loss scaled by world size: 0.00021594665304291993Per-token loss scaled by world size: 0.000333549891365692Per-token loss scaled by world size: 0.00039773472235538065
Per-token loss scaled by world size: 1.5378537909782608e-06Per-token loss scaled by world size: 1.5691426597186364e-05
Epoch: 0, Step: 77, Rank: 7, loss = 1.0991719961166382Epoch: 0, Step: 77, Rank: 6, loss = 0.7116252183914185
Epoch: 0, Step: 77, Rank: 4, loss = 1.3106850385665894
Epoch: 0, Step: 77, Rank: 5, loss = 1.1519575119018555
Epoch: 0, Step: 77, Rank: 2, loss = 0.6275357604026794
Epoch: 0, Step: 77, Rank: 1, loss = 0.0517091378569603Epoch: 0, Step: 77, Rank: 0, loss = 0.005067804828286171
Per-token loss scaled by world size: 0.0002473424538038671
Epoch: 0, Step: 77, Rank: 3, loss = 0.8150861859321594
Epoch 0: 64%|██████▎ | 77/121 [03:16<01:51, 2.53s/it] total tokens: 7920 num samples: 9 num padding tokens: 741 - rank: 4 max len: 880 min len: 742 avg len: 797.6666666666666 num_loss_counted_tokens: 3310
total tokens: 5446 num samples: 2 num padding tokens: 50 - rank: 1 max len: 2723 min len: 2673 avg len: 2698.0 num_loss_counted_tokens: 175
total tokens: 7689 num samples: 11 num padding tokens: 1344 - rank: 5 max len: 699 min len: 481 avg len: 576.8181818181819 num_loss_counted_tokens: 3004
total tokens: 8041 num samples: 17 num padding tokens: 1807 - rank: 6 max len: 473 min len: 265 avg len: 366.70588235294116 num_loss_counted_tokens: 4129
{
"epoch": 0,
"step": 77,
"rank": 0,
"loss": 0.005067804828286171,
"overall_throughput": 42.455461086288004,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.274834632873535,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26363,
"batch_size": 91,
"total_loss": 0.7216048836708069,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:18.625029"
}
total tokens: 6488 num samples: 2 num padding tokens: 319 - rank: 0 max len: 3244 min len: 2925 avg len: 3084.5 num_loss_counted_tokens: 219
total tokens: 7337 num samples: 29 num padding tokens: 2640 - rank: 7 max len: 253 min len: 79 avg len: 161.9655172413793 num_loss_counted_tokens: 1852
total tokens: 7398 num samples: 6 num padding tokens: 1043 - rank: 3 max len: 1233 min len: 937 avg len: 1059.1666666666667 num_loss_counted_tokens: 4139
total tokens: 7473 num samples: 3 num padding tokens: 1919 - rank: 2 max len: 2491 min len: 1521 avg len: 1851.3333333333333 num_loss_counted_tokens: 1621
Per-token loss scaled by world size: 0.00012608377437572926Per-token loss scaled by world size: 0.00035810453118756413Per-token loss scaled by world size: 0.00015491498925257474Per-token loss scaled by world size: 0.00043326299055479467
Per-token loss scaled by world size: 0.00016809521184768528
Per-token loss scaled by world size: 3.594179133870057e-06
Per-token loss scaled by world size: 0.0003268007712904364
Epoch: 0, Step: 78, Rank: 6, loss = 1.1593186855316162Epoch: 0, Step: 78, Rank: 5, loss = 1.4026347398757935
Epoch: 0, Step: 78, Rank: 7, loss = 0.5015178918838501
Epoch: 0, Step: 78, Rank: 3, loss = 0.40818047523498535
Epoch: 0, Step: 78, Rank: 0, loss = 0.011635705828666687
Epoch: 0, Step: 78, Rank: 1, loss = 0.5441872477531433
Epoch: 0, Step: 78, Rank: 4, loss = 1.0579766035079956
Per-token loss scaled by world size: 0.000327433692291379
Epoch: 0, Step: 78, Rank: 2, loss = 1.060025691986084
Epoch 0: 64%|██████▍ | 78/121 [03:19<01:49, 2.54s/it] total tokens: 7496 num samples: 8 num padding tokens: 719 - rank: 4 max len: 937 min len: 761 avg len: 847.125 num_loss_counted_tokens: 4720
total tokens: 8100 num samples: 4 num padding tokens: 633 - rank: 1 max len: 2025 min len: 1692 avg len: 1866.75 num_loss_counted_tokens: 1316
{
"epoch": 0,
"step": 78,
"rank": 0,
"loss": 0.011635705828666687,
"overall_throughput": 41.25017196728111,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.33673620223999,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25899,
"batch_size": 81,
"total_loss": 0.7681846618652344,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:21.186251"
}
total tokens: 8107 num samples: 11 num padding tokens: 1051 - rank: 5 max len: 737 min len: 542 avg len: 641.4545454545455 num_loss_counted_tokens: 4630
total tokens: 7305 num samples: 5 num padding tokens: 524 - rank: 2 max len: 1461 min len: 1128 avg len: 1356.2 num_loss_counted_tokens: 3964
total tokens: 7209 num samples: 27 num padding tokens: 2040 - rank: 7 max len: 267 min len: 82 avg len: 191.44444444444446 num_loss_counted_tokens: 2311
total tokens: 7672 num samples: 7 num padding tokens: 538 - rank: 3 max len: 1096 min len: 942 avg len: 1019.1428571428571 num_loss_counted_tokens: 4249
total tokens: 7226 num samples: 2 num padding tokens: 698 - rank: 0 max len: 3613 min len: 2915 avg len: 3264.0 num_loss_counted_tokens: 204
total tokens: 8070 num samples: 15 num padding tokens: 1557 - rank: 6 max len: 538 min len: 297 avg len: 434.2 num_loss_counted_tokens: 3942
Per-token loss scaled by world size: 0.0003785255830734968Per-token loss scaled by world size: 0.00020959046378266066Per-token loss scaled by world size: 0.0004416834854055196Per-token loss scaled by world size: 0.0002668427478056401
Per-token loss scaled by world size: 0.00010036973981186748Per-token loss scaled by world size: 0.00018033267406281084Per-token loss scaled by world size: 4.8121955842361785e-06
Epoch: 0, Step: 79, Rank: 6, loss = 0.6261777281761169
Epoch: 0, Step: 79, Rank: 5, loss = 1.319584608078003
Epoch: 0, Step: 79, Rank: 4, loss = 1.1308925151824951
Epoch: 0, Step: 79, Rank: 7, loss = 0.797226071357727
Epoch: 0, Step: 79, Rank: 0, loss = 0.014377035200595856
Epoch: 0, Step: 79, Rank: 1, loss = 0.5387663841247559
Epoch: 0, Step: 79, Rank: 2, loss = 0.2998671531677246
Per-token loss scaled by world size: 0.00022867463121656328
Epoch: 0, Step: 79, Rank: 3, loss = 0.6831940412521362
Epoch 0: 65%|██████▌ | 79/121 [03:21<01:46, 2.55s/it] total tokens: 7794 num samples: 9 num padding tokens: 491 - rank: 4 max len: 866 min len: 750 avg len: 811.4444444444445 num_loss_counted_tokens: 5401
total tokens: 7551 num samples: 3 num padding tokens: 552 - rank: 1 max len: 2517 min len: 2018 avg len: 2333.0 num_loss_counted_tokens: 491
{
"epoch": 0,
"step": 79,
"rank": 0,
"loss": 0.014377035200595856,
"overall_throughput": 41.31761838872816,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.35622549057007,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23901,
"batch_size": 83,
"total_loss": 0.6762607097625732,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:23.748903"
}
total tokens: 7966 num samples: 14 num padding tokens: 2590 - rank: 6 max len: 569 min len: 271 avg len: 384.0 num_loss_counted_tokens: 3308
total tokens: 7416 num samples: 4 num padding tokens: 852 - rank: 2 max len: 1854 min len: 1322 avg len: 1641.0 num_loss_counted_tokens: 506
total tokens: 7931 num samples: 11 num padding tokens: 889 - rank: 5 max len: 721 min len: 597 avg len: 640.1818181818181 num_loss_counted_tokens: 5323
total tokens: 7806 num samples: 6 num padding tokens: 1777 - rank: 3 max len: 1301 min len: 876 avg len: 1004.8333333333334 num_loss_counted_tokens: 4853
total tokens: 5476 num samples: 2 num padding tokens: 96 - rank: 0 max len: 2738 min len: 2642 avg len: 2690.0 num_loss_counted_tokens: 179
total tokens: 8100 num samples: 30 num padding tokens: 2273 - rank: 7 max len: 270 min len: 83 avg len: 194.23333333333332 num_loss_counted_tokens: 2747
Per-token loss scaled by world size: 0.00042449356988072395Per-token loss scaled by world size: 0.00028748821932822466Per-token loss scaled by world size: 0.0002529154298827052Per-token loss scaled by world size: 0.0005231253453530371Per-token loss scaled by world size: 0.0002102917933370918Per-token loss scaled by world size: 5.35248773303465e-06
Per-token loss scaled by world size: 0.00035050552105531096
Epoch: 0, Step: 80, Rank: 6, loss = 1.4146617650985718
Epoch: 0, Step: 80, Rank: 0, loss = 0.01447446458041668Epoch: 0, Step: 80, Rank: 4, loss = 0.5686815977096558
Epoch: 0, Step: 80, Rank: 3, loss = 0.7774400115013123Epoch: 0, Step: 80, Rank: 5, loss = 1.1479367017745972
Epoch: 0, Step: 80, Rank: 2, loss = 0.6839465498924255
Epoch: 0, Step: 80, Rank: 7, loss = 0.9478545188903809
Per-token loss scaled by world size: 1.0431926966703031e-06
Epoch: 0, Step: 80, Rank: 1, loss = 0.002821053843945265
Epoch 0: 66%|██████▌ | 80/121 [03:24<01:44, 2.54s/it] total tokens: 7600 num samples: 10 num padding tokens: 502 - rank: 4 max len: 760 min len: 672 avg len: 709.8 num_loss_counted_tokens: 4962
total tokens: 7035 num samples: 5 num padding tokens: 571 - rank: 1 max len: 1407 min len: 1097 avg len: 1292.8 num_loss_counted_tokens: 3599
{
"epoch": 0,
"step": 80,
"rank": 0,
"loss": 0.01447446458041668,
"overall_throughput": 41.61237764414521,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.469753742218018,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21634,
"batch_size": 76,
"total_loss": 0.6947270631790161,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:26.290599"
}
total tokens: 7760 num samples: 16 num padding tokens: 1837 - rank: 6 max len: 485 min len: 278 avg len: 370.1875 num_loss_counted_tokens: 3781
total tokens: 7595 num samples: 7 num padding tokens: 592 - rank: 2 max len: 1085 min len: 934 avg len: 1000.4285714285714 num_loss_counted_tokens: 4208
total tokens: 7444 num samples: 4 num padding tokens: 741 - rank: 0 max len: 1861 min len: 1498 avg len: 1675.75 num_loss_counted_tokens: 2607
total tokens: 7440 num samples: 8 num padding tokens: 519 - rank: 3 max len: 930 min len: 764 avg len: 865.125 num_loss_counted_tokens: 6064
total tokens: 8100 num samples: 30 num padding tokens: 2932 - rank: 7 max len: 270 min len: 75 avg len: 172.26666666666668 num_loss_counted_tokens: 2182
total tokens: 8016 num samples: 12 num padding tokens: 980 - rank: 5 max len: 668 min len: 496 avg len: 586.3333333333334 num_loss_counted_tokens: 5943
Per-token loss scaled by world size: 0.0002643285261001438Per-token loss scaled by world size: 0.000505154428537935Per-token loss scaled by world size: 0.0003831658395938575Per-token loss scaled by world size: 0.0005561576108448207
Per-token loss scaled by world size: 4.442329100129427e-06Per-token loss scaled by world size: 0.000311601092107594
Per-token loss scaled by world size: 3.0491105462715495e-06
Epoch: 0, Step: 81, Rank: 5, loss = 1.2860599756240845
Epoch: 0, Step: 81, Rank: 3, loss = 0.6729474067687988
Epoch: 0, Step: 81, Rank: 6, loss = 1.4159077405929565
Epoch: 0, Step: 81, Rank: 4, loss = 0.9754922986030579Epoch: 0, Step: 81, Rank: 1, loss = 0.011309614405035973
Epoch: 0, Step: 81, Rank: 7, loss = 0.7932974100112915
Epoch: 0, Step: 81, Rank: 0, loss = 0.007762654218822718
Per-token loss scaled by world size: 0.00019164555124007165
Epoch: 0, Step: 81, Rank: 2, loss = 0.4879056215286255
Epoch 0: 67%|██████▋ | 81/121 [03:27<01:41, 2.55s/it]{
"epoch": 0,
"step": 81,
"rank": 0,
"loss": 0.007762654218822718,
"overall_throughput": 41.32577987310498,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.05413246154785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20367,
"batch_size": 76,
"total_loss": 0.7063353061676025,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:28.868839"
}
total tokens: 5876 num samples: 2 num padding tokens: 1017 - rank: 1 max len: 2938 min len: 1921 avg len: 2429.5 num_loss_counted_tokens: 591
total tokens: 7936 num samples: 8 num padding tokens: 1689 - rank: 4 max len: 992 min len: 598 avg len: 780.875 num_loss_counted_tokens: 4729
total tokens: 6752 num samples: 2 num padding tokens: 73 - rank: 0 max len: 3376 min len: 3303 avg len: 3339.5 num_loss_counted_tokens: 449
total tokens: 8018 num samples: 19 num padding tokens: 2548 - rank: 6 max len: 422 min len: 234 avg len: 287.89473684210526 num_loss_counted_tokens: 3302
total tokens: 7156 num samples: 4 num padding tokens: 826 - rank: 2 max len: 1789 min len: 1335 avg len: 1582.5 num_loss_counted_tokens: 3840
total tokens: 6552 num samples: 28 num padding tokens: 2284 - rank: 7 max len: 234 min len: 79 avg len: 152.42857142857142 num_loss_counted_tokens: 1601
total tokens: 7540 num samples: 13 num padding tokens: 837 - rank: 5 max len: 580 min len: 438 avg len: 515.6153846153846 num_loss_counted_tokens: 4083
total tokens: 7512 num samples: 6 num padding tokens: 558 - rank: 3 max len: 1252 min len: 1059 avg len: 1159.0 num_loss_counted_tokens: 3038
Per-token loss scaled by world size: 0.000629897927865386Per-token loss scaled by world size: 0.0006152652204036713Per-token loss scaled by world size: 0.00011580222053453326Per-token loss scaled by world size: 0.0004951037117280066
Per-token loss scaled by world size: 0.000213472536415793
Per-token loss scaled by world size: 1.2029913705191575e-05Per-token loss scaled by world size: 5.17758380738087e-05
Epoch: 0, Step: 82, Rank: 4, loss = 1.4647926092147827
Epoch: 0, Step: 82, Rank: 6, loss = 1.4996294975280762Epoch: 0, Step: 82, Rank: 2, loss = 0.27569612860679626
Epoch: 0, Step: 82, Rank: 3, loss = 1.1787182092666626
Epoch: 0, Step: 82, Rank: 7, loss = 0.5082247257232666
Epoch: 0, Step: 82, Rank: 1, loss = 0.02864021621644497Epoch: 0, Step: 82, Rank: 0, loss = 0.1232653260231018
Per-token loss scaled by world size: 0.0006569805555045605
Epoch: 0, Step: 82, Rank: 5, loss = 1.5641064643859863
Epoch 0: 68%|██████▊ | 82/121 [03:29<01:38, 2.53s/it] total tokens: 7308 num samples: 9 num padding tokens: 762 - rank: 4 max len: 812 min len: 669 avg len: 727.3333333333334 num_loss_counted_tokens: 4589
total tokens: 7874 num samples: 31 num padding tokens: 2845 - rank: 7 max len: 254 min len: 81 avg len: 162.2258064516129 num_loss_counted_tokens: 2174
total tokens: 7372 num samples: 4 num padding tokens: 311 - rank: 1 max len: 1843 min len: 1686 avg len: 1765.25 num_loss_counted_tokens: 1401
{
"epoch": 0,
"step": 82,
"rank": 0,
"loss": 0.1232653260231018,
"overall_throughput": 42.003063993949816,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.33745241165161,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19046,
"batch_size": 84,
"total_loss": 0.8303841352462769,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:31.350016"
}
total tokens: 7866 num samples: 19 num padding tokens: 1707 - rank: 6 max len: 414 min len: 255 avg len: 324.1578947368421 num_loss_counted_tokens: 3285
total tokens: 6580 num samples: 4 num padding tokens: 1811 - rank: 2 max len: 1645 min len: 1002 avg len: 1192.25 num_loss_counted_tokens: 2171
total tokens: 8094 num samples: 3 num padding tokens: 948 - rank: 0 max len: 2698 min len: 2046 avg len: 2382.0 num_loss_counted_tokens: 2180
total tokens: 7992 num samples: 8 num padding tokens: 875 - rank: 3 max len: 999 min len: 819 avg len: 889.625 num_loss_counted_tokens: 5913
total tokens: 7982 num samples: 13 num padding tokens: 1310 - rank: 5 max len: 614 min len: 424 avg len: 513.2307692307693 num_loss_counted_tokens: 3981
Per-token loss scaled by world size: 0.00023041688837110996Per-token loss scaled by world size: 0.0001501823280705139Per-token loss scaled by world size: 0.000244573806412518Per-token loss scaled by world size: 0.0003567738749552518Per-token loss scaled by world size: 0.00027550142840482295
Per-token loss scaled by world size: 3.432213998166844e-05
Per-token loss scaled by world size: 0.00022764307504985482
Epoch: 0, Step: 83, Rank: 5, loss = 1.1512646675109863Epoch: 0, Step: 83, Rank: 4, loss = 0.4846196174621582Epoch: 0, Step: 83, Rank: 7, loss = 0.7435265183448792
Epoch: 0, Step: 83, Rank: 0, loss = 0.11075326055288315Epoch: 0, Step: 83, Rank: 1, loss = 0.7892091274261475
Epoch: 0, Step: 83, Rank: 3, loss = 0.8890087008476257
Epoch: 0, Step: 83, Rank: 2, loss = 0.7345757484436035
Per-token loss scaled by world size: 0.0002607592905405909
Epoch: 0, Step: 83, Rank: 6, loss = 0.8414376378059387
Epoch 0: 69%|██████▊ | 83/121 [03:32<01:35, 2.52s/it] total tokens: 7168 num samples: 7 num padding tokens: 519 - rank: 4 max len: 1024 min len: 877 avg len: 949.8571428571429 num_loss_counted_tokens: 3980
total tokens: 6612 num samples: 3 num padding tokens: 963 - rank: 1 max len: 2204 min len: 1680 avg len: 1883.0 num_loss_counted_tokens: 659
{
"epoch": 0,
"step": 83,
"rank": 0,
"loss": 0.11075326055288315,
"overall_throughput": 42.37488901314667,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.450260639190674,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25815,
"batch_size": 90,
"total_loss": 0.7180494070053101,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:33.841762"
}
total tokens: 6544 num samples: 4 num padding tokens: 763 - rank: 2 max len: 1636 min len: 1316 avg len: 1445.25 num_loss_counted_tokens: 481
total tokens: 7930 num samples: 10 num padding tokens: 919 - rank: 5 max len: 793 min len: 605 avg len: 701.1 num_loss_counted_tokens: 4841
total tokens: 7994 num samples: 14 num padding tokens: 1260 - rank: 6 max len: 571 min len: 361 avg len: 481.0 num_loss_counted_tokens: 5292
total tokens: 8096 num samples: 23 num padding tokens: 2810 - rank: 7 max len: 352 min len: 91 avg len: 229.82608695652175 num_loss_counted_tokens: 2516
total tokens: 7590 num samples: 6 num padding tokens: 565 - rank: 3 max len: 1265 min len: 1087 avg len: 1170.8333333333333 num_loss_counted_tokens: 4039
total tokens: 6258 num samples: 2 num padding tokens: 564 - rank: 0 max len: 3129 min len: 2565 avg len: 2847.0 num_loss_counted_tokens: 179
Per-token loss scaled by world size: 0.00019247813906986266Per-token loss scaled by world size: 0.00029174372320994735
Per-token loss scaled by world size: 0.0004131880996283144Per-token loss scaled by world size: 0.0003363724099472165
Per-token loss scaled by world size: 0.0005087562603875995Per-token loss scaled by world size: 0.0001915783795993775Per-token loss scaled by world size: 3.281491217421717e-06
Epoch: 0, Step: 84, Rank: 3, loss = 0.8194716572761536
Epoch: 0, Step: 84, Rank: 2, loss = 0.540647029876709Epoch: 0, Step: 84, Rank: 7, loss = 0.9448280930519104Epoch: 0, Step: 84, Rank: 4, loss = 1.1605937480926514
Epoch: 0, Step: 84, Rank: 0, loss = 0.009217298589646816
Epoch: 0, Step: 84, Rank: 5, loss = 1.429032802581787
Epoch: 0, Step: 84, Rank: 1, loss = 0.5381197333335876
Per-token loss scaled by world size: 0.0003772681811824441
Epoch: 0, Step: 84, Rank: 6, loss = 1.0596991777420044
Epoch 0: 69%|██████▉ | 84/121 [03:34<01:33, 2.52s/it] total tokens: 6480 num samples: 3 num padding tokens: 343 - rank: 1 max len: 2160 min len: 1943 avg len: 2045.6666666666667 num_loss_counted_tokens: 909
total tokens: 7448 num samples: 8 num padding tokens: 1052 - rank: 4 max len: 931 min len: 740 avg len: 799.5 num_loss_counted_tokens: 3154
{
"epoch": 0,
"step": 84,
"rank": 0,
"loss": 0.009217298589646816,
"overall_throughput": 41.98783375148776,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.287980556488037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22471,
"batch_size": 89,
"total_loss": 0.8127012252807617,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:36.362094"
}
total tokens: 7896 num samples: 28 num padding tokens: 3153 - rank: 7 max len: 282 min len: 74 avg len: 169.39285714285714 num_loss_counted_tokens: 2024
total tokens: 8115 num samples: 15 num padding tokens: 2233 - rank: 6 max len: 541 min len: 292 avg len: 392.1333333333333 num_loss_counted_tokens: 3345
total tokens: 5744 num samples: 2 num padding tokens: 367 - rank: 0 max len: 2872 min len: 2505 avg len: 2688.5 num_loss_counted_tokens: 161
total tokens: 7436 num samples: 4 num padding tokens: 1471 - rank: 2 max len: 1859 min len: 1182 avg len: 1491.25 num_loss_counted_tokens: 773
total tokens: 8071 num samples: 7 num padding tokens: 817 - rank: 3 max len: 1153 min len: 974 avg len: 1036.2857142857142 num_loss_counted_tokens: 4123
total tokens: 7788 num samples: 11 num padding tokens: 766 - rank: 5 max len: 708 min len: 547 avg len: 638.3636363636364 num_loss_counted_tokens: 3782
Per-token loss scaled by world size: 0.0005787216359749436Per-token loss scaled by world size: 0.0005308112595230341Per-token loss scaled by world size: 0.00033112603705376387Per-token loss scaled by world size: 0.00014354031009133905Per-token loss scaled by world size: 0.00046847882913425565Per-token loss scaled by world size: 3.301608558103908e-06
Per-token loss scaled by world size: 3.5768789530266076e-06
Epoch: 0, Step: 85, Rank: 6, loss = 1.3779860734939575Epoch: 0, Step: 85, Rank: 5, loss = 1.2161710262298584Epoch: 0, Step: 85, Rank: 7, loss = 0.8596031665802002
Epoch: 0, Step: 85, Rank: 1, loss = 0.008570975624024868Epoch: 0, Step: 85, Rank: 4, loss = 1.5023614168167114
Epoch: 0, Step: 85, Rank: 2, loss = 0.37263065576553345
Epoch: 0, Step: 85, Rank: 0, loss = 0.009285577572882175
Per-token loss scaled by world size: 0.00042734169983305037
Epoch: 0, Step: 85, Rank: 3, loss = 1.1093790531158447
Epoch 0: 70%|███████ | 85/121 [03:37<01:31, 2.53s/it] total tokens: 7898 num samples: 11 num padding tokens: 499 - rank: 4 max len: 718 min len: 607 avg len: 672.6363636363636 num_loss_counted_tokens: 3857
total tokens: 8060 num samples: 5 num padding tokens: 394 - rank: 1 max len: 1612 min len: 1437 avg len: 1533.2 num_loss_counted_tokens: 1786
{
"epoch": 0,
"step": 85,
"rank": 0,
"loss": 0.009285577572882175,
"overall_throughput": 41.311858360949856,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.234922885894775,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20768,
"batch_size": 87,
"total_loss": 0.8069984912872314,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:38.922540"
}
total tokens: 7898 num samples: 22 num padding tokens: 1234 - rank: 6 max len: 359 min len: 264 avg len: 302.90909090909093 num_loss_counted_tokens: 3332
total tokens: 6834 num samples: 3 num padding tokens: 886 - rank: 0 max len: 2278 min len: 1616 avg len: 1982.6666666666667 num_loss_counted_tokens: 909
total tokens: 7680 num samples: 6 num padding tokens: 1041 - rank: 2 max len: 1280 min len: 938 avg len: 1106.5 num_loss_counted_tokens: 3266
total tokens: 7683 num samples: 13 num padding tokens: 1257 - rank: 5 max len: 591 min len: 368 avg len: 494.3076923076923 num_loss_counted_tokens: 4067
total tokens: 7395 num samples: 29 num padding tokens: 2483 - rank: 7 max len: 255 min len: 89 avg len: 169.3793103448276 num_loss_counted_tokens: 2026
total tokens: 7408 num samples: 8 num padding tokens: 818 - rank: 3 max len: 926 min len: 725 avg len: 823.75 num_loss_counted_tokens: 4017
Per-token loss scaled by world size: 0.00045185594353824854Per-token loss scaled by world size: 0.0003287219151388854Per-token loss scaled by world size: 0.00010264909360557795Per-token loss scaled by world size: 0.0003051054081879556Per-token loss scaled by world size: 0.00028172050951980054Per-token loss scaled by world size: 1.640593291085679e-05
Per-token loss scaled by world size: 8.82493841345422e-05
Epoch: 0, Step: 86, Rank: 2, loss = 1.0376107692718506Epoch: 0, Step: 86, Rank: 6, loss = 0.9630652666091919
Epoch: 0, Step: 86, Rank: 1, loss = 0.3240118622779846
Epoch: 0, Step: 86, Rank: 3, loss = 1.4262832403182983
Epoch: 0, Step: 86, Rank: 0, loss = 0.05178532749414444
Epoch: 0, Step: 86, Rank: 4, loss = 0.8892507553100586
Epoch: 0, Step: 86, Rank: 7, loss = 0.2785591781139374
Per-token loss scaled by world size: 0.000460325536550954
Epoch: 0, Step: 86, Rank: 5, loss = 1.4530175924301147
Epoch 0: 71%|███████ | 86/121 [03:39<01:28, 2.53s/it] total tokens: 3744 num samples: 18 num padding tokens: 1271 - rank: 7 max len: 208 min len: 81 avg len: 137.38888888888889 num_loss_counted_tokens: 914
total tokens: 7208 num samples: 4 num padding tokens: 870 - rank: 1 max len: 1802 min len: 1447 avg len: 1584.5 num_loss_counted_tokens: 2098
total tokens: 7308 num samples: 9 num padding tokens: 993 - rank: 4 max len: 812 min len: 635 avg len: 701.6666666666666 num_loss_counted_tokens: 3474
{
"epoch": 0,
"step": 86,
"rank": 0,
"loss": 0.05178532749414444,
"overall_throughput": 41.951886200619036,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.232909202575684,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25252,
"batch_size": 101,
"total_loss": 0.802947998046875,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:41.442803"
}
total tokens: 7896 num samples: 21 num padding tokens: 2014 - rank: 6 max len: 376 min len: 211 avg len: 280.0952380952381 num_loss_counted_tokens: 3459
total tokens: 7364 num samples: 7 num padding tokens: 790 - rank: 3 max len: 1052 min len: 812 avg len: 939.1428571428571 num_loss_counted_tokens: 3186
total tokens: 7872 num samples: 6 num padding tokens: 665 - rank: 2 max len: 1312 min len: 1060 avg len: 1201.1666666666667 num_loss_counted_tokens: 5354
total tokens: 7982 num samples: 13 num padding tokens: 1356 - rank: 5 max len: 614 min len: 393 avg len: 509.6923076923077 num_loss_counted_tokens: 4015
total tokens: 6650 num samples: 2 num padding tokens: 959 - rank: 0 max len: 3325 min len: 2366 avg len: 2845.5 num_loss_counted_tokens: 183
Per-token loss scaled by world size: 0.0001364344934700057Per-token loss scaled by world size: 0.00033442748826928437Per-token loss scaled by world size: 0.00019135570619255304Per-token loss scaled by world size: 0.00039014805224724114Per-token loss scaled by world size: 7.375221321126446e-05
Per-token loss scaled by world size: 3.955068677896634e-05
Per-token loss scaled by world size: 0.00019416131544858217
Epoch: 0, Step: 87, Rank: 4, loss = 0.4185469150543213
Epoch: 0, Step: 87, Rank: 2, loss = 0.5870314836502075Epoch: 0, Step: 87, Rank: 5, loss = 1.1968766450881958
Epoch: 0, Step: 87, Rank: 0, loss = 0.12133162468671799Epoch: 0, Step: 87, Rank: 3, loss = 1.02593994140625Epoch: 0, Step: 87, Rank: 1, loss = 0.22625336050987244
Epoch: 0, Step: 87, Rank: 7, loss = 0.5956383943557739
Per-token loss scaled by world size: 0.00031522451899945736
Epoch: 0, Step: 87, Rank: 6, loss = 0.9670300483703613
Epoch 0: 72%|███████▏ | 87/121 [03:42<01:25, 2.52s/it] total tokens: 7015 num samples: 5 num padding tokens: 1264 - rank: 4 max len: 1403 min len: 1017 avg len: 1150.2 num_loss_counted_tokens: 3249
total tokens: 5650 num samples: 2 num padding tokens: 51 - rank: 1 max len: 2825 min len: 2774 avg len: 2799.5 num_loss_counted_tokens: 191
total tokens: 5482 num samples: 2 num padding tokens: 83 - rank: 2 max len: 2741 min len: 2658 avg len: 2699.5 num_loss_counted_tokens: 171
{
"epoch": 0,
"step": 87,
"rank": 0,
"loss": 0.12133162468671799,
"overall_throughput": 42.28776662058589,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.46161460876465,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24542,
"batch_size": 79,
"total_loss": 0.642331063747406,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:43.944539"
}
total tokens: 7560 num samples: 4 num padding tokens: 1086 - rank: 3 max len: 1890 min len: 1404 avg len: 1618.5 num_loss_counted_tokens: 1603
total tokens: 7830 num samples: 9 num padding tokens: 1197 - rank: 5 max len: 870 min len: 619 avg len: 737.0 num_loss_counted_tokens: 3609
total tokens: 7956 num samples: 13 num padding tokens: 2029 - rank: 6 max len: 612 min len: 285 avg len: 455.9230769230769 num_loss_counted_tokens: 3596
total tokens: 5358 num samples: 19 num padding tokens: 1354 - rank: 7 max len: 282 min len: 86 avg len: 210.73684210526315 num_loss_counted_tokens: 1995
total tokens: 7698 num samples: 2 num padding tokens: 950 - rank: 0 max len: 3849 min len: 2899 avg len: 3374.0 num_loss_counted_tokens: 1217
Per-token loss scaled by world size: 0.00014240843302104622Per-token loss scaled by world size: 0.000148817416629754Per-token loss scaled by world size: 0.0001530916924821213Per-token loss scaled by world size: 0.00020883062097709626Per-token loss scaled by world size: 0.00023989545297808945
Per-token loss scaled by world size: 0.00017666697385720909
Per-token loss scaled by world size: 0.0001427593524567783
Epoch: 0, Step: 88, Rank: 5, loss = 0.8521594405174255Epoch: 0, Step: 88, Rank: 6, loss = 0.9789233803749084Epoch: 0, Step: 88, Rank: 2, loss = 0.6247097849845886Epoch: 0, Step: 88, Rank: 3, loss = 0.6072680950164795
Epoch: 0, Step: 88, Rank: 4, loss = 0.5811154246330261
Epoch: 0, Step: 88, Rank: 1, loss = 0.7209116816520691
Epoch: 0, Step: 88, Rank: 7, loss = 0.5825473666191101
Per-token loss scaled by world size: 0.00011780338536482304
Epoch: 0, Step: 88, Rank: 0, loss = 0.480711430311203
Epoch 0: 73%|███████▎ | 88/121 [03:44<01:23, 2.54s/it] total tokens: 7821 num samples: 11 num padding tokens: 369 - rank: 4 max len: 711 min len: 644 avg len: 677.4545454545455 num_loss_counted_tokens: 4895
total tokens: 8108 num samples: 4 num padding tokens: 1270 - rank: 1 max len: 2027 min len: 1314 avg len: 1709.5 num_loss_counted_tokens: 973
total tokens: 7920 num samples: 8 num padding tokens: 1147 - rank: 3 max len: 990 min len: 714 avg len: 846.625 num_loss_counted_tokens: 2895
{
"epoch": 0,
"step": 88,
"rank": 0,
"loss": 0.480711430311203,
"overall_throughput": 41.16723941483895,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.52360773086548,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 32645,
"batch_size": 94,
"total_loss": 0.6785432696342468,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:46.514923"
}
total tokens: 7888 num samples: 17 num padding tokens: 1450 - rank: 6 max len: 464 min len: 277 avg len: 378.70588235294116 num_loss_counted_tokens: 4530
total tokens: 6446 num samples: 2 num padding tokens: 340 - rank: 0 max len: 3223 min len: 2883 avg len: 3053.0 num_loss_counted_tokens: 686
total tokens: 7860 num samples: 30 num padding tokens: 2767 - rank: 7 max len: 262 min len: 77 avg len: 169.76666666666668 num_loss_counted_tokens: 2196
total tokens: 7398 num samples: 6 num padding tokens: 659 - rank: 2 max len: 1233 min len: 1028 avg len: 1123.1666666666667 num_loss_counted_tokens: 3542
total tokens: 7536 num samples: 12 num padding tokens: 714 - rank: 5 max len: 628 min len: 502 avg len: 568.5 num_loss_counted_tokens: 5605
Per-token loss scaled by world size: 0.0003699270309880376Per-token loss scaled by world size: 0.0005684850038960576Per-token loss scaled by world size: 5.893620254937559e-06
Per-token loss scaled by world size: 3.489888695185073e-05
Per-token loss scaled by world size: 0.0005643228068947792Per-token loss scaled by world size: 0.0003445304755587131Per-token loss scaled by world size: 0.00016975219477899373
Epoch: 0, Step: 89, Rank: 5, loss = 1.299698829650879
Epoch: 0, Step: 89, Rank: 1, loss = 0.013474289327859879
Epoch: 0, Step: 89, Rank: 3, loss = 0.8457456827163696
Epoch: 0, Step: 89, Rank: 0, loss = 0.07978758215904236
Epoch: 0, Step: 89, Rank: 4, loss = 0.3880959451198578Epoch: 0, Step: 89, Rank: 6, loss = 1.2901830673217773Epoch: 0, Step: 89, Rank: 7, loss = 0.7876827716827393
Per-token loss scaled by world size: 0.00024518067948520184
Epoch: 0, Step: 89, Rank: 2, loss = 0.5605443120002747
Epoch 0: 74%|███████▎ | 89/121 [03:47<01:21, 2.55s/it] total tokens: 5448 num samples: 2 num padding tokens: 812 - rank: 1 max len: 2724 min len: 1912 avg len: 2318.0 num_loss_counted_tokens: 205
total tokens: 7690 num samples: 10 num padding tokens: 487 - rank: 4 max len: 769 min len: 675 avg len: 720.3 num_loss_counted_tokens: 4105
{
"epoch": 0,
"step": 89,
"rank": 0,
"loss": 0.07978758215904236,
"overall_throughput": 41.16520368540104,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.30566644668579,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18290,
"batch_size": 69,
"total_loss": 0.6581515669822693,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:49.084401"
}
total tokens: 7360 num samples: 4 num padding tokens: 675 - rank: 2 max len: 1840 min len: 1482 avg len: 1671.25 num_loss_counted_tokens: 836
total tokens: 7812 num samples: 12 num padding tokens: 1280 - rank: 5 max len: 651 min len: 447 avg len: 544.3333333333334 num_loss_counted_tokens: 2926
total tokens: 7095 num samples: 5 num padding tokens: 1469 - rank: 3 max len: 1419 min len: 943 avg len: 1125.2 num_loss_counted_tokens: 1900
total tokens: 7740 num samples: 18 num padding tokens: 1493 - rank: 6 max len: 430 min len: 282 avg len: 347.05555555555554 num_loss_counted_tokens: 3796
total tokens: 7248 num samples: 2 num padding tokens: 215 - rank: 0 max len: 3624 min len: 3409 avg len: 3516.5 num_loss_counted_tokens: 197
total tokens: 7772 num samples: 29 num padding tokens: 2475 - rank: 7 max len: 268 min len: 82 avg len: 182.6551724137931 num_loss_counted_tokens: 2574
Per-token loss scaled by world size: 0.0002169163926737383Per-token loss scaled by world size: 0.00031902806949801743Per-token loss scaled by world size: 0.000320168532198295Per-token loss scaled by world size: 0.00028694834327325225Per-token loss scaled by world size: 2.5503815777483396e-05
Per-token loss scaled by world size: 1.9868204617523588e-05
Per-token loss scaled by world size: 0.00033540837466716766
Epoch: 0, Step: 90, Rank: 4, loss = 0.9190002083778381Epoch: 0, Step: 90, Rank: 6, loss = 0.9222854375839233
Epoch: 0, Step: 90, Rank: 3, loss = 0.8265905380249023Epoch: 0, Step: 90, Rank: 1, loss = 0.07346692681312561Epoch: 0, Step: 90, Rank: 0, loss = 0.057232845574617386
Epoch: 0, Step: 90, Rank: 2, loss = 0.6248548030853271
Epoch: 0, Step: 90, Rank: 7, loss = 0.9661857485771179
Per-token loss scaled by world size: 0.0004802969633601606
Epoch: 0, Step: 90, Rank: 5, loss = 1.3835554122924805
Epoch 0: 74%|███████▍ | 90/121 [03:49<01:18, 2.54s/it] total tokens: 7656 num samples: 11 num padding tokens: 851 - rank: 4 max len: 696 min len: 552 avg len: 618.6363636363636 num_loss_counted_tokens: 4472
total tokens: 7875 num samples: 5 num padding tokens: 1795 - rank: 1 max len: 1575 min len: 1090 avg len: 1216.0 num_loss_counted_tokens: 4251
{
"epoch": 0,
"step": 90,
"rank": 0,
"loss": 0.057232845574617386,
"overall_throughput": 42.04731734702061,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.250526905059814,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23045,
"batch_size": 73,
"total_loss": 0.7216464877128601,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:51.641363"
}
total tokens: 8100 num samples: 15 num padding tokens: 1058 - rank: 5 max len: 540 min len: 390 avg len: 469.46666666666664 num_loss_counted_tokens: 4269
total tokens: 7154 num samples: 7 num padding tokens: 458 - rank: 2 max len: 1022 min len: 870 avg len: 956.5714285714286 num_loss_counted_tokens: 2467
total tokens: 7200 num samples: 3 num padding tokens: 1334 - rank: 0 max len: 2400 min len: 1618 avg len: 1955.3333333333333 num_loss_counted_tokens: 241
total tokens: 7389 num samples: 9 num padding tokens: 493 - rank: 3 max len: 821 min len: 703 avg len: 766.2222222222222 num_loss_counted_tokens: 4922
total tokens: 7945 num samples: 35 num padding tokens: 2409 - rank: 7 max len: 227 min len: 85 avg len: 158.17142857142858 num_loss_counted_tokens: 2310
total tokens: 7986 num samples: 22 num padding tokens: 1512 - rank: 6 max len: 363 min len: 227 avg len: 294.27272727272725 num_loss_counted_tokens: 3717
Per-token loss scaled by world size: 0.0006446933257393539Per-token loss scaled by world size: 0.00028916815062984824Per-token loss scaled by world size: 0.00012860735296271741
Per-token loss scaled by world size: 0.00038613073411397636Per-token loss scaled by world size: 3.499255626593367e-06
Per-token loss scaled by world size: 0.0005611648084595799
Per-token loss scaled by world size: 2.536307329137344e-05
Epoch: 0, Step: 91, Rank: 3, loss = 0.6820392608642578
Epoch: 0, Step: 91, Rank: 1, loss = 0.008253431878983974
Epoch: 0, Step: 91, Rank: 5, loss = 1.520589828491211
Epoch: 0, Step: 91, Rank: 7, loss = 0.9107375741004944Epoch: 0, Step: 91, Rank: 2, loss = 0.303336501121521
Epoch: 0, Step: 91, Rank: 4, loss = 1.3235772848129272
Epoch: 0, Step: 91, Rank: 0, loss = 0.05982197821140289
Per-token loss scaled by world size: 0.0006536963628605008
Epoch: 0, Step: 91, Rank: 6, loss = 1.5418245792388916
Epoch 0: 75%|███████▌ | 91/121 [03:52<01:16, 2.55s/it] total tokens: 7800 num samples: 8 num padding tokens: 997 - rank: 4 max len: 975 min len: 702 avg len: 850.375 num_loss_counted_tokens: 5549
total tokens: 7290 num samples: 3 num padding tokens: 1298 - rank: 1 max len: 2430 min len: 1736 avg len: 1997.3333333333333 num_loss_counted_tokens: 1662
{
"epoch": 0,
"step": 91,
"rank": 0,
"loss": 0.05982197821140289,
"overall_throughput": 41.22796863364372,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.05379819869995,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18869,
"batch_size": 79,
"total_loss": 0.7937725186347961,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:54.169866"
}
total tokens: 5968 num samples: 2 num padding tokens: 197 - rank: 0 max len: 2984 min len: 2787 avg len: 2885.5 num_loss_counted_tokens: 209
total tokens: 7667 num samples: 11 num padding tokens: 919 - rank: 5 max len: 697 min len: 558 avg len: 613.4545454545455 num_loss_counted_tokens: 4489
total tokens: 8062 num samples: 29 num padding tokens: 2764 - rank: 7 max len: 278 min len: 78 avg len: 182.68965517241378 num_loss_counted_tokens: 2067
total tokens: 6764 num samples: 4 num padding tokens: 631 - rank: 2 max len: 1691 min len: 1371 avg len: 1533.25 num_loss_counted_tokens: 1920
total tokens: 7920 num samples: 15 num padding tokens: 1922 - rank: 6 max len: 528 min len: 310 avg len: 399.8666666666667 num_loss_counted_tokens: 3579
total tokens: 7693 num samples: 7 num padding tokens: 417 - rank: 3 max len: 1099 min len: 981 avg len: 1039.4285714285713 num_loss_counted_tokens: 6131
Per-token loss scaled by world size: 0.0005888827727176249Per-token loss scaled by world size: 0.0006443694583140314Per-token loss scaled by world size: 8.987231012724806e-06Per-token loss scaled by world size: 0.00010111679148394614Per-token loss scaled by world size: 1.0885350093303714e-05Per-token loss scaled by world size: 0.0005767960683442652
Per-token loss scaled by world size: 0.00016675007645972073
Epoch: 0, Step: 92, Rank: 6, loss = 1.448703646659851
Epoch: 0, Step: 92, Rank: 1, loss = 0.02447298914194107Epoch: 0, Step: 92, Rank: 0, loss = 0.0202055424451828
Epoch: 0, Step: 92, Rank: 3, loss = 1.3239556550979614
Epoch: 0, Step: 92, Rank: 4, loss = 1.2967817783355713Epoch: 0, Step: 92, Rank: 2, loss = 0.2273358255624771
Epoch: 0, Step: 92, Rank: 7, loss = 0.3748958706855774
Per-token loss scaled by world size: 0.0007324381731450558
Epoch: 0, Step: 92, Rank: 5, loss = 1.646704077720642
Epoch 0: 76%|███████▌ | 92/121 [03:54<01:13, 2.54s/it] total tokens: 5964 num samples: 2 num padding tokens: 23 - rank: 1 max len: 2982 min len: 2959 avg len: 2970.5 num_loss_counted_tokens: 702
total tokens: 4664 num samples: 22 num padding tokens: 1595 - rank: 7 max len: 212 min len: 82 avg len: 139.5 num_loss_counted_tokens: 1228
total tokens: 8070 num samples: 10 num padding tokens: 724 - rank: 4 max len: 807 min len: 638 avg len: 734.6 num_loss_counted_tokens: 4997
{
"epoch": 0,
"step": 92,
"rank": 0,
"loss": 0.0202055424451828,
"overall_throughput": 41.45973903821548,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.430901527404785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17986,
"batch_size": 75,
"total_loss": 0.7953818440437317,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:56.703392"
}
total tokens: 6888 num samples: 3 num padding tokens: 1202 - rank: 2 max len: 2296 min len: 1438 avg len: 1895.3333333333333 num_loss_counted_tokens: 633
total tokens: 8040 num samples: 20 num padding tokens: 2357 - rank: 6 max len: 402 min len: 223 avg len: 284.15 num_loss_counted_tokens: 2838
total tokens: 7665 num samples: 7 num padding tokens: 534 - rank: 3 max len: 1095 min len: 894 avg len: 1018.7142857142857 num_loss_counted_tokens: 4973
total tokens: 7764 num samples: 2 num padding tokens: 79 - rank: 0 max len: 3882 min len: 3803 avg len: 3842.5 num_loss_counted_tokens: 230
total tokens: 7596 num samples: 12 num padding tokens: 1505 - rank: 5 max len: 633 min len: 403 avg len: 507.5833333333333 num_loss_counted_tokens: 3739
Per-token loss scaled by world size: 0.0010100876679643989Per-token loss scaled by world size: 0.0010313765378668904Per-token loss scaled by world size: 0.0004698181292042136Per-token loss scaled by world size: 0.00015131689724512398
Per-token loss scaled by world size: 1.1349918167979922e-05Per-token loss scaled by world size: 0.0004358472360763699Per-token loss scaled by world size: 5.984314611851005e-06
Epoch: 0, Step: 93, Rank: 5, loss = 1.8667914867401123
Epoch: 0, Step: 93, Rank: 3, loss = 0.273883581161499
Epoch: 0, Step: 93, Rank: 4, loss = 0.8503708243370056
Epoch: 0, Step: 93, Rank: 6, loss = 1.828258752822876
Epoch: 0, Step: 93, Rank: 0, loss = 0.020543351769447327
Epoch: 0, Step: 93, Rank: 1, loss = 0.01083160936832428
Epoch: 0, Step: 93, Rank: 7, loss = 0.7888835072517395
Per-token loss scaled by world size: 9.832592331804335e-05
Epoch: 0, Step: 93, Rank: 2, loss = 0.17796991765499115
Epoch 0: 77%|███████▋ | 93/121 [03:57<01:11, 2.57s/it] total tokens: 7368 num samples: 8 num padding tokens: 711 - rank: 4 max len: 921 min len: 712 avg len: 832.125 num_loss_counted_tokens: 4742
total tokens: 6549 num samples: 3 num padding tokens: 1202 - rank: 1 max len: 2183 min len: 1533 avg len: 1782.3333333333333 num_loss_counted_tokens: 405
{
"epoch": 0,
"step": 93,
"rank": 0,
"loss": 0.020543351769447327,
"overall_throughput": 40.324540881871776,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.333390712738037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 14480,
"batch_size": 60,
"total_loss": 0.7271916270256042,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:51:59.324128"
}
total tokens: 8115 num samples: 15 num padding tokens: 1727 - rank: 6 max len: 541 min len: 302 avg len: 425.8666666666667 num_loss_counted_tokens: 4209
total tokens: 7450 num samples: 5 num padding tokens: 482 - rank: 2 max len: 1490 min len: 1299 avg len: 1393.6 num_loss_counted_tokens: 4364
total tokens: 7832 num samples: 11 num padding tokens: 625 - rank: 5 max len: 712 min len: 561 avg len: 655.1818181818181 num_loss_counted_tokens: 4908
total tokens: 6486 num samples: 2 num padding tokens: 374 - rank: 0 max len: 3243 min len: 2869 avg len: 3056.0 num_loss_counted_tokens: 161
total tokens: 7693 num samples: 7 num padding tokens: 643 - rank: 3 max len: 1099 min len: 949 avg len: 1007.1428571428571 num_loss_counted_tokens: 4763
total tokens: 8073 num samples: 27 num padding tokens: 3166 - rank: 7 max len: 299 min len: 83 avg len: 181.74074074074073 num_loss_counted_tokens: 2216
Per-token loss scaled by world size: 0.00047986634308472276Per-token loss scaled by world size: 0.0005184172769077122Per-token loss scaled by world size: 0.00042661072802729905Per-token loss scaled by world size: 5.770879943156615e-05
Per-token loss scaled by world size: 0.0003191411087755114
Per-token loss scaled by world size: 1.1559887752810027e-05
Per-token loss scaled by world size: 1.194144033433986e-06
Epoch: 0, Step: 94, Rank: 3, loss = 1.1955350637435913
Epoch: 0, Step: 94, Rank: 5, loss = 0.9838176965713501Epoch: 0, Step: 94, Rank: 2, loss = 0.13308370113372803
Epoch: 0, Step: 94, Rank: 7, loss = 1.1066317558288574
Epoch: 0, Step: 94, Rank: 4, loss = 0.7359793186187744
Epoch: 0, Step: 94, Rank: 0, loss = 0.026658546179533005
Epoch: 0, Step: 94, Rank: 1, loss = 0.002753845416009426
Per-token loss scaled by world size: 0.0007509095594286919
Epoch: 0, Step: 94, Rank: 6, loss = 1.7316913604736328
Epoch 0: 78%|███████▊ | 94/121 [04:00<01:08, 2.55s/it] total tokens: 7608 num samples: 8 num padding tokens: 1057 - rank: 4 max len: 951 min len: 740 avg len: 818.875 num_loss_counted_tokens: 3486
total tokens: 5880 num samples: 2 num padding tokens: 154 - rank: 1 max len: 2940 min len: 2786 avg len: 2863.0 num_loss_counted_tokens: 481
total tokens: 6471 num samples: 3 num padding tokens: 972 - rank: 2 max len: 2157 min len: 1602 avg len: 1833.0 num_loss_counted_tokens: 1746
{
"epoch": 0,
"step": 94,
"rank": 0,
"loss": 0.026658546179533005,
"overall_throughput": 42.25155762364806,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.342710971832275,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18449,
"batch_size": 79,
"total_loss": 0.7395188808441162,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:01.831915"
}
total tokens: 7725 num samples: 5 num padding tokens: 1821 - rank: 3 max len: 1545 min len: 1004 avg len: 1180.8 num_loss_counted_tokens: 4073
total tokens: 7725 num samples: 25 num padding tokens: 2825 - rank: 7 max len: 309 min len: 84 avg len: 196.0 num_loss_counted_tokens: 2238
total tokens: 7540 num samples: 13 num padding tokens: 2357 - rank: 6 max len: 580 min len: 317 avg len: 398.6923076923077 num_loss_counted_tokens: 3252
total tokens: 7920 num samples: 11 num padding tokens: 539 - rank: 5 max len: 720 min len: 583 avg len: 671.0 num_loss_counted_tokens: 4509
total tokens: 7732 num samples: 2 num padding tokens: 720 - rank: 0 max len: 3866 min len: 3146 avg len: 3506.0 num_loss_counted_tokens: 197
Per-token loss scaled by world size: 0.00016949654673226178Per-token loss scaled by world size: 0.00019448986859060824Per-token loss scaled by world size: 0.0003709697921294719Per-token loss scaled by world size: 0.00028232726617716253Per-token loss scaled by world size: 0.00031966116512194276Per-token loss scaled by world size: 4.294802783988416e-05
Per-token loss scaled by world size: 4.4173757487442344e-06
Epoch: 0, Step: 95, Rank: 6, loss = 1.1748613119125366
Epoch: 0, Step: 95, Rank: 7, loss = 0.8941304087638855Epoch: 0, Step: 95, Rank: 1, loss = 0.13601639866828918
Epoch: 0, Step: 95, Rank: 3, loss = 0.5367955565452576Epoch: 0, Step: 95, Rank: 0, loss = 0.013989828526973724
Epoch: 0, Step: 95, Rank: 2, loss = 0.6159493923187256Epoch: 0, Step: 95, Rank: 4, loss = 1.0123668909072876
Per-token loss scaled by world size: 0.00031159218633547425
Epoch: 0, Step: 95, Rank: 5, loss = 0.9868124723434448
Epoch 0: 79%|███████▊ | 95/121 [04:02<01:06, 2.56s/it] total tokens: 7953 num samples: 11 num padding tokens: 802 - rank: 4 max len: 723 min len: 574 avg len: 650.0909090909091 num_loss_counted_tokens: 3865
total tokens: 7895 num samples: 5 num padding tokens: 903 - rank: 1 max len: 1579 min len: 1252 avg len: 1398.4 num_loss_counted_tokens: 4598
{
"epoch": 0,
"step": 95,
"rank": 0,
"loss": 0.013989828526973724,
"overall_throughput": 40.78303709583472,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.430901527404785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25336,
"batch_size": 79,
"total_loss": 0.6713653802871704,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:04.432513"
}
total tokens: 7994 num samples: 14 num padding tokens: 1263 - rank: 5 max len: 571 min len: 393 avg len: 480.7857142857143 num_loss_counted_tokens: 4425
total tokens: 5280 num samples: 24 num padding tokens: 2248 - rank: 7 max len: 220 min len: 77 avg len: 126.33333333333333 num_loss_counted_tokens: 1108
total tokens: 7212 num samples: 6 num padding tokens: 1461 - rank: 2 max len: 1202 min len: 820 avg len: 958.5 num_loss_counted_tokens: 3497
total tokens: 7820 num samples: 20 num padding tokens: 2085 - rank: 6 max len: 391 min len: 220 avg len: 286.75 num_loss_counted_tokens: 3565
total tokens: 7326 num samples: 9 num padding tokens: 374 - rank: 3 max len: 814 min len: 726 avg len: 772.4444444444445 num_loss_counted_tokens: 2645
total tokens: 7032 num samples: 3 num padding tokens: 98 - rank: 0 max len: 2344 min len: 2282 avg len: 2311.3333333333335 num_loss_counted_tokens: 312
Per-token loss scaled by world size: 0.0003659721987787634Per-token loss scaled by world size: 1.1935087059100624e-05Per-token loss scaled by world size: 0.00034196022897958755Per-token loss scaled by world size: 4.9577370191400405e-06
Per-token loss scaled by world size: 0.000384376646252349
Per-token loss scaled by world size: 3.742313765542349e-06Per-token loss scaled by world size: 0.0004321872256696224
Epoch: 0, Step: 96, Rank: 3, loss = 1.043386697769165
Epoch: 0, Step: 96, Rank: 2, loss = 0.03402693197131157
Epoch: 0, Step: 96, Rank: 6, loss = 0.974928617477417
Epoch: 0, Step: 96, Rank: 1, loss = 0.01413450762629509
Epoch: 0, Step: 96, Rank: 7, loss = 1.095857858657837
Epoch: 0, Step: 96, Rank: 0, loss = 0.010669336654245853Epoch: 0, Step: 96, Rank: 4, loss = 1.232165813446045
Per-token loss scaled by world size: 0.00038899367791600525
Epoch: 0, Step: 96, Rank: 5, loss = 1.1090209484100342
Epoch 0: 79%|███████▉ | 96/121 [04:05<01:03, 2.55s/it] total tokens: 6210 num samples: 3 num padding tokens: 1280 - rank: 1 max len: 2070 min len: 1379 avg len: 1643.3333333333333 num_loss_counted_tokens: 704
total tokens: 8100 num samples: 10 num padding tokens: 581 - rank: 4 max len: 810 min len: 687 avg len: 751.9 num_loss_counted_tokens: 4814
{
"epoch": 0,
"step": 96,
"rank": 0,
"loss": 0.010669336654245853,
"overall_throughput": 42.34869571304124,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.221776962280273,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22808,
"batch_size": 79,
"total_loss": 0.6892738342285156,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:06.964617"
}
total tokens: 5532 num samples: 2 num padding tokens: 327 - rank: 0 max len: 2766 min len: 2439 avg len: 2602.5 num_loss_counted_tokens: 211
total tokens: 7075 num samples: 25 num padding tokens: 2496 - rank: 7 max len: 283 min len: 84 avg len: 183.16 num_loss_counted_tokens: 1819
total tokens: 7546 num samples: 11 num padding tokens: 1067 - rank: 5 max len: 686 min len: 503 avg len: 589.0 num_loss_counted_tokens: 4558
total tokens: 7595 num samples: 7 num padding tokens: 688 - rank: 3 max len: 1085 min len: 846 avg len: 986.7142857142857 num_loss_counted_tokens: 4050
total tokens: 8048 num samples: 16 num padding tokens: 1729 - rank: 6 max len: 503 min len: 314 avg len: 394.9375 num_loss_counted_tokens: 3712
total tokens: 8070 num samples: 6 num padding tokens: 955 - rank: 2 max len: 1345 min len: 1089 avg len: 1185.8333333333333 num_loss_counted_tokens: 3253
Per-token loss scaled by world size: 0.00020825346291530877Per-token loss scaled by world size: 0.0002562287845648825Per-token loss scaled by world size: 8.292648271890357e-05Per-token loss scaled by world size: 9.641618089517578e-05Per-token loss scaled by world size: 0.00017162703443318605Per-token loss scaled by world size: 9.9565637356136e-05Per-token loss scaled by world size: 8.746929961489514e-05
Epoch: 0, Step: 97, Rank: 2, loss = 0.41501447558403015Epoch: 0, Step: 97, Rank: 0, loss = 0.3645939230918884Epoch: 0, Step: 97, Rank: 3, loss = 0.3456583023071289
Epoch: 0, Step: 97, Rank: 6, loss = 0.8680524826049805Epoch: 0, Step: 97, Rank: 1, loss = 0.4018867313861847Epoch: 0, Step: 97, Rank: 7, loss = 0.7153843641281128
Epoch: 0, Step: 97, Rank: 4, loss = 1.0680255889892578
Per-token loss scaled by world size: 0.00028643777477554977
Epoch: 0, Step: 97, Rank: 5, loss = 1.1939442157745361
Epoch 0: 80%|████████ | 97/121 [04:07<01:00, 2.54s/it] total tokens: 8016 num samples: 8 num padding tokens: 1080 - rank: 4 max len: 1002 min len: 797 avg len: 867.0 num_loss_counted_tokens: 5620
total tokens: 7887 num samples: 3 num padding tokens: 742 - rank: 1 max len: 2629 min len: 1930 avg len: 2381.6666666666665 num_loss_counted_tokens: 1049
total tokens: 7076 num samples: 29 num padding tokens: 2680 - rank: 7 max len: 244 min len: 78 avg len: 151.58620689655172 num_loss_counted_tokens: 1550
{
"epoch": 0,
"step": 97,
"rank": 0,
"loss": 0.3645939230918884,
"overall_throughput": 41.762877356480914,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.456066131591797,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 33346,
"batch_size": 92,
"total_loss": 0.6715700030326843,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:09.460519"
}
total tokens: 8048 num samples: 16 num padding tokens: 2637 - rank: 6 max len: 503 min len: 255 avg len: 338.1875 num_loss_counted_tokens: 3126
total tokens: 7820 num samples: 10 num padding tokens: 1247 - rank: 5 max len: 782 min len: 563 avg len: 657.3 num_loss_counted_tokens: 4844
total tokens: 7668 num samples: 4 num padding tokens: 419 - rank: 2 max len: 1917 min len: 1666 avg len: 1812.25 num_loss_counted_tokens: 734
total tokens: 7176 num samples: 6 num padding tokens: 542 - rank: 3 max len: 1196 min len: 1025 avg len: 1105.6666666666667 num_loss_counted_tokens: 3353
total tokens: 7128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 3564 min len: 3562 avg len: 3563.0 num_loss_counted_tokens: 172
Per-token loss scaled by world size: 0.0005401856615208089Per-token loss scaled by world size: 0.000174855042132549Per-token loss scaled by world size: 0.0003811018541455269Per-token loss scaled by world size: 2.7653879442368634e-05Per-token loss scaled by world size: 0.00025230227038264275
Per-token loss scaled by world size: 6.325829599518329e-05
Per-token loss scaled by world size: 0.0003237307828385383
Epoch: 0, Step: 98, Rank: 3, loss = 0.4728299081325531
Epoch: 0, Step: 98, Rank: 2, loss = 1.030547022819519Epoch: 0, Step: 98, Rank: 0, loss = 0.07477954775094986
Epoch: 0, Step: 98, Rank: 5, loss = 1.4607295989990234
Epoch: 0, Step: 98, Rank: 4, loss = 0.6822568774223328
Epoch: 0, Step: 98, Rank: 1, loss = 0.17105834186077118
Epoch: 0, Step: 98, Rank: 7, loss = 0.8754084706306458
Per-token loss scaled by world size: 0.00048297818284481764
Epoch: 0, Step: 98, Rank: 6, loss = 1.3060333728790283
Epoch 0: 81%|████████ | 98/121 [04:10<00:57, 2.52s/it] total tokens: 7452 num samples: 9 num padding tokens: 759 - rank: 4 max len: 828 min len: 690 avg len: 743.6666666666666 num_loss_counted_tokens: 4766
total tokens: 8040 num samples: 6 num padding tokens: 578 - rank: 1 max len: 1340 min len: 1184 avg len: 1243.6666666666667 num_loss_counted_tokens: 2162
{
"epoch": 0,
"step": 98,
"rank": 0,
"loss": 0.07477954775094986,
"overall_throughput": 42.80100675786857,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.374258518218994,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21633,
"batch_size": 82,
"total_loss": 0.7592054009437561,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:11.932654"
}
total tokens: 7994 num samples: 7 num padding tokens: 237 - rank: 2 max len: 1142 min len: 1003 avg len: 1108.142857142857 num_loss_counted_tokens: 2239
total tokens: 8008 num samples: 8 num padding tokens: 518 - rank: 3 max len: 1001 min len: 839 avg len: 936.25 num_loss_counted_tokens: 4873
total tokens: 7579 num samples: 11 num padding tokens: 1520 - rank: 5 max len: 689 min len: 432 avg len: 550.8181818181819 num_loss_counted_tokens: 3863
total tokens: 7808 num samples: 32 num padding tokens: 2556 - rank: 7 max len: 244 min len: 86 avg len: 164.125 num_loss_counted_tokens: 1928
total tokens: 8080 num samples: 4 num padding tokens: 1956 - rank: 0 max len: 2020 min len: 1348 avg len: 1531.0 num_loss_counted_tokens: 2972
total tokens: 7740 num samples: 18 num padding tokens: 1567 - rank: 6 max len: 430 min len: 249 avg len: 342.94444444444446 num_loss_counted_tokens: 3382
Per-token loss scaled by world size: 0.00014469273446593434Per-token loss scaled by world size: 0.0002714892034418881Per-token loss scaled by world size: 0.00024302249948959798Per-token loss scaled by world size: 0.0003524755884427577Per-token loss scaled by world size: 6.74632319714874e-05
Per-token loss scaled by world size: 0.0003295539354439825
Per-token loss scaled by world size: 0.0001440553314751014
Epoch: 0, Step: 99, Rank: 3, loss = 0.4647168815135956Epoch: 0, Step: 99, Rank: 5, loss = 0.8719554543495178
Epoch: 0, Step: 99, Rank: 1, loss = 0.2166750431060791
Epoch: 0, Step: 99, Rank: 7, loss = 0.7805275321006775
Epoch: 0, Step: 99, Rank: 6, loss = 1.1320635080337524
Epoch: 0, Step: 99, Rank: 4, loss = 1.058444857597351
Per-token loss scaled by world size: 0.00019673565111588687
Epoch: 0, Step: 99, Rank: 2, loss = 0.4626697301864624
Epoch: 0, Step: 99, Rank: 0, loss = 0.6318657398223877
Epoch 0: 82%|████████▏ | 99/121 [04:12<00:55, 2.54s/it] total tokens: 7893 num samples: 9 num padding tokens: 665 - rank: 4 max len: 877 min len: 700 avg len: 803.1111111111111 num_loss_counted_tokens: 3967
total tokens: 2590 num samples: 14 num padding tokens: 751 - rank: 7 max len: 185 min len: 86 avg len: 131.35714285714286 num_loss_counted_tokens: 581
total tokens: 7648 num samples: 8 num padding tokens: 131 - rank: 3 max len: 956 min len: 911 avg len: 939.625 num_loss_counted_tokens: 5962
total tokens: 7645 num samples: 5 num padding tokens: 395 - rank: 1 max len: 1529 min len: 1349 avg len: 1450.0 num_loss_counted_tokens: 2441
{
"epoch": 0,
"step": 99,
"rank": 0,
"loss": 0.6318657398223877,
"overall_throughput": 40.78082568043985,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.533984184265137,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25694,
"batch_size": 91,
"total_loss": 0.7023648619651794,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:14.529016"
}
total tokens: 7752 num samples: 6 num padding tokens: 918 - rank: 2 max len: 1292 min len: 958 avg len: 1139.0 num_loss_counted_tokens: 2321
total tokens: 7260 num samples: 3 num padding tokens: 1692 - rank: 0 max len: 2420 min len: 1564 avg len: 1856.0 num_loss_counted_tokens: 2978
total tokens: 7740 num samples: 20 num padding tokens: 1925 - rank: 6 max len: 387 min len: 196 avg len: 290.75 num_loss_counted_tokens: 3585
total tokens: 8064 num samples: 12 num padding tokens: 1582 - rank: 5 max len: 672 min len: 401 avg len: 540.1666666666666 num_loss_counted_tokens: 3589
Per-token loss scaled by world size: 0.0002912842610385269Per-token loss scaled by world size: 0.00046773377107456326Per-token loss scaled by world size: 0.00034302467247471213Per-token loss scaled by world size: 0.0002211699465988204
Per-token loss scaled by world size: 6.557774031534791e-05Per-token loss scaled by world size: 2.8064672733307816e-05
Per-token loss scaled by world size: 2.8628314794332255e-06
Epoch: 0, Step: 100, Rank: 5, loss = 1.2855077981948853
Epoch: 0, Step: 100, Rank: 7, loss = 0.9427604675292969
Epoch: 0, Step: 100, Rank: 4, loss = 0.6078579425811768
Epoch: 0, Step: 100, Rank: 3, loss = 0.8005583882331848
Epoch: 0, Step: 100, Rank: 2, loss = 0.1802322268486023
Epoch: 0, Step: 100, Rank: 1, loss = 0.07713224738836288
Epoch: 0, Step: 100, Rank: 0, loss = 0.007868134416639805
Per-token loss scaled by world size: 0.0006381099228747189
Epoch: 0, Step: 100, Rank: 6, loss = 1.753765344619751
Epoch 0: 83%|████████▎ | 100/121 [04:15<00:53, 2.54s/it] total tokens: 5672 num samples: 2 num padding tokens: 1177 - rank: 1 max len: 2836 min len: 1659 avg len: 2247.5 num_loss_counted_tokens: 502
total tokens: 7504 num samples: 8 num padding tokens: 1233 - rank: 4 max len: 938 min len: 657 avg len: 783.875 num_loss_counted_tokens: 4542
{
"epoch": 0,
"step": 100,
"rank": 0,
"loss": 0.007868134416639805,
"overall_throughput": 41.97207337460162,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.315226078033447,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21987,
"batch_size": 69,
"total_loss": 0.7069603800773621,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:17.051456"
}
total tokens: 7908 num samples: 6 num padding tokens: 1459 - rank: 3 max len: 1318 min len: 961 avg len: 1074.8333333333333 num_loss_counted_tokens: 4507
total tokens: 7644 num samples: 12 num padding tokens: 705 - rank: 5 max len: 637 min len: 515 avg len: 578.25 num_loss_counted_tokens: 4604
total tokens: 8032 num samples: 16 num padding tokens: 1718 - rank: 6 max len: 502 min len: 297 avg len: 394.625 num_loss_counted_tokens: 3774
total tokens: 7306 num samples: 26 num padding tokens: 2547 - rank: 7 max len: 281 min len: 87 avg len: 183.03846153846155 num_loss_counted_tokens: 1982
total tokens: 6576 num samples: 4 num padding tokens: 507 - rank: 2 max len: 1644 min len: 1411 avg len: 1517.25 num_loss_counted_tokens: 3636
total tokens: 6544 num samples: 2 num padding tokens: 236 - rank: 0 max len: 3272 min len: 3036 avg len: 3154.0 num_loss_counted_tokens: 209
Per-token loss scaled by world size: 0.00010836837464012206Per-token loss scaled by world size: 0.0003503480111248791Per-token loss scaled by world size: 0.0005262716440483928Per-token loss scaled by world size: 0.0004925990360789001Per-token loss scaled by world size: 0.0006540374597534537
Per-token loss scaled by world size: 8.3443388575688e-05Per-token loss scaled by world size: 2.478029045960284e-06
Epoch: 0, Step: 101, Rank: 6, loss = 1.2019386291503906
Epoch: 0, Step: 101, Rank: 4, loss = 0.8001510500907898
Epoch: 0, Step: 101, Rank: 5, loss = 1.4937398433685303
Epoch: 0, Step: 101, Rank: 2, loss = 0.24749982357025146
Epoch: 0, Step: 101, Rank: 7, loss = 1.1250346899032593Epoch: 0, Step: 101, Rank: 1, loss = 0.1905742734670639
Epoch: 0, Step: 101, Rank: 0, loss = 0.005659508518874645
Per-token loss scaled by world size: 0.00031609582947567105
Epoch: 0, Step: 101, Rank: 3, loss = 0.7219233512878418
Epoch 0: 83%|████████▎ | 101/121 [04:17<00:50, 2.54s/it] total tokens: 7903 num samples: 7 num padding tokens: 827 - rank: 4 max len: 1129 min len: 858 avg len: 1010.8571428571429 num_loss_counted_tokens: 3333
total tokens: 5778 num samples: 2 num padding tokens: 561 - rank: 1 max len: 2889 min len: 2328 avg len: 2608.5 num_loss_counted_tokens: 499
{
"epoch": 0,
"step": 101,
"rank": 0,
"loss": 0.005659508518874645,
"overall_throughput": 41.27268820447733,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.25443983078003,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18271,
"batch_size": 78,
"total_loss": 0.7233151197433472,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:19.615502"
}
total tokens: 7981 num samples: 23 num padding tokens: 4218 - rank: 7 max len: 347 min len: 81 avg len: 163.6086956521739 num_loss_counted_tokens: 1503
total tokens: 7659 num samples: 9 num padding tokens: 737 - rank: 5 max len: 851 min len: 617 avg len: 769.1111111111111 num_loss_counted_tokens: 5452
total tokens: 7125 num samples: 5 num padding tokens: 617 - rank: 3 max len: 1425 min len: 1151 avg len: 1301.6 num_loss_counted_tokens: 3466
total tokens: 6171 num samples: 3 num padding tokens: 616 - rank: 2 max len: 2057 min len: 1742 avg len: 1851.6666666666667 num_loss_counted_tokens: 894
total tokens: 6344 num samples: 2 num padding tokens: 281 - rank: 0 max len: 3172 min len: 2891 avg len: 3031.5 num_loss_counted_tokens: 198
total tokens: 7878 num samples: 13 num padding tokens: 1215 - rank: 6 max len: 606 min len: 373 avg len: 512.5384615384615 num_loss_counted_tokens: 4756
Per-token loss scaled by world size: 0.0004239458357915282Per-token loss scaled by world size: 0.00031359592685475945Per-token loss scaled by world size: 0.00021933596872258931Per-token loss scaled by world size: 0.00028875406133010983
Per-token loss scaled by world size: 0.000378787808585912
Per-token loss scaled by world size: 6.144649523776025e-05
Per-token loss scaled by world size: 0.0003507360816001892
Epoch: 0, Step: 102, Rank: 2, loss = 0.9117801189422607
Epoch: 0, Step: 102, Rank: 5, loss = 1.232622504234314
Epoch: 0, Step: 102, Rank: 4, loss = 0.8395524621009827Epoch: 0, Step: 102, Rank: 6, loss = 1.101325511932373Epoch: 0, Step: 102, Rank: 3, loss = 0.6377193331718445Epoch: 0, Step: 102, Rank: 0, loss = 0.1786556839942932
Epoch: 0, Step: 102, Rank: 7, loss = 1.0197651386260986
Per-token loss scaled by world size: 0.00012210274871904403
Epoch: 0, Step: 102, Rank: 1, loss = 0.35501372814178467
Epoch 0: 84%|████████▍ | 102/121 [04:20<00:48, 2.56s/it] total tokens: 7592 num samples: 8 num padding tokens: 994 - rank: 4 max len: 949 min len: 717 avg len: 824.75 num_loss_counted_tokens: 4547
total tokens: 6717 num samples: 3 num padding tokens: 1106 - rank: 1 max len: 2239 min len: 1619 avg len: 1870.3333333333333 num_loss_counted_tokens: 2357
{
"epoch": 0,
"step": 102,
"rank": 0,
"loss": 0.1786556839942932,
"overall_throughput": 40.7946280716196,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.383514404296875,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23260,
"batch_size": 97,
"total_loss": 0.7845543026924133,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:22.207291"
}
total tokens: 8022 num samples: 14 num padding tokens: 1912 - rank: 6 max len: 573 min len: 313 avg len: 436.42857142857144 num_loss_counted_tokens: 3612
total tokens: 7575 num samples: 5 num padding tokens: 437 - rank: 2 max len: 1515 min len: 1354 avg len: 1427.6 num_loss_counted_tokens: 3239
total tokens: 7656 num samples: 11 num padding tokens: 467 - rank: 5 max len: 696 min len: 601 avg len: 653.5454545454545 num_loss_counted_tokens: 3901
total tokens: 7392 num samples: 24 num padding tokens: 2967 - rank: 7 max len: 308 min len: 80 avg len: 184.375 num_loss_counted_tokens: 1964
total tokens: 7478 num samples: 2 num padding tokens: 770 - rank: 0 max len: 3739 min len: 2969 avg len: 3354.0 num_loss_counted_tokens: 161
total tokens: 7944 num samples: 6 num padding tokens: 1452 - rank: 3 max len: 1324 min len: 951 avg len: 1082.0 num_loss_counted_tokens: 5314
Per-token loss scaled by world size: 0.000377753924112767Per-token loss scaled by world size: 0.0003784565778914839Per-token loss scaled by world size: 0.00017589255003258586Per-token loss scaled by world size: 0.00025099579943343997Per-token loss scaled by world size: 0.0002923366264440119
Per-token loss scaled by world size: 1.306815647694748e-06
Per-token loss scaled by world size: 8.565741882193834e-05
Epoch: 0, Step: 103, Rank: 5, loss = 1.0730663537979126
Epoch: 0, Step: 103, Rank: 3, loss = 0.7116672396659851Epoch: 0, Step: 103, Rank: 1, loss = 0.49872133135795593Epoch: 0, Step: 103, Rank: 0, loss = 0.0037053122650831938Epoch: 0, Step: 103, Rank: 6, loss = 1.0710740089416504
Epoch: 0, Step: 103, Rank: 4, loss = 0.828883945941925
Epoch: 0, Step: 103, Rank: 7, loss = 0.24287091195583344
Per-token loss scaled by world size: 0.00028907370870001614
Epoch: 0, Step: 103, Rank: 2, loss = 0.819632351398468
Epoch 0: 85%|████████▌ | 103/121 [04:22<00:45, 2.55s/it] total tokens: 7472 num samples: 8 num padding tokens: 859 - rank: 4 max len: 934 min len: 761 avg len: 826.625 num_loss_counted_tokens: 4441
total tokens: 5632 num samples: 2 num padding tokens: 254 - rank: 1 max len: 2816 min len: 2562 avg len: 2689.0 num_loss_counted_tokens: 324
total tokens: 7860 num samples: 30 num padding tokens: 3085 - rank: 7 max len: 262 min len: 70 avg len: 159.16666666666666 num_loss_counted_tokens: 2067
total tokens: 7635 num samples: 15 num padding tokens: 2223 - rank: 6 max len: 509 min len: 271 avg len: 360.8 num_loss_counted_tokens: 2976
{
"epoch": 0,
"step": 103,
"rank": 0,
"loss": 0.0037053122650831938,
"overall_throughput": 41.79496526057776,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.362098217010498,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22683,
"batch_size": 80,
"total_loss": 0.6562026739120483,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:24.741846"
}
total tokens: 7140 num samples: 5 num padding tokens: 839 - rank: 3 max len: 1428 min len: 1015 avg len: 1260.2 num_loss_counted_tokens: 4097
total tokens: 7520 num samples: 10 num padding tokens: 1082 - rank: 5 max len: 752 min len: 531 avg len: 643.8 num_loss_counted_tokens: 3194
total tokens: 6831 num samples: 3 num padding tokens: 752 - rank: 2 max len: 2277 min len: 1793 avg len: 2026.3333333333333 num_loss_counted_tokens: 2435
total tokens: 6642 num samples: 2 num padding tokens: 16 - rank: 0 max len: 3321 min len: 3305 avg len: 3313.0 num_loss_counted_tokens: 167
Per-token loss scaled by world size: 0.00023188922205008566Per-token loss scaled by world size: 0.0005923461285419762Per-token loss scaled by world size: 0.0007710273494012654
Per-token loss scaled by world size: 0.0006771996268071234Per-token loss scaled by world size: 5.4260908655123785e-06Per-token loss scaled by world size: 7.5567550084088e-06Per-token loss scaled by world size: 0.00048243210767395794
Epoch: 0, Step: 104, Rank: 5, loss = 1.1573703289031982Epoch: 0, Step: 104, Rank: 6, loss = 1.5064910650253296
Epoch: 0, Step: 104, Rank: 3, loss = 0.4530825614929199
Epoch: 0, Step: 104, Rank: 4, loss = 1.323163390159607Epoch: 0, Step: 104, Rank: 2, loss = 0.010601903311908245
Epoch: 0, Step: 104, Rank: 1, loss = 0.014764954335987568
Epoch: 0, Step: 104, Rank: 7, loss = 0.9426120519638062
Per-token loss scaled by world size: 8.817338675726205e-05
Epoch: 0, Step: 104, Rank: 0, loss = 0.17227977514266968
Epoch 0: 86%|████████▌ | 104/121 [04:25<00:43, 2.55s/it] total tokens: 7770 num samples: 10 num padding tokens: 458 - rank: 4 max len: 777 min len: 676 avg len: 731.2 num_loss_counted_tokens: 4387
total tokens: 7380 num samples: 5 num padding tokens: 709 - rank: 1 max len: 1476 min len: 1222 avg len: 1334.2 num_loss_counted_tokens: 2606
total tokens: 7931 num samples: 7 num padding tokens: 471 - rank: 2 max len: 1133 min len: 997 avg len: 1065.7142857142858 num_loss_counted_tokens: 2877
{
"epoch": 0,
"step": 104,
"rank": 0,
"loss": 0.17227977514266968,
"overall_throughput": 41.371928375269654,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.487372398376465,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 15631,
"batch_size": 56,
"total_loss": 0.6975457668304443,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:27.296912"
}
total tokens: 8100 num samples: 12 num padding tokens: 2394 - rank: 5 max len: 675 min len: 353 avg len: 475.5 num_loss_counted_tokens: 3145
total tokens: 7684 num samples: 34 num padding tokens: 2510 - rank: 7 max len: 226 min len: 80 avg len: 152.1764705882353 num_loss_counted_tokens: 1885
total tokens: 7496 num samples: 8 num padding tokens: 516 - rank: 3 max len: 937 min len: 786 avg len: 872.5 num_loss_counted_tokens: 4095
total tokens: 8073 num samples: 23 num padding tokens: 1383 - rank: 6 max len: 351 min len: 231 avg len: 290.8695652173913 num_loss_counted_tokens: 3443
total tokens: 6147 num samples: 3 num padding tokens: 667 - rank: 0 max len: 2049 min len: 1588 avg len: 1826.6666666666667 num_loss_counted_tokens: 458
Per-token loss scaled by world size: 0.00020804539963137358Per-token loss scaled by world size: 0.00024777904036454856Per-token loss scaled by world size: 0.00040929196984507143Per-token loss scaled by world size: 0.00047369435196742415Per-token loss scaled by world size: 0.00023292600235436112
Per-token loss scaled by world size: 0.0002660582831595093Per-token loss scaled by world size: 2.4003027647268027e-05
Epoch: 0, Step: 105, Rank: 6, loss = 1.2955113649368286Epoch: 0, Step: 105, Rank: 5, loss = 1.4993610382080078
Epoch: 0, Step: 105, Rank: 7, loss = 0.7842825651168823
Epoch: 0, Step: 105, Rank: 3, loss = 0.7372690439224243Epoch: 0, Step: 105, Rank: 2, loss = 0.6585156917572021
Epoch: 0, Step: 105, Rank: 0, loss = 0.07597558200359344
Epoch: 0, Step: 105, Rank: 4, loss = 0.8421409726142883
Per-token loss scaled by world size: 0.0001044606979121454
Epoch: 0, Step: 105, Rank: 1, loss = 0.3306442201137543
Epoch 0: 87%|████████▋ | 105/121 [04:28<00:40, 2.56s/it]{
"epoch": 0,
"step": 105,
"rank": 0,
"loss": 0.07597558200359344,
"overall_throughput": 41.19873338301903,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.338607788085938,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25322,
"batch_size": 90,
"total_loss": 0.7779626250267029,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:29.864769"
}
Per-token loss scaled by world size: 0.0007665135781280696Per-token loss scaled by world size: 0.0007400316535495222Per-token loss scaled by world size: 0.0002667310182005167Per-token loss scaled by world size: 0.000634319381788373Per-token loss scaled by world size: 0.00010545395343797281Per-token loss scaled by world size: 4.777937192557147e-06
Per-token loss scaled by world size: 1.3243148941910476e-06
Epoch: 0, Step: 106, Rank: 2, loss = 0.21801286935806274
Epoch: 0, Step: 106, Rank: 4, loss = 1.5846710205078125Epoch: 0, Step: 106, Rank: 3, loss = 0.5514330267906189Epoch: 0, Step: 106, Rank: 6, loss = 1.3113759756088257Epoch: 0, Step: 106, Rank: 0, loss = 0.00987778790295124
Epoch: 0, Step: 106, Rank: 7, loss = 1.5299229621887207Epoch: 0, Step: 106, Rank: 1, loss = 0.002737855538725853
Per-token loss scaled by world size: 0.0005778921768069267
Epoch: 0, Step: 106, Rank: 5, loss = 1.1947197914123535
Epoch 0: 88%|████████▊ | 106/121 [04:30<00:38, 2.53s/it]{
"epoch": 0,
"step": 106,
"rank": 0,
"loss": 0.00987778790295124,
"overall_throughput": 42.66049518366333,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.433530807495117,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16539,
"batch_size": 82,
"total_loss": 0.800343930721283,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:32.345371"
}
Per-token loss scaled by world size: 0.00032297801226377487Per-token loss scaled by world size: 0.00039764257962815464Per-token loss scaled by world size: 0.00018069567158818245Per-token loss scaled by world size: 0.00018984438793268055
Per-token loss scaled by world size: 0.00037407863419502974
Per-token loss scaled by world size: 2.2991487185208825e-06Per-token loss scaled by world size: 0.00024643141659907997
Epoch: 0, Step: 107, Rank: 1, loss = 0.6019198894500732Epoch: 0, Step: 107, Rank: 4, loss = 1.3245971202850342
Epoch: 0, Step: 107, Rank: 2, loss = 0.6323953866958618Epoch: 0, Step: 107, Rank: 6, loss = 1.0758801698684692
Epoch: 0, Step: 107, Rank: 0, loss = 0.007658752147108316Epoch: 0, Step: 107, Rank: 3, loss = 1.2461026906967163
Epoch: 0, Step: 107, Rank: 7, loss = 0.8208938241004944
Per-token loss scaled by world size: 0.00040240780799649656
Epoch: 0, Step: 107, Rank: 5, loss = 1.3404706716537476
Epoch 0: 88%|████████▊ | 107/121 [04:33<00:35, 2.54s/it]{
"epoch": 0,
"step": 107,
"rank": 0,
"loss": 0.007658752147108316,
"overall_throughput": 41.73801803946499,
"lr": 1.6000000000000001e-06,
"cuda_mem_allocated": 24.428075790405273,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26649,
"batch_size": 107,
"total_loss": 0.8812397718429565,
"gradnorm": 1.0122549533843994,
"weight_norm": 433.0432434082031,
"timestamp": "2024-08-18T20:52:34.882920"
}
Per-token loss scaled by world size: 0.0003084656782448292Per-token loss scaled by world size: 0.0003857612609863281Per-token loss scaled by world size: 7.505448593292385e-05
Per-token loss scaled by world size: 3.2563605145696783e-06Per-token loss scaled by world size: 6.232602754607797e-05
Per-token loss scaled by world size: 0.00027665658853948116
Per-token loss scaled by world size: 0.00029414540040306747
Epoch: 0, Step: 108, Rank: 5, loss = 1.2347253561019897
Epoch: 0, Step: 108, Rank: 1, loss = 0.24023064970970154
Epoch: 0, Step: 108, Rank: 0, loss = 0.01042279601097107Epoch: 0, Step: 108, Rank: 2, loss = 0.199490025639534Epoch: 0, Step: 108, Rank: 4, loss = 0.9873215556144714
Epoch: 0, Step: 108, Rank: 6, loss = 0.9414858818054199
Epoch: 0, Step: 108, Rank: 7, loss = 0.8855085372924805
Per-token loss scaled by world size: 0.0003316248476039618
Epoch: 0, Step: 108, Rank: 3, loss = 1.0614482164382935
[2024-08-18 20:52:37,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
[2024-08-18 20:52:37,489] [INFO] [timer.py:258:stop] epoch=0/micro_step=108/global_step=3, RunningAvgSamplesPerSec=41.632043299114166, CurrSamplesPerSec=41.632043299114166, MemAllocated=22.7GB, MaxMemAllocated=30.58GB
Epoch 0: 89%|████████▉ | 108/121 [04:35<00:33, 2.56s/it]{
"epoch": 0,
"step": 108,
"rank": 0,
"loss": 0.01042279601097107,
"overall_throughput": 40.57748284835933,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 22.696479320526123,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25606,
"batch_size": 79,
"total_loss": 0.6950791478157043,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:37.550745"
}
Per-token loss scaled by world size: 0.0006956385332159698Per-token loss scaled by world size: 0.0003228207933716476
Per-token loss scaled by world size: 3.4351643989793956e-05Per-token loss scaled by world size: 7.628021558048204e-05
Per-token loss scaled by world size: 0.0002656075230333954Per-token loss scaled by world size: 0.0005566730978898704
Epoch: 0, Step: 109, Rank: 3, loss = 0.7804192900657654
Epoch: 0, Step: 109, Rank: 5, loss = 1.681706190109253
Epoch: 0, Step: 109, Rank: 1, loss = 0.0830451026558876Epoch: 0, Step: 109, Rank: 2, loss = 0.18440742790699005
Per-token loss scaled by world size: 0.00023115344811230898
Epoch: 0, Step: 109, Rank: 4, loss = 0.6421061754226685Epoch: 0, Step: 109, Rank: 6, loss = 1.345757246017456
Per-token loss scaled by world size: 2.6328027161071077e-05
Epoch: 0, Step: 109, Rank: 7, loss = 0.5588134527206421
Epoch: 0, Step: 109, Rank: 0, loss = 0.06364800781011581
Epoch 0: 90%|█████████ | 109/121 [04:38<00:31, 2.59s/it]{
"epoch": 0,
"step": 109,
"rank": 0,
"loss": 0.06364800781011581,
"overall_throughput": 40.20414871551739,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.495201587677002,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19340,
"batch_size": 78,
"total_loss": 0.6674879193305969,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:40.142481"
}
Per-token loss scaled by world size: 0.00015839148545637727Per-token loss scaled by world size: 1.4700900692332652e-06Per-token loss scaled by world size: 0.00025009570526890457Per-token loss scaled by world size: 0.00011483808339107782Per-token loss scaled by world size: 1.5624003935954534e-05Per-token loss scaled by world size: 0.00039461886626668274
Per-token loss scaled by world size: 0.0002000233216676861
Epoch: 0, Step: 110, Rank: 2, loss = 0.8055582642555237
Epoch: 0, Step: 110, Rank: 5, loss = 1.2710673809051514Epoch: 0, Step: 110, Rank: 0, loss = 0.004735160153359175
Epoch: 0, Step: 110, Rank: 4, loss = 0.36989346146583557
Epoch: 0, Step: 110, Rank: 1, loss = 0.05032491683959961Epoch: 0, Step: 110, Rank: 3, loss = 0.5101789832115173
Epoch: 0, Step: 110, Rank: 7, loss = 0.6442751288414001
Per-token loss scaled by world size: 0.00032995041692629457
Epoch: 0, Step: 110, Rank: 6, loss = 1.0627702474594116
Epoch 0: 91%|█████████ | 110/121 [04:40<00:28, 2.56s/it]{
"epoch": 0,
"step": 110,
"rank": 0,
"loss": 0.004735160153359175,
"overall_throughput": 41.98747405186681,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.342525005340576,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25768,
"batch_size": 78,
"total_loss": 0.5898504853248596,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:42.652816"
}
Per-token loss scaled by world size: 0.0003511311369948089Per-token loss scaled by world size: 0.0003018661809619516Per-token loss scaled by world size: 0.00043145185918547213Per-token loss scaled by world size: 0.0006916387937963009Per-token loss scaled by world size: 0.00025981958606280386
Per-token loss scaled by world size: 0.000197814850253053
Per-token loss scaled by world size: 5.381280425353907e-05
Epoch: 0, Step: 111, Rank: 6, loss = 0.877037763595581
Epoch: 0, Step: 111, Rank: 5, loss = 1.7275407314300537
Epoch: 0, Step: 111, Rank: 4, loss = 0.7539862394332886
Epoch: 0, Step: 111, Rank: 7, loss = 1.0776588916778564
Epoch: 0, Step: 111, Rank: 3, loss = 0.6489643454551697Epoch: 0, Step: 111, Rank: 2, loss = 0.49409204721450806
Epoch: 0, Step: 111, Rank: 1, loss = 0.13441093266010284
Per-token loss scaled by world size: 6.49542971586925e-06
Epoch: 0, Step: 111, Rank: 0, loss = 0.016223959624767303
Epoch 0: 92%|█████████▏| 111/121 [04:43<00:25, 2.57s/it]{
"epoch": 0,
"step": 111,
"rank": 0,
"loss": 0.016223959624767303,
"overall_throughput": 41.001441887063585,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.491368293762207,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19982,
"batch_size": 69,
"total_loss": 0.716239333152771,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:45.269018"
}
Per-token loss scaled by world size: 0.00018836244998965412Per-token loss scaled by world size: 0.00022297118266578764Per-token loss scaled by world size: 0.0004412824346218258Per-token loss scaled by world size: 0.0002451048349030316Per-token loss scaled by world size: 2.037934336840408e-06Per-token loss scaled by world size: 0.00028445024508982897
Per-token loss scaled by world size: 0.00019181481911800802
Epoch: 0, Step: 112, Rank: 4, loss = 0.735774040222168Epoch: 0, Step: 112, Rank: 6, loss = 1.3246747255325317
Epoch: 0, Step: 112, Rank: 1, loss = 0.5654405355453491
Epoch: 0, Step: 112, Rank: 0, loss = 0.00611762423068285Epoch: 0, Step: 112, Rank: 3, loss = 0.6693316102027893
Epoch: 0, Step: 112, Rank: 2, loss = 0.8538841009140015
Epoch: 0, Step: 112, Rank: 7, loss = 0.5758041143417358
Per-token loss scaled by world size: 0.00042108085472136736
Epoch: 0, Step: 112, Rank: 5, loss = 1.2640321254730225
Epoch 0: 93%|█████████▎| 112/121 [04:45<00:23, 2.56s/it]{
"epoch": 0,
"step": 112,
"rank": 0,
"loss": 0.00611762423068285,
"overall_throughput": 41.80928245712371,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.407159328460693,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24015,
"batch_size": 92,
"total_loss": 0.7493823766708374,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:47.762170"
}
Per-token loss scaled by world size: 0.0004581383545883Per-token loss scaled by world size: 0.0004093858879059553Per-token loss scaled by world size: 0.0004562860412988812Per-token loss scaled by world size: 0.0003013689420185983Per-token loss scaled by world size: 8.010071906028315e-05
Per-token loss scaled by world size: 9.55424093262991e-06
Per-token loss scaled by world size: 0.0002337862824788317
Epoch: 0, Step: 113, Rank: 6, loss = 1.3187236785888672
Epoch: 0, Step: 113, Rank: 5, loss = 1.1831763982772827
Epoch: 0, Step: 113, Rank: 4, loss = 1.3240771293640137
Epoch: 0, Step: 113, Rank: 1, loss = 0.23150108754634857
Epoch: 0, Step: 113, Rank: 3, loss = 0.8709939122200012
Epoch: 0, Step: 113, Rank: 0, loss = 0.027612950652837753
Epoch: 0, Step: 113, Rank: 7, loss = 0.6756715774536133
Per-token loss scaled by world size: 0.0002764550154097378
Epoch: 0, Step: 113, Rank: 2, loss = 0.7989895343780518
Epoch 0: 93%|█████████▎| 113/121 [04:48<00:20, 2.56s/it]{
"epoch": 0,
"step": 113,
"rank": 0,
"loss": 0.027612950652837753,
"overall_throughput": 41.14323140579119,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.228994369506836,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23121,
"batch_size": 80,
"total_loss": 0.8038432598114014,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:50.330248"
}
Per-token loss scaled by world size: 0.0005671991966664791Per-token loss scaled by world size: 0.00022458804596681148Per-token loss scaled by world size: 0.00021035455574747175Per-token loss scaled by world size: 4.355869896244258e-05Per-token loss scaled by world size: 7.875097253418062e-06Per-token loss scaled by world size: 0.0004406924417708069
Per-token loss scaled by world size: 0.0002821373345796019
Epoch: 0, Step: 114, Rank: 6, loss = 1.126409888267517
Epoch: 0, Step: 114, Rank: 0, loss = 0.020128747448325157Epoch: 0, Step: 114, Rank: 5, loss = 1.449761152267456Epoch: 0, Step: 114, Rank: 4, loss = 0.5376662611961365Epoch: 0, Step: 114, Rank: 3, loss = 0.5740470290184021Epoch: 0, Step: 114, Rank: 2, loss = 0.11133603751659393
Epoch: 0, Step: 114, Rank: 7, loss = 0.7211430072784424
Per-token loss scaled by world size: 2.8121936338720843e-05
Epoch: 0, Step: 114, Rank: 1, loss = 0.07187967002391815
Epoch 0: 94%|█████████▍| 114/121 [04:51<00:17, 2.56s/it]{
"epoch": 0,
"step": 114,
"rank": 0,
"loss": 0.020128747448325157,
"overall_throughput": 41.23395474647752,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.419190883636475,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20448,
"batch_size": 78,
"total_loss": 0.5765464305877686,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:52.896686"
}
Per-token loss scaled by world size: 0.00020671900711022317Per-token loss scaled by world size: 0.00019181812240276486Per-token loss scaled by world size: 0.00029149872716516256
Per-token loss scaled by world size: 0.00032473698956891894
Per-token loss scaled by world size: 0.00032112447661347687Per-token loss scaled by world size: 0.0003388051700312644
Per-token loss scaled by world size: 0.00025795208057388663
Epoch: 0, Step: 115, Rank: 3, loss = 0.9541117548942566
Epoch: 0, Step: 115, Rank: 1, loss = 0.6278446912765503Epoch: 0, Step: 115, Rank: 0, loss = 0.6766171455383301
Epoch: 0, Step: 115, Rank: 6, loss = 1.062904715538025
Epoch: 0, Step: 115, Rank: 5, loss = 1.051080584526062
Epoch: 0, Step: 115, Rank: 7, loss = 0.8443094491958618Epoch: 0, Step: 115, Rank: 4, loss = 1.1089516878128052
Per-token loss scaled by world size: 0.00011388205894036219
Epoch: 0, Step: 115, Rank: 2, loss = 0.3727502226829529
Epoch 0: 95%|█████████▌| 115/121 [04:53<00:15, 2.57s/it]{
"epoch": 0,
"step": 115,
"rank": 0,
"loss": 0.6766171455383301,
"overall_throughput": 40.978208090508964,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.531991481781006,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26185,
"batch_size": 95,
"total_loss": 0.8373212814331055,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:55.477863"
}
Per-token loss scaled by world size: 0.00025488759274594486Per-token loss scaled by world size: 0.00010854459105757996Per-token loss scaled by world size: 0.00016505405073985457Per-token loss scaled by world size: 0.00032808436662890017Per-token loss scaled by world size: 0.00027797490474767983Per-token loss scaled by world size: 0.00035925835254602134
Epoch: 0, Step: 116, Rank: 5, loss = 1.0426521301269531
Epoch: 0, Step: 116, Rank: 0, loss = 0.5245417952537537Epoch: 0, Step: 116, Rank: 3, loss = 0.8100327849388123
Epoch: 0, Step: 116, Rank: 1, loss = 0.3449546992778778
Epoch: 0, Step: 116, Rank: 6, loss = 1.1417230367660522
Epoch: 0, Step: 116, Rank: 4, loss = 0.8834042549133301
Per-token loss scaled by world size: 6.462022429332137e-05
Per-token loss scaled by world size: 4.7339886805275455e-05
Epoch: 0, Step: 116, Rank: 2, loss = 0.15044616162776947
Epoch: 0, Step: 116, Rank: 7, loss = 0.20536306500434875
Epoch 0: 96%|█████████▌| 116/121 [04:56<00:12, 2.55s/it]{
"epoch": 0,
"step": 116,
"rank": 0,
"loss": 0.5245417952537537,
"overall_throughput": 42.05718076221283,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.434387683868408,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25424,
"batch_size": 77,
"total_loss": 0.6378897428512573,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:52:57.991057"
}
Per-token loss scaled by world size: 0.00043317166273482144Per-token loss scaled by world size: 0.0003517000295687467
Per-token loss scaled by world size: 0.0003901177551597357Per-token loss scaled by world size: 0.00024814700009301305Per-token loss scaled by world size: 0.0001685179740888998
Per-token loss scaled by world size: 2.624317403387977e-06
Per-token loss scaled by world size: 2.476824556651991e-05
Epoch: 0, Step: 117, Rank: 3, loss = 1.0443732738494873
Epoch: 0, Step: 117, Rank: 5, loss = 1.2863032817840576
Epoch: 0, Step: 117, Rank: 7, loss = 0.7368724942207336Epoch: 0, Step: 117, Rank: 4, loss = 1.1584546566009521
Epoch: 0, Step: 117, Rank: 0, loss = 0.007792910560965538Epoch: 0, Step: 117, Rank: 2, loss = 0.5004141330718994
Epoch: 0, Step: 117, Rank: 1, loss = 0.0735493078827858
Per-token loss scaled by world size: 0.00034316719393245876
Epoch: 0, Step: 117, Rank: 6, loss = 1.0190349817276
Epoch 0: 97%|█████████▋| 117/121 [04:58<00:10, 2.54s/it]{
"epoch": 0,
"step": 117,
"rank": 0,
"loss": 0.007792910560965538,
"overall_throughput": 42.14977654824287,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.35035228729248,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23756,
"batch_size": 76,
"total_loss": 0.7283493876457214,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:00.498739"
}
Per-token loss scaled by world size: 0.00027391428011469543Per-token loss scaled by world size: 0.00038771191611886024Per-token loss scaled by world size: 0.0005671838880516589Per-token loss scaled by world size: 6.881056378915673e-06
Per-token loss scaled by world size: 4.1250186768593267e-05Per-token loss scaled by world size: 7.28774830349721e-05
Per-token loss scaled by world size: 0.00023806751414667815
Epoch: 0, Step: 118, Rank: 5, loss = 1.42512047290802
Epoch: 0, Step: 118, Rank: 0, loss = 0.01728951372206211Epoch: 0, Step: 118, Rank: 3, loss = 0.9741746783256531
Epoch: 0, Step: 118, Rank: 4, loss = 0.6882438659667969
Epoch: 0, Step: 118, Rank: 2, loss = 0.18311378359794617Epoch: 0, Step: 118, Rank: 1, loss = 0.10364624857902527Epoch: 0, Step: 118, Rank: 7, loss = 0.5981743931770325
Per-token loss scaled by world size: 0.0005670187529176474
Epoch: 0, Step: 118, Rank: 6, loss = 1.4247055053710938
Epoch 0: 98%|█████████▊| 118/121 [05:01<00:07, 2.53s/it]{
"epoch": 0,
"step": 118,
"rank": 0,
"loss": 0.01728951372206211,
"overall_throughput": 42.35613860527613,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.3255033493042,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20101,
"batch_size": 64,
"total_loss": 0.6768085956573486,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:03.034819"
}
Per-token loss scaled by world size: 0.0003004461759701371Per-token loss scaled by world size: 0.0002987224142998457Per-token loss scaled by world size: 0.0002618007711134851Per-token loss scaled by world size: 0.0003349074686411768Per-token loss scaled by world size: 3.861719960696064e-06
Per-token loss scaled by world size: 0.0002586914342828095
Per-token loss scaled by world size: 0.00010813030530698597
Epoch: 0, Step: 119, Rank: 0, loss = 0.012113732285797596
Epoch: 0, Step: 119, Rank: 2, loss = 0.821236252784729Epoch: 0, Step: 119, Rank: 6, loss = 0.9424620866775513
Epoch: 0, Step: 119, Rank: 4, loss = 0.9370548725128174Epoch: 0, Step: 119, Rank: 5, loss = 1.050562858581543
Epoch: 0, Step: 119, Rank: 7, loss = 0.8114826679229736
Epoch: 0, Step: 119, Rank: 1, loss = 0.3391912579536438
Per-token loss scaled by world size: 0.00019692791101988405
Epoch: 0, Step: 119, Rank: 3, loss = 0.6177382469177246
Epoch 0: 98%|█████████▊| 119/121 [05:03<00:05, 2.52s/it]{
"epoch": 0,
"step": 119,
"rank": 0,
"loss": 0.012113732285797596,
"overall_throughput": 41.99486327647021,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.461923599243164,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25095,
"batch_size": 73,
"total_loss": 0.691480278968811,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:05.516499"
}
Per-token loss scaled by world size: 0.00044983928091824055Per-token loss scaled by world size: 0.00034963246434926987Per-token loss scaled by world size: 0.0002892126503866166Per-token loss scaled by world size: 0.0003644507669378072Per-token loss scaled by world size: 0.0004460133204702288Per-token loss scaled by world size: 3.170213403791422e-06
Per-token loss scaled by world size: 2.0736099486384774e-06
Epoch: 0, Step: 120, Rank: 5, loss = 0.8975055813789368
Epoch: 0, Step: 120, Rank: 6, loss = 1.0983635187149048
Epoch: 0, Step: 120, Rank: 0, loss = 0.007807046640664339Epoch: 0, Step: 120, Rank: 4, loss = 0.861013650894165
Epoch: 0, Step: 120, Rank: 2, loss = 0.7122223377227783Epoch: 0, Step: 120, Rank: 3, loss = 1.1077854633331299
Epoch: 0, Step: 120, Rank: 1, loss = 0.005106523633003235
Per-token loss scaled by world size: 0.0004087797424290329
Epoch: 0, Step: 120, Rank: 7, loss = 1.0066711902618408
Epoch 0: 99%|█████████▉| 120/121 [05:06<00:02, 2.48s/it]{
"epoch": 0,
"step": 120,
"rank": 0,
"loss": 0.007807046640664339,
"overall_throughput": 44.47716496365875,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.361114025115967,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19701,
"batch_size": 75,
"total_loss": 0.7120593786239624,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:07.898377"
}
Per-token loss scaled by world size: 0.00019174267072230577Per-token loss scaled by world size: 0.0003345832519698888Per-token loss scaled by world size: 0.0003169576812069863Per-token loss scaled by world size: 0.0004366403736639768Per-token loss scaled by world size: 0.00023642051382921636
Per-token loss scaled by world size: 1.6517016774741933e-05
Epoch: 0, Step: 121, Rank: 5, loss = 0.9575772881507874
Epoch: 0, Step: 121, Rank: 1, loss = 0.5487675070762634Epoch: 0, Step: 121, Rank: 4, loss = 1.2496647834777832
Epoch: 0, Step: 121, Rank: 3, loss = 0.9071329236030579
Epoch: 0, Step: 121, Rank: 7, loss = 0.6766355037689209
Per-token loss scaled by world size: 0.00013164509437046945Epoch: 0, Step: 121, Rank: 0, loss = 0.04727170243859291
Epoch: 0, Step: 121, Rank: 2, loss = 0.3767682611942291
Per-token loss scaled by world size: 0.0003866745682898909
Epoch: 0, Step: 121, Rank: 6, loss = 1.106662631034851
Epoch 0: 100%|██████████| 121/121 [05:08<00:00, 2.50s/it]{
"epoch": 0,
"step": 121,
"rank": 0,
"loss": 0.04727170243859291,
"overall_throughput": 41.60628512913175,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.30147409439087,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22896,
"batch_size": 102,
"total_loss": 0.7338100075721741,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:10.441395"
}
Saving model in huggingface format at samples_seen: 12688
Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/hf_format/samples_12688
[20:53:29] INFO saving took 18.56430721282959 seconds utils.py:611
Epoch 0: 100%|██████████| 121/121 [05:27<00:00, 2.70s/it]
total tokens: 6858 num samples: 3 num padding tokens: 1697 - rank: 1 max len: 2286 min len: 1282 avg len: 1720.3333333333333 num_loss_counted_tokens: 1935
total tokens: 7868 num samples: 7 num padding tokens: 1154 - rank: 3 max len: 1124 min len: 848 avg len: 959.1428571428571 num_loss_counted_tokens: 3674
total tokens: 6932 num samples: 4 num padding tokens: 207 - rank: 1 max len: 1733 min len: 1622 avg len: 1681.25 num_loss_counted_tokens: 1829
total tokens: 7551 num samples: 3 num padding tokens: 1038 - rank: 1 max len: 2517 min len: 1773 avg len: 2171.0 num_loss_counted_tokens: 344
total tokens: 7515 num samples: 3 num padding tokens: 578 - rank: 1 max len: 2505 min len: 2066 avg len: 2312.3333333333335 num_loss_counted_tokens: 990
total tokens: 7928 num samples: 4 num padding tokens: 1119 - rank: 1 max len: 1982 min len: 1503 avg len: 1702.25 num_loss_counted_tokens: 604
total tokens: 6279 num samples: 3 num padding tokens: 554 - rank: 1 max len: 2093 min len: 1575 avg len: 1908.3333333333333 num_loss_counted_tokens: 1272
total tokens: 6693 num samples: 3 num padding tokens: 401 - rank: 1 max len: 2231 min len: 1907 avg len: 2097.3333333333335 num_loss_counted_tokens: 402 total tokens: 6963 num samples: 3 num padding tokens: 692 - rank: 1 max len: 2321 min len: 1787 avg len: 2090.3333333333335 num_loss_counted_tokens: 386
total tokens: 6988 num samples: 4 num padding tokens: 1322 - rank: 1 max len: 1747 min len: 1233 avg len: 1416.5 num_loss_counted_tokens: 2279
total tokens: 6990 num samples: 5 num padding tokens: 1059 - rank: 2 max len: 1398 min len: 1070 avg len: 1186.2 num_loss_counted_tokens: 4472
total tokens: 7260 num samples: 6 num padding tokens: 870 - rank: 2 max len: 1210 min len: 983 avg len: 1065.0 num_loss_counted_tokens: 3195
total tokens: 7796 num samples: 4 num padding tokens: 966 - rank: 1 max len: 1949 min len: 1423 avg len: 1707.5 num_loss_counted_tokens: 4297
total tokens: 7130 num samples: 5 num padding tokens: 914 - rank: 2 max len: 1426 min len: 1115 avg len: 1243.2 num_loss_counted_tokens: 4551
total tokens: 6874 num samples: 2 num padding tokens: 795 - rank: 1 max len: 3437 min len: 2642 avg len: 3039.5 num_loss_counted_tokens: 158
total tokens: 8064 num samples: 12 num padding tokens: 1509 - rank: 5 max len: 672 min len: 468 avg len: 546.25 num_loss_counted_tokens: 3627
total tokens: 7714 num samples: 7 num padding tokens: 1215 - rank: 3 max len: 1102 min len: 813 avg len: 928.4285714285714 num_loss_counted_tokens: 4827
total tokens: 7722 num samples: 13 num padding tokens: 1136 - rank: 5 max len: 594 min len: 399 avg len: 506.61538461538464 num_loss_counted_tokens: 4746
total tokens: 7575 num samples: 5 num padding tokens: 945 - rank: 2 max len: 1515 min len: 1189 avg len: 1326.0 num_loss_counted_tokens: 2489
total tokens: 7609 num samples: 7 num padding tokens: 1227 - rank: 3 max len: 1087 min len: 823 avg len: 911.7142857142857 num_loss_counted_tokens: 1419
total tokens: 7942 num samples: 11 num padding tokens: 1142 - rank: 5 max len: 722 min len: 517 avg len: 618.1818181818181 num_loss_counted_tokens: 4497
total tokens: 8108 num samples: 4 num padding tokens: 1070 - rank: 2 max len: 2027 min len: 1386 avg len: 1759.5 num_loss_counted_tokens: 2087
total tokens: 6940 num samples: 5 num padding tokens: 950 - rank: 2 max len: 1388 min len: 1065 avg len: 1198.0 num_loss_counted_tokens: 3642
total tokens: 7680 num samples: 5 num padding tokens: 844 - rank: 2 max len: 1536 min len: 1208 avg len: 1367.2 num_loss_counted_tokens: 3999
total tokens: 7700 num samples: 14 num padding tokens: 1915 - rank: 5 max len: 550 min len: 316 avg len: 413.2142857142857 num_loss_counted_tokens: 3693
total tokens: 7966 num samples: 7 num padding tokens: 799 - rank: 3 max len: 1138 min len: 946 avg len: 1023.8571428571429 num_loss_counted_tokens: 5133
total tokens: 7566 num samples: 13 num padding tokens: 1066 - rank: 5 max len: 582 min len: 447 avg len: 500.0 num_loss_counted_tokens: 4009
total tokens: 7344 num samples: 4 num padding tokens: 439 - rank: 2 max len: 1836 min len: 1616 avg len: 1726.25 num_loss_counted_tokens: 888
total tokens: 7696 num samples: 8 num padding tokens: 1050 - rank: 3 max len: 962 min len: 715 avg len: 830.75 num_loss_counted_tokens: 5818
total tokens: 8076 num samples: 12 num padding tokens: 901 - rank: 5 max len: 673 min len: 477 avg len: 597.9166666666666 num_loss_counted_tokens: 3764
total tokens: 7953 num samples: 11 num padding tokens: 496 - rank: 5 max len: 723 min len: 589 avg len: 677.9090909090909 num_loss_counted_tokens: 4581
total tokens: 7320 num samples: 4 num padding tokens: 796 - rank: 1 max len: 1830 min len: 1451 avg len: 1631.0 num_loss_counted_tokens: 2250
total tokens: 7616 num samples: 8 num padding tokens: 348 - rank: 3 max len: 952 min len: 870 avg len: 908.5 num_loss_counted_tokens: 5221
total tokens: 7987 num samples: 7 num padding tokens: 817 - rank: 3 max len: 1141 min len: 951 avg len: 1024.2857142857142 num_loss_counted_tokens: 3832
total tokens: 7826 num samples: 7 num padding tokens: 953 - rank: 3 max len: 1118 min len: 839 avg len: 981.8571428571429 num_loss_counted_tokens: 4081
total tokens: 7764 num samples: 12 num padding tokens: 1322 - rank: 5 max len: 647 min len: 464 avg len: 536.8333333333334 num_loss_counted_tokens: 3693
total tokens: 7224 num samples: 8 num padding tokens: 547 - rank: 3 max len: 903 min len: 776 avg len: 834.625 num_loss_counted_tokens: 3728
total tokens: 7896 num samples: 3 num padding tokens: 906 - rank: 1 max len: 2632 min len: 1930 avg len: 2330.0 num_loss_counted_tokens: 1083
total tokens: 7769 num samples: 17 num padding tokens: 1525 - rank: 6 max len: 457 min len: 273 avg len: 367.29411764705884 num_loss_counted_tokens: 3637
total tokens: 7816 num samples: 4 num padding tokens: 1243 - rank: 1 max len: 1954 min len: 1404 avg len: 1643.25 num_loss_counted_tokens: 2194
total tokens: 7806 num samples: 6 num padding tokens: 497 - rank: 2 max len: 1301 min len: 1097 avg len: 1218.1666666666667 num_loss_counted_tokens: 3744
total tokens: 6950 num samples: 5 num padding tokens: 1216 - rank: 3 max len: 1390 min len: 974 avg len: 1146.8 num_loss_counted_tokens: 3368
total tokens: 7455 num samples: 3 num padding tokens: 849 - rank: 2 max len: 2485 min len: 1841 avg len: 2202.0 num_loss_counted_tokens: 231
total tokens: 7536 num samples: 6 num padding tokens: 917 - rank: 2 max len: 1256 min len: 968 avg len: 1103.1666666666667 num_loss_counted_tokens: 3262
total tokens: 7806 num samples: 6 num padding tokens: 1329 - rank: 2 max len: 1301 min len: 978 avg len: 1079.5 num_loss_counted_tokens: 3537
total tokens: 7709 num samples: 13 num padding tokens: 1295 - rank: 5 max len: 593 min len: 373 avg len: 493.38461538461536 num_loss_counted_tokens: 4385
total tokens: 7220 num samples: 5 num padding tokens: 746 - rank: 1 max len: 1444 min len: 1177 avg len: 1294.8 num_loss_counted_tokens: 3357
total tokens: 6820 num samples: 4 num padding tokens: 1422 - rank: 3 max len: 1705 min len: 1072 avg len: 1349.5 num_loss_counted_tokens: 1716
total tokens: 8019 num samples: 9 num padding tokens: 450 - rank: 3 max len: 891 min len: 790 avg len: 841.0 num_loss_counted_tokens: 4323
total tokens: 7668 num samples: 4 num padding tokens: 969 - rank: 2 max len: 1917 min len: 1447 avg len: 1674.75 num_loss_counted_tokens: 1838
total tokens: 7693 num samples: 7 num padding tokens: 403 - rank: 2 max len: 1099 min len: 978 avg len: 1041.4285714285713 num_loss_counted_tokens: 4254
total tokens: 7525 num samples: 7 num padding tokens: 708 - rank: 3 max len: 1075 min len: 867 avg len: 973.8571428571429 num_loss_counted_tokens: 4837
total tokens: 7740 num samples: 5 num padding tokens: 1159 - rank: 1 max len: 1548 min len: 1219 avg len: 1316.2 num_loss_counted_tokens: 2345
total tokens: 7843 num samples: 11 num padding tokens: 990 - rank: 5 max len: 713 min len: 541 avg len: 623.0 num_loss_counted_tokens: 5313
total tokens: 5614 num samples: 2 num padding tokens: 423 - rank: 0 max len: 2807 min len: 2384 avg len: 2595.5 num_loss_counted_tokens: 193
total tokens: 8037 num samples: 3 num padding tokens: 1697 - rank: 1 max len: 2679 min len: 1665 avg len: 2113.3333333333335 num_loss_counted_tokens: 1007
total tokens: 8022 num samples: 14 num padding tokens: 1252 - rank: 5 max len: 573 min len: 409 avg len: 483.57142857142856 num_loss_counted_tokens: 4146
total tokens: 7945 num samples: 7 num padding tokens: 727 - rank: 2 max len: 1135 min len: 918 avg len: 1031.142857142857 num_loss_counted_tokens: 4527
total tokens: 7891 num samples: 13 num padding tokens: 1039 - rank: 5 max len: 607 min len: 439 avg len: 527.0769230769231 num_loss_counted_tokens: 4697
total tokens: 7472 num samples: 2 num padding tokens: 1064 - rank: 0 max len: 3736 min len: 2672 avg len: 3204.0 num_loss_counted_tokens: 186
total tokens: 6628 num samples: 2 num padding tokens: 541 - rank: 0 max len: 3314 min len: 2773 avg len: 3043.5 num_loss_counted_tokens: 178
total tokens: 7560 num samples: 10 num padding tokens: 554 - rank: 5 max len: 756 min len: 656 avg len: 700.6 num_loss_counted_tokens: 4201
total tokens: 8021 num samples: 13 num padding tokens: 974 - rank: 5 max len: 617 min len: 468 avg len: 542.0769230769231 num_loss_counted_tokens: 4392
total tokens: 4062 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4062 min len: 4062 avg len: 4062.0 num_loss_counted_tokens: 85
total tokens: 7650 num samples: 9 num padding tokens: 478 - rank: 3 max len: 850 min len: 751 avg len: 796.8888888888889 num_loss_counted_tokens: 4488
total tokens: 8041 num samples: 11 num padding tokens: 905 - rank: 5 max len: 731 min len: 567 avg len: 648.7272727272727 num_loss_counted_tokens: 4931
total tokens: 6344 num samples: 2 num padding tokens: 190 - rank: 0 max len: 3172 min len: 2982 avg len: 3077.0 num_loss_counted_tokens: 709
total tokens: 6178 num samples: 2 num padding tokens: 302 - rank: 0 max len: 3089 min len: 2787 avg len: 2938.0 num_loss_counted_tokens: 161
total tokens: 7370 num samples: 5 num padding tokens: 663 - rank: 2 max len: 1474 min len: 1191 avg len: 1341.4 num_loss_counted_tokens: 6093
total tokens: 7680 num samples: 8 num padding tokens: 596 - rank: 3 max len: 960 min len: 833 avg len: 885.5 num_loss_counted_tokens: 5258
total tokens: 7776 num samples: 12 num padding tokens: 1190 - rank: 5 max len: 648 min len: 476 avg len: 548.8333333333334 num_loss_counted_tokens: 4105
total tokens: 7803 num samples: 17 num padding tokens: 1519 - rank: 6 max len: 459 min len: 281 avg len: 369.6470588235294 num_loss_counted_tokens: 4044 total tokens: 7980 num samples: 19 num padding tokens: 2139 - rank: 6 max len: 420 min len: 236 avg len: 307.42105263157896 num_loss_counted_tokens: 3018
total tokens: 8016 num samples: 16 num padding tokens: 1816 - rank: 6 max len: 501 min len: 300 avg len: 387.5 num_loss_counted_tokens: 3555
total tokens: 8064 num samples: 18 num padding tokens: 1914 - rank: 6 max len: 448 min len: 254 avg len: 341.6666666666667 num_loss_counted_tokens: 3451
total tokens: 6752 num samples: 2 num padding tokens: 43 - rank: 0 max len: 3376 min len: 3333 avg len: 3354.5 num_loss_counted_tokens: 441
total tokens: 8010 num samples: 15 num padding tokens: 2334 - rank: 6 max len: 534 min len: 273 avg len: 378.4 num_loss_counted_tokens: 3218
total tokens: 7800 num samples: 20 num padding tokens: 1809 - rank: 6 max len: 390 min len: 229 avg len: 299.55 num_loss_counted_tokens: 3534
total tokens: 8112 num samples: 26 num padding tokens: 2416 - rank: 6 max len: 312 min len: 150 avg len: 219.07692307692307 num_loss_counted_tokens: 2759
total tokens: 7189 num samples: 7 num padding tokens: 616 - rank: 3 max len: 1027 min len: 850 avg len: 939.0 num_loss_counted_tokens: 4467
total tokens: 5632 num samples: 2 num padding tokens: 674 - rank: 0 max len: 2816 min len: 2142 avg len: 2479.0 num_loss_counted_tokens: 167
total tokens: 7192 num samples: 2 num padding tokens: 1145 - rank: 0 max len: 3596 min len: 2451 avg len: 3023.5 num_loss_counted_tokens: 187
total tokens: 5518 num samples: 2 num padding tokens: 53 - rank: 0 max len: 2759 min len: 2706 avg len: 2732.5 num_loss_counted_tokens: 276
total tokens: 6324 num samples: 2 num padding tokens: 282 - rank: 0 max len: 3162 min len: 2880 avg len: 3021.0 num_loss_counted_tokens: 179
total tokens: 8109 num samples: 17 num padding tokens: 1877 - rank: 6 max len: 477 min len: 301 avg len: 366.5882352941176 num_loss_counted_tokens: 3402
total tokens: 7284 num samples: 3 num padding tokens: 841 - rank: 0 max len: 2428 min len: 1996 avg len: 2147.6666666666665 num_loss_counted_tokens: 2016
total tokens: 6622 num samples: 2 num padding tokens: 51 - rank: 0 max len: 3311 min len: 3260 avg len: 3285.5 num_loss_counted_tokens: 220
total tokens: 7606 num samples: 2 num padding tokens: 284 - rank: 0 max len: 3803 min len: 3519 avg len: 3661.0 num_loss_counted_tokens: 239
total tokens: 8021 num samples: 13 num padding tokens: 1426 - rank: 5 max len: 617 min len: 421 avg len: 507.3076923076923 num_loss_counted_tokens: 4445
total tokens: 7548 num samples: 12 num padding tokens: 1418 - rank: 6 max len: 629 min len: 356 avg len: 510.8333333333333 num_loss_counted_tokens: 4627
total tokens: 7623 num samples: 9 num padding tokens: 977 - rank: 4 max len: 847 min len: 676 avg len: 738.4444444444445 num_loss_counted_tokens: 4362
total tokens: 920 num samples: 8 num padding tokens: 143 - rank: 7 max len: 115 min len: 80 avg len: 97.125 num_loss_counted_tokens: 174
total tokens: 7865 num samples: 11 num padding tokens: 678 - rank: 4 max len: 715 min len: 597 avg len: 653.3636363636364 num_loss_counted_tokens: 4917
total tokens: 7890 num samples: 30 num padding tokens: 2636 - rank: 7 max len: 263 min len: 81 avg len: 175.13333333333333 num_loss_counted_tokens: 2257
total tokens: 8060 num samples: 20 num padding tokens: 1821 - rank: 6 max len: 403 min len: 229 avg len: 311.95 num_loss_counted_tokens: 3851
total tokens: 7644 num samples: 14 num padding tokens: 1953 - rank: 6 max len: 546 min len: 313 avg len: 406.5 num_loss_counted_tokens: 3666
total tokens: 7568 num samples: 8 num padding tokens: 610 - rank: 3 max len: 946 min len: 809 avg len: 869.75 num_loss_counted_tokens: 4321
total tokens: 7923 num samples: 19 num padding tokens: 1491 - rank: 6 max len: 417 min len: 251 avg len: 338.5263157894737 num_loss_counted_tokens: 3342
total tokens: 7767 num samples: 9 num padding tokens: 568 - rank: 4 max len: 863 min len: 734 avg len: 799.8888888888889 num_loss_counted_tokens: 4720
total tokens: 6913 num samples: 31 num padding tokens: 1814 - rank: 7 max len: 223 min len: 81 avg len: 164.48387096774192 num_loss_counted_tokens: 2029
total tokens: 8041 num samples: 17 num padding tokens: 2270 - rank: 6 max len: 473 min len: 266 avg len: 339.47058823529414 num_loss_counted_tokens: 3062
total tokens: 7710 num samples: 3 num padding tokens: 1745 - rank: 0 max len: 2570 min len: 1545 avg len: 1988.3333333333333 num_loss_counted_tokens: 932
total tokens: 8109 num samples: 9 num padding tokens: 1178 - rank: 4 max len: 901 min len: 683 avg len: 770.1111111111111 num_loss_counted_tokens: 4382
total tokens: 7461 num samples: 9 num padding tokens: 1078 - rank: 4 max len: 829 min len: 583 avg len: 709.2222222222222 num_loss_counted_tokens: 4008
total tokens: 8028 num samples: 18 num padding tokens: 1647 - rank: 6 max len: 446 min len: 269 avg len: 354.5 num_loss_counted_tokens: 3480
total tokens: 8010 num samples: 10 num padding tokens: 847 - rank: 4 max len: 801 min len: 646 avg len: 716.3 num_loss_counted_tokens: 5137
total tokens: 7812 num samples: 28 num padding tokens: 3453 - rank: 7 max len: 279 min len: 77 avg len: 155.67857142857142 num_loss_counted_tokens: 1761
total tokens: 7791 num samples: 21 num padding tokens: 1808 - rank: 6 max len: 371 min len: 232 avg len: 284.9047619047619 num_loss_counted_tokens: 2895
total tokens: 6410 num samples: 2 num padding tokens: 76 - rank: 0 max len: 3205 min len: 3129 avg len: 3167.0 num_loss_counted_tokens: 169
total tokens: 7540 num samples: 26 num padding tokens: 3128 - rank: 7 max len: 290 min len: 81 avg len: 169.69230769230768 num_loss_counted_tokens: 1955
total tokens: 7980 num samples: 35 num padding tokens: 2701 - rank: 7 max len: 228 min len: 76 avg len: 150.82857142857142 num_loss_counted_tokens: 1955
total tokens: 7116 num samples: 6 num padding tokens: 825 - rank: 2 max len: 1186 min len: 937 avg len: 1048.5 num_loss_counted_tokens: 3033
total tokens: 7714 num samples: 19 num padding tokens: 1776 - rank: 6 max len: 406 min len: 218 avg len: 312.5263157894737 num_loss_counted_tokens: 3452
total tokens: 6944 num samples: 28 num padding tokens: 2075 - rank: 7 max len: 248 min len: 85 avg len: 173.89285714285714 num_loss_counted_tokens: 2194
total tokens: 6546 num samples: 3 num padding tokens: 491 - rank: 0 max len: 2182 min len: 1721 avg len: 2018.3333333333333 num_loss_counted_tokens: 1774
total tokens: 8090 num samples: 10 num padding tokens: 775 - rank: 4 max len: 809 min len: 668 avg len: 731.5 num_loss_counted_tokens: 4241
total tokens: 6475 num samples: 25 num padding tokens: 2566 - rank: 7 max len: 259 min len: 80 avg len: 156.36 num_loss_counted_tokens: 1562
total tokens: 7461 num samples: 9 num padding tokens: 615 - rank: 4 max len: 829 min len: 730 avg len: 760.6666666666666 num_loss_counted_tokens: 5504
total tokens: 6972 num samples: 28 num padding tokens: 2290 - rank: 7 max len: 249 min len: 78 avg len: 167.21428571428572 num_loss_counted_tokens: 2098
total tokens: 7434 num samples: 9 num padding tokens: 997 - rank: 4 max len: 826 min len: 634 avg len: 715.2222222222222 num_loss_counted_tokens: 4684
total tokens: 7830 num samples: 9 num padding tokens: 687 - rank: 4 max len: 870 min len: 732 avg len: 793.6666666666666 num_loss_counted_tokens: 4598
total tokens: 7868 num samples: 28 num padding tokens: 3286 - rank: 7 max len: 281 min len: 71 avg len: 163.64285714285714 num_loss_counted_tokens: 1693
total tokens: 7700 num samples: 10 num padding tokens: 670 - rank: 4 max len: 770 min len: 672 avg len: 703.0 num_loss_counted_tokens: 3844
total tokens: 8019 num samples: 11 num padding tokens: 675 - rank: 4 max len: 729 min len: 607 avg len: 667.6363636363636 num_loss_counted_tokens: 5829
total tokens: 7824 num samples: 8 num padding tokens: 1092 - rank: 4 max len: 978 min len: 762 avg len: 841.5 num_loss_counted_tokens: 3700
total tokens: 7890 num samples: 30 num padding tokens: 2695 - rank: 7 max len: 263 min len: 77 avg len: 173.16666666666666 num_loss_counted_tokens: 2423
total tokens: 7668 num samples: 9 num padding tokens: 739 - rank: 4 max len: 852 min len: 673 avg len: 769.8888888888889 num_loss_counted_tokens: 3730
total tokens: 7514 num samples: 26 num padding tokens: 3137 - rank: 7 max len: 289 min len: 81 avg len: 168.34615384615384 num_loss_counted_tokens: 1545
total tokens: 7336 num samples: 28 num padding tokens: 3103 - rank: 7 max len: 262 min len: 76 avg len: 151.17857142857142 num_loss_counted_tokens: 1734
total tokens: 7774 num samples: 23 num padding tokens: 3246 - rank: 7 max len: 338 min len: 79 avg len: 196.8695652173913 num_loss_counted_tokens: 2197
total tokens: 8050 num samples: 35 num padding tokens: 2271 - rank: 7 max len: 230 min len: 71 avg len: 165.11428571428573 num_loss_counted_tokens: 2692
total tokens: 7960 num samples: 10 num padding tokens: 1262 - rank: 4 max len: 796 min len: 576 avg len: 669.8 num_loss_counted_tokens: 5376
total tokens: 7576 num samples: 8 num padding tokens: 1064 - rank: 4 max len: 947 min len: 736 avg len: 814.0 num_loss_counted_tokens: 2965
total tokens: 4393 num samples: 23 num padding tokens: 1496 - rank: 7 max len: 191 min len: 75 avg len: 125.95652173913044 num_loss_counted_tokens: 1010
total tokens: 7945 num samples: 35 num padding tokens: 2723 - rank: 7 max len: 227 min len: 78 avg len: 149.2 num_loss_counted_tokens: 2193
total tokens: 7760 num samples: 10 num padding tokens: 746 - rank: 4 max len: 776 min len: 627 avg len: 701.4 num_loss_counted_tokens: 4720
Per-token loss scaled by world size: 0.0004431476700119674Per-token loss scaled by world size: 0.0004245223826728761Per-token loss scaled by world size: 0.0004812271217815578Per-token loss scaled by world size: 0.0004481318756006658Per-token loss scaled by world size: 5.5284708651015535e-06Per-token loss scaled by world size: 0.0003835844690911472
Epoch: 1, Step: 122, Rank: 0, loss = 0.014387845993041992
Per-token loss scaled by world size: 3.5970988392364234e-05Epoch: 1, Step: 122, Rank: 5, loss = 1.1048195362091064
Epoch: 1, Step: 122, Rank: 6, loss = 1.2523936033248901
Epoch: 1, Step: 122, Rank: 3, loss = 0.9982785582542419
Epoch: 1, Step: 122, Rank: 4, loss = 1.1532918214797974
Epoch: 1, Step: 122, Rank: 7, loss = 1.166263222694397
Epoch: 1, Step: 122, Rank: 1, loss = 0.09361449629068375
Per-token loss scaled by world size: 0.00016121947555802763
Epoch: 1, Step: 122, Rank: 2, loss = 0.41957369446754456
total tokens: 7389 num samples: 9 num padding tokens: 463 - rank: 4 max len: 821 min len: 731 avg len: 769.5555555555555 num_loss_counted_tokens: 4484
total tokens: 7920 num samples: 4 num padding tokens: 1041 - rank: 1 max len: 1980 min len: 1501 avg len: 1719.75 num_loss_counted_tokens: 3030
total tokens: 7557 num samples: 11 num padding tokens: 1175 - rank: 5 max len: 687 min len: 471 avg len: 580.1818181818181 num_loss_counted_tokens: 3537
{
"epoch": 1,
"step": 122,
"rank": 0,
"loss": 0.014387845993041992,
"overall_throughput": 41.10590635951548,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.46029806137085,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20820,
"batch_size": 84,
"total_loss": 0.7753278613090515,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:32.058777"
}
total tokens: 6208 num samples: 2 num padding tokens: 1070 - rank: 0 max len: 3104 min len: 2034 avg len: 2569.0 num_loss_counted_tokens: 154
total tokens: 7758 num samples: 6 num padding tokens: 429 - rank: 2 max len: 1293 min len: 1127 avg len: 1221.5 num_loss_counted_tokens: 4122
total tokens: 7641 num samples: 27 num padding tokens: 2718 - rank: 7 max len: 283 min len: 90 avg len: 182.33333333333334 num_loss_counted_tokens: 2178
total tokens: 7786 num samples: 17 num padding tokens: 1303 - rank: 6 max len: 458 min len: 306 avg len: 381.3529411764706 num_loss_counted_tokens: 3550
total tokens: 7714 num samples: 7 num padding tokens: 1069 - rank: 3 max len: 1102 min len: 863 avg len: 949.2857142857143 num_loss_counted_tokens: 3857
Per-token loss scaled by world size: 0.00028338973061181605Per-token loss scaled by world size: 0.00028131416183896363Per-token loss scaled by world size: 0.00031643020338378847
Per-token loss scaled by world size: 2.5795485271373764e-05Per-token loss scaled by world size: 0.0003353776701260358Per-token loss scaled by world size: 3.061828238060116e-06
Epoch: 1, Step: 123, Rank: 2, loss = 0.9252774715423584
Epoch: 1, Step: 123, Rank: 6, loss = 0.9321042895317078Epoch: 1, Step: 123, Rank: 4, loss = 1.0407785177230835
Per-token loss scaled by world size: 0.00020596390822902322Epoch: 1, Step: 123, Rank: 1, loss = 0.08484457433223724
Epoch: 1, Step: 123, Rank: 0, loss = 0.010070735588669777
Epoch: 1, Step: 123, Rank: 5, loss = 1.1030991077423096
Per-token loss scaled by world size: 0.00040186592377722263
Epoch: 1, Step: 123, Rank: 3, loss = 1.3217872381210327
Epoch: 1, Step: 123, Rank: 7, loss = 0.6774410605430603
total tokens: 7158 num samples: 3 num padding tokens: 1822 - rank: 1 max len: 2386 min len: 1371 avg len: 1778.6666666666667 num_loss_counted_tokens: 859
total tokens: 8090 num samples: 10 num padding tokens: 658 - rank: 4 max len: 809 min len: 681 avg len: 743.2 num_loss_counted_tokens: 4692
{
"epoch": 1,
"step": 123,
"rank": 0,
"loss": 0.010070735588669777,
"overall_throughput": 41.52344953508732,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.238781452178955,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26313,
"batch_size": 94,
"total_loss": 0.7619253396987915,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:34.606466"
}
total tokens: 7752 num samples: 8 num padding tokens: 561 - rank: 3 max len: 969 min len: 810 avg len: 898.875 num_loss_counted_tokens: 4165
total tokens: 8109 num samples: 17 num padding tokens: 1626 - rank: 6 max len: 477 min len: 291 avg len: 381.3529411764706 num_loss_counted_tokens: 4491
total tokens: 7248 num samples: 2 num padding tokens: 588 - rank: 0 max len: 3624 min len: 3036 avg len: 3330.0 num_loss_counted_tokens: 185
total tokens: 7836 num samples: 6 num padding tokens: 921 - rank: 2 max len: 1306 min len: 997 avg len: 1152.5 num_loss_counted_tokens: 3095
total tokens: 7480 num samples: 11 num padding tokens: 811 - rank: 5 max len: 680 min len: 536 avg len: 606.2727272727273 num_loss_counted_tokens: 4943
total tokens: 7614 num samples: 27 num padding tokens: 2676 - rank: 7 max len: 282 min len: 88 avg len: 182.88888888888889 num_loss_counted_tokens: 2334
Per-token loss scaled by world size: 0.00031184396357275546Per-token loss scaled by world size: 0.0001448883704142645Per-token loss scaled by world size: 0.0002452973276376724Per-token loss scaled by world size: 0.00019676386727951467Per-token loss scaled by world size: 0.00030535017140209675Per-token loss scaled by world size: 1.8834512047760654e-06
Per-token loss scaled by world size: 0.00019069209520239383
Epoch: 1, Step: 124, Rank: 6, loss = 0.9844914078712463
Epoch: 1, Step: 124, Rank: 1, loss = 0.45741257071495056
Epoch: 1, Step: 124, Rank: 3, loss = 0.9639905095100403Epoch: 1, Step: 124, Rank: 4, loss = 0.7744036912918091
Epoch: 1, Step: 124, Rank: 7, loss = 0.6211835145950317Epoch: 1, Step: 124, Rank: 0, loss = 0.005946055520325899Epoch: 1, Step: 124, Rank: 2, loss = 0.60201495885849
Per-token loss scaled by world size: 0.00031234745983965695
Epoch: 1, Step: 124, Rank: 5, loss = 0.9860809445381165
total tokens: 7335 num samples: 3 num padding tokens: 518 - rank: 1 max len: 2445 min len: 1952 avg len: 2272.3333333333335 num_loss_counted_tokens: 357
total tokens: 7632 num samples: 8 num padding tokens: 1110 - rank: 4 max len: 954 min len: 772 avg len: 815.25 num_loss_counted_tokens: 4352
total tokens: 7540 num samples: 10 num padding tokens: 877 - rank: 5 max len: 754 min len: 579 avg len: 666.3 num_loss_counted_tokens: 4493
{
"epoch": 1,
"step": 124,
"rank": 0,
"loss": 0.005946055520325899,
"overall_throughput": 42.28005677176176,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.3601393699646,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25256,
"batch_size": 81,
"total_loss": 0.6744404435157776,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:37.110671"
}
total tokens: 7146 num samples: 6 num padding tokens: 707 - rank: 3 max len: 1191 min len: 954 avg len: 1073.1666666666667 num_loss_counted_tokens: 3282
total tokens: 6650 num samples: 2 num padding tokens: 396 - rank: 0 max len: 3325 min len: 2929 avg len: 3127.0 num_loss_counted_tokens: 164
total tokens: 7188 num samples: 4 num padding tokens: 897 - rank: 2 max len: 1797 min len: 1207 avg len: 1572.75 num_loss_counted_tokens: 3275
total tokens: 7749 num samples: 27 num padding tokens: 3523 - rank: 7 max len: 287 min len: 78 avg len: 156.5185185185185 num_loss_counted_tokens: 1794
total tokens: 7924 num samples: 14 num padding tokens: 2249 - rank: 6 max len: 566 min len: 291 avg len: 405.35714285714283 num_loss_counted_tokens: 3760
Per-token loss scaled by world size: 0.0005536731332540512Per-token loss scaled by world size: 0.0003451017546467483Per-token loss scaled by world size: 8.309840632136911e-05Per-token loss scaled by world size: 0.00047730450751259923Per-token loss scaled by world size: 0.0006338073872029781
Per-token loss scaled by world size: 3.222640589228831e-05Per-token loss scaled by world size: 5.77377568333759e-06
Epoch: 1, Step: 125, Rank: 3, loss = 0.21387451887130737Epoch: 1, Step: 125, Rank: 7, loss = 0.8882056474685669
Epoch: 1, Step: 125, Rank: 4, loss = 1.6312617063522339Epoch: 1, Step: 125, Rank: 2, loss = 1.425016164779663Epoch: 1, Step: 125, Rank: 1, loss = 0.014860255643725395Epoch: 1, Step: 125, Rank: 5, loss = 1.2284624576568604
Epoch: 1, Step: 125, Rank: 0, loss = 0.08294271677732468
Per-token loss scaled by world size: 0.0003464070614427328
Epoch: 1, Step: 125, Rank: 6, loss = 0.8915651440620422
total tokens: 7860 num samples: 4 num padding tokens: 1157 - rank: 1 max len: 1965 min len: 1442 avg len: 1675.75 num_loss_counted_tokens: 559
total tokens: 7632 num samples: 9 num padding tokens: 954 - rank: 4 max len: 848 min len: 605 avg len: 742.0 num_loss_counted_tokens: 4296
{
"epoch": 1,
"step": 125,
"rank": 0,
"loss": 0.08294271677732468,
"overall_throughput": 42.30731411051347,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.3255033493042,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20590,
"batch_size": 94,
"total_loss": 0.7970236539840698,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:39.613939"
}
total tokens: 7974 num samples: 18 num padding tokens: 1582 - rank: 6 max len: 443 min len: 282 avg len: 355.1111111111111 num_loss_counted_tokens: 3319
total tokens: 8033 num samples: 29 num padding tokens: 2797 - rank: 7 max len: 277 min len: 75 avg len: 180.55172413793105 num_loss_counted_tokens: 2279
total tokens: 7904 num samples: 8 num padding tokens: 432 - rank: 3 max len: 988 min len: 857 avg len: 934.0 num_loss_counted_tokens: 3900
total tokens: 6880 num samples: 5 num padding tokens: 746 - rank: 2 max len: 1376 min len: 1087 avg len: 1226.8 num_loss_counted_tokens: 1611
total tokens: 7629 num samples: 3 num padding tokens: 752 - rank: 0 max len: 2543 min len: 1988 avg len: 2292.3333333333335 num_loss_counted_tokens: 470
total tokens: 7813 num samples: 13 num padding tokens: 1168 - rank: 5 max len: 601 min len: 444 avg len: 511.15384615384613 num_loss_counted_tokens: 3836
Per-token loss scaled by world size: 0.00016968029376585037Per-token loss scaled by world size: 0.00043697707587853074Per-token loss scaled by world size: 0.00022829265799373388Per-token loss scaled by world size: 0.00044452777365222573
Per-token loss scaled by world size: 0.00034727680031210184
Per-token loss scaled by world size: 1.7178894040625892e-06Per-token loss scaled by world size: 0.0002958408440463245
Epoch: 1, Step: 126, Rank: 5, loss = 1.2159979343414307
Epoch: 1, Step: 126, Rank: 2, loss = 0.6352813839912415
Epoch: 1, Step: 126, Rank: 6, loss = 1.2370096445083618
Epoch: 1, Step: 126, Rank: 1, loss = 0.4721778333187103Epoch: 1, Step: 126, Rank: 4, loss = 0.9663845300674438
Epoch: 1, Step: 126, Rank: 0, loss = 0.004780456889420748
Epoch: 1, Step: 126, Rank: 7, loss = 0.8232511281967163
Per-token loss scaled by world size: 0.0001971422607311979
Epoch: 1, Step: 126, Rank: 3, loss = 0.5485976338386536
total tokens: 5944 num samples: 2 num padding tokens: 412 - rank: 1 max len: 2972 min len: 2560 avg len: 2766.0 num_loss_counted_tokens: 816
total tokens: 7911 num samples: 9 num padding tokens: 941 - rank: 4 max len: 879 min len: 688 avg len: 774.4444444444445 num_loss_counted_tokens: 5131
{
"epoch": 1,
"step": 126,
"rank": 0,
"loss": 0.004780456889420748,
"overall_throughput": 41.60576526421823,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.30566644668579,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22262,
"batch_size": 84,
"total_loss": 0.7379351258277893,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:42.158108"
}
total tokens: 8048 num samples: 16 num padding tokens: 2431 - rank: 6 max len: 503 min len: 247 avg len: 351.0625 num_loss_counted_tokens: 3335
total tokens: 6345 num samples: 27 num padding tokens: 1990 - rank: 7 max len: 235 min len: 78 avg len: 161.2962962962963 num_loss_counted_tokens: 1823
total tokens: 7467 num samples: 3 num padding tokens: 1752 - rank: 2 max len: 2489 min len: 1428 avg len: 1905.0 num_loss_counted_tokens: 538
total tokens: 7710 num samples: 6 num padding tokens: 1580 - rank: 3 max len: 1285 min len: 891 avg len: 1021.6666666666666 num_loss_counted_tokens: 3990
total tokens: 7513 num samples: 11 num padding tokens: 964 - rank: 5 max len: 683 min len: 534 avg len: 595.3636363636364 num_loss_counted_tokens: 4084
total tokens: 5974 num samples: 2 num padding tokens: 4 - rank: 0 max len: 2987 min len: 2983 avg len: 2985.0 num_loss_counted_tokens: 160
Per-token loss scaled by world size: 0.00019664443971123546Per-token loss scaled by world size: 0.00033893511863425374
Per-token loss scaled by world size: 0.0002353027812205255Per-token loss scaled by world size: 0.00019304313173051924
Per-token loss scaled by world size: 0.00022410067322198302
Per-token loss scaled by world size: 0.00028643259429372847Per-token loss scaled by world size: 1.9502303985063918e-05
Epoch: 1, Step: 127, Rank: 5, loss = 1.0451911687850952
Epoch: 1, Step: 127, Rank: 3, loss = 0.6064022779464722
Epoch: 1, Step: 127, Rank: 4, loss = 0.5952967405319214
Epoch: 1, Step: 127, Rank: 1, loss = 0.7256149649620056
Epoch: 1, Step: 127, Rank: 2, loss = 0.6910704374313354
Epoch: 1, Step: 127, Rank: 0, loss = 0.060140229761600494
Epoch: 1, Step: 127, Rank: 7, loss = 0.8832864761352539
Per-token loss scaled by world size: 0.00036767972051166
Epoch: 1, Step: 127, Rank: 6, loss = 1.133832335472107
total tokens: 7504 num samples: 8 num padding tokens: 911 - rank: 4 max len: 938 min len: 741 avg len: 824.125 num_loss_counted_tokens: 5191
total tokens: 7206 num samples: 3 num padding tokens: 904 - rank: 1 max len: 2402 min len: 1894 avg len: 2100.6666666666665 num_loss_counted_tokens: 932
total tokens: 7887 num samples: 11 num padding tokens: 1542 - rank: 5 max len: 717 min len: 395 avg len: 576.8181818181819 num_loss_counted_tokens: 4690
{
"epoch": 1,
"step": 127,
"rank": 0,
"loss": 0.060140229761600494,
"overall_throughput": 42.34427644874209,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.374258518218994,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24670,
"batch_size": 85,
"total_loss": 0.7176043391227722,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:44.674459"
}
total tokens: 7096 num samples: 4 num padding tokens: 561 - rank: 2 max len: 1774 min len: 1529 avg len: 1633.75 num_loss_counted_tokens: 1104
total tokens: 7780 num samples: 20 num padding tokens: 1756 - rank: 6 max len: 389 min len: 236 avg len: 301.2 num_loss_counted_tokens: 2929
total tokens: 6770 num samples: 5 num padding tokens: 936 - rank: 3 max len: 1354 min len: 956 avg len: 1166.8 num_loss_counted_tokens: 3115
total tokens: 6110 num samples: 26 num padding tokens: 1710 - rank: 7 max len: 235 min len: 78 avg len: 169.23076923076923 num_loss_counted_tokens: 1981
total tokens: 5920 num samples: 2 num padding tokens: 555 - rank: 0 max len: 2960 min len: 2405 avg len: 2682.5 num_loss_counted_tokens: 172
Per-token loss scaled by world size: 0.0008036900544539094Per-token loss scaled by world size: 0.0005513833602890372Per-token loss scaled by world size: 0.0007896597380749881Per-token loss scaled by world size: 9.589117689756677e-05
Per-token loss scaled by world size: 5.089726073492784e-06
Per-token loss scaled by world size: 1.1957185051869601e-05
Epoch: 1, Step: 128, Rank: 6, loss = 1.6164215803146362Epoch: 1, Step: 128, Rank: 2, loss = 0.19286112487316132
Epoch: 1, Step: 128, Rank: 5, loss = 1.5882031917572021
Epoch: 1, Step: 128, Rank: 4, loss = 1.108969807624817
Per-token loss scaled by world size: 5.8463097957428545e-05
Epoch: 1, Step: 128, Rank: 1, loss = 0.010236711241304874
Epoch: 1, Step: 128, Rank: 0, loss = 0.02404888905584812
Per-token loss scaled by world size: 0.00048682422493584454
Epoch: 1, Step: 128, Rank: 3, loss = 0.9791252017021179
Epoch: 1, Step: 128, Rank: 7, loss = 0.11758390814065933
total tokens: 6630 num samples: 26 num padding tokens: 2393 - rank: 7 max len: 255 min len: 78 avg len: 162.96153846153845 num_loss_counted_tokens: 1758
total tokens: 7601 num samples: 11 num padding tokens: 831 - rank: 4 max len: 691 min len: 561 avg len: 615.4545454545455 num_loss_counted_tokens: 4864
total tokens: 7125 num samples: 5 num padding tokens: 1532 - rank: 1 max len: 1425 min len: 985 avg len: 1118.6 num_loss_counted_tokens: 3514
{
"epoch": 1,
"step": 128,
"rank": 0,
"loss": 0.02404888905584812,
"overall_throughput": 42.28494221885827,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.05379819869995,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16090,
"batch_size": 72,
"total_loss": 0.7046812772750854,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:47.176000"
}
total tokens: 5476 num samples: 2 num padding tokens: 764 - rank: 0 max len: 2738 min len: 1974 avg len: 2356.0 num_loss_counted_tokens: 216
total tokens: 7714 num samples: 14 num padding tokens: 571 - rank: 5 max len: 551 min len: 429 avg len: 510.2142857142857 num_loss_counted_tokens: 5114
total tokens: 7731 num samples: 9 num padding tokens: 842 - rank: 3 max len: 859 min len: 715 avg len: 765.4444444444445 num_loss_counted_tokens: 4877
total tokens: 7923 num samples: 19 num padding tokens: 1796 - rank: 6 max len: 417 min len: 256 avg len: 322.4736842105263 num_loss_counted_tokens: 3232
total tokens: 7840 num samples: 8 num padding tokens: 324 - rank: 2 max len: 980 min len: 894 avg len: 939.5 num_loss_counted_tokens: 5169
Per-token loss scaled by world size: 0.0001966664713108912Per-token loss scaled by world size: 0.0001095464758691378Per-token loss scaled by world size: 0.00030617474112659693Per-token loss scaled by world size: 0.0003008927742484957
Per-token loss scaled by world size: 0.0001922248484333977Per-token loss scaled by world size: 6.467673188126355e-07Per-token loss scaled by world size: 0.00025096320314332843
Epoch: 1, Step: 129, Rank: 6, loss = 1.0479342937469482Epoch: 1, Step: 129, Rank: 4, loss = 1.066330075263977Epoch: 1, Step: 129, Rank: 2, loss = 0.3815229833126068
Epoch: 1, Step: 129, Rank: 1, loss = 0.6849401593208313
Epoch: 1, Step: 129, Rank: 0, loss = 0.0022525289095938206
Epoch: 1, Step: 129, Rank: 7, loss = 0.6694710850715637
Per-token loss scaled by world size: 0.00032839240157045424
Epoch: 1, Step: 129, Rank: 5, loss = 1.14370858669281
Epoch: 1, Step: 129, Rank: 3, loss = 0.8740420937538147
total tokens: 6094 num samples: 2 num padding tokens: 254 - rank: 1 max len: 3047 min len: 2793 avg len: 2920.0 num_loss_counted_tokens: 165
total tokens: 7068 num samples: 6 num padding tokens: 383 - rank: 4 max len: 1178 min len: 1029 avg len: 1114.1666666666667 num_loss_counted_tokens: 3758
{
"epoch": 1,
"step": 129,
"rank": 0,
"loss": 0.0022525289095938206,
"overall_throughput": 41.47733724327907,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.227038383483887,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27862,
"batch_size": 83,
"total_loss": 0.7337751984596252,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:49.705725"
}
total tokens: 8016 num samples: 16 num padding tokens: 1518 - rank: 6 max len: 501 min len: 265 avg len: 406.125 num_loss_counted_tokens: 3598
total tokens: 7964 num samples: 4 num padding tokens: 776 - rank: 3 max len: 1991 min len: 1616 avg len: 1797.0 num_loss_counted_tokens: 415
total tokens: 6264 num samples: 24 num padding tokens: 2144 - rank: 7 max len: 261 min len: 85 avg len: 171.66666666666666 num_loss_counted_tokens: 1744
total tokens: 7974 num samples: 2 num padding tokens: 703 - rank: 0 max len: 3987 min len: 3284 avg len: 3635.5 num_loss_counted_tokens: 349
total tokens: 8073 num samples: 9 num padding tokens: 2207 - rank: 5 max len: 897 min len: 515 avg len: 651.7777777777778 num_loss_counted_tokens: 4298
total tokens: 7056 num samples: 3 num padding tokens: 377 - rank: 2 max len: 2352 min len: 2050 avg len: 2226.3333333333335 num_loss_counted_tokens: 2384
Per-token loss scaled by world size: 0.0002643604821059853Per-token loss scaled by world size: 0.000382772006560117Per-token loss scaled by world size: 0.000271481869276613
Per-token loss scaled by world size: 3.5156226658727974e-06
Per-token loss scaled by world size: 6.239629328774754e-07
Per-token loss scaled by world size: 0.00018353613268118352
Epoch: 1, Step: 130, Rank: 5, loss = 1.2674537897109985
Epoch: 1, Step: 130, Rank: 6, loss = 0.8989443778991699Epoch: 1, Step: 130, Rank: 0, loss = 0.011641105636954308
Epoch: 1, Step: 130, Rank: 2, loss = 0.8753636479377747
Per-token loss scaled by world size: 0.0003246103588026017Epoch: 1, Step: 130, Rank: 1, loss = 0.0020660972222685814
Epoch: 1, Step: 130, Rank: 7, loss = 0.6077340245246887
Per-token loss scaled by world size: 0.00032646721228957176
Epoch: 1, Step: 130, Rank: 4, loss = 1.0748660564422607
Epoch: 1, Step: 130, Rank: 3, loss = 1.0810145139694214
total tokens: 7672 num samples: 7 num padding tokens: 936 - rank: 4 max len: 1096 min len: 867 avg len: 962.2857142857143 num_loss_counted_tokens: 3200
total tokens: 6610 num samples: 2 num padding tokens: 448 - rank: 1 max len: 3305 min len: 2857 avg len: 3081.0 num_loss_counted_tokens: 155
total tokens: 7690 num samples: 10 num padding tokens: 781 - rank: 5 max len: 769 min len: 604 avg len: 690.9 num_loss_counted_tokens: 5078
{
"epoch": 1,
"step": 130,
"rank": 0,
"loss": 0.011641105636954308,
"overall_throughput": 41.55595823632429,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.426838874816895,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26490,
"batch_size": 77,
"total_loss": 0.7273854613304138,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:52.249145"
}
total tokens: 5650 num samples: 2 num padding tokens: 223 - rank: 2 max len: 2825 min len: 2602 avg len: 2713.5 num_loss_counted_tokens: 210
total tokens: 4070 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4070 min len: 4070 avg len: 4070.0 num_loss_counted_tokens: 1038
total tokens: 8030 num samples: 5 num padding tokens: 1189 - rank: 3 max len: 1606 min len: 1172 avg len: 1368.2 num_loss_counted_tokens: 2403
total tokens: 5180 num samples: 20 num padding tokens: 2159 - rank: 7 max len: 259 min len: 76 avg len: 151.05 num_loss_counted_tokens: 1160
total tokens: 7800 num samples: 13 num padding tokens: 2266 - rank: 6 max len: 600 min len: 271 avg len: 425.6923076923077 num_loss_counted_tokens: 3635
Per-token loss scaled by world size: 0.0001472200092393905Per-token loss scaled by world size: 0.00025629153242334723Per-token loss scaled by world size: 0.0002756573085207492Per-token loss scaled by world size: 0.00023288748343475163Per-token loss scaled by world size: 0.0004771172534674406
Per-token loss scaled by world size: 1.5136585034269956e-06
Epoch: 1, Step: 131, Rank: 2, loss = 0.8319223523139954Epoch: 1, Step: 131, Rank: 6, loss = 0.894783616065979
Epoch: 1, Step: 131, Rank: 3, loss = 0.755952775478363Epoch: 1, Step: 131, Rank: 4, loss = 1.5487226247787476
Epoch: 1, Step: 131, Rank: 1, loss = 0.4778761565685272
Per-token loss scaled by world size: 0.0003685416013468057
Epoch: 1, Step: 131, Rank: 0, loss = 0.004913335666060448
Epoch: 1, Step: 131, Rank: 5, loss = 1.1962860822677612
Per-token loss scaled by world size: 0.0002723717479966581
Epoch: 1, Step: 131, Rank: 7, loss = 0.8841187357902527
total tokens: 7064 num samples: 4 num padding tokens: 383 - rank: 1 max len: 1766 min len: 1494 avg len: 1670.25 num_loss_counted_tokens: 2269
total tokens: 7308 num samples: 9 num padding tokens: 754 - rank: 4 max len: 812 min len: 666 avg len: 728.2222222222222 num_loss_counted_tokens: 4526
{
"epoch": 1,
"step": 131,
"rank": 0,
"loss": 0.004913335666060448,
"overall_throughput": 42.206020029222465,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.24073839187622,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25968,
"batch_size": 101,
"total_loss": 0.8243219256401062,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:54.758478"
}
total tokens: 8040 num samples: 8 num padding tokens: 747 - rank: 3 max len: 1005 min len: 816 avg len: 911.625 num_loss_counted_tokens: 5982
total tokens: 7415 num samples: 5 num padding tokens: 1415 - rank: 2 max len: 1483 min len: 1015 avg len: 1200.0 num_loss_counted_tokens: 2673
total tokens: 8106 num samples: 3 num padding tokens: 591 - rank: 0 max len: 2702 min len: 2308 avg len: 2505.0 num_loss_counted_tokens: 285
total tokens: 7936 num samples: 32 num padding tokens: 2881 - rank: 7 max len: 248 min len: 70 avg len: 157.96875 num_loss_counted_tokens: 2168
total tokens: 7824 num samples: 12 num padding tokens: 940 - rank: 5 max len: 652 min len: 506 avg len: 573.6666666666666 num_loss_counted_tokens: 5458
total tokens: 7664 num samples: 16 num padding tokens: 1917 - rank: 6 max len: 479 min len: 262 avg len: 359.1875 num_loss_counted_tokens: 3606
Per-token loss scaled by world size: 0.0003236977499909699Per-token loss scaled by world size: 0.0009724997216835618Per-token loss scaled by world size: 0.00038814375875517726Per-token loss scaled by world size: 0.0005891940090805292
Per-token loss scaled by world size: 0.0001362602924928069
Per-token loss scaled by world size: 5.815729309688322e-06
Per-token loss scaled by world size: 8.939716281020083e-06
Epoch: 1, Step: 132, Rank: 5, loss = 1.257119059562683Epoch: 1, Step: 132, Rank: 6, loss = 2.0749497413635254
Epoch: 1, Step: 132, Rank: 4, loss = 0.6906496286392212Epoch: 1, Step: 132, Rank: 7, loss = 0.8281532526016235
Epoch: 1, Step: 132, Rank: 2, loss = 0.012408585287630558
Epoch: 1, Step: 132, Rank: 3, loss = 0.290728360414505
Epoch: 1, Step: 132, Rank: 1, loss = 0.019074002280831337
Per-token loss scaled by world size: 3.719959931913763e-05
Epoch: 1, Step: 132, Rank: 0, loss = 0.07936999201774597
total tokens: 7930 num samples: 10 num padding tokens: 877 - rank: 4 max len: 793 min len: 643 avg len: 705.3 num_loss_counted_tokens: 3435
total tokens: 7288 num samples: 4 num padding tokens: 780 - rank: 1 max len: 1822 min len: 1515 avg len: 1627.0 num_loss_counted_tokens: 3100
total tokens: 7105 num samples: 7 num padding tokens: 499 - rank: 3 max len: 1015 min len: 827 avg len: 943.7142857142857 num_loss_counted_tokens: 4654
{
"epoch": 1,
"step": 132,
"rank": 0,
"loss": 0.07936999201774597,
"overall_throughput": 42.231542345095384,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.476311683654785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17069,
"batch_size": 64,
"total_loss": 0.6565565466880798,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:57.301535"
}
total tokens: 7560 num samples: 30 num padding tokens: 2281 - rank: 7 max len: 252 min len: 77 avg len: 175.96666666666667 num_loss_counted_tokens: 2486
total tokens: 7176 num samples: 6 num padding tokens: 666 - rank: 2 max len: 1196 min len: 1024 avg len: 1085.0 num_loss_counted_tokens: 3309
total tokens: 5546 num samples: 2 num padding tokens: 144 - rank: 0 max len: 2773 min len: 2629 avg len: 2701.0 num_loss_counted_tokens: 194
total tokens: 8046 num samples: 18 num padding tokens: 1959 - rank: 6 max len: 447 min len: 255 avg len: 338.1666666666667 num_loss_counted_tokens: 3390
total tokens: 7656 num samples: 12 num padding tokens: 892 - rank: 5 max len: 638 min len: 458 avg len: 563.6666666666666 num_loss_counted_tokens: 4307
Per-token loss scaled by world size: 0.00027533259708434343Per-token loss scaled by world size: 0.00012797772069461644Per-token loss scaled by world size: 0.00020723696798086166Per-token loss scaled by world size: 0.00024243281222879887Per-token loss scaled by world size: 0.00023833484738133848
Per-token loss scaled by world size: 0.00026018035714514554
Per-token loss scaled by world size: 0.00010569631558610126
Epoch: 1, Step: 133, Rank: 2, loss = 0.7923144102096558
Epoch: 1, Step: 133, Rank: 6, loss = 0.9153087735176086Epoch: 1, Step: 133, Rank: 1, loss = 0.42544591426849365Epoch: 1, Step: 133, Rank: 4, loss = 0.8059375882148743
Epoch: 1, Step: 133, Rank: 3, loss = 0.8649370670318604
Epoch: 1, Step: 133, Rank: 7, loss = 0.6889333724975586
Epoch: 1, Step: 133, Rank: 0, loss = 0.35137417912483215
Per-token loss scaled by world size: 0.0003980571636930108
Epoch: 1, Step: 133, Rank: 5, loss = 1.323291301727295
total tokens: 7461 num samples: 9 num padding tokens: 647 - rank: 4 max len: 829 min len: 685 avg len: 757.1111111111111 num_loss_counted_tokens: 3551
total tokens: 7808 num samples: 4 num padding tokens: 946 - rank: 1 max len: 1952 min len: 1419 avg len: 1715.5 num_loss_counted_tokens: 1967
total tokens: 8010 num samples: 30 num padding tokens: 2819 - rank: 7 max len: 267 min len: 85 avg len: 173.03333333333333 num_loss_counted_tokens: 2333
{
"epoch": 1,
"step": 133,
"rank": 0,
"loss": 0.35137417912483215,
"overall_throughput": 41.395595194588644,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.437856197357178,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26595,
"batch_size": 92,
"total_loss": 0.7709429264068604,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:53:59.817004"
}
total tokens: 6995 num samples: 5 num padding tokens: 1102 - rank: 2 max len: 1399 min len: 998 avg len: 1178.6 num_loss_counted_tokens: 2712
total tokens: 7808 num samples: 8 num padding tokens: 442 - rank: 3 max len: 976 min len: 844 avg len: 920.75 num_loss_counted_tokens: 4240
total tokens: 7686 num samples: 3 num padding tokens: 937 - rank: 0 max len: 2562 min len: 2050 avg len: 2249.6666666666665 num_loss_counted_tokens: 556
total tokens: 7992 num samples: 18 num padding tokens: 1651 - rank: 6 max len: 444 min len: 270 avg len: 352.27777777777777 num_loss_counted_tokens: 3377
total tokens: 7656 num samples: 12 num padding tokens: 1025 - rank: 5 max len: 638 min len: 445 avg len: 552.5833333333334 num_loss_counted_tokens: 4330
Per-token loss scaled by world size: 0.0003683593822643161Per-token loss scaled by world size: 0.0001245876046596095Per-token loss scaled by world size: 0.0003223164821974933Per-token loss scaled by world size: 0.0003114262653980404Per-token loss scaled by world size: 0.0002396363124717027Per-token loss scaled by world size: 0.0002762637159321457
Per-token loss scaled by world size: 7.41567782824859e-06
Epoch: 1, Step: 134, Rank: 1, loss = 0.3786684572696686
Epoch: 1, Step: 134, Rank: 6, loss = 0.9796406626701355
Epoch: 1, Step: 134, Rank: 5, loss = 1.1195822954177856
Epoch: 1, Step: 134, Rank: 7, loss = 0.9465411901473999
Epoch: 1, Step: 134, Rank: 4, loss = 0.7283446192741394
Epoch: 1, Step: 134, Rank: 3, loss = 0.8396689891815186
Epoch: 1, Step: 134, Rank: 0, loss = 0.02253902517259121
Per-token loss scaled by world size: 0.00010645172005752102
Epoch: 1, Step: 134, Rank: 2, loss = 0.32354670763015747
total tokens: 6520 num samples: 4 num padding tokens: 733 - rank: 1 max len: 1630 min len: 1198 avg len: 1446.75 num_loss_counted_tokens: 1720
total tokens: 7600 num samples: 10 num padding tokens: 457 - rank: 4 max len: 760 min len: 692 avg len: 714.3 num_loss_counted_tokens: 4057
{
"epoch": 1,
"step": 134,
"rank": 0,
"loss": 0.02253902517259121,
"overall_throughput": 41.963973566813905,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.358724117279053,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24315,
"batch_size": 87,
"total_loss": 0.6673164963722229,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:02.337863"
}
total tokens: 7712 num samples: 16 num padding tokens: 2123 - rank: 6 max len: 482 min len: 243 avg len: 349.3125 num_loss_counted_tokens: 3585
total tokens: 8064 num samples: 7 num padding tokens: 647 - rank: 2 max len: 1152 min len: 995 avg len: 1059.5714285714287 num_loss_counted_tokens: 4100
total tokens: 6102 num samples: 27 num padding tokens: 2162 - rank: 7 max len: 226 min len: 78 avg len: 145.92592592592592 num_loss_counted_tokens: 1630
total tokens: 7956 num samples: 12 num padding tokens: 1032 - rank: 5 max len: 663 min len: 497 avg len: 577.0 num_loss_counted_tokens: 5176
total tokens: 7920 num samples: 8 num padding tokens: 902 - rank: 3 max len: 990 min len: 770 avg len: 877.25 num_loss_counted_tokens: 3876
total tokens: 6987 num samples: 3 num padding tokens: 1078 - rank: 0 max len: 2329 min len: 1691 avg len: 1969.6666666666667 num_loss_counted_tokens: 437
Per-token loss scaled by world size: 0.0005423504626378417Per-token loss scaled by world size: 0.00036555714905261993Per-token loss scaled by world size: 0.00022980774519965053Per-token loss scaled by world size: 2.886021502490621e-05Per-token loss scaled by world size: 2.577510167611763e-05Per-token loss scaled by world size: 0.00026930312742479146
Per-token loss scaled by world size: 4.34833509643795e-06
Epoch: 1, Step: 135, Rank: 3, loss = 0.5620235800743103
Epoch: 1, Step: 135, Rank: 4, loss = 0.8940157294273376Epoch: 1, Step: 135, Rank: 2, loss = 0.07058126479387283Epoch: 1, Step: 135, Rank: 1, loss = 0.0630362331867218
Epoch: 1, Step: 135, Rank: 6, loss = 1.3263858556747437
Epoch: 1, Step: 135, Rank: 0, loss = 0.010634397156536579
Epoch: 1, Step: 135, Rank: 7, loss = 0.658614456653595
Per-token loss scaled by world size: 0.0009107645018957555
Epoch: 1, Step: 135, Rank: 5, loss = 2.227388381958008
total tokens: 6480 num samples: 3 num padding tokens: 750 - rank: 1 max len: 2160 min len: 1742 avg len: 1910.0 num_loss_counted_tokens: 614
total tokens: 7314 num samples: 6 num padding tokens: 901 - rank: 4 max len: 1219 min len: 926 avg len: 1068.8333333333333 num_loss_counted_tokens: 4111
{
"epoch": 1,
"step": 135,
"rank": 0,
"loss": 0.010634397156536579,
"overall_throughput": 41.220052175245854,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.333390712738037,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19565,
"batch_size": 73,
"total_loss": 0.7265850305557251,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:04.904683"
}
total tokens: 7488 num samples: 9 num padding tokens: 392 - rank: 5 max len: 832 min len: 756 avg len: 788.4444444444445 num_loss_counted_tokens: 5496
total tokens: 7700 num samples: 11 num padding tokens: 1846 - rank: 6 max len: 700 min len: 394 avg len: 532.1818181818181 num_loss_counted_tokens: 3763
total tokens: 7185 num samples: 5 num padding tokens: 672 - rank: 3 max len: 1437 min len: 1230 avg len: 1302.6 num_loss_counted_tokens: 1982
total tokens: 6648 num samples: 4 num padding tokens: 299 - rank: 2 max len: 1662 min len: 1515 avg len: 1587.25 num_loss_counted_tokens: 916
total tokens: 7875 num samples: 21 num padding tokens: 3380 - rank: 7 max len: 375 min len: 88 avg len: 214.04761904761904 num_loss_counted_tokens: 2091
total tokens: 5508 num samples: 2 num padding tokens: 33 - rank: 0 max len: 2754 min len: 2721 avg len: 2737.5 num_loss_counted_tokens: 386
Per-token loss scaled by world size: 0.0004460816562641412Per-token loss scaled by world size: 0.0001522299717180431Per-token loss scaled by world size: 0.0003226569388061762Per-token loss scaled by world size: 0.0003637947083916515Per-token loss scaled by world size: 0.00024503390886820853
Per-token loss scaled by world size: 5.745379894506186e-05
Per-token loss scaled by world size: 6.490522537205834e-06
Epoch: 1, Step: 136, Rank: 6, loss = 1.0003172159194946
Epoch: 1, Step: 136, Rank: 5, loss = 1.3829646110534668Epoch: 1, Step: 136, Rank: 3, loss = 0.4719509482383728
Epoch: 1, Step: 136, Rank: 1, loss = 0.17812113463878632Epoch: 1, Step: 136, Rank: 4, loss = 1.127854585647583
Epoch: 1, Step: 136, Rank: 7, loss = 0.759666383266449
Epoch: 1, Step: 136, Rank: 0, loss = 0.020122243091464043
Per-token loss scaled by world size: 0.00021433050278574228
Epoch: 1, Step: 136, Rank: 2, loss = 0.6644781231880188
total tokens: 6660 num samples: 4 num padding tokens: 420 - rank: 1 max len: 1665 min len: 1335 avg len: 1560.0 num_loss_counted_tokens: 1802
total tokens: 7997 num samples: 11 num padding tokens: 948 - rank: 4 max len: 727 min len: 535 avg len: 640.8181818181819 num_loss_counted_tokens: 5687
{
"epoch": 1,
"step": 136,
"rank": 0,
"loss": 0.020122243091464043,
"overall_throughput": 41.5055278365255,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.32311248779297,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24802,
"batch_size": 88,
"total_loss": 0.7006844282150269,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:07.453575"
}
total tokens: 8010 num samples: 15 num padding tokens: 963 - rank: 5 max len: 534 min len: 399 avg len: 469.8 num_loss_counted_tokens: 4616
total tokens: 7308 num samples: 6 num padding tokens: 791 - rank: 2 max len: 1218 min len: 1008 avg len: 1086.1666666666667 num_loss_counted_tokens: 4945
total tokens: 6356 num samples: 28 num padding tokens: 2338 - rank: 7 max len: 227 min len: 81 avg len: 143.5 num_loss_counted_tokens: 1107
total tokens: 7782 num samples: 2 num padding tokens: 2057 - rank: 0 max len: 3891 min len: 1834 avg len: 2862.5 num_loss_counted_tokens: 230
total tokens: 7752 num samples: 8 num padding tokens: 1153 - rank: 3 max len: 969 min len: 740 avg len: 824.875 num_loss_counted_tokens: 4517
total tokens: 8085 num samples: 21 num padding tokens: 1827 - rank: 6 max len: 385 min len: 241 avg len: 298.0 num_loss_counted_tokens: 3422
Per-token loss scaled by world size: 0.00032276863930746913Per-token loss scaled by world size: 0.0002541161666158587Per-token loss scaled by world size: 4.045515743200667e-05Per-token loss scaled by world size: 0.00033090231590904295
Per-token loss scaled by world size: 0.0003559431352186948
Per-token loss scaled by world size: 0.00022709915356244892
Epoch: 1, Step: 137, Rank: 0, loss = 0.13576750457286835
Epoch: 1, Step: 137, Rank: 6, loss = 1.1105082035064697Epoch: 1, Step: 137, Rank: 3, loss = 0.8528138995170593Per-token loss scaled by world size: 0.0001023018267005682
Epoch: 1, Step: 137, Rank: 5, loss = 1.0832115411758423
Epoch: 1, Step: 137, Rank: 4, loss = 1.1945451498031616
Epoch: 1, Step: 137, Rank: 1, loss = 0.7621447443962097
Per-token loss scaled by world size: 0.0002989015483763069
Epoch: 1, Step: 137, Rank: 7, loss = 0.3433249294757843
Epoch: 1, Step: 137, Rank: 2, loss = 1.0031136274337769
total tokens: 6924 num samples: 4 num padding tokens: 1102 - rank: 1 max len: 1731 min len: 1172 avg len: 1455.5 num_loss_counted_tokens: 1807
total tokens: 7890 num samples: 10 num padding tokens: 472 - rank: 4 max len: 789 min len: 675 avg len: 741.8 num_loss_counted_tokens: 3224
total tokens: 5649 num samples: 21 num padding tokens: 2159 - rank: 7 max len: 269 min len: 72 avg len: 166.1904761904762 num_loss_counted_tokens: 1470
{
"epoch": 1,
"step": 137,
"rank": 0,
"loss": 0.13576750457286835,
"overall_throughput": 42.12198553508762,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.488715171813965,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26848,
"batch_size": 89,
"total_loss": 0.8106787204742432,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:09.964344"
}
total tokens: 7440 num samples: 8 num padding tokens: 452 - rank: 3 max len: 930 min len: 829 avg len: 873.5 num_loss_counted_tokens: 4943
total tokens: 7932 num samples: 12 num padding tokens: 1107 - rank: 5 max len: 661 min len: 488 avg len: 568.75 num_loss_counted_tokens: 4833
total tokens: 7648 num samples: 16 num padding tokens: 1451 - rank: 6 max len: 478 min len: 284 avg len: 387.3125 num_loss_counted_tokens: 3233
total tokens: 5446 num samples: 2 num padding tokens: 973 - rank: 0 max len: 2723 min len: 1750 avg len: 2236.5 num_loss_counted_tokens: 235
total tokens: 7763 num samples: 7 num padding tokens: 774 - rank: 2 max len: 1109 min len: 938 avg len: 998.4285714285714 num_loss_counted_tokens: 4812
Per-token loss scaled by world size: 0.00015472256927751005Per-token loss scaled by world size: 0.00039065544842742383Per-token loss scaled by world size: 0.0003819867270067334Per-token loss scaled by world size: 0.00013601673708762974Per-token loss scaled by world size: 0.0002345130778849125
Per-token loss scaled by world size: 9.365750884171575e-05
Per-token loss scaled by world size: 0.00026829339913092554
Epoch: 1, Step: 138, Rank: 5, loss = 1.27397620677948
Epoch: 1, Step: 138, Rank: 1, loss = 0.5045696496963501Epoch: 1, Step: 138, Rank: 2, loss = 0.4435676038265228
Epoch: 1, Step: 138, Rank: 4, loss = 1.2457064390182495
Epoch: 1, Step: 138, Rank: 0, loss = 0.3054288327693939
Epoch: 1, Step: 138, Rank: 7, loss = 0.8749383091926575Epoch: 1, Step: 138, Rank: 3, loss = 0.7647764682769775
Per-token loss scaled by world size: 0.0003792343777604401
Epoch: 1, Step: 138, Rank: 6, loss = 1.236730694770813
total tokens: 7994 num samples: 7 num padding tokens: 1969 - rank: 4 max len: 1142 min len: 760 avg len: 860.7142857142857 num_loss_counted_tokens: 3971
total tokens: 6921 num samples: 3 num padding tokens: 423 - rank: 1 max len: 2307 min len: 1952 avg len: 2166.0 num_loss_counted_tokens: 2301
{
"epoch": 1,
"step": 138,
"rank": 0,
"loss": 0.3054288327693939,
"overall_throughput": 42.2029043837429,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.34983253479004,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26089,
"batch_size": 100,
"total_loss": 0.8312118053436279,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:12.475862"
}
total tokens: 7876 num samples: 11 num padding tokens: 1537 - rank: 5 max len: 716 min len: 452 avg len: 576.2727272727273 num_loss_counted_tokens: 4021
total tokens: 7676 num samples: 4 num padding tokens: 763 - rank: 2 max len: 1919 min len: 1594 avg len: 1728.25 num_loss_counted_tokens: 1719
total tokens: 7115 num samples: 5 num padding tokens: 406 - rank: 3 max len: 1423 min len: 1187 avg len: 1341.8 num_loss_counted_tokens: 3469
total tokens: 7202 num samples: 26 num padding tokens: 2407 - rank: 7 max len: 277 min len: 93 avg len: 184.42307692307693 num_loss_counted_tokens: 2108
total tokens: 6352 num samples: 2 num padding tokens: 277 - rank: 0 max len: 3176 min len: 2899 avg len: 3037.5 num_loss_counted_tokens: 710
total tokens: 8046 num samples: 18 num padding tokens: 1591 - rank: 6 max len: 447 min len: 283 avg len: 358.6111111111111 num_loss_counted_tokens: 3535
Per-token loss scaled by world size: 0.0002807814453262836Per-token loss scaled by world size: 0.0002877341175917536Per-token loss scaled by world size: 0.00017392370500601828
Per-token loss scaled by world size: 0.00026719356537796557Per-token loss scaled by world size: 0.00031913904240354896
Per-token loss scaled by world size: 0.000377663760446012
Per-token loss scaled by world size: 3.4834424695873167e-06
Epoch: 1, Step: 139, Rank: 5, loss = 0.8960040807723999Epoch: 1, Step: 139, Rank: 3, loss = 0.5415984392166138
Epoch: 1, Step: 139, Rank: 1, loss = 0.8743534088134766Epoch: 1, Step: 139, Rank: 6, loss = 0.9937989711761475
Epoch: 1, Step: 139, Rank: 7, loss = 0.8320407271385193
Epoch: 1, Step: 139, Rank: 0, loss = 0.010847439989447594
Epoch: 1, Step: 139, Rank: 4, loss = 1.1760449409484863
Per-token loss scaled by world size: 0.0001241332065546885
Epoch: 1, Step: 139, Rank: 2, loss = 0.38655081391334534
total tokens: 7483 num samples: 7 num padding tokens: 916 - rank: 4 max len: 1069 min len: 850 avg len: 938.1428571428571 num_loss_counted_tokens: 5047
total tokens: 8067 num samples: 3 num padding tokens: 632 - rank: 1 max len: 2689 min len: 2157 avg len: 2478.3333333333335 num_loss_counted_tokens: 281
{
"epoch": 1,
"step": 139,
"rank": 0,
"loss": 0.010847439989447594,
"overall_throughput": 41.72034163026323,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.309249877929688,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24912,
"batch_size": 83,
"total_loss": 0.7139047980308533,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:15.011121"
}
total tokens: 7392 num samples: 4 num padding tokens: 517 - rank: 2 max len: 1848 min len: 1598 avg len: 1718.75 num_loss_counted_tokens: 1592
total tokens: 3971 num samples: 19 num padding tokens: 1215 - rank: 7 max len: 209 min len: 77 avg len: 145.05263157894737 num_loss_counted_tokens: 1089
total tokens: 7575 num samples: 5 num padding tokens: 836 - rank: 3 max len: 1515 min len: 1078 avg len: 1347.8 num_loss_counted_tokens: 3059
total tokens: 7148 num samples: 2 num padding tokens: 850 - rank: 0 max len: 3574 min len: 2724 avg len: 3149.0 num_loss_counted_tokens: 226
total tokens: 7308 num samples: 9 num padding tokens: 1380 - rank: 5 max len: 812 min len: 530 avg len: 658.6666666666666 num_loss_counted_tokens: 3339
total tokens: 8048 num samples: 16 num padding tokens: 2308 - rank: 6 max len: 503 min len: 219 avg len: 358.75 num_loss_counted_tokens: 3793
Per-token loss scaled by world size: 0.0002717878087423742Per-token loss scaled by world size: 0.00017612801457289606
Per-token loss scaled by world size: 0.00045838873484171927Per-token loss scaled by world size: 0.00027124761254526675
Per-token loss scaled by world size: 0.00033216923475265503Per-token loss scaled by world size: 2.2576082301384304e-06
Per-token loss scaled by world size: 5.144028546055779e-05
Epoch: 1, Step: 140, Rank: 2, loss = 0.5452042818069458
Epoch: 1, Step: 140, Rank: 3, loss = 0.8413191437721252
Epoch: 1, Step: 140, Rank: 5, loss = 1.4189423322677612
Epoch: 1, Step: 140, Rank: 4, loss = 0.8396469950675964
Epoch: 1, Step: 140, Rank: 0, loss = 0.0069884262047708035Epoch: 1, Step: 140, Rank: 7, loss = 1.028229832649231
Epoch: 1, Step: 140, Rank: 1, loss = 0.15923340618610382
Per-token loss scaled by world size: 0.0004096345801372081
Epoch: 1, Step: 140, Rank: 6, loss = 1.2680238485336304
total tokens: 7968 num samples: 8 num padding tokens: 1069 - rank: 4 max len: 996 min len: 754 avg len: 862.375 num_loss_counted_tokens: 4593
total tokens: 7478 num samples: 2 num padding tokens: 921 - rank: 1 max len: 3739 min len: 2818 avg len: 3278.5 num_loss_counted_tokens: 165
{
"epoch": 1,
"step": 140,
"rank": 0,
"loss": 0.0069884262047708035,
"overall_throughput": 41.71879346229527,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.433530807495117,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24764,
"batch_size": 84,
"total_loss": 0.7634485363960266,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:17.546839"
}
total tokens: 7473 num samples: 3 num padding tokens: 2287 - rank: 2 max len: 2491 min len: 1276 avg len: 1728.6666666666667 num_loss_counted_tokens: 634
total tokens: 7476 num samples: 28 num padding tokens: 2615 - rank: 7 max len: 267 min len: 72 avg len: 173.60714285714286 num_loss_counted_tokens: 2127
total tokens: 7326 num samples: 6 num padding tokens: 603 - rank: 3 max len: 1221 min len: 1065 avg len: 1120.5 num_loss_counted_tokens: 2131
total tokens: 7580 num samples: 2 num padding tokens: 14 - rank: 0 max len: 3790 min len: 3776 avg len: 3783.0 num_loss_counted_tokens: 616
total tokens: 7390 num samples: 10 num padding tokens: 1181 - rank: 5 max len: 739 min len: 522 avg len: 620.9 num_loss_counted_tokens: 3575
total tokens: 7740 num samples: 15 num padding tokens: 1604 - rank: 6 max len: 516 min len: 269 avg len: 409.06666666666666 num_loss_counted_tokens: 3156
Per-token loss scaled by world size: 0.0002609801304060966Per-token loss scaled by world size: 0.0002962287690024823Per-token loss scaled by world size: 0.0003096856235060841Per-token loss scaled by world size: 0.00027970768860541284
Per-token loss scaled by world size: 2.0839811440964695e-06
Per-token loss scaled by world size: 0.0006479129078797996
Per-token loss scaled by world size: 5.338866685633548e-06
Epoch: 1, Step: 141, Rank: 4, loss = 0.7952631711959839
Epoch: 1, Step: 141, Rank: 3, loss = 0.8313897848129272Epoch: 1, Step: 141, Rank: 7, loss = 0.7006337642669678
Epoch: 1, Step: 141, Rank: 2, loss = 0.750910222530365
Epoch: 1, Step: 141, Rank: 5, loss = 1.7394031286239624Epoch: 1, Step: 141, Rank: 0, loss = 0.005594708025455475
Epoch: 1, Step: 141, Rank: 1, loss = 0.014332855120301247
Per-token loss scaled by world size: 0.0004338203580118716
Epoch: 1, Step: 141, Rank: 6, loss = 1.1646449565887451
total tokens: 6904 num samples: 4 num padding tokens: 1250 - rank: 1 max len: 1726 min len: 1224 avg len: 1413.5 num_loss_counted_tokens: 3292
total tokens: 7900 num samples: 10 num padding tokens: 629 - rank: 4 max len: 790 min len: 654 avg len: 727.1 num_loss_counted_tokens: 4621
{
"epoch": 1,
"step": 141,
"rank": 0,
"loss": 0.005594708025455475,
"overall_throughput": 43.02419669348744,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.362098217010498,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21477,
"batch_size": 74,
"total_loss": 0.7502715587615967,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:20.007363"
}
total tokens: 7999 num samples: 19 num padding tokens: 1858 - rank: 6 max len: 421 min len: 242 avg len: 323.2105263157895 num_loss_counted_tokens: 3598
total tokens: 7648 num samples: 8 num padding tokens: 656 - rank: 3 max len: 956 min len: 812 avg len: 874.0 num_loss_counted_tokens: 5090
total tokens: 6966 num samples: 6 num padding tokens: 548 - rank: 2 max len: 1161 min len: 993 avg len: 1069.6666666666667 num_loss_counted_tokens: 3065
total tokens: 7440 num samples: 31 num padding tokens: 2495 - rank: 7 max len: 240 min len: 77 avg len: 159.51612903225808 num_loss_counted_tokens: 2088
total tokens: 6690 num samples: 3 num padding tokens: 492 - rank: 0 max len: 2230 min len: 1959 avg len: 2066.0 num_loss_counted_tokens: 280
total tokens: 7728 num samples: 12 num padding tokens: 1190 - rank: 5 max len: 644 min len: 425 avg len: 544.8333333333334 num_loss_counted_tokens: 3724
Per-token loss scaled by world size: 0.0005912419874221087Per-token loss scaled by world size: 2.4495158868376166e-05Per-token loss scaled by world size: 0.0005119486595503986Per-token loss scaled by world size: 4.400004763738252e-05Per-token loss scaled by world size: 0.00011134906526422128Per-token loss scaled by world size: 0.0005244921194389462
Per-token loss scaled by world size: 0.0003919812443200499
Epoch: 1, Step: 142, Rank: 1, loss = 0.06206461042165756Epoch: 1, Step: 142, Rank: 6, loss = 1.297149896621704Epoch: 1, Step: 142, Rank: 5, loss = 1.4980593919754028
Epoch: 1, Step: 142, Rank: 0, loss = 0.11148512363433838
Epoch: 1, Step: 142, Rank: 2, loss = 0.2821306884288788
Epoch: 1, Step: 142, Rank: 7, loss = 0.9931824803352356
Epoch: 1, Step: 142, Rank: 4, loss = 1.3289319276809692
Per-token loss scaled by world size: 0.00033672110293991864
Epoch: 1, Step: 142, Rank: 3, loss = 0.8531671166419983
total tokens: 6288 num samples: 3 num padding tokens: 423 - rank: 1 max len: 2096 min len: 1765 avg len: 1955.0 num_loss_counted_tokens: 4016
total tokens: 7448 num samples: 8 num padding tokens: 727 - rank: 4 max len: 931 min len: 688 avg len: 840.125 num_loss_counted_tokens: 5430
{
"epoch": 1,
"step": 142,
"rank": 0,
"loss": 0.11148512363433838,
"overall_throughput": 41.71373478675397,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.478935718536377,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20270,
"batch_size": 89,
"total_loss": 0.8032714128494263,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:22.542197"
}
total tokens: 7664 num samples: 16 num padding tokens: 2101 - rank: 6 max len: 479 min len: 261 avg len: 347.6875 num_loss_counted_tokens: 2998
total tokens: 7707 num samples: 7 num padding tokens: 488 - rank: 3 max len: 1101 min len: 961 avg len: 1031.2857142857142 num_loss_counted_tokens: 3858
total tokens: 7130 num samples: 5 num padding tokens: 559 - rank: 2 max len: 1426 min len: 1132 avg len: 1314.2 num_loss_counted_tokens: 3828
total tokens: 7540 num samples: 29 num padding tokens: 3170 - rank: 7 max len: 260 min len: 78 avg len: 150.68965517241378 num_loss_counted_tokens: 1742
total tokens: 6342 num samples: 2 num padding tokens: 886 - rank: 0 max len: 3171 min len: 2285 avg len: 2728.0 num_loss_counted_tokens: 169
total tokens: 7872 num samples: 12 num padding tokens: 877 - rank: 5 max len: 656 min len: 510 avg len: 582.9166666666666 num_loss_counted_tokens: 4836
Per-token loss scaled by world size: 0.0003404757590033114Per-token loss scaled by world size: 0.0005307358223944902Per-token loss scaled by world size: 6.328061135718599e-05Per-token loss scaled by world size: 0.0002552252262830734
Per-token loss scaled by world size: 2.1218120309640653e-06
Per-token loss scaled by world size: 1.8243759768665768e-05
Per-token loss scaled by world size: 0.0003829057968687266Epoch: 1, Step: 143, Rank: 5, loss = 1.3186794519424438
Epoch: 1, Step: 143, Rank: 2, loss = 0.15722858905792236Epoch: 1, Step: 143, Rank: 3, loss = 0.6341390013694763
Epoch: 1, Step: 143, Rank: 0, loss = 0.005271907430142164
Epoch: 1, Step: 143, Rank: 4, loss = 0.8459545969963074
Epoch: 1, Step: 143, Rank: 1, loss = 0.04532890021800995
Epoch: 1, Step: 143, Rank: 7, loss = 0.9513773322105408
Per-token loss scaled by world size: 0.0005416726926341653
Epoch: 1, Step: 143, Rank: 6, loss = 1.3458534479141235
total tokens: 7796 num samples: 4 num padding tokens: 508 - rank: 1 max len: 1949 min len: 1564 avg len: 1822.0 num_loss_counted_tokens: 2068
total tokens: 7893 num samples: 9 num padding tokens: 666 - rank: 4 max len: 877 min len: 712 avg len: 803.0 num_loss_counted_tokens: 5521
total tokens: 7679 num samples: 7 num padding tokens: 655 - rank: 3 max len: 1097 min len: 888 avg len: 1003.4285714285714 num_loss_counted_tokens: 5783
{
"epoch": 1,
"step": 143,
"rank": 0,
"loss": 0.005271907430142164,
"overall_throughput": 42.46446277162131,
"lr": 2.4000000000000003e-06,
"cuda_mem_allocated": 24.28184461593628,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19877,
"batch_size": 76,
"total_loss": 0.6629791259765625,
"gradnorm": 1.007487177848816,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:25.071211"
}
total tokens: 2587 num samples: 13 num padding tokens: 827 - rank: 7 max len: 199 min len: 78 avg len: 135.3846153846154 num_loss_counted_tokens: 674
total tokens: 7755 num samples: 11 num padding tokens: 897 - rank: 5 max len: 705 min len: 483 avg len: 623.4545454545455 num_loss_counted_tokens: 3835
total tokens: 8028 num samples: 18 num padding tokens: 2679 - rank: 6 max len: 446 min len: 203 avg len: 297.1666666666667 num_loss_counted_tokens: 3221
total tokens: 7772 num samples: 2 num padding tokens: 1840 - rank: 0 max len: 3886 min len: 2046 avg len: 2966.0 num_loss_counted_tokens: 2038 total tokens: 7080 num samples: 5 num padding tokens: 517 - rank: 2 max len: 1416 min len: 1149 avg len: 1312.6 num_loss_counted_tokens: 2382
Per-token loss scaled by world size: 0.00031199524528346956Per-token loss scaled by world size: 8.699101454112679e-05Per-token loss scaled by world size: 1.0524022400204558e-06Per-token loss scaled by world size: 0.00040260597597807646Per-token loss scaled by world size: 0.0002991097862832248Per-token loss scaled by world size: 9.484303154749796e-05
Epoch: 1, Step: 144, Rank: 1, loss = 0.21871715784072876Epoch: 1, Step: 144, Rank: 0, loss = 0.0026460024528205395
Epoch: 1, Step: 144, Rank: 2, loss = 0.23845909535884857
Epoch: 1, Step: 144, Rank: 4, loss = 1.0122520923614502
Epoch: 1, Step: 144, Rank: 7, loss = 0.7844340801239014
Epoch: 1, Step: 144, Rank: 3, loss = 0.7520367503166199
Per-token loss scaled by world size: 0.0005959446425549686
Per-token loss scaled by world size: 0.00046541052870452404Epoch: 1, Step: 144, Rank: 5, loss = 1.4983538389205933
Epoch: 1, Step: 144, Rank: 6, loss = 1.1701583862304688
[2024-08-18 20:54:27,526] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
[2024-08-18 20:54:27,603] [INFO] [timer.py:258:stop] epoch=0/micro_step=144/global_step=4, RunningAvgSamplesPerSec=41.73388834459918, CurrSamplesPerSec=41.83623290194428, MemAllocated=22.69GB, MaxMemAllocated=30.58GB
total tokens: 7693 num samples: 7 num padding tokens: 967 - rank: 4 max len: 1099 min len: 892 avg len: 960.8571428571429 num_loss_counted_tokens: 3764
total tokens: 8019 num samples: 3 num padding tokens: 876 - rank: 1 max len: 2673 min len: 2126 avg len: 2381.0 num_loss_counted_tokens: 283
total tokens: 8005 num samples: 5 num padding tokens: 1366 - rank: 3 max len: 1601 min len: 1136 avg len: 1327.8 num_loss_counted_tokens: 1813
{
"epoch": 1,
"step": 144,
"rank": 0,
"loss": 0.0026460024528205395,
"overall_throughput": 41.11764673168018,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 22.69074296951294,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20114,
"batch_size": 79,
"total_loss": 0.709632158279419,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:27.606216"
}
total tokens: 8100 num samples: 15 num padding tokens: 1493 - rank: 6 max len: 540 min len: 336 avg len: 440.46666666666664 num_loss_counted_tokens: 3689
total tokens: 8000 num samples: 25 num padding tokens: 2965 - rank: 7 max len: 320 min len: 78 avg len: 201.4 num_loss_counted_tokens: 2308
total tokens: 7860 num samples: 4 num padding tokens: 801 - rank: 2 max len: 1965 min len: 1608 avg len: 1764.75 num_loss_counted_tokens: 4235
total tokens: 7858 num samples: 2 num padding tokens: 1014 - rank: 0 max len: 3929 min len: 2915 avg len: 3422.0 num_loss_counted_tokens: 226
total tokens: 7686 num samples: 9 num padding tokens: 1386 - rank: 5 max len: 854 min len: 588 avg len: 700.0 num_loss_counted_tokens: 3557
Per-token loss scaled by world size: 0.00021755551279056817Per-token loss scaled by world size: 0.00011090948828496039Per-token loss scaled by world size: 0.0003043776086997241Per-token loss scaled by world size: 0.00020149040210526437
Per-token loss scaled by world size: 1.0487364079381223e-06
Per-token loss scaled by world size: 0.000273953570285812Per-token loss scaled by world size: 0.00017797687905840576
Epoch: 1, Step: 145, Rank: 5, loss = 1.0936287641525269
Epoch: 1, Step: 145, Rank: 6, loss = 0.7239550352096558
Epoch: 1, Step: 145, Rank: 3, loss = 0.7816769480705261Epoch: 1, Step: 145, Rank: 0, loss = 0.0037681099493056536
Epoch: 1, Step: 145, Rank: 1, loss = 0.3984977900981903
Epoch: 1, Step: 145, Rank: 7, loss = 0.6394709348678589
Epoch: 1, Step: 145, Rank: 4, loss = 0.9843152165412903
Per-token loss scaled by world size: 0.00025657241349108517
Epoch: 1, Step: 145, Rank: 2, loss = 0.9218646883964539
total tokens: 5826 num samples: 2 num padding tokens: 6 - rank: 1 max len: 2913 min len: 2907 avg len: 2910.0 num_loss_counted_tokens: 567
total tokens: 7812 num samples: 7 num padding tokens: 1611 - rank: 4 max len: 1116 min len: 752 avg len: 885.8571428571429 num_loss_counted_tokens: 3680
{
"epoch": 1,
"step": 145,
"rank": 0,
"loss": 0.0037681099493056536,
"overall_throughput": 41.9281555047958,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.221776962280273,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28744,
"batch_size": 94,
"total_loss": 0.6933972239494324,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:30.179291"
}
total tokens: 5792 num samples: 2 num padding tokens: 692 - rank: 2 max len: 2896 min len: 2204 avg len: 2550.0 num_loss_counted_tokens: 520
total tokens: 7612 num samples: 11 num padding tokens: 980 - rank: 5 max len: 692 min len: 509 avg len: 602.9090909090909 num_loss_counted_tokens: 4697
total tokens: 7952 num samples: 16 num padding tokens: 1525 - rank: 6 max len: 497 min len: 266 avg len: 401.6875 num_loss_counted_tokens: 4516
total tokens: 6225 num samples: 25 num padding tokens: 2041 - rank: 7 max len: 249 min len: 75 avg len: 167.36 num_loss_counted_tokens: 1838
total tokens: 6171 num samples: 3 num padding tokens: 1311 - rank: 3 max len: 2057 min len: 1365 avg len: 1620.0 num_loss_counted_tokens: 217
total tokens: 7190 num samples: 2 num padding tokens: 292 - rank: 0 max len: 3595 min len: 3303 avg len: 3449.0 num_loss_counted_tokens: 179
Per-token loss scaled by world size: 0.00040030613308772445Per-token loss scaled by world size: 2.0256973130017286e-06Per-token loss scaled by world size: 4.8830220293893944e-06Per-token loss scaled by world size: 0.0004052049189340323Per-token loss scaled by world size: 0.0007852399721741676
Per-token loss scaled by world size: 0.00055212079314515Per-token loss scaled by world size: 0.0006642856751568615
Epoch: 1, Step: 146, Rank: 3, loss = 0.004231428261846304Epoch: 1, Step: 146, Rank: 6, loss = 1.640268087387085
Epoch: 1, Step: 146, Rank: 2, loss = 0.8361894488334656
Epoch: 1, Step: 146, Rank: 5, loss = 1.3876097202301025
Epoch: 1, Step: 146, Rank: 7, loss = 0.8464224338531494
Epoch: 1, Step: 146, Rank: 4, loss = 1.1533113718032837
Epoch: 1, Step: 146, Rank: 1, loss = 0.010200022719800472
Per-token loss scaled by world size: 5.620273441309109e-05
Epoch: 1, Step: 146, Rank: 0, loss = 0.11740048974752426
total tokens: 5644 num samples: 2 num padding tokens: 996 - rank: 1 max len: 2822 min len: 1826 avg len: 2324.0 num_loss_counted_tokens: 207
total tokens: 7605 num samples: 9 num padding tokens: 737 - rank: 4 max len: 845 min len: 701 avg len: 763.1111111111111 num_loss_counted_tokens: 4932
{
"epoch": 1,
"step": 146,
"rank": 0,
"loss": 0.11740048974752426,
"overall_throughput": 40.05446798320092,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.520647048950195,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16711,
"batch_size": 66,
"total_loss": 0.749454140663147,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:32.780987"
}
total tokens: 6615 num samples: 27 num padding tokens: 2358 - rank: 7 max len: 245 min len: 88 avg len: 157.66666666666666 num_loss_counted_tokens: 1644
total tokens: 6744 num samples: 4 num padding tokens: 435 - rank: 2 max len: 1686 min len: 1368 avg len: 1577.25 num_loss_counted_tokens: 992
total tokens: 6300 num samples: 2 num padding tokens: 261 - rank: 0 max len: 3150 min len: 2889 avg len: 3019.5 num_loss_counted_tokens: 190
total tokens: 6790 num samples: 5 num padding tokens: 1037 - rank: 3 max len: 1358 min len: 1051 avg len: 1150.6 num_loss_counted_tokens: 2865
total tokens: 7872 num samples: 12 num padding tokens: 1248 - rank: 5 max len: 656 min len: 441 avg len: 552.0 num_loss_counted_tokens: 3289
total tokens: 7776 num samples: 18 num padding tokens: 1655 - rank: 6 max len: 432 min len: 260 avg len: 340.05555555555554 num_loss_counted_tokens: 3056
Per-token loss scaled by world size: 0.0007255689124576747Per-token loss scaled by world size: 0.0006010388606227934Per-token loss scaled by world size: 0.0007319062133319676Per-token loss scaled by world size: 7.5493703661777545e-06Per-token loss scaled by world size: 0.00025839314912445843Per-token loss scaled by world size: 5.159122338227462e-06Per-token loss scaled by world size: 7.537942292401567e-05
Epoch: 1, Step: 147, Rank: 5, loss = 1.5308597087860107Epoch: 1, Step: 147, Rank: 6, loss = 1.544230580329895
Epoch: 1, Step: 147, Rank: 4, loss = 1.26811683177948Epoch: 1, Step: 147, Rank: 2, loss = 0.010885103605687618
Epoch: 1, Step: 147, Rank: 1, loss = 0.015928227454423904
Epoch: 1, Step: 147, Rank: 0, loss = 0.159041166305542Epoch: 1, Step: 147, Rank: 7, loss = 0.5451772212982178
Per-token loss scaled by world size: 0.000249014439759776
Epoch: 1, Step: 147, Rank: 3, loss = 0.5253893136978149
total tokens: 5520 num samples: 2 num padding tokens: 43 - rank: 1 max len: 2760 min len: 2717 avg len: 2738.5 num_loss_counted_tokens: 221
{poch 1: 21%|██▏ | 26/122 [01:06<04:05, 2.56s/it]
"epoch": 1,
"step": 147,
"rank": 0,
"loss": 0.159041166305542,
"overall_throughput": 41.27657023996106,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.05473041534424,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16879,
"batch_size": 60,
"total_loss": 0.699953556060791,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:35.343018"
}
total tokens: 7721 num samples: 7 num padding tokens: 1814 - rank: 4 max len: 1103 min len: 719 avg len: 843.8571428571429 num_loss_counted_tokens: 3306
total tokens: 6106 num samples: 2 num padding tokens: 283 - rank: 0 max len: 3053 min len: 2770 avg len: 2911.5 num_loss_counted_tokens: 189
total tokens: 8064 num samples: 32 num padding tokens: 2634 - rank: 7 max len: 252 min len: 83 avg len: 169.6875 num_loss_counted_tokens: 2206
total tokens: 7017 num samples: 3 num padding tokens: 592 - rank: 2 max len: 2339 min len: 2001 avg len: 2141.6666666666665 num_loss_counted_tokens: 257
total tokens: 7960 num samples: 5 num padding tokens: 1198 - rank: 3 max len: 1592 min len: 1124 avg len: 1352.4 num_loss_counted_tokens: 3699
total tokens: 7840 num samples: 16 num padding tokens: 2136 - rank: 6 max len: 490 min len: 253 avg len: 356.5 num_loss_counted_tokens: 2793
total tokens: 7744 num samples: 11 num padding tokens: 773 - rank: 5 max len: 704 min len: 499 avg len: 633.7272727272727 num_loss_counted_tokens: 4777
Per-token loss scaled by world size: 0.00016060298366937786Per-token loss scaled by world size: 0.0004763362812809646Per-token loss scaled by world size: 1.0486909332030336e-06
Per-token loss scaled by world size: 0.00025925057707354426
Per-token loss scaled by world size: 0.00013205081631895155
Per-token loss scaled by world size: 0.0002572258817963302Per-token loss scaled by world size: 0.0002500153495930135
Epoch: 1, Step: 148, Rank: 5, loss = 1.6056700944900513
Epoch: 1, Step: 148, Rank: 0, loss = 0.003535006195306778
Epoch: 1, Step: 148, Rank: 2, loss = 0.5413725972175598
Epoch: 1, Step: 148, Rank: 6, loss = 0.8739013075828552
Epoch: 1, Step: 148, Rank: 4, loss = 0.8670763373374939Epoch: 1, Step: 148, Rank: 7, loss = 0.8427704572677612
Epoch: 1, Step: 148, Rank: 1, loss = 0.44512680172920227
Per-token loss scaled by world size: 0.00031931744888424873
Epoch: 1, Step: 148, Rank: 3, loss = 1.0763791799545288
total tokens: 7304 num samples: 8 num padding tokens: 608 - rank: 4 max len: 913 min len: 793 avg len: 837.0 num_loss_counted_tokens: 4081
total tokens: 7014 num samples: 3 num padding tokens: 554 - rank: 1 max len: 2338 min len: 1840 avg len: 2153.3333333333335 num_loss_counted_tokens: 349
{
"epoch": 1,
"step": 148,
"rank": 0,
"loss": 0.003535006195306778,
"overall_throughput": 40.32522306984037,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.535661697387695,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26967,
"batch_size": 89,
"total_loss": 0.781978964805603,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:37.948159"
}
total tokens: 7966 num samples: 14 num padding tokens: 3196 - rank: 6 max len: 569 min len: 266 avg len: 340.7142857142857 num_loss_counted_tokens: 2847
total tokens: 6400 num samples: 25 num padding tokens: 2104 - rank: 7 max len: 256 min len: 75 avg len: 171.84 num_loss_counted_tokens: 1881
total tokens: 7720 num samples: 10 num padding tokens: 795 - rank: 5 max len: 772 min len: 616 avg len: 692.5 num_loss_counted_tokens: 5407
total tokens: 5704 num samples: 2 num padding tokens: 490 - rank: 0 max len: 2852 min len: 2362 avg len: 2607.0 num_loss_counted_tokens: 206
total tokens: 7476 num samples: 6 num padding tokens: 582 - rank: 2 max len: 1246 min len: 1054 avg len: 1149.0 num_loss_counted_tokens: 3350
total tokens: 7350 num samples: 7 num padding tokens: 419 - rank: 3 max len: 1050 min len: 921 avg len: 990.1428571428571 num_loss_counted_tokens: 4766
Per-token loss scaled by world size: 0.00031092012068256736Per-token loss scaled by world size: 0.00025245780125260353Per-token loss scaled by world size: 0.0003971010446548462Per-token loss scaled by world size: 0.00022266971063800156Per-token loss scaled by world size: 0.0001838229363784194Per-token loss scaled by world size: 0.0004638316167984158
Per-token loss scaled by world size: 4.563625225273427e-06
Epoch: 1, Step: 149, Rank: 4, loss = 0.7849859595298767Epoch: 1, Step: 149, Rank: 5, loss = 1.2347360849380493
Epoch: 1, Step: 149, Rank: 1, loss = 0.6923636198043823
Epoch: 1, Step: 149, Rank: 7, loss = 0.9667672514915466
Epoch: 1, Step: 149, Rank: 2, loss = 0.5715744495391846Epoch: 1, Step: 149, Rank: 3, loss = 1.4422264099121094
Epoch: 1, Step: 149, Rank: 0, loss = 0.014190022833645344
Per-token loss scaled by world size: 0.00031763844890519977
Epoch: 1, Step: 149, Rank: 6, loss = 0.9876570105552673
total tokens: 7170 num samples: 3 num padding tokens: 1425 - rank: 1 max len: 2390 min len: 1672 avg len: 1915.0 num_loss_counted_tokens: 1941
total tokens: 7462 num samples: 7 num padding tokens: 447 - rank: 4 max len: 1066 min len: 923 avg len: 1002.1428571428571 num_loss_counted_tokens: 3761
{
"epoch": 1,
"step": 149,
"rank": 0,
"loss": 0.014190022833645344,
"overall_throughput": 42.217359471248905,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.2309513092041,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24875,
"batch_size": 89,
"total_loss": 0.8368127346038818,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:40.455161"
}
total tokens: 7720 num samples: 5 num padding tokens: 320 - rank: 2 max len: 1544 min len: 1403 avg len: 1480.0 num_loss_counted_tokens: 2899
total tokens: 7384 num samples: 8 num padding tokens: 1322 - rank: 5 max len: 923 min len: 593 avg len: 757.75 num_loss_counted_tokens: 4939
total tokens: 7968 num samples: 24 num padding tokens: 3134 - rank: 7 max len: 332 min len: 91 avg len: 201.41666666666666 num_loss_counted_tokens: 2053
total tokens: 6318 num samples: 2 num padding tokens: 100 - rank: 0 max len: 3159 min len: 3059 avg len: 3109.0 num_loss_counted_tokens: 160
total tokens: 6785 num samples: 5 num padding tokens: 341 - rank: 3 max len: 1357 min len: 1186 avg len: 1288.8 num_loss_counted_tokens: 5137
total tokens: 7602 num samples: 14 num padding tokens: 1273 - rank: 6 max len: 543 min len: 371 avg len: 452.07142857142856 num_loss_counted_tokens: 4011
Per-token loss scaled by world size: 0.00026751268887892365Per-token loss scaled by world size: 0.00014292483683675528Per-token loss scaled by world size: 0.0001920466311275959Per-token loss scaled by world size: 0.0004723104939330369Per-token loss scaled by world size: 2.162046712328447e-06Per-token loss scaled by world size: 0.0003416259423829615
Per-token loss scaled by world size: 0.0003082101175095886
Epoch: 1, Step: 150, Rank: 3, loss = 0.7713059782981873Epoch: 1, Step: 150, Rank: 0, loss = 0.0062337215058505535
Epoch: 1, Step: 150, Rank: 2, loss = 0.5537184476852417
Epoch: 1, Step: 150, Rank: 5, loss = 1.3617892265319824
Epoch: 1, Step: 150, Rank: 1, loss = 0.4120880365371704
Epoch: 1, Step: 150, Rank: 7, loss = 0.9849929809570312
Epoch: 1, Step: 150, Rank: 4, loss = 0.8886467814445496
Per-token loss scaled by world size: 0.00027262946241535246
Epoch: 1, Step: 150, Rank: 6, loss = 0.7860589027404785
total tokens: 7392 num samples: 3 num padding tokens: 958 - rank: 1 max len: 2464 min len: 1969 avg len: 2144.6666666666665 num_loss_counted_tokens: 582
total tokens: 7800 num samples: 10 num padding tokens: 762 - rank: 4 max len: 780 min len: 667 avg len: 703.8 num_loss_counted_tokens: 4946
{
"epoch": 1,
"step": 150,
"rank": 0,
"loss": 0.0062337215058505535,
"overall_throughput": 41.782778937325965,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.4852614402771,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23066,
"batch_size": 89,
"total_loss": 0.7206042408943176,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:42.989502"
}
total tokens: 7866 num samples: 19 num padding tokens: 2154 - rank: 6 max len: 414 min len: 221 avg len: 300.63157894736844 num_loss_counted_tokens: 2912
total tokens: 7504 num samples: 4 num padding tokens: 1167 - rank: 2 max len: 1876 min len: 1283 avg len: 1584.25 num_loss_counted_tokens: 741
total tokens: 8016 num samples: 8 num padding tokens: 721 - rank: 3 max len: 1002 min len: 795 avg len: 911.875 num_loss_counted_tokens: 6005
total tokens: 5805 num samples: 27 num padding tokens: 2202 - rank: 7 max len: 215 min len: 71 avg len: 133.44444444444446 num_loss_counted_tokens: 1404
total tokens: 6802 num samples: 2 num padding tokens: 541 - rank: 0 max len: 3401 min len: 2860 avg len: 3130.5 num_loss_counted_tokens: 166
total tokens: 7920 num samples: 12 num padding tokens: 1266 - rank: 5 max len: 660 min len: 445 avg len: 554.5 num_loss_counted_tokens: 4048
Per-token loss scaled by world size: 0.0004141188692301512Per-token loss scaled by world size: 0.0002827317512128502Per-token loss scaled by world size: 0.00034681695979088545
Per-token loss scaled by world size: 0.0004108196299057454
Per-token loss scaled by world size: 1.1767973546739086e-06
Per-token loss scaled by world size: 9.391092316946015e-05Per-token loss scaled by world size: 0.00017690712411422282
Epoch: 1, Step: 151, Rank: 3, loss = 1.0656384229660034
Epoch: 1, Step: 151, Rank: 5, loss = 1.2724319696426392Epoch: 1, Step: 151, Rank: 4, loss = 0.8687286376953125
Epoch: 1, Step: 151, Rank: 0, loss = 0.003615856869146228Epoch: 1, Step: 151, Rank: 6, loss = 1.2622946500778198
Epoch: 1, Step: 151, Rank: 7, loss = 0.5435692667961121
Epoch: 1, Step: 151, Rank: 1, loss = 0.28855305910110474
Per-token loss scaled by world size: 0.0003057793073821813
Epoch: 1, Step: 151, Rank: 2, loss = 0.9395451545715332
total tokens: 8016 num samples: 8 num padding tokens: 1103 - rank: 4 max len: 1002 min len: 696 avg len: 864.125 num_loss_counted_tokens: 4285
total tokens: 7425 num samples: 3 num padding tokens: 1417 - rank: 1 max len: 2475 min len: 1754 avg len: 2002.6666666666667 num_loss_counted_tokens: 868
{
"epoch": 1,
"step": 151,
"rank": 0,
"loss": 0.003615856869146228,
"overall_throughput": 41.2486467947231,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.402647495269775,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24581,
"batch_size": 87,
"total_loss": 0.780547022819519,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:45.555476"
}
total tokens: 8046 num samples: 6 num padding tokens: 1241 - rank: 3 max len: 1341 min len: 1041 avg len: 1134.1666666666667 num_loss_counted_tokens: 2869
total tokens: 7560 num samples: 30 num padding tokens: 2626 - rank: 7 max len: 252 min len: 77 avg len: 164.46666666666667 num_loss_counted_tokens: 2002
total tokens: 7755 num samples: 15 num padding tokens: 2376 - rank: 6 max len: 517 min len: 262 avg len: 358.6 num_loss_counted_tokens: 3077
total tokens: 6984 num samples: 4 num padding tokens: 492 - rank: 2 max len: 1746 min len: 1432 avg len: 1623.0 num_loss_counted_tokens: 2214
total tokens: 7491 num samples: 11 num padding tokens: 868 - rank: 5 max len: 681 min len: 536 avg len: 602.0909090909091 num_loss_counted_tokens: 3815
total tokens: 7306 num samples: 2 num padding tokens: 786 - rank: 0 max len: 3653 min len: 2867 avg len: 3260.0 num_loss_counted_tokens: 160
Per-token loss scaled by world size: 0.00020859052892774343Per-token loss scaled by world size: 0.0006468938081525266Per-token loss scaled by world size: 0.00038840470369905233Per-token loss scaled by world size: 8.114238880807534e-05
Per-token loss scaled by world size: 4.119947334402241e-05
Per-token loss scaled by world size: 0.0005277044838294387
Per-token loss scaled by world size: 3.4815836897905683e-06
Epoch: 1, Step: 152, Rank: 5, loss = 1.5654021501541138
Epoch: 1, Step: 152, Rank: 3, loss = 0.5047630071640015
Epoch: 1, Step: 152, Rank: 2, loss = 0.1963544338941574
Epoch: 1, Step: 152, Rank: 7, loss = 0.9398908615112305
Epoch: 1, Step: 152, Rank: 4, loss = 1.276978850364685
Epoch: 1, Step: 152, Rank: 0, loss = 0.008424997329711914
Epoch: 1, Step: 152, Rank: 1, loss = 0.09969757497310638
Per-token loss scaled by world size: 0.0005147996125742793
Epoch: 1, Step: 152, Rank: 6, loss = 1.2457506656646729
total tokens: 5714 num samples: 2 num padding tokens: 255 - rank: 1 max len: 2857 min len: 2602 avg len: 2729.5 num_loss_counted_tokens: 173
total tokens: 7479 num samples: 9 num padding tokens: 752 - rank: 4 max len: 831 min len: 644 avg len: 747.4444444444445 num_loss_counted_tokens: 4908
{
"epoch": 1,
"step": 152,
"rank": 0,
"loss": 0.008424997329711914,
"overall_throughput": 43.280048077519346,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.22560167312622,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19359,
"batch_size": 61,
"total_loss": 0.7296578288078308,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:48.003629"
}
total tokens: 7932 num samples: 6 num padding tokens: 1038 - rank: 2 max len: 1322 min len: 990 avg len: 1149.0 num_loss_counted_tokens: 3464 total tokens: 6662 num samples: 2 num padding tokens: 247 - rank: 0 max len: 3331 min len: 3084 avg len: 3207.5 num_loss_counted_tokens: 180
total tokens: 3819 num samples: 19 num padding tokens: 1446 - rank: 7 max len: 201 min len: 76 avg len: 124.89473684210526 num_loss_counted_tokens: 686
total tokens: 8052 num samples: 22 num padding tokens: 2171 - rank: 6 max len: 366 min len: 201 avg len: 267.3181818181818 num_loss_counted_tokens: 3024
total tokens: 7912 num samples: 8 num padding tokens: 615 - rank: 3 max len: 989 min len: 855 avg len: 912.125 num_loss_counted_tokens: 4585
total tokens: 8047 num samples: 13 num padding tokens: 1226 - rank: 5 max len: 619 min len: 375 avg len: 524.6923076923077 num_loss_counted_tokens: 5490
Per-token loss scaled by world size: 0.0002795422915369272Per-token loss scaled by world size: 0.00013167920405976474Per-token loss scaled by world size: 0.0003253524482715875Per-token loss scaled by world size: 0.00030560040613636374Per-token loss scaled by world size: 0.00012384731962811202
Per-token loss scaled by world size: 1.194436777041119e-06
Epoch: 1, Step: 153, Rank: 3, loss = 1.0706535577774048Per-token loss scaled by world size: 0.0002909142931457609
Epoch: 1, Step: 153, Rank: 1, loss = 0.407550573348999Epoch: 1, Step: 153, Rank: 7, loss = 0.43332335352897644
Epoch: 1, Step: 153, Rank: 4, loss = 1.0056545734405518Epoch: 1, Step: 153, Rank: 6, loss = 0.9199037551879883Epoch: 1, Step: 153, Rank: 0, loss = 0.003930592909455299
Per-token loss scaled by world size: 0.0003188060945831239
Epoch: 1, Step: 153, Rank: 2, loss = 0.9573261737823486
Epoch: 1, Step: 153, Rank: 5, loss = 1.0491111278533936
total tokens: 7839 num samples: 9 num padding tokens: 793 - rank: 4 max len: 871 min len: 691 avg len: 782.8888888888889 num_loss_counted_tokens: 5172
total tokens: 6369 num samples: 3 num padding tokens: 846 - rank: 1 max len: 2123 min len: 1506 avg len: 1841.0 num_loss_counted_tokens: 2162
{
"epoch": 1,
"step": 153,
"rank": 0,
"loss": 0.003930592909455299,
"overall_throughput": 41.68838808478364,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.49734401702881,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26326,
"batch_size": 95,
"total_loss": 0.7309317588806152,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:50.537829"
}
total tokens: 7215 num samples: 5 num padding tokens: 508 - rank: 2 max len: 1443 min len: 1149 avg len: 1341.4 num_loss_counted_tokens: 3549
total tokens: 7714 num samples: 29 num padding tokens: 2672 - rank: 7 max len: 266 min len: 79 avg len: 173.86206896551724 num_loss_counted_tokens: 2096
total tokens: 8004 num samples: 3 num padding tokens: 275 - rank: 0 max len: 2668 min len: 2403 avg len: 2576.3333333333335 num_loss_counted_tokens: 964
total tokens: 7700 num samples: 7 num padding tokens: 838 - rank: 3 max len: 1100 min len: 894 avg len: 980.2857142857143 num_loss_counted_tokens: 4738
total tokens: 8076 num samples: 12 num padding tokens: 1148 - rank: 5 max len: 673 min len: 516 avg len: 577.3333333333334 num_loss_counted_tokens: 4820
total tokens: 8032 num samples: 16 num padding tokens: 2150 - rank: 6 max len: 502 min len: 274 avg len: 367.625 num_loss_counted_tokens: 3962
Per-token loss scaled by world size: 0.00019867185619659722Per-token loss scaled by world size: 0.0002461467229295522Per-token loss scaled by world size: 0.00024380745890084654Per-token loss scaled by world size: 0.00019991688895970583
Per-token loss scaled by world size: 6.630049756495282e-05
Per-token loss scaled by world size: 1.6863944551914756e-07
Epoch: 1, Step: 154, Rank: 2, loss = 0.7483974695205688
Epoch: 1, Step: 154, Rank: 3, loss = 0.6136698722839355Epoch: 1, Step: 154, Rank: 4, loss = 0.7555781602859497
Epoch: 1, Step: 154, Rank: 6, loss = 0.6098480820655823
Per-token loss scaled by world size: 0.00021128085791133344Epoch: 1, Step: 154, Rank: 1, loss = 0.20351766049861908
Epoch: 1, Step: 154, Rank: 0, loss = 0.0005176598788239062
Per-token loss scaled by world size: 0.0005612249951809645
Epoch: 1, Step: 154, Rank: 7, loss = 0.6485530138015747
Epoch: 1, Step: 154, Rank: 5, loss = 1.7227503061294556
total tokens: 7904 num samples: 4 num padding tokens: 814 - rank: 1 max len: 1976 min len: 1533 avg len: 1772.5 num_loss_counted_tokens: 2873
total tokens: 7744 num samples: 11 num padding tokens: 596 - rank: 4 max len: 704 min len: 594 avg len: 649.8181818181819 num_loss_counted_tokens: 5417
{
"epoch": 1,
"step": 154,
"rank": 0,
"loss": 0.0005176598788239062,
"overall_throughput": 42.108806987198356,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.21819305419922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24557,
"batch_size": 80,
"total_loss": 0.662854015827179,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:53.052708"
}
total tokens: 7440 num samples: 31 num padding tokens: 3180 - rank: 7 max len: 240 min len: 74 avg len: 137.41935483870967 num_loss_counted_tokens: 1408
total tokens: 7020 num samples: 6 num padding tokens: 1084 - rank: 2 max len: 1170 min len: 846 avg len: 989.3333333333334 num_loss_counted_tokens: 1599
total tokens: 8097 num samples: 3 num padding tokens: 808 - rank: 0 max len: 2699 min len: 2025 avg len: 2429.6666666666665 num_loss_counted_tokens: 947
total tokens: 7999 num samples: 19 num padding tokens: 1990 - rank: 6 max len: 421 min len: 240 avg len: 316.2631578947368 num_loss_counted_tokens: 3596
total tokens: 7657 num samples: 13 num padding tokens: 1095 - rank: 5 max len: 589 min len: 434 avg len: 504.7692307692308 num_loss_counted_tokens: 4851
total tokens: 7560 num samples: 9 num padding tokens: 485 - rank: 3 max len: 840 min len: 716 avg len: 786.1111111111111 num_loss_counted_tokens: 5656
Per-token loss scaled by world size: 0.0003278250514995307Per-token loss scaled by world size: 0.0003366835881024599Per-token loss scaled by world size: 0.0003885742917191237Per-token loss scaled by world size: 0.00019582045206334442
Per-token loss scaled by world size: 0.00032262562308460474
Per-token loss scaled by world size: 0.00013199940440244973
Per-token loss scaled by world size: 3.822985672741197e-05
Epoch: 1, Step: 155, Rank: 5, loss = 1.060516357421875
Epoch: 1, Step: 155, Rank: 4, loss = 0.8947165012359619Epoch: 1, Step: 155, Rank: 7, loss = 0.9188936948776245
Epoch: 1, Step: 155, Rank: 2, loss = 0.5344429612159729
Epoch: 1, Step: 155, Rank: 3, loss = 0.8805260062217712
Epoch: 1, Step: 155, Rank: 1, loss = 0.36025938391685486
Epoch: 1, Step: 155, Rank: 0, loss = 0.10433883965015411
Per-token loss scaled by world size: 0.00041954353218898177
Epoch: 1, Step: 155, Rank: 6, loss = 1.1450392007827759
total tokens: 8064 num samples: 12 num padding tokens: 820 - rank: 4 max len: 672 min len: 523 avg len: 603.6666666666666 num_loss_counted_tokens: 5303
total tokens: 6845 num samples: 5 num padding tokens: 859 - rank: 1 max len: 1369 min len: 1106 avg len: 1197.2 num_loss_counted_tokens: 4981
{
"epoch": 1,
"step": 155,
"rank": 0,
"loss": 0.10433883965015411,
"overall_throughput": 41.90741869411001,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.32686471939087,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21834,
"batch_size": 76,
"total_loss": 0.7373416423797607,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:55.565723"
}
total tokens: 7539 num samples: 7 num padding tokens: 396 - rank: 2 max len: 1077 min len: 913 avg len: 1020.4285714285714 num_loss_counted_tokens: 3046
total tokens: 8118 num samples: 9 num padding tokens: 1166 - rank: 3 max len: 902 min len: 687 avg len: 772.4444444444445 num_loss_counted_tokens: 4403
total tokens: 7740 num samples: 15 num padding tokens: 1204 - rank: 5 max len: 516 min len: 339 avg len: 435.73333333333335 num_loss_counted_tokens: 4055
total tokens: 7732 num samples: 2 num padding tokens: 2186 - rank: 0 max len: 3866 min len: 1680 avg len: 2773.0 num_loss_counted_tokens: 1642
total tokens: 6720 num samples: 30 num padding tokens: 2251 - rank: 7 max len: 224 min len: 83 avg len: 148.96666666666667 num_loss_counted_tokens: 1606
total tokens: 7872 num samples: 24 num padding tokens: 1261 - rank: 6 max len: 328 min len: 226 avg len: 275.4583333333333 num_loss_counted_tokens: 3708
Per-token loss scaled by world size: 0.00035203597508370876Per-token loss scaled by world size: 0.0004669471236411482Per-token loss scaled by world size: 0.00028399238362908363Per-token loss scaled by world size: 0.0006751486216671765
Per-token loss scaled by world size: 8.408135727222543e-06Per-token loss scaled by world size: 0.0003578344185370952
Epoch: 1, Step: 156, Rank: 2, loss = 0.6541054248809814
Epoch: 1, Step: 156, Rank: 3, loss = 1.075495958328247
Epoch: 1, Step: 156, Rank: 6, loss = 1.5550360679626465
Epoch: 1, Step: 156, Rank: 5, loss = 0.81082683801651Epoch: 1, Step: 156, Rank: 0, loss = 0.019366038963198662
Per-token loss scaled by world size: 0.0002267559466417879
Epoch: 1, Step: 156, Rank: 4, loss = 0.8241821527481079
Per-token loss scaled by world size: 2.405712393738213e-06
Epoch: 1, Step: 156, Rank: 7, loss = 0.5222756266593933
Epoch: 1, Step: 156, Rank: 1, loss = 0.00554095720872283
total tokens: 6510 num samples: 3 num padding tokens: 687 - rank: 1 max len: 2170 min len: 1746 avg len: 1941.0 num_loss_counted_tokens: 859
total tokens: 7452 num samples: 9 num padding tokens: 620 - rank: 4 max len: 828 min len: 705 avg len: 759.1111111111111 num_loss_counted_tokens: 5560
total tokens: 8004 num samples: 29 num padding tokens: 2622 - rank: 7 max len: 276 min len: 75 avg len: 185.58620689655172 num_loss_counted_tokens: 2559
{
"epoch": 1,
"step": 156,
"rank": 0,
"loss": 0.019366038963198662,
"overall_throughput": 40.42005133127166,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.42158031463623,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18426,
"batch_size": 65,
"total_loss": 0.6833536028862,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:54:58.183018"
}
total tokens: 7995 num samples: 15 num padding tokens: 1805 - rank: 6 max len: 533 min len: 277 avg len: 412.6666666666667 num_loss_counted_tokens: 4022
total tokens: 7678 num samples: 11 num padding tokens: 680 - rank: 5 max len: 698 min len: 541 avg len: 636.1818181818181 num_loss_counted_tokens: 5943
total tokens: 6816 num samples: 4 num padding tokens: 1254 - rank: 2 max len: 1704 min len: 1064 avg len: 1390.5 num_loss_counted_tokens: 844
total tokens: 7217 num samples: 7 num padding tokens: 458 - rank: 3 max len: 1031 min len: 875 avg len: 965.5714285714286 num_loss_counted_tokens: 3971
total tokens: 5472 num samples: 2 num padding tokens: 101 - rank: 0 max len: 2736 min len: 2635 avg len: 2685.5 num_loss_counted_tokens: 183
Per-token loss scaled by world size: 0.000452109903562814Per-token loss scaled by world size: 0.0004523490206338465Per-token loss scaled by world size: 7.593091595481383e-06Per-token loss scaled by world size: 0.0006878247950226068Per-token loss scaled by world size: 9.245219553122297e-05Per-token loss scaled by world size: 0.0003072120016440749Per-token loss scaled by world size: 0.0005076072411611676
Epoch: 1, Step: 157, Rank: 5, loss = 0.9610720276832581
Epoch: 1, Step: 157, Rank: 6, loss = 1.4613697528839111Epoch: 1, Step: 157, Rank: 3, loss = 0.6527103185653687Epoch: 1, Step: 157, Rank: 2, loss = 0.19642624258995056Epoch: 1, Step: 157, Rank: 4, loss = 0.9605640172958374
Epoch: 1, Step: 157, Rank: 1, loss = 0.016132472082972527
Epoch: 1, Step: 157, Rank: 7, loss = 1.078474998474121
Per-token loss scaled by world size: 3.8059803046053275e-05
Epoch: 1, Step: 157, Rank: 0, loss = 0.08086281269788742
total tokens: 7684 num samples: 4 num padding tokens: 608 - rank: 1 max len: 1921 min len: 1493 avg len: 1769.0 num_loss_counted_tokens: 2916
total tokens: 8118 num samples: 9 num padding tokens: 805 - rank: 4 max len: 902 min len: 707 avg len: 812.5555555555555 num_loss_counted_tokens: 5091
{
"epoch": 1,
"step": 157,
"rank": 0,
"loss": 0.08086281269788742,
"overall_throughput": 41.84706536508769,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.473669052124023,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16997,
"batch_size": 74,
"total_loss": 0.6759515404701233,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:00.708630"
}
total tokens: 7973 num samples: 17 num padding tokens: 2030 - rank: 6 max len: 469 min len: 276 avg len: 349.5882352941176 num_loss_counted_tokens: 3185
total tokens: 7330 num samples: 5 num padding tokens: 579 - rank: 2 max len: 1466 min len: 1164 avg len: 1350.2 num_loss_counted_tokens: 1678
total tokens: 7656 num samples: 11 num padding tokens: 986 - rank: 5 max len: 696 min len: 545 avg len: 606.3636363636364 num_loss_counted_tokens: 3391
total tokens: 7714 num samples: 29 num padding tokens: 2374 - rank: 7 max len: 266 min len: 78 avg len: 184.13793103448276 num_loss_counted_tokens: 2574
total tokens: 7791 num samples: 7 num padding tokens: 885 - rank: 3 max len: 1113 min len: 906 avg len: 986.5714285714286 num_loss_counted_tokens: 4155
total tokens: 6052 num samples: 2 num padding tokens: 661 - rank: 0 max len: 3026 min len: 2365 avg len: 2695.5 num_loss_counted_tokens: 163
Per-token loss scaled by world size: 0.00020404128008522093Per-token loss scaled by world size: 0.0002751105057541281Per-token loss scaled by world size: 0.0002363547682762146Per-token loss scaled by world size: 0.0002702484780456871Per-token loss scaled by world size: 0.0001953808678081259
Per-token loss scaled by world size: 0.00024261375074274838Per-token loss scaled by world size: 1.6901136632441194e-06
Epoch: 1, Step: 158, Rank: 5, loss = 0.8857870101928711
Epoch: 1, Step: 158, Rank: 4, loss = 0.870132565498352Epoch: 1, Step: 158, Rank: 3, loss = 0.6569619178771973Epoch: 1, Step: 158, Rank: 7, loss = 0.6290775537490845Epoch: 1, Step: 158, Rank: 2, loss = 0.7610032558441162
Epoch: 1, Step: 158, Rank: 0, loss = 0.005441743414849043
Epoch: 1, Step: 158, Rank: 1, loss = 0.7811556458473206
Per-token loss scaled by world size: 0.0003746829752344638
Epoch: 1, Step: 158, Rank: 6, loss = 1.2063854932785034
total tokens: 7932 num samples: 3 num padding tokens: 1113 - rank: 1 max len: 2644 min len: 1935 avg len: 2273.0 num_loss_counted_tokens: 430
total tokens: 7551 num samples: 9 num padding tokens: 529 - rank: 4 max len: 839 min len: 726 avg len: 780.2222222222222 num_loss_counted_tokens: 3403
{
"epoch": 1,
"step": 158,
"rank": 0,
"loss": 0.005441743414849043,
"overall_throughput": 42.29815333288017,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.366318225860596,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25758,
"batch_size": 93,
"total_loss": 0.724493145942688,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:03.213265"
}
total tokens: 7185 num samples: 5 num padding tokens: 924 - rank: 2 max len: 1437 min len: 1091 avg len: 1252.2 num_loss_counted_tokens: 3371
total tokens: 6068 num samples: 2 num padding tokens: 230 - rank: 0 max len: 3034 min len: 2804 avg len: 2919.0 num_loss_counted_tokens: 189
total tokens: 7942 num samples: 11 num padding tokens: 1172 - rank: 5 max len: 722 min len: 546 avg len: 615.4545454545455 num_loss_counted_tokens: 3713
total tokens: 6292 num samples: 22 num padding tokens: 2266 - rank: 7 max len: 286 min len: 80 avg len: 183.0 num_loss_counted_tokens: 1812
total tokens: 7856 num samples: 16 num padding tokens: 1583 - rank: 6 max len: 491 min len: 299 avg len: 392.0625 num_loss_counted_tokens: 3733
total tokens: 7308 num samples: 7 num padding tokens: 647 - rank: 3 max len: 1044 min len: 844 avg len: 951.5714285714286 num_loss_counted_tokens: 4260
Per-token loss scaled by world size: 0.0002086303138639778Per-token loss scaled by world size: 0.00018750393064692616Per-token loss scaled by world size: 0.00023788934049662203Per-token loss scaled by world size: 0.00018921871378552169Per-token loss scaled by world size: 0.00015611379058100283
Per-token loss scaled by world size: 0.00034976963070221245
Per-token loss scaled by world size: 2.962535518236109e-06
Epoch: 1, Step: 159, Rank: 6, loss = 0.6299428939819336Epoch: 1, Step: 159, Rank: 2, loss = 0.7992189526557922Epoch: 1, Step: 159, Rank: 5, loss = 1.1750948429107666
Epoch: 1, Step: 159, Rank: 1, loss = 0.5244837999343872
Epoch: 1, Step: 159, Rank: 7, loss = 0.6357039213180542
Epoch: 1, Step: 159, Rank: 4, loss = 0.7009196281433105
Epoch: 1, Step: 159, Rank: 0, loss = 0.009953008033335209
Per-token loss scaled by world size: 0.0001541711390018463
Epoch: 1, Step: 159, Rank: 3, loss = 0.5179572105407715
total tokens: 7248 num samples: 3 num padding tokens: 1157 - rank: 1 max len: 2416 min len: 1450 avg len: 2030.3333333333333 num_loss_counted_tokens: 491
total tokens: 7848 num samples: 9 num padding tokens: 544 - rank: 4 max len: 872 min len: 752 avg len: 811.5555555555555 num_loss_counted_tokens: 5085
{
"epoch": 1,
"step": 159,
"rank": 0,
"loss": 0.009953008033335209,
"overall_throughput": 41.96663008841316,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.325264930725098,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26877,
"batch_size": 82,
"total_loss": 0.6241592168807983,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:05.729512"
}
total tokens: 7830 num samples: 18 num padding tokens: 1959 - rank: 6 max len: 435 min len: 237 avg len: 326.1666666666667 num_loss_counted_tokens: 3267
total tokens: 8118 num samples: 11 num padding tokens: 1074 - rank: 5 max len: 738 min len: 541 avg len: 640.3636363636364 num_loss_counted_tokens: 4228
total tokens: 7938 num samples: 7 num padding tokens: 978 - rank: 3 max len: 1134 min len: 898 avg len: 994.2857142857143 num_loss_counted_tokens: 3454
total tokens: 7010 num samples: 5 num padding tokens: 659 - rank: 2 max len: 1402 min len: 1181 avg len: 1270.2 num_loss_counted_tokens: 5158
total tokens: 6944 num samples: 31 num padding tokens: 2362 - rank: 7 max len: 224 min len: 74 avg len: 147.80645161290323 num_loss_counted_tokens: 1787
total tokens: 6818 num samples: 2 num padding tokens: 131 - rank: 0 max len: 3409 min len: 3278 avg len: 3343.5 num_loss_counted_tokens: 203
Per-token loss scaled by world size: 0.00015841875574551523Per-token loss scaled by world size: 0.00011544318113010377Per-token loss scaled by world size: 0.00030993111431598663Per-token loss scaled by world size: 0.00033960427390411496Per-token loss scaled by world size: 0.0002961684949696064Per-token loss scaled by world size: 0.0003685772535391152
Epoch: 1, Step: 160, Rank: 5, loss = 0.9887577295303345
Epoch: 1, Step: 160, Rank: 3, loss = 1.0834225416183472Epoch: 1, Step: 160, Rank: 6, loss = 0.9448515772819519
Epoch: 1, Step: 160, Rank: 4, loss = 1.1758536100387573Epoch: 1, Step: 160, Rank: 1, loss = 0.36829259991645813Epoch: 1, Step: 160, Rank: 2, loss = 0.5053954124450684
Per-token loss scaled by world size: 5.767856782767922e-05
Epoch: 1, Step: 160, Rank: 7, loss = 0.18400904536247253
Per-token loss scaled by world size: 0.00019863103807438165
Epoch: 1, Step: 160, Rank: 0, loss = 0.6336826682090759
total tokens: 7389 num samples: 9 num padding tokens: 610 - rank: 4 max len: 821 min len: 661 avg len: 753.2222222222222 num_loss_counted_tokens: 4659
total tokens: 7668 num samples: 4 num padding tokens: 927 - rank: 1 max len: 1917 min len: 1314 avg len: 1685.25 num_loss_counted_tokens: 682
total tokens: 7945 num samples: 35 num padding tokens: 3133 - rank: 7 max len: 227 min len: 79 avg len: 137.4857142857143 num_loss_counted_tokens: 1541
{
"epoch": 1,
"step": 160,
"rank": 0,
"loss": 0.6336826682090759,
"overall_throughput": 41.55102604571352,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.496148586273193,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25522,
"batch_size": 69,
"total_loss": 0.7355331778526306,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:08.279444"
}
total tokens: 7800 num samples: 12 num padding tokens: 1127 - rank: 5 max len: 650 min len: 421 avg len: 556.0833333333334 num_loss_counted_tokens: 4039
total tokens: 7800 num samples: 6 num padding tokens: 1207 - rank: 2 max len: 1300 min len: 984 avg len: 1098.8333333333333 num_loss_counted_tokens: 3151
total tokens: 8020 num samples: 20 num padding tokens: 2104 - rank: 6 max len: 401 min len: 238 avg len: 295.8 num_loss_counted_tokens: 3264
total tokens: 7640 num samples: 8 num padding tokens: 396 - rank: 3 max len: 955 min len: 852 avg len: 905.5 num_loss_counted_tokens: 6041
total tokens: 5680 num samples: 2 num padding tokens: 585 - rank: 0 max len: 2840 min len: 2255 avg len: 2547.5 num_loss_counted_tokens: 205
Per-token loss scaled by world size: 5.620659976557363e-06Per-token loss scaled by world size: 0.00039823996485210955Per-token loss scaled by world size: 0.0004752624372486025Per-token loss scaled by world size: 0.00014614466635975987Per-token loss scaled by world size: 0.0004995565977878869
Per-token loss scaled by world size: 0.00032942448160611093Per-token loss scaled by world size: 0.0004525336844380945
Epoch: 1, Step: 161, Rank: 5, loss = 1.1807301044464111
Epoch: 1, Step: 161, Rank: 2, loss = 0.9893774390220642Epoch: 1, Step: 161, Rank: 6, loss = 1.2410858869552612
Epoch: 1, Step: 161, Rank: 3, loss = 0.36307814717292786
Epoch: 1, Step: 161, Rank: 1, loss = 0.013963826932013035
Epoch: 1, Step: 161, Rank: 4, loss = 0.8184139132499695
Epoch: 1, Step: 161, Rank: 7, loss = 1.1242634057998657
Per-token loss scaled by world size: 9.448503078601789e-06
Epoch: 1, Step: 161, Rank: 0, loss = 0.023473624140024185
total tokens: 7380 num samples: 3 num padding tokens: 869 - rank: 1 max len: 2460 min len: 1846 avg len: 2170.3333333333335 num_loss_counted_tokens: 394
total tokens: 7407 num samples: 9 num padding tokens: 1144 - rank: 4 max len: 823 min len: 580 avg len: 695.8888888888889 num_loss_counted_tokens: 4039
{
"epoch": 1,
"step": 161,
"rank": 0,
"loss": 0.023473624140024185,
"overall_throughput": 40.61008048978514,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.50694465637207,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19875,
"batch_size": 70,
"total_loss": 0.7192983031272888,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:10.880154"
}
total tokens: 8008 num samples: 14 num padding tokens: 808 - rank: 5 max len: 572 min len: 460 avg len: 514.2857142857143 num_loss_counted_tokens: 4103
total tokens: 7992 num samples: 8 num padding tokens: 557 - rank: 3 max len: 999 min len: 868 avg len: 929.375 num_loss_counted_tokens: 6112
total tokens: 7704 num samples: 18 num padding tokens: 1248 - rank: 6 max len: 428 min len: 273 avg len: 358.6666666666667 num_loss_counted_tokens: 3541
total tokens: 8028 num samples: 2 num padding tokens: 1416 - rank: 0 max len: 4014 min len: 2598 avg len: 3306.0 num_loss_counted_tokens: 164 total tokens: 8076 num samples: 6 num padding tokens: 1073 - rank: 2 max len: 1346 min len: 1029 avg len: 1167.1666666666667 num_loss_counted_tokens: 4924
total tokens: 6233 num samples: 23 num padding tokens: 2022 - rank: 7 max len: 271 min len: 80 avg len: 183.08695652173913 num_loss_counted_tokens: 1936
Per-token loss scaled by world size: 0.0008293814607895911Per-token loss scaled by world size: 0.0006533037521876395Per-token loss scaled by world size: 0.0005620094598270953Per-token loss scaled by world size: 1.2225326827319805e-05Per-token loss scaled by world size: 5.4908236052142456e-05Per-token loss scaled by world size: 5.549823254114017e-05
Per-token loss scaled by world size: 2.421064209556789e-06
Epoch: 1, Step: 162, Rank: 5, loss = 1.6809488534927368Epoch: 1, Step: 162, Rank: 0, loss = 0.02477768063545227
Epoch: 1, Step: 162, Rank: 4, loss = 1.1390526294708252
Epoch: 1, Step: 162, Rank: 7, loss = 1.3240833282470703
Epoch: 1, Step: 162, Rank: 2, loss = 0.11248104274272919
Epoch: 1, Step: 162, Rank: 1, loss = 0.1112852692604065
Epoch: 1, Step: 162, Rank: 3, loss = 0.004906891845166683
Per-token loss scaled by world size: 0.0008481117547489703
Epoch: 1, Step: 162, Rank: 6, loss = 1.7189104557037354
total tokens: 7875 num samples: 3 num padding tokens: 1486 - rank: 1 max len: 2625 min len: 1737 avg len: 2129.6666666666665 num_loss_counted_tokens: 747
total tokens: 7448 num samples: 8 num padding tokens: 1221 - rank: 4 max len: 931 min len: 707 avg len: 778.375 num_loss_counted_tokens: 2823
total tokens: 6572 num samples: 4 num padding tokens: 264 - rank: 2 max len: 1643 min len: 1485 avg len: 1577.0 num_loss_counted_tokens: 3122
{
"epoch": 1,
"step": 162,
"rank": 0,
"loss": 0.02477768063545227,
"overall_throughput": 42.2840773469095,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.426692962646484,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 16214,
"batch_size": 68,
"total_loss": 0.7645557522773743,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:13.378530"
}
total tokens: 7760 num samples: 16 num padding tokens: 1908 - rank: 6 max len: 485 min len: 279 avg len: 365.75 num_loss_counted_tokens: 3363
total tokens: 7944 num samples: 6 num padding tokens: 1666 - rank: 3 max len: 1324 min len: 934 avg len: 1046.3333333333333 num_loss_counted_tokens: 2604
total tokens: 7590 num samples: 11 num padding tokens: 1066 - rank: 5 max len: 690 min len: 492 avg len: 593.0909090909091 num_loss_counted_tokens: 3430
total tokens: 7830 num samples: 29 num padding tokens: 2293 - rank: 7 max len: 270 min len: 77 avg len: 190.93103448275863 num_loss_counted_tokens: 2270
total tokens: 6446 num samples: 2 num padding tokens: 435 - rank: 0 max len: 3223 min len: 2788 avg len: 3005.5 num_loss_counted_tokens: 179
Per-token loss scaled by world size: 0.00018602880300022662Per-token loss scaled by world size: 0.000572515360545367Per-token loss scaled by world size: 0.0006904263282194734Per-token loss scaled by world size: 0.0004460025520529598
Per-token loss scaled by world size: 0.0004231746424920857
Per-token loss scaled by world size: 5.620245701720705e-06
Per-token loss scaled by world size: 8.813981935418269e-07
Epoch: 1, Step: 163, Rank: 6, loss = 1.2291189432144165
Epoch: 1, Step: 163, Rank: 4, loss = 1.4822590351104736Epoch: 1, Step: 163, Rank: 2, loss = 0.3993805944919586
Epoch: 1, Step: 163, Rank: 0, loss = 0.012065964750945568Epoch: 1, Step: 163, Rank: 3, loss = 0.9575117230415344
Epoch: 1, Step: 163, Rank: 7, loss = 0.9085030555725098
Epoch: 1, Step: 163, Rank: 1, loss = 0.0018922517774626613
Per-token loss scaled by world size: 0.0005087603931315243
Epoch: 1, Step: 163, Rank: 5, loss = 1.0922449827194214
total tokens: 7695 num samples: 9 num padding tokens: 710 - rank: 4 max len: 855 min len: 712 avg len: 776.1111111111111 num_loss_counted_tokens: 5130
total tokens: 7552 num samples: 4 num padding tokens: 1186 - rank: 1 max len: 1888 min len: 1444 avg len: 1591.5 num_loss_counted_tokens: 754
{
"epoch": 1,
"step": 163,
"rank": 0,
"loss": 0.012065964750945568,
"overall_throughput": 42.679599605421174,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.32099151611328,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17175,
"batch_size": 79,
"total_loss": 0.7603721022605896,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:15.854736"
}
total tokens: 6895 num samples: 5 num padding tokens: 862 - rank: 2 max len: 1379 min len: 1055 avg len: 1206.6 num_loss_counted_tokens: 1706
total tokens: 7287 num samples: 7 num padding tokens: 458 - rank: 3 max len: 1041 min len: 894 avg len: 975.5714285714286 num_loss_counted_tokens: 4839
total tokens: 6182 num samples: 22 num padding tokens: 2681 - rank: 7 max len: 281 min len: 75 avg len: 159.13636363636363 num_loss_counted_tokens: 1402
total tokens: 6306 num samples: 2 num padding tokens: 367 - rank: 0 max len: 3153 min len: 2786 avg len: 2969.5 num_loss_counted_tokens: 481
total tokens: 8016 num samples: 16 num padding tokens: 1294 - rank: 6 max len: 501 min len: 281 avg len: 420.125 num_loss_counted_tokens: 3734
total tokens: 7799 num samples: 11 num padding tokens: 922 - rank: 5 max len: 709 min len: 521 avg len: 625.1818181818181 num_loss_counted_tokens: 3657
Per-token loss scaled by world size: 0.000572259072214365Per-token loss scaled by world size: 0.0008030128665268421Per-token loss scaled by world size: 0.0005713719874620438Per-token loss scaled by world size: 0.0008417390054091811
Per-token loss scaled by world size: 4.514108695730101e-06Per-token loss scaled by world size: 1.0517849659663625e-05
Per-token loss scaled by world size: 8.813677595753688e-06
Epoch: 1, Step: 164, Rank: 5, loss = 1.8358327150344849Epoch: 1, Step: 164, Rank: 6, loss = 1.7513710260391235
Epoch: 1, Step: 164, Rank: 7, loss = 1.2461622953414917
Epoch: 1, Step: 164, Rank: 2, loss = 0.009845270775258541
Epoch: 1, Step: 164, Rank: 4, loss = 1.2480970621109009Epoch: 1, Step: 164, Rank: 0, loss = 0.02293943054974079Epoch: 1, Step: 164, Rank: 1, loss = 0.019222630187869072
Per-token loss scaled by world size: 0.0004583366389852017
Epoch: 1, Step: 164, Rank: 3, loss = 0.9996321797370911
total tokens: 7084 num samples: 4 num padding tokens: 676 - rank: 1 max len: 1771 min len: 1482 avg len: 1602.0 num_loss_counted_tokens: 552
total tokens: 7983 num samples: 9 num padding tokens: 1125 - rank: 4 max len: 887 min len: 678 avg len: 762.0 num_loss_counted_tokens: 4997
{
"epoch": 1,
"step": 164,
"rank": 0,
"loss": 0.02293943054974079,
"overall_throughput": 41.52812214957308,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.29750394821167,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17448,
"batch_size": 78,
"total_loss": 0.8916378021240234,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:18.404418"
}
total tokens: 7848 num samples: 12 num padding tokens: 795 - rank: 5 max len: 654 min len: 498 avg len: 587.75 num_loss_counted_tokens: 5391
total tokens: 7010 num samples: 5 num padding tokens: 860 - rank: 2 max len: 1402 min len: 1134 avg len: 1230.0 num_loss_counted_tokens: 3080
total tokens: 7448 num samples: 7 num padding tokens: 586 - rank: 3 max len: 1064 min len: 904 avg len: 980.2857142857143 num_loss_counted_tokens: 3922
total tokens: 7560 num samples: 28 num padding tokens: 2574 - rank: 7 max len: 270 min len: 79 avg len: 178.07142857142858 num_loss_counted_tokens: 2339
total tokens: 7263 num samples: 3 num padding tokens: 675 - rank: 0 max len: 2421 min len: 1797 avg len: 2196.0 num_loss_counted_tokens: 329
total tokens: 8109 num samples: 17 num padding tokens: 1719 - rank: 6 max len: 477 min len: 271 avg len: 375.88235294117646 num_loss_counted_tokens: 3620
Per-token loss scaled by world size: 0.0004442204663064331Per-token loss scaled by world size: 0.00018070742953568697Per-token loss scaled by world size: 0.0003504411142785102Per-token loss scaled by world size: 4.3835102587763686e-06
Per-token loss scaled by world size: 0.00022767498739995062Per-token loss scaled by world size: 7.03313219219126e-07
Per-token loss scaled by world size: 0.0002697974268812686
Epoch: 1, Step: 165, Rank: 2, loss = 0.5169813632965088
Epoch: 1, Step: 165, Rank: 5, loss = 1.2708592414855957
Epoch: 1, Step: 165, Rank: 3, loss = 1.002568244934082
Epoch: 1, Step: 165, Rank: 4, loss = 0.651349663734436
Epoch: 1, Step: 165, Rank: 7, loss = 0.7718567252159119
Epoch: 1, Step: 165, Rank: 1, loss = 0.012540674768388271
Epoch: 1, Step: 165, Rank: 0, loss = 0.0020120912231504917
Per-token loss scaled by world size: 0.00034711475018411875
Epoch: 1, Step: 165, Rank: 6, loss = 0.9930519461631775
total tokens: 8085 num samples: 11 num padding tokens: 750 - rank: 4 max len: 735 min len: 600 avg len: 666.8181818181819 num_loss_counted_tokens: 3756
total tokens: 7432 num samples: 4 num padding tokens: 599 - rank: 1 max len: 1858 min len: 1560 avg len: 1708.25 num_loss_counted_tokens: 1691
{
"epoch": 1,
"step": 165,
"rank": 0,
"loss": 0.0020120912231504917,
"overall_throughput": 42.759868613235994,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.2490234375,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22887,
"batch_size": 75,
"total_loss": 0.6526525020599365,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:20.878553"
}
total tokens: 2249 num samples: 13 num padding tokens: 686 - rank: 7 max len: 173 min len: 79 avg len: 120.23076923076923 num_loss_counted_tokens: 568
total tokens: 7657 num samples: 13 num padding tokens: 1363 - rank: 5 max len: 589 min len: 381 avg len: 484.15384615384613 num_loss_counted_tokens: 3837
total tokens: 7615 num samples: 5 num padding tokens: 1314 - rank: 2 max len: 1523 min len: 1058 avg len: 1260.2 num_loss_counted_tokens: 3900
total tokens: 7980 num samples: 21 num padding tokens: 2329 - rank: 6 max len: 380 min len: 178 avg len: 269.0952380952381 num_loss_counted_tokens: 2730
total tokens: 8040 num samples: 8 num padding tokens: 983 - rank: 3 max len: 1005 min len: 762 avg len: 882.125 num_loss_counted_tokens: 5456
total tokens: 8061 num samples: 3 num padding tokens: 594 - rank: 0 max len: 2687 min len: 2311 avg len: 2489.0 num_loss_counted_tokens: 259
Per-token loss scaled by world size: 0.0003275613998994231Per-token loss scaled by world size: 0.00014095827646087855Per-token loss scaled by world size: 0.0002651048998814076Per-token loss scaled by world size: 0.00036449063918553293Per-token loss scaled by world size: 0.00021203258074820042Per-token loss scaled by world size: 0.0002020968240685761
Per-token loss scaled by world size: 1.8056784938380588e-06
Epoch: 1, Step: 166, Rank: 1, loss = 0.4387502372264862
Epoch: 1, Step: 166, Rank: 5, loss = 0.659977912902832
Epoch: 1, Step: 166, Rank: 6, loss = 1.019575834274292
Epoch: 1, Step: 166, Rank: 0, loss = 0.005620399955660105
Epoch: 1, Step: 166, Rank: 7, loss = 0.8251721262931824Epoch: 1, Step: 166, Rank: 3, loss = 1.1345226764678955Epoch: 1, Step: 166, Rank: 4, loss = 0.6290516257286072
Per-token loss scaled by world size: 0.0001444466906832531
Epoch: 1, Step: 166, Rank: 2, loss = 0.44960838556289673
total tokens: 5966 num samples: 2 num padding tokens: 617 - rank: 1 max len: 2983 min len: 2366 avg len: 2674.5 num_loss_counted_tokens: 241
total tokens: 7798 num samples: 7 num padding tokens: 1377 - rank: 4 max len: 1114 min len: 740 avg len: 917.2857142857143 num_loss_counted_tokens: 4598
{
"epoch": 1,
"step": 166,
"rank": 0,
"loss": 0.005620399955660105,
"overall_throughput": 41.7691118119109,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.322949409484863,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24901,
"batch_size": 68,
"total_loss": 0.64528489112854,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:23.412572"
}
total tokens: 8100 num samples: 12 num padding tokens: 1017 - rank: 5 max len: 675 min len: 510 avg len: 590.25 num_loss_counted_tokens: 5149
total tokens: 7824 num samples: 4 num padding tokens: 1097 - rank: 3 max len: 1956 min len: 1341 avg len: 1681.75 num_loss_counted_tokens: 3450
total tokens: 8000 num samples: 16 num padding tokens: 2121 - rank: 6 max len: 500 min len: 290 avg len: 367.4375 num_loss_counted_tokens: 3595
total tokens: 6925 num samples: 25 num padding tokens: 2774 - rank: 7 max len: 277 min len: 77 avg len: 166.04 num_loss_counted_tokens: 1782
total tokens: 6918 num samples: 3 num padding tokens: 285 - rank: 2 max len: 2306 min len: 2144 avg len: 2211.0 num_loss_counted_tokens: 208
total tokens: 6292 num samples: 2 num padding tokens: 131 - rank: 0 max len: 3146 min len: 3015 avg len: 3080.5 num_loss_counted_tokens: 174
Per-token loss scaled by world size: 0.0004856240702793002Per-token loss scaled by world size: 0.0003434315149206668Per-token loss scaled by world size: 3.41317463607993e-05Per-token loss scaled by world size: 3.1640320230508223e-06Per-token loss scaled by world size: 1.4528293377225054e-06
Per-token loss scaled by world size: 0.0005926437443122268
Per-token loss scaled by world size: 0.0002455389767419547
Epoch: 1, Step: 167, Rank: 5, loss = 1.262865424156189Epoch: 1, Step: 167, Rank: 2, loss = 0.08875960856676102
Epoch: 1, Step: 167, Rank: 6, loss = 0.8930936455726624
Epoch: 1, Step: 167, Rank: 0, loss = 0.0037780827842652798
Epoch: 1, Step: 167, Rank: 4, loss = 1.5411700010299683Epoch: 1, Step: 167, Rank: 1, loss = 0.008228065446019173
Epoch: 1, Step: 167, Rank: 7, loss = 0.6385241150856018
Per-token loss scaled by world size: 0.0002826468553394079
Epoch: 1, Step: 167, Rank: 3, loss = 0.7350231409072876
total tokens: 7464 num samples: 4 num padding tokens: 501 - rank: 1 max len: 1866 min len: 1478 avg len: 1740.75 num_loss_counted_tokens: 1361
total tokens: 7820 num samples: 10 num padding tokens: 879 - rank: 4 max len: 782 min len: 657 avg len: 694.1 num_loss_counted_tokens: 4357
{
"epoch": 1,
"step": 167,
"rank": 0,
"loss": 0.0037780827842652798,
"overall_throughput": 41.38714694400821,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.380234718322754,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20804,
"batch_size": 85,
"total_loss": 0.6464303731918335,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:25.970591"
}
total tokens: 6966 num samples: 27 num padding tokens: 2143 - rank: 7 max len: 258 min len: 77 avg len: 178.62962962962962 num_loss_counted_tokens: 1881
total tokens: 8032 num samples: 8 num padding tokens: 946 - rank: 3 max len: 1004 min len: 790 avg len: 885.75 num_loss_counted_tokens: 3467
total tokens: 7410 num samples: 6 num padding tokens: 411 - rank: 2 max len: 1235 min len: 1054 avg len: 1166.5 num_loss_counted_tokens: 3794
total tokens: 7668 num samples: 12 num padding tokens: 1179 - rank: 5 max len: 639 min len: 424 avg len: 540.75 num_loss_counted_tokens: 2919
total tokens: 8037 num samples: 19 num padding tokens: 1731 - rank: 6 max len: 423 min len: 269 avg len: 331.89473684210526 num_loss_counted_tokens: 3562
total tokens: 6488 num samples: 2 num padding tokens: 894 - rank: 0 max len: 3244 min len: 2350 avg len: 2797.0 num_loss_counted_tokens: 206
Per-token loss scaled by world size: 0.00039238386671058834Per-token loss scaled by world size: 0.0004944170941598713Per-token loss scaled by world size: 5.2391669669304974e-06Per-token loss scaled by world size: 0.00014728681708220392Per-token loss scaled by world size: 0.0004893930163234472Per-token loss scaled by world size: 3.5653782106237486e-05
Per-token loss scaled by world size: 0.0004791621759068221
Epoch: 1, Step: 168, Rank: 0, loss = 0.012632940895855427Epoch: 1, Step: 168, Rank: 6, loss = 1.180048942565918
Epoch: 1, Step: 168, Rank: 5, loss = 1.1921632289886475Epoch: 1, Step: 168, Rank: 2, loss = 0.35514533519744873Epoch: 1, Step: 168, Rank: 4, loss = 0.9461355805397034
Epoch: 1, Step: 168, Rank: 1, loss = 0.08597017824649811
Epoch: 1, Step: 168, Rank: 7, loss = 1.1553797721862793
Per-token loss scaled by world size: 0.00023344735382124782
Epoch: 1, Step: 168, Rank: 3, loss = 0.5628999471664429
total tokens: 7931 num samples: 11 num padding tokens: 374 - rank: 4 max len: 721 min len: 645 avg len: 687.0 num_loss_counted_tokens: 3685
total tokens: 6885 num samples: 3 num padding tokens: 1645 - rank: 1 max len: 2295 min len: 1266 avg len: 1746.6666666666667 num_loss_counted_tokens: 678
{
"epoch": 1,
"step": 168,
"rank": 0,
"loss": 0.012632940895855427,
"overall_throughput": 41.535125133568044,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.440462589263916,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19290,
"batch_size": 79,
"total_loss": 0.6862969994544983,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:28.521157"
}
total tokens: 8112 num samples: 16 num padding tokens: 1692 - rank: 6 max len: 507 min len: 284 avg len: 401.25 num_loss_counted_tokens: 3636
total tokens: 7821 num samples: 9 num padding tokens: 753 - rank: 3 max len: 869 min len: 722 avg len: 785.3333333333334 num_loss_counted_tokens: 5472
total tokens: 7440 num samples: 6 num padding tokens: 983 - rank: 2 max len: 1240 min len: 874 avg len: 1076.1666666666667 num_loss_counted_tokens: 3013
total tokens: 7728 num samples: 12 num padding tokens: 813 - rank: 5 max len: 644 min len: 530 avg len: 576.25 num_loss_counted_tokens: 4677
total tokens: 7306 num samples: 26 num padding tokens: 2436 - rank: 7 max len: 281 min len: 81 avg len: 187.30769230769232 num_loss_counted_tokens: 2351
total tokens: 7062 num samples: 2 num padding tokens: 653 - rank: 0 max len: 3531 min len: 2878 avg len: 3204.5 num_loss_counted_tokens: 193
Per-token loss scaled by world size: 0.00023193543893285096Per-token loss scaled by world size: 0.00030478413100354373Per-token loss scaled by world size: 0.00034480085014365613
Per-token loss scaled by world size: 4.721171990240691e-06Per-token loss scaled by world size: 4.196311692794552e-06
Per-token loss scaled by world size: 0.00038670990034006536
Epoch: 1, Step: 169, Rank: 3, loss = 0.8575863242149353
Per-token loss scaled by world size: 8.75471014296636e-05Epoch: 1, Step: 169, Rank: 6, loss = 0.9701833724975586
Epoch: 1, Step: 169, Rank: 0, loss = 0.013284197077155113Epoch: 1, Step: 169, Rank: 2, loss = 0.652608335018158
Epoch: 1, Step: 169, Rank: 1, loss = 0.011807371862232685
Epoch: 1, Step: 169, Rank: 4, loss = 1.0881049633026123
Per-token loss scaled by world size: 0.0006382779683917761
Epoch: 1, Step: 169, Rank: 7, loss = 0.24633565545082092
Epoch: 1, Step: 169, Rank: 5, loss = 1.7959545850753784
total tokens: 6174 num samples: 3 num padding tokens: 245 - rank: 1 max len: 2058 min len: 1856 avg len: 1976.3333333333333 num_loss_counted_tokens: 674
total tokens: 7389 num samples: 9 num padding tokens: 742 - rank: 4 max len: 821 min len: 683 avg len: 738.5555555555555 num_loss_counted_tokens: 5047
total tokens: 7320 num samples: 30 num padding tokens: 2107 - rank: 7 max len: 244 min len: 91 avg len: 173.76666666666668 num_loss_counted_tokens: 2225
{
"epoch": 1,
"step": 169,
"rank": 0,
"loss": 0.013284197077155113,
"overall_throughput": 41.77523610688796,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.364055633544922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22510,
"batch_size": 81,
"total_loss": 0.7044830918312073,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:31.054810"
}
total tokens: 7602 num samples: 7 num padding tokens: 811 - rank: 3 max len: 1086 min len: 880 avg len: 970.1428571428571 num_loss_counted_tokens: 5560
total tokens: 8100 num samples: 12 num padding tokens: 1477 - rank: 5 max len: 675 min len: 438 avg len: 551.9166666666666 num_loss_counted_tokens: 3835
total tokens: 7270 num samples: 5 num padding tokens: 1169 - rank: 2 max len: 1454 min len: 1115 avg len: 1220.2 num_loss_counted_tokens: 3611
total tokens: 5712 num samples: 2 num padding tokens: 500 - rank: 0 max len: 2856 min len: 2356 avg len: 2606.0 num_loss_counted_tokens: 161
total tokens: 7848 num samples: 18 num padding tokens: 1235 - rank: 6 max len: 436 min len: 265 avg len: 367.3888888888889 num_loss_counted_tokens: 4095
Per-token loss scaled by world size: 0.00010392792319180444Per-token loss scaled by world size: 0.00014247662329580635Per-token loss scaled by world size: 0.00030479932320304215Per-token loss scaled by world size: 0.0002555457758717239Per-token loss scaled by world size: 0.00023876398336142302Per-token loss scaled by world size: 0.0003587114915717393
Per-token loss scaled by world size: 0.00022289040498435497
Epoch: 1, Step: 170, Rank: 6, loss = 0.8772567510604858
Epoch: 1, Step: 170, Rank: 4, loss = 1.0463379621505737
Epoch: 1, Step: 170, Rank: 2, loss = 0.4891044497489929
Epoch: 1, Step: 170, Rank: 7, loss = 0.8196468949317932Epoch: 1, Step: 170, Rank: 3, loss = 0.7651548981666565
Epoch: 1, Step: 170, Rank: 5, loss = 1.2314116954803467
Epoch: 1, Step: 170, Rank: 1, loss = 0.3567715585231781
Per-token loss scaled by world size: 2.3532686100224964e-05
Epoch: 1, Step: 170, Rank: 0, loss = 0.08078476786613464
total tokens: 7016 num samples: 4 num padding tokens: 1240 - rank: 1 max len: 1754 min len: 1179 avg len: 1444.0 num_loss_counted_tokens: 1903
total tokens: 8085 num samples: 11 num padding tokens: 649 - rank: 4 max len: 735 min len: 609 avg len: 676.0 num_loss_counted_tokens: 4383
{
"epoch": 1,
"step": 170,
"rank": 0,
"loss": 0.08078476786613464,
"overall_throughput": 40.4277908497992,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.523925304412842,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27463,
"batch_size": 84,
"total_loss": 0.7083086371421814,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:33.666058"
}
total tokens: 7812 num samples: 18 num padding tokens: 1829 - rank: 6 max len: 434 min len: 242 avg len: 332.3888888888889 num_loss_counted_tokens: 3261
total tokens: 7917 num samples: 13 num padding tokens: 1050 - rank: 5 max len: 609 min len: 441 avg len: 528.2307692307693 num_loss_counted_tokens: 3856
total tokens: 7644 num samples: 7 num padding tokens: 494 - rank: 2 max len: 1092 min len: 972 avg len: 1021.4285714285714 num_loss_counted_tokens: 4662
total tokens: 7260 num samples: 33 num padding tokens: 2169 - rank: 7 max len: 220 min len: 78 avg len: 154.27272727272728 num_loss_counted_tokens: 1876
total tokens: 7704 num samples: 8 num padding tokens: 1070 - rank: 3 max len: 963 min len: 747 avg len: 829.25 num_loss_counted_tokens: 5234
total tokens: 6336 num samples: 3 num padding tokens: 247 - rank: 0 max len: 2112 min len: 1890 avg len: 2029.6666666666667 num_loss_counted_tokens: 2181
Per-token loss scaled by world size: 0.0003628956328611821Per-token loss scaled by world size: 0.0002919238177128136Per-token loss scaled by world size: 0.0003382969880476594Per-token loss scaled by world size: 7.853787246858701e-05Per-token loss scaled by world size: 0.00014888570876792073Per-token loss scaled by world size: 0.00044355227146297693
Per-token loss scaled by world size: 0.00015364577120635659
Epoch: 1, Step: 171, Rank: 4, loss = 0.961414635181427Epoch: 1, Step: 171, Rank: 3, loss = 1.1141388416290283Epoch: 1, Step: 171, Rank: 1, loss = 0.49033647775650024
Epoch: 1, Step: 171, Rank: 2, loss = 0.2586546540260315
Epoch: 1, Step: 171, Rank: 5, loss = 1.4607839584350586Epoch: 1, Step: 171, Rank: 6, loss = 1.195151448249817
Epoch: 1, Step: 171, Rank: 7, loss = 0.5060131549835205
Per-token loss scaled by world size: 3.983392525697127e-05
Epoch: 1, Step: 171, Rank: 0, loss = 0.1311880499124527
total tokens: 7515 num samples: 9 num padding tokens: 869 - rank: 4 max len: 835 min len: 663 avg len: 738.4444444444445 num_loss_counted_tokens: 5088
total tokens: 8040 num samples: 4 num padding tokens: 1032 - rank: 1 max len: 2010 min len: 1576 avg len: 1752.0 num_loss_counted_tokens: 2036
{
"epoch": 1,
"step": 171,
"rank": 0,
"loss": 0.1311880499124527,
"overall_throughput": 40.56471721668778,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.534343242645264,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26347,
"batch_size": 96,
"total_loss": 0.7647101283073425,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:36.272989"
}
total tokens: 7752 num samples: 17 num padding tokens: 1950 - rank: 6 max len: 456 min len: 279 avg len: 341.29411764705884 num_loss_counted_tokens: 3147
total tokens: 7596 num samples: 12 num padding tokens: 817 - rank: 5 max len: 633 min len: 491 avg len: 564.9166666666666 num_loss_counted_tokens: 5545
total tokens: 7479 num samples: 27 num padding tokens: 3306 - rank: 7 max len: 277 min len: 84 avg len: 154.55555555555554 num_loss_counted_tokens: 1626 total tokens: 8016 num samples: 8 num padding tokens: 490 - rank: 3 max len: 1002 min len: 848 avg len: 940.75 num_loss_counted_tokens: 5067
total tokens: 5446 num samples: 2 num padding tokens: 669 - rank: 0 max len: 2723 min len: 2054 avg len: 2388.5 num_loss_counted_tokens: 841
total tokens: 6835 num samples: 5 num padding tokens: 1245 - rank: 2 max len: 1367 min len: 1005 avg len: 1118.0 num_loss_counted_tokens: 616
Per-token loss scaled by world size: 0.00026996861561201513Per-token loss scaled by world size: 8.522550342604518e-05Per-token loss scaled by world size: 0.00032177582033909857Per-token loss scaled by world size: 9.002388833323494e-05Per-token loss scaled by world size: 9.29309317143634e-05Per-token loss scaled by world size: 0.00024938900605775416
Per-token loss scaled by world size: 0.00014789693523198366
Epoch: 1, Step: 172, Rank: 0, loss = 0.3234558403491974Epoch: 1, Step: 172, Rank: 4, loss = 1.1561405658721924
Epoch: 1, Step: 172, Rank: 6, loss = 0.9699972867965698Epoch: 1, Step: 172, Rank: 2, loss = 0.30621522665023804Epoch: 1, Step: 172, Rank: 1, loss = 0.3339008390903473Epoch: 1, Step: 172, Rank: 5, loss = 0.896054744720459
Epoch: 1, Step: 172, Rank: 7, loss = 0.5313937067985535
Per-token loss scaled by world size: 0.00018880210700444877
Epoch: 1, Step: 172, Rank: 3, loss = 0.67836594581604
total tokens: 6222 num samples: 3 num padding tokens: 850 - rank: 1 max len: 2074 min len: 1514 avg len: 1790.6666666666667 num_loss_counted_tokens: 599
total tokens: 7579 num samples: 11 num padding tokens: 750 - rank: 4 max len: 689 min len: 580 avg len: 620.8181818181819 num_loss_counted_tokens: 3571
{
"epoch": 1,
"step": 172,
"rank": 0,
"loss": 0.3234558403491974,
"overall_throughput": 41.46099612058457,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.491368293762207,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 28744,
"batch_size": 104,
"total_loss": 0.6494404673576355,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:38.822555"
}
total tokens: 7328 num samples: 32 num padding tokens: 2250 - rank: 7 max len: 229 min len: 81 avg len: 158.6875 num_loss_counted_tokens: 2126
total tokens: 7920 num samples: 10 num padding tokens: 439 - rank: 3 max len: 792 min len: 702 avg len: 748.1 num_loss_counted_tokens: 6006
total tokens: 7617 num samples: 3 num padding tokens: 480 - rank: 0 max len: 2539 min len: 2078 avg len: 2379.0 num_loss_counted_tokens: 282
total tokens: 7494 num samples: 6 num padding tokens: 1464 - rank: 2 max len: 1249 min len: 825 avg len: 1005.0 num_loss_counted_tokens: 3339
total tokens: 7938 num samples: 14 num padding tokens: 1254 - rank: 5 max len: 567 min len: 424 avg len: 477.42857142857144 num_loss_counted_tokens: 4110
total tokens: 8018 num samples: 19 num padding tokens: 1923 - rank: 6 max len: 422 min len: 231 avg len: 320.7894736842105 num_loss_counted_tokens: 3186
Per-token loss scaled by world size: 0.0002713052381295711Per-token loss scaled by world size: 0.00035623108851723373Per-token loss scaled by world size: 0.00047955545596778393Per-token loss scaled by world size: 0.0002560637367423624Per-token loss scaled by world size: 3.385763557162136e-05Per-token loss scaled by world size: 5.4830157750984654e-05
Per-token loss scaled by world size: 2.089197550958488e-06
Epoch: 1, Step: 173, Rank: 3, loss = 0.766302764415741Epoch: 1, Step: 173, Rank: 2, loss = 0.101323202252388Epoch: 1, Step: 173, Rank: 5, loss = 1.4351296424865723
Epoch: 1, Step: 173, Rank: 4, loss = 1.066066026687622
Epoch: 1, Step: 173, Rank: 1, loss = 0.16408610343933105Epoch: 1, Step: 173, Rank: 7, loss = 0.8119148015975952
Epoch: 1, Step: 173, Rank: 0, loss = 0.006252184975892305
Per-token loss scaled by world size: 0.00036022928543388844
Epoch: 1, Step: 173, Rank: 6, loss = 1.0780311822891235
total tokens: 7947 num samples: 9 num padding tokens: 497 - rank: 4 max len: 883 min len: 762 avg len: 827.7777777777778 num_loss_counted_tokens: 6018
total tokens: 6960 num samples: 3 num padding tokens: 675 - rank: 1 max len: 2320 min len: 1916 avg len: 2095.0 num_loss_counted_tokens: 1117
{
"epoch": 1,
"step": 173,
"rank": 0,
"loss": 0.006252184975892305,
"overall_throughput": 42.234314636277595,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.221298694610596,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23941,
"batch_size": 80,
"total_loss": 0.678638219833374,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:41.355552"
}
total tokens: 6350 num samples: 2 num padding tokens: 380 - rank: 0 max len: 3175 min len: 2795 avg len: 2985.0 num_loss_counted_tokens: 1089
total tokens: 6624 num samples: 4 num padding tokens: 786 - rank: 2 max len: 1656 min len: 1233 avg len: 1459.5 num_loss_counted_tokens: 2379
total tokens: 8041 num samples: 11 num padding tokens: 1409 - rank: 5 max len: 731 min len: 520 avg len: 602.9090909090909 num_loss_counted_tokens: 3830
total tokens: 8064 num samples: 16 num padding tokens: 2063 - rank: 6 max len: 504 min len: 251 avg len: 375.0625 num_loss_counted_tokens: 3722
total tokens: 7936 num samples: 32 num padding tokens: 2626 - rank: 7 max len: 248 min len: 74 avg len: 165.9375 num_loss_counted_tokens: 2201
total tokens: 7314 num samples: 6 num padding tokens: 1159 - rank: 3 max len: 1219 min len: 923 avg len: 1025.8333333333333 num_loss_counted_tokens: 4532
Per-token loss scaled by world size: 0.0002848200674634427Per-token loss scaled by world size: 0.0003533354902174324Per-token loss scaled by world size: 0.00015056866686791182
Per-token loss scaled by world size: 0.00036068688496015966Per-token loss scaled by world size: 0.0002968825865536928
Per-token loss scaled by world size: 0.000274753401754424
Per-token loss scaled by world size: 2.8490408112702426e-06
Epoch: 1, Step: 174, Rank: 6, loss = 1.022597074508667
Epoch: 1, Step: 174, Rank: 2, loss = 0.4357645511627197
Epoch: 1, Step: 174, Rank: 4, loss = 0.8243048787117004
Epoch: 1, Step: 174, Rank: 5, loss = 1.0438729524612427
Epoch: 1, Step: 174, Rank: 7, loss = 0.8592153191566467
Epoch: 1, Step: 174, Rank: 1, loss = 0.7951706647872925
Epoch: 1, Step: 174, Rank: 0, loss = 0.008245480246841908
Per-token loss scaled by world size: 0.0002512831415515393
Epoch: 1, Step: 174, Rank: 3, loss = 0.7272448539733887
total tokens: 7217 num samples: 7 num padding tokens: 815 - rank: 4 max len: 1031 min len: 839 avg len: 914.5714285714286 num_loss_counted_tokens: 4377
total tokens: 5458 num samples: 2 num padding tokens: 51 - rank: 1 max len: 2729 min len: 2678 avg len: 2703.5 num_loss_counted_tokens: 801
{
"epoch": 1,
"step": 174,
"rank": 0,
"loss": 0.008245480246841908,
"overall_throughput": 41.476173818456004,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.290608882904053,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23153,
"batch_size": 84,
"total_loss": 0.7145519852638245,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:43.868184"
}
total tokens: 8085 num samples: 15 num padding tokens: 1660 - rank: 6 max len: 539 min len: 289 avg len: 428.3333333333333 num_loss_counted_tokens: 4071
total tokens: 6890 num samples: 5 num padding tokens: 1024 - rank: 3 max len: 1378 min len: 1085 avg len: 1173.2 num_loss_counted_tokens: 3068
total tokens: 7680 num samples: 3 num padding tokens: 1620 - rank: 2 max len: 2560 min len: 1464 avg len: 2020.0 num_loss_counted_tokens: 1227
total tokens: 7850 num samples: 10 num padding tokens: 1055 - rank: 5 max len: 785 min len: 541 avg len: 679.5 num_loss_counted_tokens: 4251
total tokens: 7830 num samples: 29 num padding tokens: 3032 - rank: 7 max len: 270 min len: 79 avg len: 165.44827586206895 num_loss_counted_tokens: 2136
total tokens: 6586 num samples: 2 num padding tokens: 49 - rank: 0 max len: 3293 min len: 3244 avg len: 3268.5 num_loss_counted_tokens: 217
Per-token loss scaled by world size: 0.0003542072663549334Per-token loss scaled by world size: 0.00031207496067509055Per-token loss scaled by world size: 0.000552273471839726Per-token loss scaled by world size: 0.0003514452837407589
Per-token loss scaled by world size: 8.925243264457094e-07Per-token loss scaled by world size: 2.603805114631541e-06
Per-token loss scaled by world size: 0.00034913059789687395
Epoch: 1, Step: 175, Rank: 4, loss = 0.9186340570449829Epoch: 1, Step: 175, Rank: 6, loss = 1.4435738325119019Epoch: 1, Step: 175, Rank: 1, loss = 0.0023329469840973616Epoch: 1, Step: 175, Rank: 2, loss = 0.9258535504341125
Epoch: 1, Step: 175, Rank: 3, loss = 0.8157249093055725
Epoch: 1, Step: 175, Rank: 0, loss = 0.006806021090596914
Epoch: 1, Step: 175, Rank: 7, loss = 0.9125837683677673
Per-token loss scaled by world size: 0.00037907989462837577
Epoch: 1, Step: 175, Rank: 5, loss = 0.9908674359321594
total tokens: 7552 num samples: 8 num padding tokens: 1076 - rank: 4 max len: 944 min len: 723 avg len: 809.5 num_loss_counted_tokens: 4001
total tokens: 6219 num samples: 3 num padding tokens: 803 - rank: 1 max len: 2073 min len: 1455 avg len: 1805.3333333333333 num_loss_counted_tokens: 1982
{
"epoch": 1,
"step": 175,
"rank": 0,
"loss": 0.006806021090596914,
"overall_throughput": 41.26828372064641,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.29252052307129,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20911,
"batch_size": 75,
"total_loss": 0.752047061920166,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:46.430726"
}
total tokens: 5842 num samples: 23 num padding tokens: 1859 - rank: 7 max len: 254 min len: 86 avg len: 173.17391304347825 num_loss_counted_tokens: 1764
total tokens: 7518 num samples: 7 num padding tokens: 452 - rank: 3 max len: 1074 min len: 955 avg len: 1009.4285714285714 num_loss_counted_tokens: 5885
total tokens: 7803 num samples: 17 num padding tokens: 1386 - rank: 6 max len: 459 min len: 302 avg len: 377.47058823529414 num_loss_counted_tokens: 3507
total tokens: 7755 num samples: 11 num padding tokens: 586 - rank: 5 max len: 705 min len: 578 avg len: 651.7272727272727 num_loss_counted_tokens: 3489
total tokens: 7440 num samples: 6 num padding tokens: 402 - rank: 2 max len: 1240 min len: 1106 avg len: 1173.0 num_loss_counted_tokens: 4030
total tokens: 7226 num samples: 2 num padding tokens: 1117 - rank: 0 max len: 3613 min len: 2496 avg len: 3054.5 num_loss_counted_tokens: 257
Per-token loss scaled by world size: 0.00023268039512913674Per-token loss scaled by world size: 0.00022860463650431484Per-token loss scaled by world size: 0.00033770385198295116Per-token loss scaled by world size: 4.125645318708848e-06Per-token loss scaled by world size: 0.0003991488483734429Per-token loss scaled by world size: 1.1589580026338808e-05
Per-token loss scaled by world size: 0.0002852912584785372
Epoch: 1, Step: 176, Rank: 3, loss = 0.6885303854942322
Epoch: 1, Step: 176, Rank: 0, loss = 0.012208300642669201Epoch: 1, Step: 176, Rank: 6, loss = 0.9993079304695129
Epoch: 1, Step: 176, Rank: 4, loss = 1.181131362915039
Epoch: 1, Step: 176, Rank: 2, loss = 0.6764696836471558
Epoch: 1, Step: 176, Rank: 1, loss = 0.034295015037059784
Epoch: 1, Step: 176, Rank: 7, loss = 0.844212532043457
Per-token loss scaled by world size: 0.0003952819970436394
Epoch: 1, Step: 176, Rank: 5, loss = 1.1696888208389282
total tokens: 7189 num samples: 7 num padding tokens: 797 - rank: 4 max len: 1027 min len: 807 avg len: 913.1428571428571 num_loss_counted_tokens: 4546
total tokens: 7528 num samples: 4 num padding tokens: 414 - rank: 1 max len: 1882 min len: 1584 avg len: 1778.5 num_loss_counted_tokens: 2995
{
"epoch": 1,
"step": 176,
"rank": 0,
"loss": 0.012208300642669201,
"overall_throughput": 41.56666979221221,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.38214635848999,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23673,
"batch_size": 86,
"total_loss": 0.7007305026054382,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:49.012721"
}
total tokens: 8025 num samples: 15 num padding tokens: 1836 - rank: 6 max len: 535 min len: 285 avg len: 412.6 num_loss_counted_tokens: 4071
total tokens: 7525 num samples: 5 num padding tokens: 253 - rank: 2 max len: 1505 min len: 1379 avg len: 1454.4 num_loss_counted_tokens: 1812
total tokens: 7860 num samples: 10 num padding tokens: 1300 - rank: 5 max len: 786 min len: 557 avg len: 656.0 num_loss_counted_tokens: 4726
total tokens: 7980 num samples: 6 num padding tokens: 677 - rank: 3 max len: 1330 min len: 1078 avg len: 1217.1666666666667 num_loss_counted_tokens: 3981
total tokens: 8091 num samples: 29 num padding tokens: 3107 - rank: 7 max len: 279 min len: 72 avg len: 171.86206896551724 num_loss_counted_tokens: 2093
total tokens: 5826 num samples: 2 num padding tokens: 735 - rank: 0 max len: 2913 min len: 2178 avg len: 2545.5 num_loss_counted_tokens: 2214
Per-token loss scaled by world size: 0.0002194504631916061Per-token loss scaled by world size: 0.0005286230007186532Per-token loss scaled by world size: 0.00018121296307072043Per-token loss scaled by world size: 0.00039247411768883467
Per-token loss scaled by world size: 3.4347518521826714e-05Per-token loss scaled by world size: 4.241336228005821e-06
Per-token loss scaled by world size: 0.00030080656870268285
Epoch: 1, Step: 177, Rank: 2, loss = 0.5341705083847046Epoch: 1, Step: 177, Rank: 5, loss = 1.156915545463562
Epoch: 1, Step: 177, Rank: 3, loss = 1.558248519897461
Epoch: 1, Step: 177, Rank: 7, loss = 0.646885097026825
Epoch: 1, Step: 177, Rank: 1, loss = 0.10124789923429489Epoch: 1, Step: 177, Rank: 0, loss = 0.012502399273216724
Epoch: 1, Step: 177, Rank: 4, loss = 0.8867025375366211
Per-token loss scaled by world size: 0.00036306059337221086
Epoch: 1, Step: 177, Rank: 6, loss = 1.0702118873596191
total tokens: 7792 num samples: 8 num padding tokens: 837 - rank: 4 max len: 974 min len: 724 avg len: 869.375 num_loss_counted_tokens: 4288
total tokens: 7242 num samples: 3 num padding tokens: 572 - rank: 1 max len: 2414 min len: 1992 avg len: 2223.3333333333335 num_loss_counted_tokens: 799
{
"epoch": 1,
"step": 177,
"rank": 0,
"loss": 0.012502399273216724,
"overall_throughput": 42.43462885426557,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.246610641479492,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23582,
"batch_size": 96,
"total_loss": 0.7458605170249939,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:51.468642"
}
total tokens: 7854 num samples: 11 num padding tokens: 757 - rank: 5 max len: 714 min len: 600 avg len: 645.1818181818181 num_loss_counted_tokens: 5101
total tokens: 7854 num samples: 14 num padding tokens: 1807 - rank: 6 max len: 561 min len: 335 avg len: 431.92857142857144 num_loss_counted_tokens: 3657
total tokens: 7752 num samples: 24 num padding tokens: 3111 - rank: 7 max len: 323 min len: 76 avg len: 193.375 num_loss_counted_tokens: 2099
total tokens: 5566 num samples: 2 num padding tokens: 100 - rank: 0 max len: 2783 min len: 2683 avg len: 2733.0 num_loss_counted_tokens: 274
total tokens: 7026 num samples: 6 num padding tokens: 376 - rank: 3 max len: 1171 min len: 1022 avg len: 1108.3333333333333 num_loss_counted_tokens: 3075
total tokens: 7916 num samples: 4 num padding tokens: 1590 - rank: 2 max len: 1979 min len: 1340 avg len: 1581.5 num_loss_counted_tokens: 4584
Per-token loss scaled by world size: 0.0001956072374014184Per-token loss scaled by world size: 0.00021871054195798934Per-token loss scaled by world size: 0.0003932247345801443
Per-token loss scaled by world size: 1.0211075277766213e-05
Per-token loss scaled by world size: 0.00026568045723252
Per-token loss scaled by world size: 0.00030937412520870566Epoch: 1, Step: 178, Rank: 2, loss = 0.689293622970581
Per-token loss scaled by world size: 0.000248032680246979Epoch: 1, Step: 178, Rank: 3, loss = 0.6164806485176086
Epoch: 1, Step: 178, Rank: 5, loss = 1.2392969131469727
Epoch: 1, Step: 178, Rank: 1, loss = 0.032181479036808014
Epoch: 1, Step: 178, Rank: 6, loss = 0.8373252153396606
Epoch: 1, Step: 178, Rank: 4, loss = 0.9750311970710754
Per-token loss scaled by world size: 3.7826398511242587e-06Epoch: 1, Step: 178, Rank: 7, loss = 0.7817060351371765
Epoch: 1, Step: 178, Rank: 0, loss = 0.01192146260291338
total tokens: 8037 num samples: 9 num padding tokens: 489 - rank: 4 max len: 893 min len: 768 avg len: 838.6666666666666 num_loss_counted_tokens: 5348
total tokens: 7436 num samples: 4 num padding tokens: 550 - rank: 1 max len: 1859 min len: 1539 avg len: 1721.5 num_loss_counted_tokens: 2082
total tokens: 7380 num samples: 5 num padding tokens: 481 - rank: 2 max len: 1476 min len: 1220 avg len: 1379.8 num_loss_counted_tokens: 1223
{
"epoch": 1,
"step": 178,
"rank": 0,
"loss": 0.01192146260291338,
"overall_throughput": 40.62829660225024,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.526740550994873,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25213,
"batch_size": 83,
"total_loss": 0.647904634475708,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:54.071173"
}
total tokens: 8000 num samples: 16 num padding tokens: 2833 - rank: 6 max len: 500 min len: 256 avg len: 322.9375 num_loss_counted_tokens: 2783
total tokens: 7994 num samples: 7 num padding tokens: 840 - rank: 3 max len: 1142 min len: 895 avg len: 1022.0 num_loss_counted_tokens: 3757
total tokens: 7520 num samples: 10 num padding tokens: 909 - rank: 5 max len: 752 min len: 566 avg len: 661.1 num_loss_counted_tokens: 3632
total tokens: 7560 num samples: 30 num padding tokens: 2931 - rank: 7 max len: 252 min len: 77 avg len: 154.3 num_loss_counted_tokens: 1778
total tokens: 6586 num samples: 2 num padding tokens: 527 - rank: 0 max len: 3293 min len: 2766 avg len: 3029.5 num_loss_counted_tokens: 357
Per-token loss scaled by world size: 0.0004986776039004326Per-token loss scaled by world size: 0.0002882078697439283Per-token loss scaled by world size: 0.0005514522199518979
Per-token loss scaled by world size: 0.0002959422126878053
Per-token loss scaled by world size: 0.0005512300995178521Per-token loss scaled by world size: 4.8149313442991115e-06
Per-token loss scaled by world size: 8.981861901702359e-05
Epoch: 1, Step: 179, Rank: 5, loss = 1.2778526544570923
Epoch: 1, Step: 179, Rank: 6, loss = 1.1555607318878174
Epoch: 1, Step: 179, Rank: 4, loss = 0.6678496599197388
Epoch: 1, Step: 179, Rank: 7, loss = 1.277337908744812
Epoch: 1, Step: 179, Rank: 0, loss = 0.011157399974763393
Epoch: 1, Step: 179, Rank: 1, loss = 0.20813219249248505Epoch: 1, Step: 179, Rank: 2, loss = 0.6857721209526062
Per-token loss scaled by world size: 0.00024368343292735517
Epoch: 1, Step: 179, Rank: 3, loss = 0.5646754503250122
total tokens: 5876 num samples: 2 num padding tokens: 220 - rank: 1 max len: 2938 min len: 2718 avg len: 2828.0 num_loss_counted_tokens: 745
total tokens: 7839 num samples: 9 num padding tokens: 725 - rank: 4 max len: 871 min len: 630 avg len: 790.4444444444445 num_loss_counted_tokens: 4180
total tokens: 7287 num samples: 7 num padding tokens: 860 - rank: 3 max len: 1041 min len: 872 avg len: 918.1428571428571 num_loss_counted_tokens: 4584
{
"epoch": 1,
"step": 179,
"rank": 0,
"loss": 0.011157399974763393,
"overall_throughput": 41.67005705220922,
"lr": 3.2000000000000003e-06,
"cuda_mem_allocated": 24.338607788085938,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18538,
"batch_size": 79,
"total_loss": 0.7310422658920288,
"gradnorm": 0.9710609316825867,
"weight_norm": 433.04327392578125,
"timestamp": "2024-08-18T20:55:56.610997"
}
total tokens: 8112 num samples: 13 num padding tokens: 833 - rank: 5 max len: 624 min len: 485 avg len: 559.9230769230769 num_loss_counted_tokens: 4530
total tokens: 7329 num samples: 3 num padding tokens: 1943 - rank: 2 max len: 2443 min len: 1182 avg len: 1795.3333333333333 num_loss_counted_tokens: 502
total tokens: 7812 num samples: 28 num padding tokens: 2498 - rank: 7 max len: 279 min len: 87 avg len: 189.78571428571428 num_loss_counted_tokens: 2490
total tokens: 6798 num samples: 2 num padding tokens: 177 - rank: 0 max len: 3399 min len: 3222 avg len: 3310.5 num_loss_counted_tokens: 146
total tokens: 8024 num samples: 17 num padding tokens: 1349 - rank: 6 max len: 472 min len: 304 avg len: 392.6470588235294 num_loss_counted_tokens: 4448
Per-token loss scaled by world size: 0.0003445638285484165Per-token loss scaled by world size: 0.00042620761087164283Per-token loss scaled by world size: 7.993769395397976e-05Per-token loss scaled by world size: 4.4396303565008566e-05
Per-token loss scaled by world size: 0.00013256767124403268
Per-token loss scaled by world size: 0.0002054571668850258
Epoch: 1, Step: 180, Rank: 1, loss = 0.21686097979545593Epoch: 1, Step: 180, Rank: 5, loss = 1.1562479734420776
Per-token loss scaled by world size: 0.0001180191757157445Epoch: 1, Step: 180, Rank: 0, loss = 0.12044162303209305
Epoch: 1, Step: 180, Rank: 4, loss = 0.9347586035728455
Epoch: 1, Step: 180, Rank: 3, loss = 0.3596395254135132
Epoch: 1, Step: 180, Rank: 7, loss = 0.5573796033859253
Per-token loss scaled by world size: 0.00041190627962350845
Epoch: 1, Step: 180, Rank: 2, loss = 0.3201712667942047
Epoch: 1, Step: 180, Rank: 6, loss = 1.11745023727417
[2024-08-18 20:55:59,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
[2024-08-18 20:55:59,226] [INFO] [timer.py:258:stop] epoch=0/micro_step=180/global_step=5, RunningAvgSamplesPerSec=41.67583330946055, CurrSamplesPerSec=41.56020644882237, MemAllocated=22.74GB, MaxMemAllocated=30.61GB
total tokens: 7120 num samples: 4 num padding tokens: 1624 - rank: 1 max len: 1780 min len: 1186 avg len: 1374.0 num_loss_counted_tokens: 1016
total tokens: 7790 num samples: 10 num padding tokens: 1050 - rank: 4 max len: 779 min len: 583 avg len: 674.0 num_loss_counted_tokens: 5078
{
"epoch": 1,
"step": 180,
"rank": 0,
"loss": 0.12044162303209305,
"overall_throughput": 40.381532723652576,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 22.73690176010132,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21703,
"batch_size": 76,
"total_loss": 0.5978687405586243,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:55:59.229405"
}
total tokens: 6960 num samples: 29 num padding tokens: 2603 - rank: 7 max len: 240 min len: 85 avg len: 150.24137931034483 num_loss_counted_tokens: 1642
total tokens: 7408 num samples: 8 num padding tokens: 579 - rank: 3 max len: 926 min len: 792 avg len: 853.625 num_loss_counted_tokens: 5923
total tokens: 8020 num samples: 20 num padding tokens: 1553 - rank: 6 max len: 401 min len: 267 avg len: 323.35 num_loss_counted_tokens: 3882
total tokens: 5448 num samples: 2 num padding tokens: 854 - rank: 0 max len: 2724 min len: 1870 avg len: 2297.0 num_loss_counted_tokens: 185
total tokens: 7812 num samples: 14 num padding tokens: 1049 - rank: 5 max len: 558 min len: 414 avg len: 483.07142857142856 num_loss_counted_tokens: 3625
total tokens: 7854 num samples: 7 num padding tokens: 726 - rank: 2 max len: 1122 min len: 943 avg len: 1018.2857142857143 num_loss_counted_tokens: 3158
Per-token loss scaled by world size: 0.00021158010349608958
Per-token loss scaled by world size: 9.194504673359916e-05Per-token loss scaled by world size: 2.7597000098467106e-06Per-token loss scaled by world size: 0.00044244344462640584Per-token loss scaled by world size: 0.0002959812409244478Per-token loss scaled by world size: 1.918704765557777e-05
Per-token loss scaled by world size: 0.0002461440162733197
Epoch: 1, Step: 181, Rank: 4, loss = 0.640823245048523
Epoch: 1, Step: 181, Rank: 5, loss = 1.3400505781173706
Epoch: 1, Step: 181, Rank: 2, loss = 0.27847856283187866
Epoch: 1, Step: 181, Rank: 0, loss = 0.008358441293239594Epoch: 1, Step: 181, Rank: 1, loss = 0.058112770318984985
Epoch: 1, Step: 181, Rank: 7, loss = 0.8964532017707825
Epoch: 1, Step: 181, Rank: 3, loss = 0.7455086708068848
Per-token loss scaled by world size: 0.00037991307908669114
Epoch: 1, Step: 181, Rank: 6, loss = 1.1506617069244385
total tokens: 7336 num samples: 8 num padding tokens: 1044 - rank: 4 max len: 917 min len: 694 avg len: 786.5 num_loss_counted_tokens: 4411
total tokens: 5638 num samples: 2 num padding tokens: 201 - rank: 1 max len: 2819 min len: 2618 avg len: 2718.5 num_loss_counted_tokens: 197
{
"epoch": 1,
"step": 181,
"rank": 0,
"loss": 0.008358441293239594,
"overall_throughput": 41.99766118896557,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.434746265411377,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24230,
"batch_size": 85,
"total_loss": 0.6398059129714966,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:01.770174"
}
total tokens: 8088 num samples: 12 num padding tokens: 1035 - rank: 5 max len: 674 min len: 501 avg len: 587.75 num_loss_counted_tokens: 4793
total tokens: 7696 num samples: 16 num padding tokens: 1539 - rank: 6 max len: 481 min len: 270 avg len: 384.8125 num_loss_counted_tokens: 3508
total tokens: 7340 num samples: 4 num padding tokens: 1220 - rank: 2 max len: 1835 min len: 1287 avg len: 1530.0 num_loss_counted_tokens: 2015
total tokens: 7047 num samples: 27 num padding tokens: 2280 - rank: 7 max len: 261 min len: 83 avg len: 176.55555555555554 num_loss_counted_tokens: 1980
total tokens: 6842 num samples: 2 num padding tokens: 194 - rank: 0 max len: 3421 min len: 3227 avg len: 3324.0 num_loss_counted_tokens: 203
total tokens: 7308 num samples: 6 num padding tokens: 899 - rank: 3 max len: 1218 min len: 951 avg len: 1068.1666666666667 num_loss_counted_tokens: 2980
Per-token loss scaled by world size: 0.00044985805288888514Per-token loss scaled by world size: 0.0003697045613080263Per-token loss scaled by world size: 0.00020446558482944965Per-token loss scaled by world size: 0.0002235985011793673Per-token loss scaled by world size: 0.0003323642595205456
Per-token loss scaled by world size: 8.518856338923797e-05
Epoch: 1, Step: 182, Rank: 3, loss = 0.6204019784927368
Per-token loss scaled by world size: 0.00010683093569241464Epoch: 1, Step: 182, Rank: 5, loss = 1.0257915258407593Epoch: 1, Step: 182, Rank: 6, loss = 0.9221861958503723Epoch: 1, Step: 182, Rank: 2, loss = 0.5673153400421143
Epoch: 1, Step: 182, Rank: 4, loss = 1.2481874227523804
Epoch: 1, Step: 182, Rank: 1, loss = 0.23636631667613983
Epoch: 1, Step: 182, Rank: 7, loss = 0.296415776014328Per-token loss scaled by world size: 2.3087804947863333e-06
Epoch: 1, Step: 182, Rank: 0, loss = 0.006406000349670649
total tokens: 7911 num samples: 3 num padding tokens: 1870 - rank: 1 max len: 2637 min len: 1680 avg len: 2013.6666666666667 num_loss_counted_tokens: 268
total tokens: 3888 num samples: 18 num padding tokens: 1364 - rank: 7 max len: 216 min len: 80 avg len: 140.22222222222223 num_loss_counted_tokens: 1085
total tokens: 7695 num samples: 9 num padding tokens: 456 - rank: 4 max len: 855 min len: 750 avg len: 804.3333333333334 num_loss_counted_tokens: 5317
{
"epoch": 1,
"step": 182,
"rank": 0,
"loss": 0.006406000349670649,
"overall_throughput": 40.52672826708525,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.530043125152588,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22197,
"batch_size": 78,
"total_loss": 0.6153838038444519,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:04.374573"
}
total tokens: 8024 num samples: 17 num padding tokens: 2930 - rank: 6 max len: 472 min len: 227 avg len: 299.6470588235294 num_loss_counted_tokens: 3168
total tokens: 6560 num samples: 4 num padding tokens: 820 - rank: 2 max len: 1640 min len: 1225 avg len: 1435.0 num_loss_counted_tokens: 3801
total tokens: 7062 num samples: 6 num padding tokens: 912 - rank: 3 max len: 1177 min len: 874 avg len: 1025.0 num_loss_counted_tokens: 4465
total tokens: 7854 num samples: 11 num padding tokens: 1369 - rank: 5 max len: 714 min len: 483 avg len: 589.5454545454545 num_loss_counted_tokens: 3058
total tokens: 6820 num samples: 2 num padding tokens: 89 - rank: 0 max len: 3410 min len: 3321 avg len: 3365.5 num_loss_counted_tokens: 208
Per-token loss scaled by world size: 0.00046004995238035917Per-token loss scaled by world size: 0.0006659817881882191Per-token loss scaled by world size: 0.0003906514320988208
Per-token loss scaled by world size: 2.6746805815491825e-05Per-token loss scaled by world size: 0.000365366053301841
Per-token loss scaled by world size: 6.110809408710338e-06Per-token loss scaled by world size: 2.263839405713952e-06
Epoch: 1, Step: 183, Rank: 6, loss = 1.5981065034866333Epoch: 1, Step: 183, Rank: 3, loss = 0.9374169707298279
Epoch: 1, Step: 183, Rank: 4, loss = 1.103947401046753Epoch: 1, Step: 183, Rank: 2, loss = 0.06418230384588242
Epoch: 1, Step: 183, Rank: 7, loss = 0.8767415285110474
Epoch: 1, Step: 183, Rank: 0, loss = 0.005432365462183952Epoch: 1, Step: 183, Rank: 1, loss = 0.014663650654256344
Per-token loss scaled by world size: 0.0007157879881560802
Epoch: 1, Step: 183, Rank: 5, loss = 1.7176227569580078
total tokens: 7851 num samples: 3 num padding tokens: 445 - rank: 1 max len: 2617 min len: 2389 avg len: 2468.6666666666665 num_loss_counted_tokens: 473
total tokens: 7480 num samples: 11 num padding tokens: 807 - rank: 4 max len: 680 min len: 536 avg len: 606.6363636363636 num_loss_counted_tokens: 4807
{
"epoch": 1,
"step": 183,
"rank": 0,
"loss": 0.005432365462183952,
"overall_throughput": 41.57406220293697,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.319289207458496,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19197,
"batch_size": 71,
"total_loss": 0.7897641658782959,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:06.915463"
}
total tokens: 976 num samples: 8 num padding tokens: 185 - rank: 7 max len: 122 min len: 80 avg len: 98.875 num_loss_counted_tokens: 195
total tokens: 7280 num samples: 7 num padding tokens: 746 - rank: 3 max len: 1040 min len: 851 avg len: 933.4285714285714 num_loss_counted_tokens: 5111
total tokens: 6990 num samples: 6 num padding tokens: 218 - rank: 2 max len: 1165 min len: 1063 avg len: 1128.6666666666667 num_loss_counted_tokens: 3805
total tokens: 7917 num samples: 21 num padding tokens: 2826 - rank: 6 max len: 377 min len: 129 avg len: 242.42857142857142 num_loss_counted_tokens: 2506
total tokens: 7920 num samples: 15 num padding tokens: 1424 - rank: 5 max len: 528 min len: 378 avg len: 433.06666666666666 num_loss_counted_tokens: 4285
total tokens: 5748 num samples: 2 num padding tokens: 35 - rank: 0 max len: 2874 min len: 2839 avg len: 2856.5 num_loss_counted_tokens: 168
Per-token loss scaled by world size: 0.000344208674505353Per-token loss scaled by world size: 0.00029513309709727764Per-token loss scaled by world size: 0.0003547095402609557Per-token loss scaled by world size: 0.0004679278936237097Per-token loss scaled by world size: 4.7765744966454804e-05Per-token loss scaled by world size: 0.0003639253554865718Per-token loss scaled by world size: 4.720650849776575e-06
Epoch: 1, Step: 184, Rank: 6, loss = 0.9553658366203308
Epoch: 1, Step: 184, Rank: 1, loss = 0.12865106761455536Epoch: 1, Step: 184, Rank: 2, loss = 0.9270830750465393
Epoch: 1, Step: 184, Rank: 4, loss = 1.2603052854537964Epoch: 1, Step: 184, Rank: 0, loss = 0.012714482843875885Epoch: 1, Step: 184, Rank: 7, loss = 0.7949041128158569
Epoch: 1, Step: 184, Rank: 5, loss = 0.9801874756813049
Per-token loss scaled by world size: 0.00017599221609998494
Epoch: 1, Step: 184, Rank: 3, loss = 0.4740130305290222
total tokens: 6996 num samples: 2 num padding tokens: 1478 - rank: 1 max len: 3498 min len: 2020 avg len: 2759.0 num_loss_counted_tokens: 177
total tokens: 7335 num samples: 9 num padding tokens: 438 - rank: 4 max len: 815 min len: 722 avg len: 766.3333333333334 num_loss_counted_tokens: 4746
total tokens: 7648 num samples: 16 num padding tokens: 1473 - rank: 6 max len: 478 min len: 288 avg len: 385.9375 num_loss_counted_tokens: 3125
total tokens: 7722 num samples: 11 num padding tokens: 1234 - rank: 5 max len: 702 min len: 483 avg len: 589.8181818181819 num_loss_counted_tokens: 4245
{
"epoch": 1,
"step": 184,
"rank": 0,
"loss": 0.012714482843875885,
"overall_throughput": 41.53668738608439,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.342710971832275,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21547,
"batch_size": 88,
"total_loss": 0.6916530132293701,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:09.462656"
}
total tokens: 7588 num samples: 28 num padding tokens: 2405 - rank: 7 max len: 271 min len: 82 avg len: 185.10714285714286 num_loss_counted_tokens: 2309
total tokens: 7024 num samples: 4 num padding tokens: 1375 - rank: 2 max len: 1756 min len: 1193 avg len: 1412.25 num_loss_counted_tokens: 2078
total tokens: 7448 num samples: 7 num padding tokens: 379 - rank: 3 max len: 1064 min len: 961 avg len: 1009.8571428571429 num_loss_counted_tokens: 5273
total tokens: 4061 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4061 min len: 4061 avg len: 4061.0 num_loss_counted_tokens: 393
Per-token loss scaled by world size: 0.00041325463098473847Per-token loss scaled by world size: 0.00040759582770988345Per-token loss scaled by world size: 0.0002913470088969916Per-token loss scaled by world size: 5.2812706599070225e-06Per-token loss scaled by world size: 9.605777449905872e-05
Per-token loss scaled by world size: 4.5470653276424855e-05
Per-token loss scaled by world size: 0.000289949937723577
Epoch: 1, Step: 185, Rank: 5, loss = 1.2077573537826538
Epoch: 1, Step: 185, Rank: 6, loss = 1.2245250940322876Epoch: 1, Step: 185, Rank: 0, loss = 0.015649065375328064Epoch: 1, Step: 185, Rank: 4, loss = 0.8632975816726685
Epoch: 1, Step: 185, Rank: 2, loss = 0.2846311926841736
Epoch: 1, Step: 185, Rank: 1, loss = 0.13473522663116455
Epoch: 1, Step: 185, Rank: 7, loss = 0.859157919883728
Per-token loss scaled by world size: 0.00039052340434864163
Epoch: 1, Step: 185, Rank: 3, loss = 1.1571696996688843
total tokens: 7909 num samples: 11 num padding tokens: 619 - rank: 4 max len: 719 min len: 567 avg len: 662.7272727272727 num_loss_counted_tokens: 3030
total tokens: 7710 num samples: 5 num padding tokens: 1395 - rank: 1 max len: 1542 min len: 1135 avg len: 1263.0 num_loss_counted_tokens: 2357
total tokens: 7533 num samples: 9 num padding tokens: 442 - rank: 3 max len: 837 min len: 720 avg len: 787.8888888888889 num_loss_counted_tokens: 4365
{
"epoch": 1,
"step": 185,
"rank": 0,
"loss": 0.015649065375328064,
"overall_throughput": 42.01932932464499,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.411304473876953,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23705,
"batch_size": 85,
"total_loss": 0.7183653712272644,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:11.984726"
}
total tokens: 8030 num samples: 22 num padding tokens: 1301 - rank: 6 max len: 365 min len: 237 avg len: 305.8636363636364 num_loss_counted_tokens: 3940
total tokens: 7868 num samples: 14 num padding tokens: 1280 - rank: 5 max len: 562 min len: 411 avg len: 470.57142857142856 num_loss_counted_tokens: 4766
total tokens: 6728 num samples: 29 num padding tokens: 2588 - rank: 7 max len: 232 min len: 75 avg len: 142.75862068965517 num_loss_counted_tokens: 1514
total tokens: 6430 num samples: 2 num padding tokens: 1586 - rank: 0 max len: 3215 min len: 1629 avg len: 2422.0 num_loss_counted_tokens: 1620
total tokens: 7819 num samples: 7 num padding tokens: 799 - rank: 2 max len: 1117 min len: 938 avg len: 1002.8571428571429 num_loss_counted_tokens: 3592
Per-token loss scaled by world size: 0.00034539305488578975Per-token loss scaled by world size: 0.0004060663341078907Per-token loss scaled by world size: 0.0002557812840677798Per-token loss scaled by world size: 0.00037697027437388897Per-token loss scaled by world size: 0.00021574345009867102Per-token loss scaled by world size: 1.922340288729174e-06Per-token loss scaled by world size: 1.7185264368890785e-05
Epoch: 1, Step: 186, Rank: 6, loss = 1.187833309173584Epoch: 1, Step: 186, Rank: 2, loss = 0.8059667944908142
Epoch: 1, Step: 186, Rank: 0, loss = 0.006057294551283121
Epoch: 1, Step: 186, Rank: 3, loss = 1.0883334875106812
Epoch: 1, Step: 186, Rank: 4, loss = 1.279515027999878
Epoch: 1, Step: 186, Rank: 7, loss = 0.6798076033592224Epoch: 1, Step: 186, Rank: 1, loss = 0.054150767624378204
Per-token loss scaled by world size: 0.00038644636515527964
Epoch: 1, Step: 186, Rank: 5, loss = 1.217692494392395
total tokens: 8019 num samples: 9 num padding tokens: 612 - rank: 4 max len: 891 min len: 729 avg len: 823.0 num_loss_counted_tokens: 5812
total tokens: 7845 num samples: 3 num padding tokens: 1345 - rank: 1 max len: 2615 min len: 1815 avg len: 2166.6666666666665 num_loss_counted_tokens: 455
{
"epoch": 1,
"step": 186,
"rank": 0,
"loss": 0.006057294551283121,
"overall_throughput": 41.92610024295953,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.250525951385498,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25208,
"batch_size": 86,
"total_loss": 0.7899196147918701,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:14.513363"
}
total tokens: 8030 num samples: 5 num padding tokens: 2103 - rank: 3 max len: 1606 min len: 905 avg len: 1185.4 num_loss_counted_tokens: 1115
total tokens: 7990 num samples: 17 num padding tokens: 1874 - rank: 6 max len: 470 min len: 282 avg len: 359.7647058823529 num_loss_counted_tokens: 3470
total tokens: 7040 num samples: 4 num padding tokens: 176 - rank: 2 max len: 1760 min len: 1656 avg len: 1716.0 num_loss_counted_tokens: 1048
total tokens: 8091 num samples: 29 num padding tokens: 2852 - rank: 7 max len: 279 min len: 85 avg len: 180.6551724137931 num_loss_counted_tokens: 2267
total tokens: 6206 num samples: 2 num padding tokens: 163 - rank: 0 max len: 3103 min len: 2940 avg len: 3021.5 num_loss_counted_tokens: 193
total tokens: 7722 num samples: 11 num padding tokens: 1158 - rank: 5 max len: 702 min len: 471 avg len: 596.7272727272727 num_loss_counted_tokens: 4117
Per-token loss scaled by world size: 0.0002549219934735447Per-token loss scaled by world size: 9.078537550522014e-05Per-token loss scaled by world size: 0.0001285703619942069
Per-token loss scaled by world size: 0.00026954273926094174
Per-token loss scaled by world size: 0.00027302553644403815
Per-token loss scaled by world size: 0.00012735271593555808
Per-token loss scaled by world size: 0.00019962496298830956
Epoch: 1, Step: 187, Rank: 3, loss = 0.3104405999183655
Epoch: 1, Step: 187, Rank: 2, loss = 0.8717057108879089
Epoch: 1, Step: 187, Rank: 6, loss = 0.9217013716697693
Epoch: 1, Step: 187, Rank: 1, loss = 0.4396463632583618
Epoch: 1, Step: 187, Rank: 4, loss = 0.9336108565330505
Epoch: 1, Step: 187, Rank: 0, loss = 0.43548262119293213
Per-token loss scaled by world size: 0.00028984216623939574
Epoch: 1, Step: 187, Rank: 7, loss = 0.6826175451278687
Epoch: 1, Step: 187, Rank: 5, loss = 0.9911152720451355
total tokens: 6432 num samples: 3 num padding tokens: 1172 - rank: 1 max len: 2144 min len: 1513 avg len: 1753.3333333333333 num_loss_counted_tokens: 778
total tokens: 7992 num samples: 8 num padding tokens: 895 - rank: 4 max len: 999 min len: 761 avg len: 887.125 num_loss_counted_tokens: 3688
total tokens: 7580 num samples: 10 num padding tokens: 664 - rank: 5 max len: 758 min len: 591 avg len: 691.6 num_loss_counted_tokens: 5001
{
"epoch": 1,
"step": 187,
"rank": 0,
"loss": 0.43548262119293213,
"overall_throughput": 41.80966716175401,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.324402809143066,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27356,
"batch_size": 97,
"total_loss": 0.6982901096343994,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:17.044420"
}
total tokens: 7756 num samples: 14 num padding tokens: 2347 - rank: 6 max len: 554 min len: 267 avg len: 386.35714285714283 num_loss_counted_tokens: 3784
total tokens: 5145 num samples: 21 num padding tokens: 1727 - rank: 7 max len: 245 min len: 84 avg len: 162.76190476190476 num_loss_counted_tokens: 1417
total tokens: 7055 num samples: 5 num padding tokens: 476 - rank: 2 max len: 1411 min len: 1211 avg len: 1315.8 num_loss_counted_tokens: 2646
total tokens: 7146 num samples: 6 num padding tokens: 724 - rank: 3 max len: 1191 min len: 1017 avg len: 1070.3333333333333 num_loss_counted_tokens: 3869
total tokens: 7376 num samples: 2 num padding tokens: 819 - rank: 0 max len: 3688 min len: 2869 avg len: 3278.5 num_loss_counted_tokens: 160
Per-token loss scaled by world size: 0.00011057691881433129Per-token loss scaled by world size: 0.00035715868580155075Per-token loss scaled by world size: 0.0003879719879478216Per-token loss scaled by world size: 0.00021651088900398463Per-token loss scaled by world size: 6.0199621657375246e-05
Per-token loss scaled by world size: 5.681112452293746e-05
Per-token loss scaled by world size: 0.0005215432029217482
Epoch: 1, Step: 188, Rank: 6, loss = 1.0699580907821655Epoch: 1, Step: 188, Rank: 4, loss = 1.1622670888900757
Epoch: 1, Step: 188, Rank: 2, loss = 0.18034301698207855Epoch: 1, Step: 188, Rank: 1, loss = 0.3312608003616333
Epoch: 1, Step: 188, Rank: 7, loss = 0.6486124992370605
Epoch: 1, Step: 188, Rank: 0, loss = 0.1701919287443161
Epoch: 1, Step: 188, Rank: 5, loss = 1.562412977218628
Per-token loss scaled by world size: 0.0001381830807076767
Epoch: 1, Step: 188, Rank: 3, loss = 0.4139619469642639
total tokens: 5862 num samples: 2 num padding tokens: 913 - rank: 1 max len: 2931 min len: 2018 avg len: 2474.5 num_loss_counted_tokens: 226
total tokens: 7819 num samples: 7 num padding tokens: 1165 - rank: 4 max len: 1117 min len: 742 avg len: 950.5714285714286 num_loss_counted_tokens: 4359
{
"epoch": 1,
"step": 188,
"rank": 0,
"loss": 0.1701919287443161,
"overall_throughput": 40.95455841532473,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.21819305419922,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23966,
"batch_size": 84,
"total_loss": 0.6923760771751404,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:19.630875"
}
total tokens: 8085 num samples: 11 num padding tokens: 1172 - rank: 5 max len: 735 min len: 538 avg len: 628.4545454545455 num_loss_counted_tokens: 4424
total tokens: 7680 num samples: 4 num padding tokens: 1258 - rank: 2 max len: 1920 min len: 1352 avg len: 1605.5 num_loss_counted_tokens: 1060
total tokens: 7875 num samples: 15 num padding tokens: 1673 - rank: 6 max len: 525 min len: 296 avg len: 413.46666666666664 num_loss_counted_tokens: 4419
total tokens: 6734 num samples: 2 num padding tokens: 36 - rank: 0 max len: 3367 min len: 3331 avg len: 3349.0 num_loss_counted_tokens: 164
total tokens: 7920 num samples: 6 num padding tokens: 486 - rank: 3 max len: 1320 min len: 1118 avg len: 1239.0 num_loss_counted_tokens: 1697
total tokens: 8064 num samples: 28 num padding tokens: 2898 - rank: 7 max len: 288 min len: 79 avg len: 184.5 num_loss_counted_tokens: 2302
Per-token loss scaled by world size: 0.0002291280252393335Per-token loss scaled by world size: 0.00042932084761559963Per-token loss scaled by world size: 0.0003715125494636595Per-token loss scaled by world size: 0.0003486127534415573Per-token loss scaled by world size: 0.00029553903732448816
Per-token loss scaled by world size: 2.541187996030203e-06
Per-token loss scaled by world size: 4.7232209908543155e-05
Epoch: 1, Step: 189, Rank: 5, loss = 1.2460501194000244
Epoch: 1, Step: 189, Rank: 6, loss = 1.0782687664031982Epoch: 1, Step: 189, Rank: 0, loss = 0.007375480607151985
Epoch: 1, Step: 189, Rank: 2, loss = 0.665015459060669
Epoch: 1, Step: 189, Rank: 7, loss = 0.8577651381492615
Epoch: 1, Step: 189, Rank: 4, loss = 1.0118049383163452
Epoch: 1, Step: 189, Rank: 1, loss = 0.13708558678627014
Per-token loss scaled by world size: 0.00045083268196322024
Epoch: 1, Step: 189, Rank: 3, loss = 1.308485507965088
total tokens: 8012 num samples: 4 num padding tokens: 1085 - rank: 1 max len: 2003 min len: 1547 avg len: 1731.75 num_loss_counted_tokens: 501
total tokens: 7704 num samples: 9 num padding tokens: 539 - rank: 4 max len: 856 min len: 750 avg len: 796.1111111111111 num_loss_counted_tokens: 5817
{
"epoch": 1,
"step": 189,
"rank": 0,
"loss": 0.007375480607151985,
"overall_throughput": 41.729872575916524,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.477022171020508,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23219,
"batch_size": 98,
"total_loss": 0.7889814376831055,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:22.162410"
}
total tokens: 6978 num samples: 6 num padding tokens: 878 - rank: 3 max len: 1163 min len: 953 avg len: 1016.6666666666666 num_loss_counted_tokens: 3406
total tokens: 5496 num samples: 24 num padding tokens: 2060 - rank: 7 max len: 229 min len: 84 avg len: 143.16666666666666 num_loss_counted_tokens: 1259
total tokens: 8074 num samples: 11 num padding tokens: 921 - rank: 5 max len: 734 min len: 585 avg len: 650.2727272727273 num_loss_counted_tokens: 3399
total tokens: 8100 num samples: 15 num padding tokens: 2428 - rank: 6 max len: 540 min len: 251 avg len: 378.1333333333333 num_loss_counted_tokens: 2866
total tokens: 7430 num samples: 5 num padding tokens: 440 - rank: 2 max len: 1486 min len: 1279 avg len: 1398.0 num_loss_counted_tokens: 2676
total tokens: 7548 num samples: 2 num padding tokens: 467 - rank: 0 max len: 3774 min len: 3307 avg len: 3540.5 num_loss_counted_tokens: 178
Per-token loss scaled by world size: 0.00038093223702162504Per-token loss scaled by world size: 0.0003549058164935559Per-token loss scaled by world size: 0.0002593057288322598Per-token loss scaled by world size: 0.0002407356078037992Per-token loss scaled by world size: 0.0001624725991860032
Per-token loss scaled by world size: 9.274062176700681e-05Per-token loss scaled by world size: 3.2670384825905785e-05
Epoch: 1, Step: 190, Rank: 6, loss = 1.1850801706314087
Epoch: 1, Step: 190, Rank: 7, loss = 0.8067001104354858Epoch: 1, Step: 190, Rank: 4, loss = 1.1041120290756226
Epoch: 1, Step: 190, Rank: 3, loss = 0.7489284873008728
Epoch: 1, Step: 190, Rank: 0, loss = 0.10163756459951401
Epoch: 1, Step: 190, Rank: 1, loss = 0.2885160744190216Epoch: 1, Step: 190, Rank: 2, loss = 0.5054522752761841
Per-token loss scaled by world size: 0.00030219164909794927
Epoch: 1, Step: 190, Rank: 5, loss = 0.9401181936264038
total tokens: 6304 num samples: 2 num padding tokens: 712 - rank: 1 max len: 3152 min len: 2440 avg len: 2796.0 num_loss_counted_tokens: 243
total tokens: 7343 num samples: 7 num padding tokens: 1218 - rank: 4 max len: 1049 min len: 690 avg len: 875.0 num_loss_counted_tokens: 4702
{
"epoch": 1,
"step": 190,
"rank": 0,
"loss": 0.10163756459951401,
"overall_throughput": 41.867834901558076,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.32686471939087,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24888,
"batch_size": 83,
"total_loss": 0.7100681662559509,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:24.689299"
}
total tokens: 8112 num samples: 12 num padding tokens: 910 - rank: 5 max len: 676 min len: 517 avg len: 600.1666666666666 num_loss_counted_tokens: 4936
total tokens: 4446 num samples: 19 num padding tokens: 1491 - rank: 7 max len: 234 min len: 76 avg len: 155.52631578947367 num_loss_counted_tokens: 1146
total tokens: 7266 num samples: 2 num padding tokens: 342 - rank: 0 max len: 3633 min len: 3291 avg len: 3462.0 num_loss_counted_tokens: 203
total tokens: 7984 num samples: 16 num padding tokens: 1646 - rank: 6 max len: 499 min len: 253 avg len: 396.125 num_loss_counted_tokens: 3681
total tokens: 7490 num samples: 5 num padding tokens: 639 - rank: 2 max len: 1498 min len: 1299 avg len: 1370.2 num_loss_counted_tokens: 3496
total tokens: 7590 num samples: 6 num padding tokens: 448 - rank: 3 max len: 1265 min len: 1117 avg len: 1190.3333333333333 num_loss_counted_tokens: 2752
Per-token loss scaled by world size: 0.0003764858120121062Per-token loss scaled by world size: 0.0006537719164043665Per-token loss scaled by world size: 0.00021446413302328438Per-token loss scaled by world size: 9.457199485041201e-05Per-token loss scaled by world size: 0.00034635854535736144
Per-token loss scaled by world size: 1.4150586139294319e-05
Per-token loss scaled by world size: 7.414004357997328e-05
Epoch: 1, Step: 191, Rank: 5, loss = 1.6465245485305786
Epoch: 1, Step: 191, Rank: 2, loss = 0.23817956447601318
Epoch: 1, Step: 191, Rank: 3, loss = 0.8723039627075195Epoch: 1, Step: 191, Rank: 7, loss = 0.9481794834136963
Epoch: 1, Step: 191, Rank: 4, loss = 0.5401279330253601
Epoch: 1, Step: 191, Rank: 0, loss = 0.03563825041055679
Epoch: 1, Step: 191, Rank: 1, loss = 0.18672169744968414
Per-token loss scaled by world size: 0.0005529047921299934
Epoch: 1, Step: 191, Rank: 6, loss = 1.3924907445907593
total tokens: 7998 num samples: 3 num padding tokens: 968 - rank: 1 max len: 2666 min len: 1768 avg len: 2343.3333333333335 num_loss_counted_tokens: 631
total tokens: 7758 num samples: 9 num padding tokens: 1421 - rank: 4 max len: 862 min len: 603 avg len: 704.1111111111111 num_loss_counted_tokens: 3783
{
"epoch": 1,
"step": 191,
"rank": 0,
"loss": 0.03563825041055679,
"overall_throughput": 41.91096199447346,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.354421615600586,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20148,
"batch_size": 73,
"total_loss": 0.7325208187103271,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:27.214193"
}
total tokens: 7946 num samples: 29 num padding tokens: 2882 - rank: 7 max len: 274 min len: 81 avg len: 174.6206896551724 num_loss_counted_tokens: 2096
total tokens: 8094 num samples: 19 num padding tokens: 1335 - rank: 6 max len: 426 min len: 280 avg len: 355.7368421052632 num_loss_counted_tokens: 4052
total tokens: 8118 num samples: 6 num padding tokens: 1664 - rank: 3 max len: 1353 min len: 917 avg len: 1075.6666666666667 num_loss_counted_tokens: 3173
total tokens: 7826 num samples: 13 num padding tokens: 1193 - rank: 5 max len: 602 min len: 428 avg len: 510.2307692307692 num_loss_counted_tokens: 2941
total tokens: 7920 num samples: 5 num padding tokens: 502 - rank: 2 max len: 1584 min len: 1373 avg len: 1483.6 num_loss_counted_tokens: 2395
total tokens: 7424 num samples: 2 num padding tokens: 728 - rank: 0 max len: 3712 min len: 2984 avg len: 3348.0 num_loss_counted_tokens: 259
Per-token loss scaled by world size: 0.00023188829072751105Per-token loss scaled by world size: 0.00031627173302695155Per-token loss scaled by world size: 0.0003686068521346897Per-token loss scaled by world size: 0.0001379517198074609Per-token loss scaled by world size: 0.00017684763588476926Per-token loss scaled by world size: 3.2488858323631575e-06Per-token loss scaled by world size: 3.9887581806397066e-05
Epoch: 1, Step: 192, Rank: 6, loss = 1.147979974746704
Epoch: 1, Step: 192, Rank: 4, loss = 0.42963337898254395
Epoch: 1, Step: 192, Rank: 0, loss = 0.010118248872458935Epoch: 1, Step: 192, Rank: 2, loss = 0.9849887490272522
Epoch: 1, Step: 192, Rank: 3, loss = 0.7221871018409729Epoch: 1, Step: 192, Rank: 7, loss = 0.5507698655128479
Epoch: 1, Step: 192, Rank: 1, loss = 0.12422488629817963
Per-token loss scaled by world size: 0.00024023951846174896
Epoch: 1, Step: 192, Rank: 5, loss = 0.7481959462165833
total tokens: 6840 num samples: 3 num padding tokens: 728 - rank: 1 max len: 2280 min len: 1767 avg len: 2037.3333333333333 num_loss_counted_tokens: 324
total tokens: 8030 num samples: 10 num padding tokens: 837 - rank: 4 max len: 803 min len: 661 avg len: 719.3 num_loss_counted_tokens: 4290
{
"epoch": 1,
"step": 192,
"rank": 0,
"loss": 0.010118248872458935,
"overall_throughput": 42.505572644428966,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.430901527404785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24915,
"batch_size": 77,
"total_loss": 0.589762270450592,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:29.701973"
}
total tokens: 7060 num samples: 5 num padding tokens: 1273 - rank: 2 max len: 1412 min len: 992 avg len: 1157.4 num_loss_counted_tokens: 3150
total tokens: 7864 num samples: 8 num padding tokens: 738 - rank: 3 max len: 983 min len: 809 avg len: 890.75 num_loss_counted_tokens: 4284
total tokens: 7627 num samples: 29 num padding tokens: 3124 - rank: 7 max len: 263 min len: 85 avg len: 155.27586206896552 num_loss_counted_tokens: 1771
total tokens: 7680 num samples: 12 num padding tokens: 957 - rank: 5 max len: 640 min len: 482 avg len: 560.25 num_loss_counted_tokens: 4379
total tokens: 7837 num samples: 17 num padding tokens: 1825 - rank: 6 max len: 461 min len: 273 avg len: 353.6470588235294 num_loss_counted_tokens: 2956
total tokens: 7128 num samples: 2 num padding tokens: 1238 - rank: 0 max len: 3564 min len: 2326 avg len: 2945.0 num_loss_counted_tokens: 463
Per-token loss scaled by world size: 0.00031890295213088393Per-token loss scaled by world size: 0.0001828969834605232Per-token loss scaled by world size: 0.0001293038367293775Per-token loss scaled by world size: 0.00029604701558128
Per-token loss scaled by world size: 0.00024164760543499142
Per-token loss scaled by world size: 0.00021057862613815814Per-token loss scaled by world size: 3.072937033721246e-05
Epoch: 1, Step: 193, Rank: 2, loss = 0.4273168742656708Epoch: 1, Step: 193, Rank: 6, loss = 0.9783613681793213
Epoch: 1, Step: 193, Rank: 1, loss = 0.6044288277626038Epoch: 1, Step: 193, Rank: 5, loss = 1.0538945198059082
Epoch: 1, Step: 193, Rank: 7, loss = 0.7985849380493164
Epoch: 1, Step: 193, Rank: 0, loss = 0.10155288130044937
Epoch: 1, Step: 193, Rank: 4, loss = 0.6959097385406494
Per-token loss scaled by world size: 0.00022084206284489483
Epoch: 1, Step: 193, Rank: 3, loss = 0.7298278212547302
total tokens: 6780 num samples: 5 num padding tokens: 682 - rank: 4 max len: 1356 min len: 1000 avg len: 1219.6 num_loss_counted_tokens: 3729
total tokens: 7341 num samples: 3 num padding tokens: 810 - rank: 1 max len: 2447 min len: 1943 avg len: 2177.0 num_loss_counted_tokens: 1272
{
"epoch": 1,
"step": 193,
"rank": 0,
"loss": 0.10155288130044937,
"overall_throughput": 41.24658740563778,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.264228343963623,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26438,
"batch_size": 78,
"total_loss": 0.6737346053123474,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:32.267001"
}
total tokens: 7935 num samples: 15 num padding tokens: 1362 - rank: 6 max len: 529 min len: 333 avg len: 438.2 num_loss_counted_tokens: 4416
total tokens: 7840 num samples: 8 num padding tokens: 1773 - rank: 5 max len: 980 min len: 597 avg len: 758.375 num_loss_counted_tokens: 3956
total tokens: 5744 num samples: 2 num padding tokens: 30 - rank: 0 max len: 2872 min len: 2842 avg len: 2857.0 num_loss_counted_tokens: 152
total tokens: 7596 num samples: 4 num padding tokens: 617 - rank: 2 max len: 1899 min len: 1609 avg len: 1744.75 num_loss_counted_tokens: 1330
total tokens: 7644 num samples: 26 num padding tokens: 3101 - rank: 7 max len: 294 min len: 79 avg len: 174.73076923076923 num_loss_counted_tokens: 1861
total tokens: 7405 num samples: 5 num padding tokens: 157 - rank: 3 max len: 1481 min len: 1402 avg len: 1449.6 num_loss_counted_tokens: 2757
Per-token loss scaled by world size: 0.00030297457124106586Per-token loss scaled by world size: 0.000216660147998482Per-token loss scaled by world size: 0.00037809842615388334Per-token loss scaled by world size: 0.00046558064059354365Per-token loss scaled by world size: 0.00023001583758741617
Per-token loss scaled by world size: 6.362871499732137e-05
Per-token loss scaled by world size: 1.6167678040801547e-05
Epoch: 1, Step: 194, Rank: 5, loss = 1.3895835876464844Epoch: 1, Step: 194, Rank: 6, loss = 0.9042654633522034Epoch: 1, Step: 194, Rank: 4, loss = 1.1284819841384888
Epoch: 1, Step: 194, Rank: 1, loss = 0.18990786373615265
Epoch: 1, Step: 194, Rank: 7, loss = 0.6466493010520935
Epoch: 1, Step: 194, Rank: 3, loss = 0.6865110397338867
Epoch: 1, Step: 194, Rank: 0, loss = 0.048254456371068954
Per-token loss scaled by world size: 0.0002928127069026232
Epoch: 1, Step: 194, Rank: 2, loss = 0.873936116695404
total tokens: 7677 num samples: 3 num padding tokens: 483 - rank: 1 max len: 2559 min len: 2205 avg len: 2398.0 num_loss_counted_tokens: 1070
total tokens: 7784 num samples: 8 num padding tokens: 1598 - rank: 4 max len: 973 min len: 692 avg len: 773.25 num_loss_counted_tokens: 3677
{
"epoch": 1,
"step": 194,
"rank": 0,
"loss": 0.048254456371068954,
"overall_throughput": 41.96848339305704,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.232909202575684,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23877,
"batch_size": 72,
"total_loss": 0.7334486842155457,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:34.786179"
}
total tokens: 6501 num samples: 3 num padding tokens: 1290 - rank: 2 max len: 2167 min len: 1410 avg len: 1737.0 num_loss_counted_tokens: 241
total tokens: 7221 num samples: 29 num padding tokens: 2731 - rank: 7 max len: 249 min len: 78 avg len: 154.82758620689654 num_loss_counted_tokens: 1846
total tokens: 8080 num samples: 20 num padding tokens: 1685 - rank: 6 max len: 404 min len: 251 avg len: 319.75 num_loss_counted_tokens: 3499
total tokens: 6885 num samples: 5 num padding tokens: 603 - rank: 3 max len: 1377 min len: 1090 avg len: 1256.4 num_loss_counted_tokens: 2165
total tokens: 7980 num samples: 12 num padding tokens: 1568 - rank: 5 max len: 665 min len: 429 avg len: 534.3333333333334 num_loss_counted_tokens: 4731
total tokens: 6814 num samples: 2 num padding tokens: 843 - rank: 0 max len: 3407 min len: 2564 avg len: 2985.5 num_loss_counted_tokens: 553
Per-token loss scaled by world size: 0.00011609335342654958Per-token loss scaled by world size: 0.00037096577580086887Per-token loss scaled by world size: 0.000276279344689101Per-token loss scaled by world size: 0.0002262169582536444Per-token loss scaled by world size: 0.0004427096282597631
Per-token loss scaled by world size: 0.00033836739021353424
Per-token loss scaled by world size: 1.8575705325929448e-05
Epoch: 1, Step: 195, Rank: 4, loss = 0.9719303250312805Epoch: 1, Step: 195, Rank: 7, loss = 0.7238518595695496
Epoch: 1, Step: 195, Rank: 2, loss = 0.3041645884513855Epoch: 1, Step: 195, Rank: 0, loss = 0.04866834729909897
Epoch: 1, Step: 195, Rank: 1, loss = 0.5926884412765503
Epoch: 1, Step: 195, Rank: 5, loss = 0.8865225911140442Epoch: 1, Step: 195, Rank: 6, loss = 1.1598992347717285
Per-token loss scaled by world size: 0.00017340479826088995
Epoch: 1, Step: 195, Rank: 3, loss = 0.4543205797672272
total tokens: 6090 num samples: 3 num padding tokens: 726 - rank: 1 max len: 2030 min len: 1643 avg len: 1788.0 num_loss_counted_tokens: 1593
total tokens: 7690 num samples: 10 num padding tokens: 460 - rank: 4 max len: 769 min len: 680 avg len: 723.0 num_loss_counted_tokens: 4881
{
"epoch": 1,
"step": 195,
"rank": 0,
"loss": 0.04866834729909897,
"overall_throughput": 41.3547409747129,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.354421615600586,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20960,
"batch_size": 83,
"total_loss": 0.6427558064460754,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:37.382964"
}
total tokens: 7967 num samples: 31 num padding tokens: 2761 - rank: 7 max len: 257 min len: 83 avg len: 167.93548387096774 num_loss_counted_tokens: 2276
total tokens: 6960 num samples: 5 num padding tokens: 813 - rank: 2 max len: 1392 min len: 1089 avg len: 1229.4 num_loss_counted_tokens: 1530
total tokens: 7856 num samples: 16 num padding tokens: 1984 - rank: 6 max len: 491 min len: 260 avg len: 367.0 num_loss_counted_tokens: 3780
total tokens: 7602 num samples: 7 num padding tokens: 1100 - rank: 3 max len: 1086 min len: 795 avg len: 928.8571428571429 num_loss_counted_tokens: 5015
total tokens: 7469 num samples: 11 num padding tokens: 1062 - rank: 5 max len: 679 min len: 515 avg len: 582.4545454545455 num_loss_counted_tokens: 4102
total tokens: 6152 num samples: 2 num padding tokens: 117 - rank: 0 max len: 3076 min len: 2959 avg len: 3017.5 num_loss_counted_tokens: 200
Per-token loss scaled by world size: 0.00046574202133342624Per-token loss scaled by world size: 2.0194725038891193e-06Per-token loss scaled by world size: 0.0005153888487257063Per-token loss scaled by world size: 0.0002995604299940169Per-token loss scaled by world size: 0.0004275553219486028Per-token loss scaled by world size: 3.1239229429047555e-05Per-token loss scaled by world size: 3.525143984006718e-05
Epoch: 1, Step: 196, Rank: 6, loss = 1.3931604623794556
Epoch: 1, Step: 196, Rank: 3, loss = 1.2589589357376099Epoch: 1, Step: 196, Rank: 0, loss = 0.005458886735141277
Epoch: 1, Step: 196, Rank: 4, loss = 0.8097493052482605Epoch: 1, Step: 196, Rank: 2, loss = 0.08444354683160782Epoch: 1, Step: 196, Rank: 1, loss = 0.09528905153274536Epoch: 1, Step: 196, Rank: 7, loss = 1.1557354927062988
Per-token loss scaled by world size: 0.0005320304771885276
Epoch: 1, Step: 196, Rank: 5, loss = 1.4381449222564697
total tokens: 6570 num samples: 3 num padding tokens: 733 - rank: 1 max len: 2190 min len: 1806 avg len: 1945.6666666666667 num_loss_counted_tokens: 897
total tokens: 8064 num samples: 9 num padding tokens: 1305 - rank: 4 max len: 896 min len: 651 avg len: 751.0 num_loss_counted_tokens: 4732
{
"epoch": 1,
"step": 196,
"rank": 0,
"loss": 0.005458886735141277,
"overall_throughput": 41.48583814053106,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.379756450653076,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21625,
"batch_size": 81,
"total_loss": 0.7801175117492676,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:39.894850"
}
total tokens: 7520 num samples: 5 num padding tokens: 1864 - rank: 3 max len: 1504 min len: 905 avg len: 1131.2 num_loss_counted_tokens: 2489
total tokens: 7856 num samples: 16 num padding tokens: 1953 - rank: 6 max len: 491 min len: 301 avg len: 368.9375 num_loss_counted_tokens: 3951
total tokens: 7020 num samples: 4 num padding tokens: 441 - rank: 2 max len: 1755 min len: 1533 avg len: 1644.75 num_loss_counted_tokens: 2927
total tokens: 7965 num samples: 27 num padding tokens: 3021 - rank: 7 max len: 295 min len: 88 avg len: 183.11111111111111 num_loss_counted_tokens: 2199
total tokens: 7226 num samples: 2 num padding tokens: 1285 - rank: 0 max len: 3613 min len: 2328 avg len: 2970.5 num_loss_counted_tokens: 471
total tokens: 7764 num samples: 12 num padding tokens: 807 - rank: 5 max len: 647 min len: 495 avg len: 579.75 num_loss_counted_tokens: 4070
Per-token loss scaled by world size: 0.0001413007703376934Per-token loss scaled by world size: 0.0003590704873204231Per-token loss scaled by world size: 0.000253127800533548
Per-token loss scaled by world size: 0.00044859678018838167Per-token loss scaled by world size: 7.146921416278929e-05
Per-token loss scaled by world size: 0.0002180417359340936Per-token loss scaled by world size: 9.443299404665595e-07
Epoch: 1, Step: 197, Rank: 5, loss = 1.1000573635101318
Epoch: 1, Step: 197, Rank: 2, loss = 0.4328925609588623
Epoch: 1, Step: 197, Rank: 3, loss = 0.7754886150360107
Epoch: 1, Step: 197, Rank: 1, loss = 0.21895486116409302
Epoch: 1, Step: 197, Rank: 4, loss = 1.374332308769226
Epoch: 1, Step: 197, Rank: 7, loss = 0.6679981350898743Epoch: 1, Step: 197, Rank: 0, loss = 0.002893072785809636
Per-token loss scaled by world size: 0.00030740915099158883
Epoch: 1, Step: 197, Rank: 6, loss = 0.9417863488197327
total tokens: 6920 num samples: 4 num padding tokens: 227 - rank: 1 max len: 1730 min len: 1622 avg len: 1673.25 num_loss_counted_tokens: 2502
total tokens: 7749 num samples: 9 num padding tokens: 674 - rank: 4 max len: 861 min len: 729 avg len: 786.1111111111111 num_loss_counted_tokens: 3445
{
"epoch": 1,
"step": 197,
"rank": 0,
"loss": 0.002893072785809636,
"overall_throughput": 41.993274454566276,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.219207286834717,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 24509,
"batch_size": 94,
"total_loss": 0.6893004179000854,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:42.416522"
}
total tokens: 7620 num samples: 15 num padding tokens: 2321 - rank: 6 max len: 508 min len: 250 avg len: 353.26666666666665 num_loss_counted_tokens: 3281
total tokens: 7766 num samples: 11 num padding tokens: 899 - rank: 5 max len: 706 min len: 532 avg len: 624.2727272727273 num_loss_counted_tokens: 4764
total tokens: 5782 num samples: 2 num padding tokens: 1020 - rank: 0 max len: 2891 min len: 1871 avg len: 2381.0 num_loss_counted_tokens: 1847
total tokens: 8015 num samples: 5 num padding tokens: 1364 - rank: 2 max len: 1603 min len: 1206 avg len: 1330.2 num_loss_counted_tokens: 1802
total tokens: 7968 num samples: 32 num padding tokens: 2753 - rank: 7 max len: 249 min len: 81 avg len: 162.96875 num_loss_counted_tokens: 2041
total tokens: 8008 num samples: 7 num padding tokens: 830 - rank: 3 max len: 1144 min len: 885 avg len: 1025.4285714285713 num_loss_counted_tokens: 4315
Per-token loss scaled by world size: 0.00019664198043756187Per-token loss scaled by world size: 0.0002436544600641355Per-token loss scaled by world size: 7.737488886050414e-06Per-token loss scaled by world size: 0.0004965663538314402Per-token loss scaled by world size: 5.262534159555798e-06
Per-token loss scaled by world size: 0.000477502413559705Per-token loss scaled by world size: 0.00032058294164016843
Epoch: 1, Step: 198, Rank: 3, loss = 0.611785888671875
Epoch: 1, Step: 198, Rank: 0, loss = 0.019427867606282234Epoch: 1, Step: 198, Rank: 6, loss = 1.2468160390853882Epoch: 1, Step: 198, Rank: 2, loss = 0.4937434196472168
Epoch: 1, Step: 198, Rank: 1, loss = 0.013213565573096275
Epoch: 1, Step: 198, Rank: 4, loss = 1.198948860168457Epoch: 1, Step: 198, Rank: 7, loss = 0.8049436807632446
Per-token loss scaled by world size: 0.0005729582044295967
Epoch: 1, Step: 198, Rank: 5, loss = 1.4386264085769653
total tokens: 7452 num samples: 9 num padding tokens: 706 - rank: 4 max len: 828 min len: 701 avg len: 749.5555555555555 num_loss_counted_tokens: 4285
total tokens: 7940 num samples: 5 num padding tokens: 529 - rank: 1 max len: 1588 min len: 1397 avg len: 1482.2 num_loss_counted_tokens: 3826
total tokens: 7634 num samples: 11 num padding tokens: 983 - rank: 5 max len: 694 min len: 488 avg len: 604.6363636363636 num_loss_counted_tokens: 4895
{
"epoch": 1,
"step": 198,
"rank": 0,
"loss": 0.019427867606282234,
"overall_throughput": 41.69130867994642,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.38558578491211,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20087,
"batch_size": 77,
"total_loss": 0.7284382581710815,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:44.992595"
}
total tokens: 7712 num samples: 16 num padding tokens: 2400 - rank: 6 max len: 482 min len: 260 avg len: 332.0 num_loss_counted_tokens: 2891
total tokens: 7206 num samples: 3 num padding tokens: 834 - rank: 0 max len: 2402 min len: 1692 avg len: 2124.0 num_loss_counted_tokens: 304
total tokens: 7511 num samples: 29 num padding tokens: 2662 - rank: 7 max len: 259 min len: 84 avg len: 167.20689655172413 num_loss_counted_tokens: 1997
total tokens: 7902 num samples: 6 num padding tokens: 937 - rank: 2 max len: 1317 min len: 1043 avg len: 1160.8333333333333 num_loss_counted_tokens: 5167
total tokens: 7928 num samples: 8 num padding tokens: 746 - rank: 3 max len: 991 min len: 833 avg len: 897.75 num_loss_counted_tokens: 4239
Per-token loss scaled by world size: 0.00035409454721957445Per-token loss scaled by world size: 0.00035435750032775104Per-token loss scaled by world size: 0.00030246065580286086Per-token loss scaled by world size: 9.443990165891591e-06
Per-token loss scaled by world size: 1.890566181828035e-06Per-token loss scaled by world size: 0.00016794257680885494
Per-token loss scaled by world size: 0.00030859385151416063
Epoch: 1, Step: 199, Rank: 3, loss = 0.9465774893760681
Epoch: 1, Step: 199, Rank: 1, loss = 0.0050501748919487Epoch: 1, Step: 199, Rank: 0, loss = 0.0252272579818964
Epoch: 1, Step: 199, Rank: 2, loss = 0.8079480528831482Epoch: 1, Step: 199, Rank: 4, loss = 0.9458750486373901
Epoch: 1, Step: 199, Rank: 5, loss = 0.8243313431739807
Epoch: 1, Step: 199, Rank: 7, loss = 0.4486165940761566
Per-token loss scaled by world size: 0.00041021130164153874
Epoch: 1, Step: 199, Rank: 6, loss = 1.095776915550232
total tokens: 7592 num samples: 13 num padding tokens: 445 - rank: 4 max len: 584 min len: 509 avg len: 549.7692307692307 num_loss_counted_tokens: 5215
total tokens: 6915 num samples: 5 num padding tokens: 522 - rank: 1 max len: 1383 min len: 1143 avg len: 1278.6 num_loss_counted_tokens: 1111
total tokens: 8064 num samples: 36 num padding tokens: 2321 - rank: 7 max len: 224 min len: 71 avg len: 159.52777777777777 num_loss_counted_tokens: 2309
{
"epoch": 1,
"step": 199,
"rank": 0,
"loss": 0.0252272579818964,
"overall_throughput": 42.03544117506762,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.382384777069092,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21370,
"batch_size": 70,
"total_loss": 0.6374253630638123,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:47.471975"
}
total tokens: 7920 num samples: 16 num padding tokens: 1517 - rank: 5 max len: 495 min len: 348 avg len: 400.1875 num_loss_counted_tokens: 4230
total tokens: 7693 num samples: 7 num padding tokens: 1201 - rank: 2 max len: 1099 min len: 763 avg len: 927.4285714285714 num_loss_counted_tokens: 3819
total tokens: 7610 num samples: 10 num padding tokens: 880 - rank: 3 max len: 761 min len: 590 avg len: 673.0 num_loss_counted_tokens: 4994
total tokens: 8004 num samples: 23 num padding tokens: 1596 - rank: 6 max len: 348 min len: 229 avg len: 278.60869565217394 num_loss_counted_tokens: 3462
total tokens: 7608 num samples: 3 num padding tokens: 1669 - rank: 0 max len: 2536 min len: 1608 avg len: 1979.6666666666667 num_loss_counted_tokens: 938
Per-token loss scaled by world size: 0.0003396017709746957Per-token loss scaled by world size: 0.0003830210189335048Per-token loss scaled by world size: 0.000320710358209908
Per-token loss scaled by world size: 0.0005103153525851667Per-token loss scaled by world size: 0.0006997347227297723
Per-token loss scaled by world size: 1.8154447616325342e-06
Epoch: 1, Step: 200, Rank: 6, loss = 0.8558957576751709
Epoch: 1, Step: 200, Rank: 3, loss = 0.9063122272491455Epoch: 1, Step: 200, Rank: 2, loss = 1.022187352180481
Epoch: 1, Step: 200, Rank: 5, loss = 1.3619040250778198
Epoch: 1, Step: 200, Rank: 4, loss = 1.8674170970916748
Per-token loss scaled by world size: 3.414201637497172e-05
Epoch: 1, Step: 200, Rank: 0, loss = 0.0048449682071805
Per-token loss scaled by world size: 5.252029586699791e-05
Epoch: 1, Step: 200, Rank: 7, loss = 0.09111650288105011
Epoch: 1, Step: 200, Rank: 1, loss = 0.14016354084014893
total tokens: 7851 num samples: 3 num padding tokens: 1199 - rank: 1 max len: 2617 min len: 1961 avg len: 2217.3333333333335 num_loss_counted_tokens: 624
total tokens: 8019 num samples: 27 num padding tokens: 2910 - rank: 7 max len: 297 min len: 79 avg len: 189.22222222222223 num_loss_counted_tokens: 2220
total tokens: 7539 num samples: 7 num padding tokens: 949 - rank: 4 max len: 1077 min len: 826 avg len: 941.4285714285714 num_loss_counted_tokens: 4098
{
"epoch": 1,
"step": 200,
"rank": 0,
"loss": 0.0048449682071805,
"overall_throughput": 41.06569124957365,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.254440784454346,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21350,
"batch_size": 73,
"total_loss": 0.7812302112579346,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:50.045786"
}
total tokens: 7458 num samples: 11 num padding tokens: 2562 - rank: 6 max len: 678 min len: 317 avg len: 445.09090909090907 num_loss_counted_tokens: 2861
total tokens: 7362 num samples: 9 num padding tokens: 766 - rank: 5 max len: 818 min len: 687 avg len: 732.8888888888889 num_loss_counted_tokens: 4108
total tokens: 7554 num samples: 6 num padding tokens: 667 - rank: 3 max len: 1259 min len: 1083 avg len: 1147.8333333333333 num_loss_counted_tokens: 931
total tokens: 5794 num samples: 2 num padding tokens: 43 - rank: 0 max len: 2897 min len: 2854 avg len: 2875.5 num_loss_counted_tokens: 721
total tokens: 7444 num samples: 4 num padding tokens: 878 - rank: 2 max len: 1861 min len: 1304 avg len: 1641.5 num_loss_counted_tokens: 1667
Per-token loss scaled by world size: 0.0003313705965410918Per-token loss scaled by world size: 0.00037777406396344304Per-token loss scaled by world size: 0.00013144082913640887Per-token loss scaled by world size: 0.00039307758561335504
Per-token loss scaled by world size: 0.00033154338598251343
Per-token loss scaled by world size: 1.7019568986142986e-05
Per-token loss scaled by world size: 5.37650066689821e-06
Epoch: 1, Step: 201, Rank: 6, loss = 0.9256009459495544
Epoch: 1, Step: 201, Rank: 2, loss = 0.3671470880508423
Epoch: 1, Step: 201, Rank: 0, loss = 0.04753991216421127Epoch: 1, Step: 201, Rank: 3, loss = 1.0552173852920532
Epoch: 1, Step: 201, Rank: 4, loss = 1.0979639291763306
Epoch: 1, Step: 201, Rank: 7, loss = 0.9260835647583008
Epoch: 1, Step: 201, Rank: 1, loss = 0.015017910860478878
Per-token loss scaled by world size: 0.0004141141544096172
Epoch: 1, Step: 201, Rank: 5, loss = 1.1567243337631226
total tokens: 7734 num samples: 3 num padding tokens: 874 - rank: 1 max len: 2578 min len: 2141 avg len: 2286.6666666666665 num_loss_counted_tokens: 217
total tokens: 7511 num samples: 7 num padding tokens: 887 - rank: 4 max len: 1073 min len: 833 avg len: 946.2857142857143 num_loss_counted_tokens: 5956
{
"epoch": 1,
"step": 201,
"rank": 0,
"loss": 0.04753991216421127,
"overall_throughput": 42.99976331813581,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.05379819869995,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 22346,
"batch_size": 78,
"total_loss": 0.6989118456840515,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:52.499632"
}
total tokens: 6544 num samples: 2 num padding tokens: 303 - rank: 0 max len: 3272 min len: 2969 avg len: 3120.5 num_loss_counted_tokens: 203
total tokens: 8040 num samples: 15 num padding tokens: 2803 - rank: 6 max len: 536 min len: 243 avg len: 349.1333333333333 num_loss_counted_tokens: 2637
total tokens: 6354 num samples: 3 num padding tokens: 420 - rank: 2 max len: 2118 min len: 1749 avg len: 1978.0 num_loss_counted_tokens: 450
total tokens: 7452 num samples: 9 num padding tokens: 1025 - rank: 5 max len: 828 min len: 555 avg len: 714.1111111111111 num_loss_counted_tokens: 3110
total tokens: 7005 num samples: 5 num padding tokens: 460 - rank: 3 max len: 1401 min len: 1187 avg len: 1309.0 num_loss_counted_tokens: 3212
total tokens: 6902 num samples: 29 num padding tokens: 3026 - rank: 7 max len: 238 min len: 78 avg len: 133.6551724137931 num_loss_counted_tokens: 1459
Per-token loss scaled by world size: 0.00022836425341665745Per-token loss scaled by world size: 0.0001815208961488679Per-token loss scaled by world size: 8.564612653572112e-05Per-token loss scaled by world size: 0.0004483639495447278
Per-token loss scaled by world size: 0.00014287869271356612
Per-token loss scaled by world size: 0.0001819162571337074
Per-token loss scaled by world size: 0.0001619806425878778
Epoch: 1, Step: 202, Rank: 2, loss = 0.5714277625083923
Epoch: 1, Step: 202, Rank: 5, loss = 1.411449670791626
Epoch: 1, Step: 202, Rank: 1, loss = 0.26961401104927063
Epoch: 1, Step: 202, Rank: 3, loss = 0.7188906669616699
Epoch: 1, Step: 202, Rank: 0, loss = 0.4497821033000946
Epoch: 1, Step: 202, Rank: 4, loss = 0.5726723670959473
Epoch: 1, Step: 202, Rank: 7, loss = 0.5099150538444519
Per-token loss scaled by world size: 0.0003626368416007608
Epoch: 1, Step: 202, Rank: 6, loss = 1.1415808200836182
total tokens: 7960 num samples: 4 num padding tokens: 506 - rank: 1 max len: 1990 min len: 1681 avg len: 1863.5 num_loss_counted_tokens: 885
total tokens: 8028 num samples: 9 num padding tokens: 771 - rank: 4 max len: 892 min len: 741 avg len: 806.3333333333334 num_loss_counted_tokens: 4521
{
"epoch": 1,
"step": 202,
"rank": 0,
"loss": 0.4497821033000946,
"overall_throughput": 41.605158110193564,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.336650848388672,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25184,
"batch_size": 99,
"total_loss": 0.7056666016578674,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:55.026480"
}
total tokens: 8040 num samples: 30 num padding tokens: 2761 - rank: 7 max len: 268 min len: 77 avg len: 175.96666666666667 num_loss_counted_tokens: 2391
total tokens: 8090 num samples: 5 num padding tokens: 1606 - rank: 2 max len: 1618 min len: 1055 avg len: 1296.8 num_loss_counted_tokens: 2154
total tokens: 7315 num samples: 7 num padding tokens: 483 - rank: 3 max len: 1045 min len: 932 avg len: 976.0 num_loss_counted_tokens: 4783
total tokens: 7920 num samples: 16 num padding tokens: 1848 - rank: 6 max len: 495 min len: 269 avg len: 379.5 num_loss_counted_tokens: 3451
total tokens: 7579 num samples: 11 num padding tokens: 1108 - rank: 5 max len: 689 min len: 497 avg len: 588.2727272727273 num_loss_counted_tokens: 4657
total tokens: 7665 num samples: 3 num padding tokens: 442 - rank: 0 max len: 2555 min len: 2201 avg len: 2407.6666666666665 num_loss_counted_tokens: 2954
Per-token loss scaled by world size: 0.0001610679319128394Per-token loss scaled by world size: 0.000670548586640507Per-token loss scaled by world size: 5.642953510687221e-06Per-token loss scaled by world size: 0.0005202414467930794
Per-token loss scaled by world size: 0.00042746320832520723Per-token loss scaled by world size: 1.2397128557495307e-05Per-token loss scaled by world size: 0.0004940929939039052
Epoch: 1, Step: 203, Rank: 5, loss = 1.5487158298492432
Epoch: 1, Step: 203, Rank: 2, loss = 0.37200650572776794
Epoch: 1, Step: 203, Rank: 0, loss = 0.013033106923103333Epoch: 1, Step: 203, Rank: 6, loss = 1.2015626430511475
Epoch: 1, Step: 203, Rank: 1, loss = 0.028632719069719315
Epoch: 1, Step: 203, Rank: 4, loss = 1.141169548034668
Epoch: 1, Step: 203, Rank: 7, loss = 0.9872797131538391
Per-token loss scaled by world size: 0.00014795419701840729
Epoch: 1, Step: 203, Rank: 3, loss = 0.3417187035083771
total tokens: 5726 num samples: 2 num padding tokens: 666 - rank: 1 max len: 2863 min len: 2197 avg len: 2530.0 num_loss_counted_tokens: 483
total tokens: 7280 num samples: 8 num padding tokens: 395 - rank: 4 max len: 910 min len: 794 avg len: 860.625 num_loss_counted_tokens: 4783
{
"epoch": 1,
"step": 203,
"rank": 0,
"loss": 0.013033106923103333,
"overall_throughput": 41.573804652467835,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.309247970581055,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18477,
"batch_size": 80,
"total_loss": 0.7042648792266846,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:56:57.572853"
}
total tokens: 7865 num samples: 13 num padding tokens: 1975 - rank: 6 max len: 605 min len: 321 avg len: 453.0769230769231 num_loss_counted_tokens: 4477
total tokens: 7068 num samples: 4 num padding tokens: 740 - rank: 2 max len: 1767 min len: 1462 avg len: 1582.0 num_loss_counted_tokens: 1481
total tokens: 6864 num samples: 22 num padding tokens: 2832 - rank: 7 max len: 312 min len: 80 avg len: 183.27272727272728 num_loss_counted_tokens: 1790
total tokens: 7750 num samples: 10 num padding tokens: 711 - rank: 5 max len: 775 min len: 654 avg len: 703.9 num_loss_counted_tokens: 2490
total tokens: 7506 num samples: 6 num padding tokens: 1303 - rank: 3 max len: 1251 min len: 942 avg len: 1033.8333333333333 num_loss_counted_tokens: 3894
total tokens: 7992 num samples: 2 num padding tokens: 51 - rank: 0 max len: 3996 min len: 3945 avg len: 3970.5 num_loss_counted_tokens: 176
Per-token loss scaled by world size: 0.0005467137671075761Per-token loss scaled by world size: 0.0001344903139397502Per-token loss scaled by world size: 2.761472160273115e-06
Per-token loss scaled by world size: 0.0002821572998072952
Per-token loss scaled by world size: 0.00027734090690501034
Per-token loss scaled by world size: 4.0628190618008375e-05
Epoch: 1, Step: 204, Rank: 2, loss = 0.3588033616542816
Epoch: 1, Step: 204, Rank: 5, loss = 1.4585639238357544
Epoch: 1, Step: 204, Rank: 0, loss = 0.007367262616753578
Epoch: 1, Step: 204, Rank: 4, loss = 0.7527604103088379
Per-token loss scaled by world size: 0.00022612858447246253Epoch: 1, Step: 204, Rank: 1, loss = 0.1083909347653389
Epoch: 1, Step: 204, Rank: 3, loss = 0.7399108409881592
Per-token loss scaled by world size: 0.0005020878161303699
Epoch: 1, Step: 204, Rank: 7, loss = 0.6032828092575073
Epoch: 1, Step: 204, Rank: 6, loss = 1.3395075798034668
total tokens: 6300 num samples: 3 num padding tokens: 1167 - rank: 1 max len: 2100 min len: 1431 avg len: 1711.0 num_loss_counted_tokens: 1709
total tokens: 7578 num samples: 9 num padding tokens: 661 - rank: 4 max len: 842 min len: 721 avg len: 768.5555555555555 num_loss_counted_tokens: 5406
{
"epoch": 1,
"step": 204,
"rank": 0,
"loss": 0.007367262616753578,
"overall_throughput": 42.3404581250765,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.448826789855957,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21343,
"batch_size": 69,
"total_loss": 0.6710734367370605,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:00.072190"
}
total tokens: 7812 num samples: 28 num padding tokens: 2702 - rank: 7 max len: 279 min len: 93 avg len: 182.5 num_loss_counted_tokens: 2270
total tokens: 8109 num samples: 17 num padding tokens: 1674 - rank: 6 max len: 477 min len: 282 avg len: 378.52941176470586 num_loss_counted_tokens: 3720
total tokens: 7854 num samples: 11 num padding tokens: 1060 - rank: 5 max len: 714 min len: 484 avg len: 617.6363636363636 num_loss_counted_tokens: 5323
total tokens: 5918 num samples: 2 num padding tokens: 56 - rank: 0 max len: 2959 min len: 2903 avg len: 2931.0 num_loss_counted_tokens: 173
total tokens: 7658 num samples: 7 num padding tokens: 462 - rank: 3 max len: 1094 min len: 870 avg len: 1028.0 num_loss_counted_tokens: 4939
total tokens: 8088 num samples: 6 num padding tokens: 933 - rank: 2 max len: 1348 min len: 1098 avg len: 1192.5 num_loss_counted_tokens: 6453
Per-token loss scaled by world size: 0.0004229408223181963Per-token loss scaled by world size: 6.994641353230691e-06Per-token loss scaled by world size: 2.8473236852732953e-06Per-token loss scaled by world size: 0.0008069836185313761Per-token loss scaled by world size: 0.00010431646660435945Per-token loss scaled by world size: 0.00042939934064634144Per-token loss scaled by world size: 0.0006417424301616848
Epoch: 1, Step: 205, Rank: 2, loss = 0.24320079386234283Epoch: 1, Step: 205, Rank: 6, loss = 1.8813813924789429
Epoch: 1, Step: 205, Rank: 4, loss = 1.0010908842086792Epoch: 1, Step: 205, Rank: 7, loss = 0.9860336780548096Epoch: 1, Step: 205, Rank: 0, loss = 0.016307132318615913
Epoch: 1, Step: 205, Rank: 5, loss = 1.4961422681808472Epoch: 1, Step: 205, Rank: 1, loss = 0.006638179067522287
Per-token loss scaled by world size: 0.00024272690643556416
Epoch: 1, Step: 205, Rank: 3, loss = 0.565887451171875
total tokens: 7544 num samples: 4 num padding tokens: 665 - rank: 1 max len: 1886 min len: 1535 avg len: 1719.75 num_loss_counted_tokens: 918
total tokens: 7810 num samples: 10 num padding tokens: 835 - rank: 4 max len: 781 min len: 634 avg len: 697.5 num_loss_counted_tokens: 3430
{
"epoch": 1,
"step": 205,
"rank": 0,
"loss": 0.016307132318615913,
"overall_throughput": 41.43704192183203,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.372108459472656,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 18651,
"batch_size": 75,
"total_loss": 0.7745852470397949,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:02.626787"
}
total tokens: 7680 num samples: 6 num padding tokens: 744 - rank: 2 max len: 1280 min len: 1064 avg len: 1156.0 num_loss_counted_tokens: 6296
total tokens: 6860 num samples: 28 num padding tokens: 2293 - rank: 7 max len: 245 min len: 78 avg len: 163.10714285714286 num_loss_counted_tokens: 1793
total tokens: 8112 num samples: 13 num padding tokens: 1290 - rank: 5 max len: 624 min len: 454 avg len: 524.7692307692307 num_loss_counted_tokens: 5007
total tokens: 8046 num samples: 18 num padding tokens: 1903 - rank: 6 max len: 447 min len: 248 avg len: 341.27777777777777 num_loss_counted_tokens: 3492
total tokens: 5766 num samples: 2 num padding tokens: 993 - rank: 0 max len: 2883 min len: 1890 avg len: 2386.5 num_loss_counted_tokens: 1080
total tokens: 7856 num samples: 8 num padding tokens: 604 - rank: 3 max len: 982 min len: 795 avg len: 906.5 num_loss_counted_tokens: 3800
Per-token loss scaled by world size: 0.0005271086702123284Per-token loss scaled by world size: 0.00045938679249957204Per-token loss scaled by world size: 3.3147989597637206e-06Per-token loss scaled by world size: 2.3308498384722043e-06Per-token loss scaled by world size: 0.0002847542054951191Per-token loss scaled by world size: 0.0002816052583511919
Per-token loss scaled by world size: 0.00022615509806200862
Epoch: 1, Step: 206, Rank: 1, loss = 0.005856843199580908Epoch: 1, Step: 206, Rank: 6, loss = 1.324492335319519Epoch: 1, Step: 206, Rank: 0, loss = 0.008329261094331741Epoch: 1, Step: 206, Rank: 2, loss = 0.7076036334037781
Epoch: 1, Step: 206, Rank: 3, loss = 0.7155161499977112Epoch: 1, Step: 206, Rank: 4, loss = 1.1543241739273071
Epoch: 1, Step: 206, Rank: 7, loss = 0.5682712197303772
Per-token loss scaled by world size: 0.0005236774450168014
Epoch: 1, Step: 206, Rank: 5, loss = 1.3158705234527588
total tokens: 7600 num samples: 5 num padding tokens: 1200 - rank: 4 max len: 1520 min len: 945 avg len: 1280.0 num_loss_counted_tokens: 2473
total tokens: 6054 num samples: 2 num padding tokens: 522 - rank: 1 max len: 3027 min len: 2505 avg len: 2766.0 num_loss_counted_tokens: 167
total tokens: 6924 num samples: 3 num padding tokens: 422 - rank: 2 max len: 2308 min len: 2080 avg len: 2167.3333333333335 num_loss_counted_tokens: 427
{
"epoch": 1,
"step": 206,
"rank": 0,
"loss": 0.008329261094331741,
"overall_throughput": 41.21002850546722,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.469753742218018,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 20102,
"batch_size": 76,
"total_loss": 0.7250330448150635,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:05.194525"
}
total tokens: 7872 num samples: 24 num padding tokens: 2881 - rank: 7 max len: 328 min len: 83 avg len: 207.95833333333334 num_loss_counted_tokens: 2171
total tokens: 7536 num samples: 8 num padding tokens: 1132 - rank: 5 max len: 942 min len: 619 avg len: 800.5 num_loss_counted_tokens: 4424
total tokens: 7956 num samples: 13 num padding tokens: 1551 - rank: 6 max len: 612 min len: 344 avg len: 492.6923076923077 num_loss_counted_tokens: 3611
total tokens: 8112 num samples: 4 num padding tokens: 1080 - rank: 3 max len: 2028 min len: 1605 avg len: 1758.0 num_loss_counted_tokens: 520
total tokens: 7128 num samples: 2 num padding tokens: 321 - rank: 0 max len: 3564 min len: 3243 avg len: 3403.5 num_loss_counted_tokens: 173
Per-token loss scaled by world size: 0.0001850179978646338Per-token loss scaled by world size: 3.4617110031831544e-06Per-token loss scaled by world size: 0.0002683971542865038Per-token loss scaled by world size: 0.0003899486910086125Per-token loss scaled by world size: 0.00042667845264077187Per-token loss scaled by world size: 8.096924830169883e-06
Per-token loss scaled by world size: 0.00022364444157574326
Epoch: 1, Step: 207, Rank: 2, loss = 0.7098768949508667Epoch: 1, Step: 207, Rank: 0, loss = 0.009155793115496635
Epoch: 1, Step: 207, Rank: 6, loss = 1.0313655138015747
Epoch: 1, Step: 207, Rank: 3, loss = 0.48934948444366455Epoch: 1, Step: 207, Rank: 1, loss = 0.021415354683995247
Epoch: 1, Step: 207, Rank: 4, loss = 1.1285111904144287
Epoch: 1, Step: 207, Rank: 7, loss = 0.591511607170105
Per-token loss scaled by world size: 0.00048212718684226274
Epoch: 1, Step: 207, Rank: 5, loss = 1.2751661539077759
total tokens: 7845 num samples: 5 num padding tokens: 421 - rank: 1 max len: 1569 min len: 1404 avg len: 1484.8 num_loss_counted_tokens: 4961
total tokens: 7887 num samples: 11 num padding tokens: 927 - rank: 4 max len: 717 min len: 576 avg len: 632.7272727272727 num_loss_counted_tokens: 3254
total tokens: 6467 num samples: 29 num padding tokens: 1852 - rank: 7 max len: 223 min len: 79 avg len: 159.13793103448276 num_loss_counted_tokens: 1984
{
"epoch": 1,
"step": 207,
"rank": 0,
"loss": 0.009155793115496635,
"overall_throughput": 41.547768908524304,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.43647813796997,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21159,
"batch_size": 69,
"total_loss": 0.6570440530776978,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:07.743363"
}
total tokens: 7980 num samples: 20 num padding tokens: 2050 - rank: 6 max len: 399 min len: 224 avg len: 296.5 num_loss_counted_tokens: 3262
total tokens: 8024 num samples: 8 num padding tokens: 757 - rank: 3 max len: 1003 min len: 749 avg len: 908.375 num_loss_counted_tokens: 5777
total tokens: 8022 num samples: 14 num padding tokens: 1040 - rank: 5 max len: 573 min len: 420 avg len: 498.7142857142857 num_loss_counted_tokens: 4255
total tokens: 7974 num samples: 6 num padding tokens: 1244 - rank: 2 max len: 1329 min len: 1027 avg len: 1121.6666666666667 num_loss_counted_tokens: 4559
total tokens: 5482 num samples: 2 num padding tokens: 958 - rank: 0 max len: 2741 min len: 1783 avg len: 2262.0 num_loss_counted_tokens: 672
Per-token loss scaled by world size: 0.0004824527713935822Per-token loss scaled by world size: 0.000384941027732566Per-token loss scaled by world size: 0.0005819547805003822Per-token loss scaled by world size: 0.0003608883998822421
Per-token loss scaled by world size: 0.00046005993499420583Per-token loss scaled by world size: 5.933908323640935e-05
Per-token loss scaled by world size: 1.4446718523686286e-05
Epoch: 1, Step: 208, Rank: 6, loss = 1.4061481952667236
Epoch: 1, Step: 208, Rank: 3, loss = 1.1657265424728394
Epoch: 1, Step: 208, Rank: 5, loss = 0.9301137924194336Epoch: 1, Step: 208, Rank: 1, loss = 0.14337806403636932
Epoch: 1, Step: 208, Rank: 7, loss = 0.8719965815544128
Epoch: 1, Step: 208, Rank: 0, loss = 0.03490688279271126
Epoch: 1, Step: 208, Rank: 4, loss = 1.1116198301315308
Per-token loss scaled by world size: 0.0002515815431252122
Epoch: 1, Step: 208, Rank: 2, loss = 0.607883870601654
total tokens: 6438 num samples: 2 num padding tokens: 348 - rank: 1 max len: 3219 min len: 2871 avg len: 3045.0 num_loss_counted_tokens: 205
total tokens: 7539 num samples: 7 num padding tokens: 636 - rank: 4 max len: 1077 min len: 894 avg len: 986.1428571428571 num_loss_counted_tokens: 4629
{
"epoch": 1,
"step": 208,
"rank": 0,
"loss": 0.03490688279271126,
"overall_throughput": 40.69133919277144,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.454561710357666,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19330,
"batch_size": 86,
"total_loss": 0.7839717268943787,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:10.341855"
}
total tokens: 7893 num samples: 9 num padding tokens: 1279 - rank: 5 max len: 877 min len: 621 avg len: 734.8888888888889 num_loss_counted_tokens: 3674
total tokens: 7657 num samples: 13 num padding tokens: 2094 - rank: 6 max len: 589 min len: 288 avg len: 427.9230769230769 num_loss_counted_tokens: 3574
total tokens: 5474 num samples: 2 num padding tokens: 292 - rank: 2 max len: 2737 min len: 2445 avg len: 2591.0 num_loss_counted_tokens: 182
total tokens: 7275 num samples: 5 num padding tokens: 1073 - rank: 3 max len: 1455 min len: 1087 avg len: 1240.4 num_loss_counted_tokens: 3573
total tokens: 6975 num samples: 25 num padding tokens: 2126 - rank: 7 max len: 279 min len: 79 avg len: 193.96 num_loss_counted_tokens: 2228
total tokens: 7194 num samples: 2 num padding tokens: 95 - rank: 0 max len: 3597 min len: 3502 avg len: 3549.5 num_loss_counted_tokens: 205
Per-token loss scaled by world size: 0.0006150374538265169Per-token loss scaled by world size: 0.00033614260610193014Per-token loss scaled by world size: 0.0003573091235011816Per-token loss scaled by world size: 0.0002815852640196681
Per-token loss scaled by world size: 0.00023183257144410163
Per-token loss scaled by world size: 2.870956450351514e-05
Per-token loss scaled by world size: 2.8179058062960394e-05
Epoch: 1, Step: 209, Rank: 6, loss = 0.96549391746521Epoch: 1, Step: 209, Rank: 4, loss = 0.9082993268966675
Epoch: 1, Step: 209, Rank: 5, loss = 1.6619080305099487Epoch: 1, Step: 209, Rank: 2, loss = 0.7608785629272461
Epoch: 1, Step: 209, Rank: 7, loss = 0.6264405846595764
Epoch: 1, Step: 209, Rank: 0, loss = 0.07757683098316193
Epoch: 1, Step: 209, Rank: 1, loss = 0.07614333927631378
Per-token loss scaled by world size: 0.00017559101979713887
Epoch: 1, Step: 209, Rank: 3, loss = 0.4744688868522644
total tokens: 7210 num samples: 5 num padding tokens: 1177 - rank: 1 max len: 1442 min len: 1085 avg len: 1206.6 num_loss_counted_tokens: 3033
total tokens: 7776 num samples: 9 num padding tokens: 1371 - rank: 4 max len: 864 min len: 614 avg len: 711.6666666666666 num_loss_counted_tokens: 2696
{
"epoch": 1,
"step": 209,
"rank": 0,
"loss": 0.07757683098316193,
"overall_throughput": 41.93509249352839,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.419190883636475,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21617,
"batch_size": 86,
"total_loss": 0.6939011812210083,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:12.865160"
}
total tokens: 7826 num samples: 13 num padding tokens: 1029 - rank: 5 max len: 602 min len: 416 avg len: 522.8461538461538 num_loss_counted_tokens: 4387
total tokens: 7904 num samples: 19 num padding tokens: 1674 - rank: 6 max len: 416 min len: 259 avg len: 327.89473684210526 num_loss_counted_tokens: 3581
total tokens: 7758 num samples: 3 num padding tokens: 1159 - rank: 0 max len: 2586 min len: 1912 avg len: 2199.6666666666665 num_loss_counted_tokens: 346
total tokens: 7568 num samples: 8 num padding tokens: 373 - rank: 3 max len: 946 min len: 874 avg len: 899.375 num_loss_counted_tokens: 4817
total tokens: 7441 num samples: 7 num padding tokens: 404 - rank: 2 max len: 1063 min len: 956 avg len: 1005.2857142857143 num_loss_counted_tokens: 3350
total tokens: 7967 num samples: 31 num padding tokens: 2428 - rank: 7 max len: 257 min len: 80 avg len: 178.67741935483872 num_loss_counted_tokens: 2210
Per-token loss scaled by world size: 0.0002890804025810212Per-token loss scaled by world size: 0.0005653423140756786Per-token loss scaled by world size: 0.000670413370244205Per-token loss scaled by world size: 9.497793507762253e-05Per-token loss scaled by world size: 2.160387111871387e-06
Per-token loss scaled by world size: 9.657991176936775e-05
Per-token loss scaled by world size: 0.00031631108140572906
Epoch: 1, Step: 210, Rank: 5, loss = 1.631869912147522Epoch: 1, Step: 210, Rank: 2, loss = 0.23118816316127777
Epoch: 1, Step: 210, Rank: 1, loss = 0.2350875735282898Epoch: 1, Step: 210, Rank: 4, loss = 1.3761138916015625Epoch: 1, Step: 210, Rank: 3, loss = 0.703657865524292
Epoch: 1, Step: 210, Rank: 0, loss = 0.005258652381598949
Epoch: 1, Step: 210, Rank: 7, loss = 0.7699407339096069
Per-token loss scaled by world size: 0.000578251841943711
Epoch: 1, Step: 210, Rank: 6, loss = 1.4075372219085693
total tokens: 7904 num samples: 8 num padding tokens: 750 - rank: 4 max len: 988 min len: 839 avg len: 894.25 num_loss_counted_tokens: 5760
total tokens: 7317 num samples: 3 num padding tokens: 636 - rank: 1 max len: 2439 min len: 1946 avg len: 2227.0 num_loss_counted_tokens: 845
{
"epoch": 1,
"step": 210,
"rank": 0,
"loss": 0.005258652381598949,
"overall_throughput": 42.45218456081296,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.25443983078003,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 19473,
"batch_size": 68,
"total_loss": 0.7950817346572876,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:15.357201"
}
total tokens: 7389 num samples: 9 num padding tokens: 1138 - rank: 5 max len: 821 min len: 562 avg len: 694.5555555555555 num_loss_counted_tokens: 4077
total tokens: 7860 num samples: 15 num padding tokens: 2296 - rank: 6 max len: 524 min len: 263 avg len: 370.93333333333334 num_loss_counted_tokens: 3347
total tokens: 6396 num samples: 26 num padding tokens: 1827 - rank: 7 max len: 246 min len: 83 avg len: 175.73076923076923 num_loss_counted_tokens: 2093
total tokens: 7170 num samples: 6 num padding tokens: 733 - rank: 3 max len: 1195 min len: 1001 avg len: 1072.8333333333333 num_loss_counted_tokens: 2366
total tokens: 7220 num samples: 4 num padding tokens: 329 - rank: 2 max len: 1805 min len: 1615 avg len: 1722.75 num_loss_counted_tokens: 1729
total tokens: 7124 num samples: 2 num padding tokens: 264 - rank: 0 max len: 3562 min len: 3298 avg len: 3430.0 num_loss_counted_tokens: 186
Per-token loss scaled by world size: 0.0008867266005836427Per-token loss scaled by world size: 0.00016864115605130792Per-token loss scaled by world size: 3.4453678381396458e-06
Per-token loss scaled by world size: 7.723766611889005e-05Per-token loss scaled by world size: 7.638386159669608e-05Per-token loss scaled by world size: 0.0004412019916344434
Per-token loss scaled by world size: 0.00035067120916210115
Epoch: 1, Step: 211, Rank: 5, loss = 1.9709715843200684Epoch: 1, Step: 211, Rank: 3, loss = 0.3748471140861511
Epoch: 1, Step: 211, Rank: 1, loss = 0.16978223621845245Epoch: 1, Step: 211, Rank: 0, loss = 0.1716800183057785
Epoch: 1, Step: 211, Rank: 2, loss = 0.007658191490918398
Epoch: 1, Step: 211, Rank: 4, loss = 0.9806817173957825Epoch: 1, Step: 211, Rank: 7, loss = 0.7794544100761414
Per-token loss scaled by world size: 0.0005940343835391104
Epoch: 1, Step: 211, Rank: 6, loss = 1.320389986038208
total tokens: 6598 num samples: 2 num padding tokens: 510 - rank: 1 max len: 3299 min len: 2789 avg len: 3044.0 num_loss_counted_tokens: 202
total tokens: 7126 num samples: 7 num padding tokens: 665 - rank: 4 max len: 1018 min len: 822 avg len: 923.0 num_loss_counted_tokens: 4429
{
"epoch": 1,
"step": 211,
"rank": 0,
"loss": 0.1716800183057785,
"overall_throughput": 42.027410025626445,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.381672859191895,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17782,
"batch_size": 82,
"total_loss": 0.721933126449585,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:17.873318"
}
total tokens: 8049 num samples: 3 num padding tokens: 1897 - rank: 2 max len: 2683 min len: 1693 avg len: 2050.6666666666665 num_loss_counted_tokens: 2086
total tokens: 6725 num samples: 25 num padding tokens: 2254 - rank: 7 max len: 269 min len: 87 avg len: 178.84 num_loss_counted_tokens: 2062
total tokens: 7750 num samples: 10 num padding tokens: 610 - rank: 5 max len: 775 min len: 681 avg len: 714.0 num_loss_counted_tokens: 2811
total tokens: 8060 num samples: 13 num padding tokens: 2361 - rank: 6 max len: 620 min len: 304 avg len: 438.38461538461536 num_loss_counted_tokens: 3659
total tokens: 7035 num samples: 5 num padding tokens: 813 - rank: 3 max len: 1407 min len: 1064 avg len: 1244.4 num_loss_counted_tokens: 3760
total tokens: 7708 num samples: 2 num padding tokens: 291 - rank: 0 max len: 3854 min len: 3563 avg len: 3708.5 num_loss_counted_tokens: 181
Per-token loss scaled by world size: 0.00016636037616990507Per-token loss scaled by world size: 0.00016077600594144315Per-token loss scaled by world size: 0.00041039849747903645Per-token loss scaled by world size: 0.00037365706521086395
Per-token loss scaled by world size: 0.00032838378683663905Per-token loss scaled by world size: 0.0003097376029472798
Per-token loss scaled by world size: 3.766906047530938e-06
Epoch: 1, Step: 212, Rank: 5, loss = 1.1992356777191162
Epoch: 1, Step: 212, Rank: 3, loss = 0.46980756521224976
Epoch: 1, Step: 212, Rank: 1, loss = 1.0918726921081543Epoch: 1, Step: 212, Rank: 2, loss = 0.48612579703330994
Epoch: 1, Step: 212, Rank: 0, loss = 0.011007370427250862
Epoch: 1, Step: 212, Rank: 4, loss = 0.9595785140991211
Epoch: 1, Step: 212, Rank: 7, loss = 0.9050920009613037
Per-token loss scaled by world size: 0.00043126812670379877
Epoch: 1, Step: 212, Rank: 6, loss = 1.2602193355560303
total tokens: 7704 num samples: 8 num padding tokens: 1465 - rank: 4 max len: 963 min len: 685 avg len: 779.875 num_loss_counted_tokens: 4503
total tokens: 8007 num samples: 3 num padding tokens: 648 - rank: 1 max len: 2669 min len: 2232 avg len: 2453.0 num_loss_counted_tokens: 287
{
"epoch": 1,
"step": 212,
"rank": 0,
"loss": 0.011007370427250862,
"overall_throughput": 42.88947039606486,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.303375244140625,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23377,
"batch_size": 85,
"total_loss": 0.7978672981262207,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:20.342573"
}
total tokens: 7920 num samples: 12 num padding tokens: 1442 - rank: 5 max len: 660 min len: 470 avg len: 539.8333333333334 num_loss_counted_tokens: 4323
total tokens: 6372 num samples: 3 num padding tokens: 828 - rank: 2 max len: 2124 min len: 1644 avg len: 1848.0 num_loss_counted_tokens: 677
total tokens: 7533 num samples: 27 num padding tokens: 2411 - rank: 7 max len: 279 min len: 75 avg len: 189.7037037037037 num_loss_counted_tokens: 2282
total tokens: 7255 num samples: 5 num padding tokens: 1237 - rank: 3 max len: 1451 min len: 980 avg len: 1203.6 num_loss_counted_tokens: 3251
total tokens: 7388 num samples: 2 num padding tokens: 920 - rank: 0 max len: 3694 min len: 2774 avg len: 3234.0 num_loss_counted_tokens: 589
total tokens: 7973 num samples: 17 num padding tokens: 1660 - rank: 6 max len: 469 min len: 291 avg len: 371.3529411764706 num_loss_counted_tokens: 3245
Per-token loss scaled by world size: 0.0003884605539496988Per-token loss scaled by world size: 0.0004556115891318768Per-token loss scaled by world size: 0.0003792895295191556Per-token loss scaled by world size: 0.00035762478364631534Per-token loss scaled by world size: 0.00014603856834582984
Per-token loss scaled by world size: 3.432124140090309e-05
Per-token loss scaled by world size: 8.249920938396826e-05
Epoch: 1, Step: 213, Rank: 6, loss = 1.0305296182632446
Epoch: 1, Step: 213, Rank: 4, loss = 1.2378966808319092Epoch: 1, Step: 213, Rank: 7, loss = 0.9716665744781494
Epoch: 1, Step: 213, Rank: 3, loss = 0.39678677916526794
Epoch: 1, Step: 213, Rank: 0, loss = 0.0932508111000061Epoch: 1, Step: 213, Rank: 2, loss = 1.0554473400115967
Epoch: 1, Step: 213, Rank: 1, loss = 0.22415034472942352
Per-token loss scaled by world size: 0.00048252404667437077
Epoch: 1, Step: 213, Rank: 5, loss = 1.3110178709030151
total tokens: 7245 num samples: 3 num padding tokens: 751 - rank: 1 max len: 2415 min len: 2008 avg len: 2164.6666666666665 num_loss_counted_tokens: 1753
total tokens: 8024 num samples: 8 num padding tokens: 872 - rank: 4 max len: 1003 min len: 763 avg len: 894.0 num_loss_counted_tokens: 3911
{
"epoch": 1,
"step": 213,
"rank": 0,
"loss": 0.0932508111000061,
"overall_throughput": 42.19836853236888,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.430901527404785,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 21736,
"batch_size": 78,
"total_loss": 0.7900933623313904,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:22.852352"
}
total tokens: 7104 num samples: 24 num padding tokens: 3028 - rank: 7 max len: 296 min len: 82 avg len: 169.83333333333334 num_loss_counted_tokens: 1782
total tokens: 7856 num samples: 16 num padding tokens: 1303 - rank: 6 max len: 491 min len: 317 avg len: 409.5625 num_loss_counted_tokens: 3722
total tokens: 8107 num samples: 11 num padding tokens: 1365 - rank: 5 max len: 737 min len: 516 avg len: 612.9090909090909 num_loss_counted_tokens: 5379
total tokens: 7816 num samples: 4 num padding tokens: 627 - rank: 2 max len: 1954 min len: 1579 avg len: 1797.25 num_loss_counted_tokens: 1262
total tokens: 7015 num samples: 5 num padding tokens: 1303 - rank: 3 max len: 1403 min len: 1004 avg len: 1142.4 num_loss_counted_tokens: 2463
total tokens: 7132 num samples: 2 num padding tokens: 1084 - rank: 0 max len: 3566 min len: 2482 avg len: 3024.0 num_loss_counted_tokens: 887
Per-token loss scaled by world size: 0.0001603560958756134Per-token loss scaled by world size: 0.00011656123388092965Per-token loss scaled by world size: 0.00013124111865181476Per-token loss scaled by world size: 0.0004040842177346349Per-token loss scaled by world size: 0.0002459329552948475
Per-token loss scaled by world size: 0.00028513988945633173
Per-token loss scaled by world size: 0.0004030088894069195
Epoch: 1, Step: 214, Rank: 2, loss = 0.4810081720352173
Epoch: 1, Step: 214, Rank: 0, loss = 0.3936741352081299
Epoch: 1, Step: 214, Rank: 1, loss = 0.34963998198509216
Epoch: 1, Step: 214, Rank: 6, loss = 1.2121011018753052
Epoch: 1, Step: 214, Rank: 7, loss = 0.7377066612243652
Epoch: 1, Step: 214, Rank: 4, loss = 0.855312705039978
Epoch: 1, Step: 214, Rank: 5, loss = 1.2088755369186401
Per-token loss scaled by world size: 0.00028479599859565496
Epoch: 1, Step: 214, Rank: 3, loss = 0.8542811870574951
total tokens: 7895 num samples: 5 num padding tokens: 1253 - rank: 1 max len: 1579 min len: 1089 avg len: 1328.4 num_loss_counted_tokens: 1673
total tokens: 7480 num samples: 11 num padding tokens: 884 - rank: 4 max len: 680 min len: 516 avg len: 599.6363636363636 num_loss_counted_tokens: 4908
{
"epoch": 1,
"step": 214,
"rank": 0,
"loss": 0.3936741352081299,
"overall_throughput": 41.376296140594235,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.258357048034668,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23997,
"batch_size": 85,
"total_loss": 0.761574923992157,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:25.410853"
}
total tokens: 8085 num samples: 35 num padding tokens: 2995 - rank: 7 max len: 231 min len: 82 avg len: 145.42857142857142 num_loss_counted_tokens: 1964
total tokens: 6426 num samples: 2 num padding tokens: 1555 - rank: 0 max len: 3213 min len: 1658 avg len: 2435.5 num_loss_counted_tokens: 537
total tokens: 7791 num samples: 21 num padding tokens: 1378 - rank: 6 max len: 371 min len: 236 avg len: 305.3809523809524 num_loss_counted_tokens: 3598
total tokens: 7740 num samples: 15 num padding tokens: 793 - rank: 5 max len: 516 min len: 375 avg len: 463.1333333333333 num_loss_counted_tokens: 3953
total tokens: 8104 num samples: 8 num padding tokens: 972 - rank: 2 max len: 1013 min len: 810 avg len: 891.5 num_loss_counted_tokens: 5296
total tokens: 8100 num samples: 10 num padding tokens: 449 - rank: 3 max len: 810 min len: 698 avg len: 765.1 num_loss_counted_tokens: 3926
Per-token loss scaled by world size: 0.00016885650984477252Per-token loss scaled by world size: 0.00022565454128198326Per-token loss scaled by world size: 0.00031425835913978517Per-token loss scaled by world size: 7.733783036201203e-07
Per-token loss scaled by world size: 0.00022409454686567187
Per-token loss scaled by world size: 0.00017783122893888503
Per-token loss scaled by world size: 0.00018609287508297712
Epoch: 1, Step: 215, Rank: 2, loss = 0.7786210179328918
Epoch: 1, Step: 215, Rank: 5, loss = 1.084348440170288Epoch: 1, Step: 215, Rank: 1, loss = 0.5826393961906433
Epoch: 1, Step: 215, Rank: 0, loss = 0.002668541856110096
Epoch: 1, Step: 215, Rank: 6, loss = 0.7732382416725159
Epoch: 1, Step: 215, Rank: 4, loss = 0.6136066317558289
Epoch: 1, Step: 215, Rank: 7, loss = 0.642113447189331
Per-token loss scaled by world size: 0.00024193401623051614
Epoch: 1, Step: 215, Rank: 3, loss = 0.8347933292388916
total tokens: 7668 num samples: 9 num padding tokens: 1000 - rank: 4 max len: 852 min len: 647 avg len: 740.8888888888889 num_loss_counted_tokens: 4699
total tokens: 8060 num samples: 5 num padding tokens: 515 - rank: 1 max len: 1612 min len: 1420 avg len: 1509.0 num_loss_counted_tokens: 2649
{
"epoch": 1,
"step": 215,
"rank": 0,
"loss": 0.002668541856110096,
"overall_throughput": 41.217687956745806,
"lr": 4.000000000000001e-06,
"cuda_mem_allocated": 24.42807674407959,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27604,
"batch_size": 87,
"total_loss": 0.6640036106109619,
"gradnorm": 0.8729047775268555,
"weight_norm": 433.0433044433594,
"timestamp": "2024-08-18T20:57:27.975308"
}
total tokens: 7025 num samples: 5 num padding tokens: 744 - rank: 2 max len: 1405 min len: 1116 avg len: 1256.2 num_loss_counted_tokens: 4200
total tokens: 7672 num samples: 7 num padding tokens: 693 - rank: 3 max len: 1096 min len: 866 avg len: 997.0 num_loss_counted_tokens: 5684
total tokens: 8112 num samples: 13 num padding tokens: 563 - rank: 5 max len: 624 min len: 528 avg len: 580.6923076923077 num_loss_counted_tokens: 5251
total tokens: 7856 num samples: 16 num padding tokens: 1465 - rank: 6 max len: 491 min len: 299 avg len: 399.4375 num_loss_counted_tokens: 3069
total tokens: 7812 num samples: 28 num padding tokens: 2476 - rank: 7 max len: 279 min len: 86 avg len: 190.57142857142858 num_loss_counted_tokens: 2263
total tokens: 6210 num samples: 3 num padding tokens: 683 - rank: 0 max len: 2070 min len: 1654 avg len: 1842.3333333333333 num_loss_counted_tokens: 1033
Per-token loss scaled by world size: 0.00016547582345083356Per-token loss scaled by world size: 0.0002693594142328948Per-token loss scaled by world size: 0.00031613183091394603Per-token loss scaled by world size: 5.018114825361408e-05
Per-token loss scaled by world size: 0.0004250952915754169Per-token loss scaled by world size: 0.00035303577897138894Per-token loss scaled by world size: 5.689787940355018e-05
Epoch: 1, Step: 216, Rank: 5, loss = 1.0305107831954956
Epoch: 1, Step: 216, Rank: 2, loss = 0.5394098162651062
Epoch: 1, Step: 216, Rank: 7, loss = 0.8780443072319031
Epoch: 1, Step: 216, Rank: 1, loss = 0.16357800364494324Epoch: 1, Step: 216, Rank: 0, loss = 0.18547286093235016
Epoch: 1, Step: 216, Rank: 4, loss = 1.3857043981552124
Epoch: 1, Step: 216, Rank: 3, loss = 1.150808334350586
Per-token loss scaled by world size: 0.00032035927870310843
Epoch: 1, Step: 216, Rank: 6, loss = 1.0442911386489868
[2024-08-18 20:57:30,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
[2024-08-18 20:57:30,575] [INFO] [timer.py:258:stop] epoch=0/micro_step=216/global_step=6, RunningAvgSamplesPerSec=41.682838913878484, CurrSamplesPerSec=41.70386986575945, MemAllocated=22.89GB, MaxMemAllocated=30.61GB
total tokens: 8096 num samples: 11 num padding tokens: 403 - rank: 4 max len: 736 min len: 676 avg len: 699.3636363636364 num_loss_counted_tokens: 3366
total tokens: 7895 num samples: 5 num padding tokens: 1102 - rank: 1 max len: 1579 min len: 1157 avg len: 1358.6 num_loss_counted_tokens: 2579
{
"epoch": 1,
"step": 216,
"rank": 0,
"loss": 0.18547286093235016,
"overall_throughput": 40.61028464966421,
"lr": 4.800000000000001e-06,
"cuda_mem_allocated": 22.89185380935669,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 26078,
"batch_size": 113,
"total_loss": 0.7972275614738464,
"gradnorm": 0.7308346033096313,
"weight_norm": 433.0433349609375,
"timestamp": "2024-08-18T20:57:30.638455"
}
total tokens: 7917 num samples: 7 num padding tokens: 347 - rank: 2 max len: 1131 min len: 1017 avg len: 1081.4285714285713 num_loss_counted_tokens: 5762
total tokens: 7920 num samples: 12 num padding tokens: 1053 - rank: 5 max len: 660 min len: 459 avg len: 572.25 num_loss_counted_tokens: 3495
total tokens: 7784 num samples: 8 num padding tokens: 989 - rank: 3 max len: 973 min len: 781 avg len: 849.375 num_loss_counted_tokens: 3166
total tokens: 7786 num samples: 17 num padding tokens: 2184 - rank: 6 max len: 458 min len: 242 avg len: 329.52941176470586 num_loss_counted_tokens: 3198
total tokens: 7887 num samples: 33 num padding tokens: 2326 - rank: 7 max len: 239 min len: 82 avg len: 168.5151515151515 num_loss_counted_tokens: 2289
total tokens: 7284 num samples: 3 num padding tokens: 783 - rank: 0 max len: 2428 min len: 1739 avg len: 2167.0 num_loss_counted_tokens: 406
Per-token loss scaled by world size: 0.0005763740628026426Per-token loss scaled by world size: 0.0009438088163733482Per-token loss scaled by world size: 6.679360376438126e-05Per-token loss scaled by world size: 0.0001800585159799084Per-token loss scaled by world size: 0.0003732589539140463Per-token loss scaled by world size: 7.849858957342803e-05
Per-token loss scaled by world size: 0.00011912822810700163
Epoch: 1, Step: 217, Rank: 1, loss = 0.1438567191362381
Epoch: 1, Step: 217, Rank: 5, loss = 2.0327281951904297Epoch: 1, Step: 217, Rank: 3, loss = 0.38780102133750916
Epoch: 1, Step: 217, Rank: 6, loss = 1.241365671157837Epoch: 1, Step: 217, Rank: 0, loss = 0.16906633973121643Epoch: 1, Step: 217, Rank: 4, loss = 0.8039065003395081
Epoch: 1, Step: 217, Rank: 2, loss = 0.256572425365448
Per-token loss scaled by world size: 0.0005111052305437624
Epoch: 1, Step: 217, Rank: 7, loss = 1.1007928848266602
total tokens: 6974 num samples: 2 num padding tokens: 562 - rank: 1 max len: 3487 min len: 2925 avg len: 3206.0 num_loss_counted_tokens: 216
total tokens: 7728 num samples: 7 num padding tokens: 633 - rank: 4 max len: 1104 min len: 930 avg len: 1013.5714285714286 num_loss_counted_tokens: 5349
{
"epoch": 1,
"step": 217,
"rank": 0,
"loss": 0.16906633973121643,
"overall_throughput": 41.92863106467066,
"lr": 4.800000000000001e-06,
"cuda_mem_allocated": 24.260313034057617,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 17230,
"batch_size": 69,
"total_loss": 0.767011284828186,
"gradnorm": 0.7308346033096313,
"weight_norm": 433.0433349609375,
"timestamp": "2024-08-18T20:57:33.123706"
}
total tokens: 4065 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4065 min len: 4065 avg len: 4065.0 num_loss_counted_tokens: 82
total tokens: 6741 num samples: 3 num padding tokens: 1158 - rank: 2 max len: 2247 min len: 1528 avg len: 1861.0 num_loss_counted_tokens: 930
total tokens: 7992 num samples: 27 num padding tokens: 3238 - rank: 7 max len: 296 min len: 81 avg len: 176.07407407407408 num_loss_counted_tokens: 2221
total tokens: 7632 num samples: 12 num padding tokens: 2411 - rank: 6 max len: 636 min len: 301 avg len: 435.0833333333333 num_loss_counted_tokens: 3582
total tokens: 7080 num samples: 5 num padding tokens: 814 - rank: 3 max len: 1416 min len: 1134 avg len: 1253.2 num_loss_counted_tokens: 2449
total tokens: 8091 num samples: 9 num padding tokens: 1045 - rank: 5 max len: 899 min len: 643 avg len: 782.8888888888889 num_loss_counted_tokens: 5684
Per-token loss scaled by world size: 0.0003471940290182829Per-token loss scaled by world size: 0.0005411332240328193Per-token loss scaled by world size: 6.277004104049411e-06Per-token loss scaled by world size: 0.0005339714116416872
Per-token loss scaled by world size: 4.73601221528952e-06
Per-token loss scaled by world size: 0.00031345669412985444
Epoch: 1, Step: 218, Rank: 5, loss = 1.1664127111434937
Per-token loss scaled by world size: 5.614342057924659e-07Epoch: 1, Step: 218, Rank: 1, loss = 0.010208474472165108
Epoch: 1, Step: 218, Rank: 3, loss = 0.748376727104187Epoch: 1, Step: 218, Rank: 0, loss = 0.013530082069337368
Epoch: 1, Step: 218, Rank: 4, loss = 1.1509753465652466
Epoch: 1, Step: 218, Rank: 7, loss = 0.6756559014320374
Epoch: 1, Step: 218, Rank: 2, loss = 0.0012101713800802827Per-token loss scaled by world size: 0.00039466869202442467
Epoch: 1, Step: 218, Rank: 6, loss = 0.8507083654403687
total tokens: 6480 num samples: 3 num padding tokens: 206 - rank: 1 max len: 2160 min len: 1983 avg len: 2091.3333333333335 num_loss_counted_tokens: 1674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment