youkaichao · September 10, 2025 08:49
diff --git a/nccl_error.log b/nccl_error.log
 INFO 09-09 23:50:30 [__init__.py:216] Automatically detected platform cuda.
 /usr/local/lib/python3.12/dist-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
 The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
 ============================================================= test session starts =============================================================
 platform linux -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 -- /usr/bin/python3
 cachedir: .pytest_cache
 hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/vllm-workspace/tests/.hypothesis/examples'))
 rootdir: /vllm-workspace
 configfile: pyproject.toml
 plugins: anyio-4.6.2.post1, buildkite-test-collector-0.1.9, hydra-core-1.3.2, hypothesis-6.131.0, asyncio-0.24.0, forked-1.6.0, mock-3.14.0, rerunfailures-14.0, shard-0.1.2, subtests-0.14.1, timeout-2.3.1, schemathesis-3.39.15
 asyncio: mode=Mode.STRICT, default_loop_scope=None
 config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 928/928 [00:00<00:00, 4.00MB/s]
 INFO 09-09 23:51:02 [__init__.py:744] Resolved architecture: GraniteMoeForCausalLM
 INFO 09-09 23:51:05 [__init__.py:2946] Downcasting torch.float32 to torch.bfloat16.
 INFO 09-09 23:51:05 [__init__.py:1812] Using max model len 4096
 WARNING 09-09 23:51:05 [interface.py:531] Current platform cuda does not have '_pytestfixturefunction' attribute.
 WARNING 09-09 23:51:06 [interface.py:531] Current platform cuda does not have '__test__' attribute.
 WARNING 09-09 23:51:06 [interface.py:531] Current platform cuda does not have '__bases__' attribute.
 WARNING 09-09 23:51:06 [interface.py:531] Current platform cuda does not have '__test__' attribute.
 WARNING 09-09 23:51:06 [interface.py:531] Current platform cuda does not have '_schemathesis_test' attribute.
 collected 1 item
 Running 1 items in this shard: tests/v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]

 v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA] INFO 09-09 23:51:06 [__init__.py:744] Resolved architecture: GraniteMoeForCausalLM
 INFO 09-09 23:51:09 [__init__.py:2946] Downcasting torch.float32 to torch.bfloat16.
 INFO 09-09 23:51:09 [__init__.py:1812] Using max model len 4096
 INFO 09-09 23:51:16 [arg_utils.py:1272] Using mp-based distributed executor backend for async scheduling.
 INFO 09-09 23:51:16 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
 INFO 09-09 23:51:16 [__init__.py:3582] Cudagraph is disabled under eager mode
 tokenizer_config.json: 4.13kB [00:00, 13.7MB/s]
 tokenizer.json: 2.06MB [00:00, 93.2MB/s]
 special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 906/906 [00:00<00:00, 6.32MB/s]
 generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.42MB/s]
 INFO 09-09 23:51:18 [utils.py:648] Started DP Coordinator process (PID: 748)
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:18 [core.py:654] Waiting for init message from front-end.
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:18 [core.py:654] Waiting for init message from front-end.
 [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:18 [core.py:76] Initializing a V1 LLM engine (v0.10.1rc2.dev614+gc0243db7c) with config: model='ibm-research/PowerMoE-3b', speculative_config=None, tokenizer='ibm-research/PowerMoE-3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ibm-research/PowerMoE-3b, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:18 [core.py:76] Initializing a V1 LLM engine (v0.10.1rc2.dev614+gc0243db7c) with config: model='ibm-research/PowerMoE-3b', speculative_config=None, tokenizer='ibm-research/PowerMoE-3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ibm-research/PowerMoE-3b, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
 (EngineCore_DP1 pid=756) WARNING 09-09 23:51:18 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
 (EngineCore_DP0 pid=753) WARNING 09-09 23:51:18 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_988b5a8f'), local_subscribe_addr='ipc:///tmp/7c69c795-bb75-4aa2-be26-cd42ddf40591', remote_subscribe_addr=None, remote_addr_ipv6=False)
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_c88a7c2a'), local_subscribe_addr='ipc:///tmp/48bab07c-915d-4974-bb32-add57a1f4835', remote_subscribe_addr=None, remote_addr_ipv6=False)
 (EngineCore_DP0 pid=753) W0909 23:51:22.478000 774 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
 (EngineCore_DP0 pid=753) W0909 23:51:22.478000 774 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
 (EngineCore_DP1 pid=756) W0909 23:51:22.478000 772 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
 (EngineCore_DP1 pid=756) W0909 23:51:22.478000 772 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:23 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_38a047e4'), local_subscribe_addr='ipc:///tmp/de4c61b3-2e0e-4daf-9bb9-5bf9ad22cd8b', remote_subscribe_addr=None, remote_addr_ipv6=False)
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:23 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_26f47cfa'), local_subscribe_addr='ipc:///tmp/257306b0-3f49-400e-8d9f-4a7673e003d8', remote_subscribe_addr=None, remote_addr_ipv6=False)
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:23 [parallel_state.py:1004] Adjusting world_size=2 rank=1 distributed_init_method=tcp://127.0.0.1:57465 for DP
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:23 [parallel_state.py:1004] Adjusting world_size=2 rank=0 distributed_init_method=tcp://127.0.0.1:57465 for DP
 [W909 23:51:24.138463461 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
 [W909 23:51:24.140910301 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
 [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:24 [__init__.py:1432] Found nccl from library libnccl.so.2
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:24 [__init__.py:1432] Found nccl from library libnccl.so.2
 (EngineCore_DP0 pid=753) INFO 09-09 23:51:24 [pynccl.py:70] vLLM is using nccl==2.27.3
 (EngineCore_DP1 pid=756) INFO 09-09 23:51:24 [pynccl.py:70] vLLM is using nccl==2.27.3
 9b405963aa12:774:774 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.3<0>
 9b405963aa12:774:774 NCCL CALL ncclGetUniqueId(0x52bcd9be09a65e5a)
 9b405963aa12:774:774 [0] NCCL INFO cudaDriverVersion 12080
 9b405963aa12:774:774 [0] NCCL INFO NCCL version 2.27.3+cuda12.9
 9b405963aa12:772:772 [0] NCCL INFO cudaDriverVersion 12080
 9b405963aa12:774:774 [0] NCCL INFO init.cc:1792 Cuda Host Alloc Size 4 pointer 0x7f13a2600000
 9b405963aa12:772:772 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.3<0>
 9b405963aa12:772:772 [0] NCCL INFO NCCL version 2.27.3+cuda12.9
 9b405963aa12:772:772 [0] NCCL INFO init.cc:1792 Cuda Host Alloc Size 4 pointer 0x7f13a2600000
 9b405963aa12:774:774 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
 9b405963aa12:772:772 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
 9b405963aa12:774:774 [0] NCCL INFO NET/IB : No device found.
 9b405963aa12:774:774 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.3<0>
 9b405963aa12:774:774 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
 9b405963aa12:774:774 [0] NCCL INFO Initialized NET plugin Socket
 9b405963aa12:774:774 [0] NCCL INFO Assigned NET plugin Socket to comm
 9b405963aa12:774:774 [0] NCCL INFO Using network Socket
 9b405963aa12:772:772 [0] NCCL INFO NET/IB : No device found.
 9b405963aa12:772:772 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.3<0>
 9b405963aa12:772:772 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
 9b405963aa12:772:772 [0] NCCL INFO Initialized NET plugin Socket
 9b405963aa12:772:772 [0] NCCL INFO Assigned NET plugin Socket to comm
 9b405963aa12:772:772 [0] NCCL INFO Using network Socket
 9b405963aa12:774:774 [0] NCCL INFO ncclCommInitRank comm 0x133d3e40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 38000 commId 0x52bcd9be09a65e5a - Init START
 9b405963aa12:772:772 [0] NCCL INFO ncclCommInitRank comm 0x133c8440 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 3a000 commId 0x52bcd9be09a65e5a - Init START
 9b405963aa12:774:774 [0] NCCL INFO RAS network listening socket at 172.17.0.3<39139>
 9b405963aa12:772:772 [0] NCCL INFO RAS network listening socket at 172.17.0.3<41085>
 9b405963aa12:774:774 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
 9b405963aa12:772:772 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
 9b405963aa12:772:772 [0] NCCL INFO Mem Realloc old size 0, new size 256 pointer 0x16dd0160
 9b405963aa12:774:774 [0] NCCL INFO Mem Realloc old size 0, new size 256 pointer 0x16da3a60
 9b405963aa12:772:772 [0] NCCL INFO Bootstrap timings total 0.000727 (create 0.000031, send 0.000077, recv 0.000202, ring 0.000023, delay 0.000000)
 9b405963aa12:774:774 [0] NCCL INFO Bootstrap timings total 0.001125 (create 0.000040, send 0.000145, recv 0.000431, ring 0.000022, delay 0.000000)
 9b405963aa12:774:849 [0] NCCL INFO RAS thread started
 9b405963aa12:774:849 [0] NCCL INFO Mem Realloc old size 0, new size 32 pointer 0x7f1354004e10
 9b405963aa12:772:848 [0] NCCL INFO RAS thread started
 9b405963aa12:774:849 [0] NCCL INFO RAS handling local addRanks request (old nRasPeers 0)
 9b405963aa12:772:848 [0] NCCL INFO Mem Realloc old size 0, new size 32 pointer 0x7f1358004e10
 9b405963aa12:774:849 [0] NCCL INFO RAS finished local processing of addRanks request (new nRasPeers 2, nRankPeers 2)
 9b405963aa12:772:848 [0] NCCL INFO RAS handling local addRanks request (old nRasPeers 0)
 9b405963aa12:772:848 [0] NCCL INFO RAS finished local processing of addRanks request (new nRasPeers 2, nRankPeers 2)
 9b405963aa12:774:849 [0] NCCL INFO RAS peer 0: socket 172.17.0.3<39139>, pid 774, GPU 0 [this process]
 9b405963aa12:774:849 [0] NCCL INFO RAS peer 1: socket 172.17.0.3<41085>, pid 772, GPU 0 (NVML 1)
 9b405963aa12:772:848 [0] NCCL INFO RAS link 1: calculated deferred primary connection with 172.17.0.3<39139>
 9b405963aa12:774:849 [0] NCCL INFO RAS peersHash 0x96c046b3de33d4c2
 9b405963aa12:772:848 [0] NCCL INFO RAS link -1: calculated deferred primary connection with 172.17.0.3<39139>
 9b405963aa12:774:849 [0] NCCL INFO RAS link 1: opening new primary connection with 172.17.0.3<41085>
 9b405963aa12:774:849 [0] NCCL INFO RAS link -1: calculated existing primary connection with 172.17.0.3<41085>
 9b405963aa12:772:848 [0] NCCL INFO RAS new incoming socket connection from 172.17.0.3<39164>
 9b405963aa12:772:848 [0] NCCL INFO RAS handling connInit from 172.17.0.3<39164> (version 22703, listeningAddr 172.17.0.3<39139>, peersHash 0x96c046b3de33d4c2, deadPeersHash 0x0)
 9b405963aa12:774:849 [0] NCCL INFO RAS handling connInitAck from 172.17.0.3<41085> (nack 0)
 9b405963aa12:774:849 [0] NCCL INFO RAS established connection with 172.17.0.3<41085> (sendQ empty, experiencingDelays 0, startRetryTime 0.00s)
 9b405963aa12:774:774 [0] NCCL INFO TOPO/NET : Importing network plugins to topology
 9b405963aa12:772:772 [0] NCCL INFO TOPO/NET : Importing network plugins to topology
 9b405963aa12:774:774 [0] NCCL INFO Retrieving state for Socket
 9b405963aa12:772:772 [0] NCCL INFO Retrieving state for Socket
 9b405963aa12:774:774 [0] NCCL INFO Initialized state 0 for Socket
 9b405963aa12:772:772 [0] NCCL INFO Initialized state 0 for Socket
 9b405963aa12:774:774 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
 9b405963aa12:772:772 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
 9b405963aa12:774:774 [0] NCCL INFO ncclTopoPopulateNics : Filled eth0 in topo with pciPath=(null) keep=1 coll=(null)
 9b405963aa12:772:772 [0] NCCL INFO ncclTopoPopulateNics : Filled eth0 in topo with pciPath=(null) keep=1 coll=(null)
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 ===
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO CPU/0-0 (1/2/-1)
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
 9b405963aa12:774:774 [0] NCCL INFO + PCI[12.0] - GPU/0-38000 (0)
 9b405963aa12:774:774 [0] NCCL INFO + PCI[12.0] - GPU/0-3a000 (1)
 9b405963aa12:774:774 [0] NCCL INFO ==========================================
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO GPU/0-38000 :GPU/0-38000 (0/5000.0/LOC) GPU/0-3a000 (2/12.0/PHB) CPU/0-0 (1/12.0/PHB)
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO GPU/0-3a000 :GPU/0-38000 (2/12.0/PHB) GPU/0-3a000 (0/5000.0/LOC) CPU/0-0 (1/12.0/PHB)
 9b405963aa12:772:772 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 ===
 9b405963aa12:774:774 [0] NCCL INFO Setting affinity for GPU 0 to 0-47
 9b405963aa12:772:772 [0] NCCL INFO CPU/0-0 (1/2/-1)
 9b405963aa12:772:772 [0] NCCL INFO + PCI[5000.0] - NIC/0-0
 9b405963aa12:772:772 [0] NCCL INFO + PCI[12.0] - GPU/0-38000 (0)
 9b405963aa12:772:772 [0] NCCL INFO + PCI[12.0] - GPU/0-3a000 (1)
 9b405963aa12:772:772 [0] NCCL INFO ==========================================
 9b405963aa12:772:772 [0] NCCL INFO GPU/0-38000 :GPU/0-38000 (0/5000.0/LOC) GPU/0-3a000 (2/12.0/PHB) CPU/0-0 (1/12.0/PHB)
 9b405963aa12:772:772 [0] NCCL INFO GPU/0-3a000 :GPU/0-38000 (2/12.0/PHB) GPU/0-3a000 (0/5000.0/LOC) CPU/0-0 (1/12.0/PHB)
 9b405963aa12:772:772 [0] NCCL INFO Setting affinity for GPU 1 to 0-47
 9b405963aa12:774:774 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
 9b405963aa12:774:774 [0] NCCL INFO  0 : GPU/0 GPU/1
 9b405963aa12:774:774 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
 9b405963aa12:772:772 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
 9b405963aa12:774:774 [0] NCCL INFO  0 : GPU/0 GPU/1
 9b405963aa12:772:772 [0] NCCL INFO  0 : GPU/0 GPU/1
 9b405963aa12:772:772 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
 9b405963aa12:772:772 [0] NCCL INFO  0 : GPU/0 GPU/1
 9b405963aa12:772:772 [0] NCCL INFO comm 0x133c8440 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
 9b405963aa12:774:774 [0] NCCL INFO comm 0x133d3e40 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
 9b405963aa12:772:772 [0] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1
 9b405963aa12:774:774 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
 9b405963aa12:772:772 [0] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1
 9b405963aa12:774:774 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1
 9b405963aa12:774:774 [0] NCCL INFO Channel 00/02 : 0 1
 9b405963aa12:772:772 [0] NCCL INFO Ring 00 : 0 -> 1 -> 0
 9b405963aa12:774:774 [0] NCCL INFO Channel 01/02 : 0 1
 9b405963aa12:772:772 [0] NCCL INFO Ring 01 : 0 -> 1 -> 0
 9b405963aa12:772:772 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
 9b405963aa12:774:774 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
 9b405963aa12:772:772 [0] NCCL INFO P2P Chunksize set to 131072
 9b405963aa12:774:774 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
 9b405963aa12:774:774 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
 9b405963aa12:774:774 [0] NCCL INFO P2P Chunksize set to 131072
 9b405963aa12:772:772 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
 9b405963aa12:774:774 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
 9b405963aa12:772:772 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:774 [0] NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 0
 9b405963aa12:772:772 [0] NCCL INFO UDS: Creating service thread comm 0x133c8440 rank 1
 9b405963aa12:774:774 [0] NCCL INFO UDS: Creating service thread comm 0x133d3e40 rank 0
 9b405963aa12:772:772 [0] NCCL INFO misc/utils.cc:233 memory stack hunk malloc(65536)
 9b405963aa12:774:774 [0] NCCL INFO misc/utils.cc:233 memory stack hunk malloc(65536)
 9b405963aa12:774:853 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 23
 9b405963aa12:772:850 [0] NCCL INFO [Proxy Service] Device 0 CPU core 39
 9b405963aa12:774:851 [0] NCCL INFO [Proxy Service] Device 0 CPU core 31
 9b405963aa12:772:852 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 3
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:42 Cuda Alloc Size 1216 pointer 0x7f13a4200000
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:42 Cuda Alloc Size 1216 pointer 0x7f13a4200000
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:45 Cuda Alloc Size 40 pointer 0x7f13a4400000
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:45 Cuda Alloc Size 40 pointer 0x7f13a4400000
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:56 Cuda Alloc Size 8 pointer 0x7f13a4600000
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:56 Cuda Alloc Size 8 pointer 0x7f13a4600000
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:42 Cuda Alloc Size 1216 pointer 0x7f13a4800000
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:42 Cuda Alloc Size 1216 pointer 0x7f13a4800000
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:45 Cuda Alloc Size 40 pointer 0x7f13a4a00000
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:45 Cuda Alloc Size 40 pointer 0x7f13a4a00000
 9b405963aa12:774:774 [0] NCCL INFO channel.cc:56 Cuda Alloc Size 8 pointer 0x7f13a4c00000
 9b405963aa12:774:774 [0] NCCL INFO   Algorithm   |                            Tree                  |                            Ring                  |                   CollNetDirect                  |
 9b405963aa12:774:774 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
 9b405963aa12:772:772 [0] NCCL INFO channel.cc:56 Cuda Alloc Size 8 pointer 0x7f13a4c00000
 9b405963aa12:774:774 [0] NCCL INFO  Max NThreads |            512 |            640 |            512 |            512 |            640 |            256 |              0 |              0 |            640 |
 9b405963aa12:774:774 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     7.6/   6.0 |    16.5/   0.0 |    14.1/  12.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     7.6/   6.0 |    16.5/   0.0 |    14.1/  12.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:772:772 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
 9b405963aa12:772:772 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
 9b405963aa12:774:774 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     7.6/  12.0 |    16.5/   0.0 |    14.1/  24.0 |     0.8/   0.0 |     0.8/   0.0 |    39.2/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     7.6/  12.0 |    16.5/   0.0 |    14.1/  24.0 |     0.8/   0.0 |     0.8/   0.0 |    39.2/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO     AllReduce |     8.8/   1.5 |    17.8/   0.0 |    16.4/   5.5 |     8.6/   6.0 |    19.0/   0.0 |    19.8/  12.0 |     0.8/   0.0 |     0.8/   0.0 |    39.2/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO   Algorithm   |                    CollNetChain                  |                            NVLS                  |                        NVLSTree                  |
 9b405963aa12:774:774 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
 9b405963aa12:774:774 [0] NCCL INFO  Max NThreads |              0 |              0 |            640 |              0 |              0 |            640 |              0 |              0 |            640 |
 9b405963aa12:774:774 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO     AllReduce |     0.0/   0.0 |     0.0/   0.0 |    35.6/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO   Algorithm   |                             PAT                  |
 9b405963aa12:774:774 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
 9b405963aa12:774:774 [0] NCCL INFO  Max NThreads |              0 |              0 |              0 |
 9b405963aa12:774:774 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO     AllReduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
 9b405963aa12:774:774 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
 9b405963aa12:774:774 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
 9b405963aa12:772:772 [0] NCCL INFO init.cc:447 Cuda Alloc Size 23648 pointer 0x7f13a4e00000
 9b405963aa12:774:774 [0] NCCL INFO init.cc:447 Cuda Alloc Size 23648 pointer 0x7f13a4e00000
 9b405963aa12:772:772 [0] NCCL INFO init.cc:449 Cuda Alloc Size 8 pointer 0x7f13a5000000
 9b405963aa12:774:774 [0] NCCL INFO init.cc:449 Cuda Alloc Size 8 pointer 0x7f13a5000000
 9b405963aa12:774:774 [0] NCCL INFO CC Off, workFifoBytes 1048576
 9b405963aa12:772:772 [0] NCCL INFO init.cc:491 Cuda Host Alloc Size 1048576 pointer 0x7f13a2600200
 9b405963aa12:772:772 [0] NCCL INFO init.cc:501 Cuda Host Alloc Size 65536 pointer 0x7f13a2700200
 9b405963aa12:772:772 [0] NCCL INFO init.cc:502 Cuda Host Alloc Size 65536 pointer 0x7f13a2710200
 9b405963aa12:774:774 [0] NCCL INFO init.cc:491 Cuda Host Alloc Size 1048576 pointer 0x7f13a2600200
 9b405963aa12:774:774 [0] NCCL INFO init.cc:501 Cuda Host Alloc Size 65536 pointer 0x7f13a2700200
 9b405963aa12:774:774 [0] NCCL INFO init.cc:502 Cuda Host Alloc Size 65536 pointer 0x7f13a2710200
 9b405963aa12:772:772 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
 9b405963aa12:772:772 NCCL CALL ncclCommInitRank(0x133c8440, 2, 0x52bcd9be09a65e5a, 1, 0)
 9b405963aa12:774:774 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
 9b405963aa12:772:772 [0] NCCL INFO ncclCommInitRank comm 0x133c8440 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 3a000 commId 0x52bcd9be09a65e5a - Init COMPLETE
 9b405963aa12:774:774 NCCL CALL ncclCommInitRank(0x133d3e40, 2, 0x52bcd9be09a65e5a, 0, 0)
 9b405963aa12:774:774 [0] NCCL INFO ncclCommInitRank comm 0x133d3e40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 38000 commId 0x52bcd9be09a65e5a - Init COMPLETE
 9b405963aa12:772:772 [0] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.12 (kernels 0.10, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
 9b405963aa12:774:774 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.12 (kernels 0.11, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
 9b405963aa12:774:774 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f13a5200000 recvbuff 0x7f13a5200200 count 1 datatype 7 op 0 root 0 comm 0x133d3e40 [nranks=2] stream (nil)
 9b405963aa12:772:772 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f13a5200000 recvbuff 0x7f13a5200200 count 1 datatype 7 op 0 root 0 comm 0x133c8440 [nranks=2] stream (nil)
 9b405963aa12:774:774 NCCL CALL ncclAllReduce(7f13a5200000,7f13a5200200,1,7,0,0,0x133d3e40,(nil))
 9b405963aa12:772:772 NCCL CALL ncclAllReduce(7f13a5200000,7f13a5200200,1,7,0,0,0x133c8440,(nil))
 9b405963aa12:774:774 [0] NCCL INFO misc/utils.cc:233 memory stack hunk malloc(65536)
 9b405963aa12:772:772 [0] NCCL INFO misc/utils.cc:233 memory stack hunk malloc(65536)
 9b405963aa12:772:855 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:854 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:850 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f1350004f20
 9b405963aa12:772:850 [0] NCCL INFO New proxy recv connection 0 from local rank 1, transport 1
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=1 op.reqBuff=0x7f1350004ee0 op.respSize=16 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f1350004f40
 9b405963aa12:774:851 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f134c004f20
 9b405963aa12:774:851 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 1
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=1 op.reqBuff=0x7f134c004ee0 op.respSize=16 done
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f134c004f40
 9b405963aa12:772:850 [0] NCCL INFO CUMEM Host Alloc Size 10485760 pointer 0x7f13a5400000 handle 7f1350008cd0 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:850 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a5400000 size 9637888
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=3 op.reqBuff=0x7f1350008b50 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM Host Alloc Size 10485760 pointer 0x7f13a5400000 handle 7f134c008cd0 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a5400000 size 9637888
 9b405963aa12:772:855 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=3 op.reqBuff=0x7f134c008b50 op.respSize=112 done
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:772:850 [0] NCCL INFO New proxy recv connection 1 from local rank 1, transport 1
 9b405963aa12:774:854 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=1 op.reqBuff=0x7f135000ac00 op.respSize=16 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f1350004fb8
 9b405963aa12:774:851 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 1
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=1 op.reqBuff=0x7f134c00ac00 op.respSize=16 done
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f134c004fb8
 9b405963aa12:772:850 [0] NCCL INFO CUMEM Host Alloc Size 10485760 pointer 0x7f13a5e00000 handle 7f135000adc0 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:850 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a5e00000 size 9637888
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=3 op.reqBuff=0x7f135000ac40 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM Host Alloc Size 10485760 pointer 0x7f13a5e00000 handle 7f134c00adc0 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:855 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:851 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a5e00000 size 9637888
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=3 op.reqBuff=0x7f134c00ac40 op.respSize=112 done
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:772:850 [0] NCCL INFO New proxy send connection 2 from local rank 1, transport 1
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=1 op.reqBuff=0x7f135000ca70 op.respSize=16 done
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 1
 9b405963aa12:772:855 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f1350005030
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=1 op.reqBuff=0x7f134c00ca70 op.respSize=16 done
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f134c005030
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:772:850 [0] NCCL INFO CUMEM Host Alloc Size 2097152 pointer 0x7f13a6800000 handle 7f135000cc10 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:850 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a6800000 size 4096
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=3 op.reqBuff=0x7f135000ca90 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM Host Alloc Size 2097152 pointer 0x7f13a6800000 handle 7f134c00cc10 numa 0 dev 0 granularity 2097152
 9b405963aa12:774:851 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a6800000 size 4096
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=3 op.reqBuff=0x7f134c00ca90 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
 9b405963aa12:772:855 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
 9b405963aa12:774:854 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
 9b405963aa12:772:850 [0] NCCL INFO New proxy send connection 3 from local rank 1, transport 1
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=1 op.reqBuff=0x7f135000ca90 op.respSize=16 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:774:851 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 1
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=1 op.reqBuff=0x7f134c00ca90 op.respSize=16 done
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f13500050a8
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f134c0050a8
 9b405963aa12:772:850 [0] NCCL INFO CUMEM Host Alloc Size 2097152 pointer 0x7f13a6a00000 handle 7f135000ea40 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:850 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a6a00000 size 4096
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=3 op.reqBuff=0x7f135000e8c0 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM Host Alloc Size 2097152 pointer 0x7f13a6a00000 handle 7f134c00ea40 numa 0 dev 0 granularity 2097152
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:851 [0] NCCL INFO CUMEM allocated shareable buffer 0x7f13a6a00000 size 4096
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=3 op.reqBuff=0x7f134c00e8c0 op.respSize=112 done
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Setup res=0
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:774:854 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
 9b405963aa12:774:851 [0] NCCL INFO New proxy send connection 4 from local rank 1, transport 0
 9b405963aa12:774:851 [0] NCCL INFO proxyProgressAsync opId=0x7f134800dca0 op.type=1 op.reqBuff=0x7f134c00e8c0 op.respSize=16 done
 9b405963aa12:772:850 [0] NCCL INFO New proxy send connection 4 from local rank 0, transport 0
 9b405963aa12:774:851 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:772:850 [0] NCCL INFO proxyProgressAsync opId=0x7f134400dca0 op.type=1 op.reqBuff=0x7f135000e8c0 op.respSize=16 done
 9b405963aa12:772:855 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134800dca0
 9b405963aa12:772:855 [0] NCCL INFO resp.opId=0x7f134800dca0 matches expected opId=0x7f134800dca0
 9b405963aa12:774:854 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f134400dca0
 9b405963aa12:772:855 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f134c005120
 9b405963aa12:774:854 [0] NCCL INFO resp.opId=0x7f134400dca0 matches expected opId=0x7f134400dca0
 9b405963aa12:772:850 [0] NCCL INFO Received and initiated operation=Init res=0
 9b405963aa12:774:854 [0] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f1350005120
 9b405963aa12:772:855 [0] NCCL INFO ProxyCall UDS comm 0x133c8440 rank 1 tpRank 0(ca9c4d91f631c5d4) reqSize 8 respSize 0 respFd 0x7f162e7e0a20 opId 0xb6445018f727ebdd
 9b405963aa12:774:854 [0] NCCL INFO ProxyCall UDS comm 0x133d3e40 rank 0 tpRank 1(b67c22d86ea01199) reqSize 8 respSize 0 respFd 0x7f162bddfa20 opId 0x44bae3e981a49b63
 9b405963aa12:774:853 [0] NCCL INFO proxyUDSRecvReq::ncclProxyMsgGetFd rank 1 opId 0xb6445018f727ebdd handle=0x7f134c008cd0
 9b405963aa12:774:853 [0] NCCL INFO UDS proxyGetFd received handle 0x7f134c008cd0 peer 1 opId b6445018f727ebdd
 9b405963aa12:772:852 [0] NCCL INFO proxyUDSRecvReq::ncclProxyMsgGetFd rank 0 opId 0x44bae3e981a49b63 handle=0x7f1350008cd0
 9b405963aa12:772:852 [0] NCCL INFO UDS proxyGetFd received handle 0x7f1350008cd0 peer 0 opId 44bae3e981a49b63
 9b405963aa12:772:855 [0] NCCL INFO ProxyCall UDS comm 0x133c8440 rank 1 tpRank 0(ca9c4d91f631c5d4) reqSize 8 respSize 0 respFd 164 opId 0xb6445018f727ebdd - DONE
 9b405963aa12:772:855 [0] NCCL INFO UDS: ClientGetFd handle 0x7f134c008cd0 tpRank 0 returned fd 164 sameProcess 0
 9b405963aa12:774:854 [0] NCCL INFO ProxyCall UDS comm 0x133d3e40 rank 0 tpRank 1(b67c22d86ea01199) reqSize 8 respSize 0 respFd 179 opId 0x44bae3e981a49b63 - DONE
 9b405963aa12:774:854 [0] NCCL INFO UDS: ClientGetFd handle 0x7f1350008cd0 tpRank 1 returned fd 179 sameProcess 0

 [2025-09-09 23:51:24] 9b405963aa12:772:855 [0] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
 9b405963aa12:772:855 [0] NCCL INFO transport/shm.cc:169 -> 1
 9b405963aa12:772:855 [0] NCCL INFO transport.cc:198 -> 1
 9b405963aa12:772:855 [0] NCCL INFO transport/generic.cc:19 -> 1
 9b405963aa12:772:855 [0] NCCL INFO group.cc:146 -> 1
 9b405963aa12:772:855 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
 9b405963aa12:772:772 [0] NCCL INFO group.cc:545 -> 1
 9b405963aa12:772:772 [0] NCCL INFO group.cc:694 -> 1
 9b405963aa12:772:772 [0] NCCL INFO enqueue.cc:2432 -> 1

 [2025-09-09 23:51:24] 9b405963aa12:774:854 [0] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
 9b405963aa12:774:854 [0] NCCL INFO transport/shm.cc:169 -> 1
 9b405963aa12:774:854 [0] NCCL INFO transport.cc:198 -> 1
 9b405963aa12:774:854 [0] NCCL INFO transport/generic.cc:19 -> 1
 9b405963aa12:774:854 [0] NCCL INFO group.cc:146 -> 1
 9b405963aa12:774:854 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
 9b405963aa12:774:774 [0] NCCL INFO group.cc:545 -> 1
 9b405963aa12:774:774 [0] NCCL INFO group.cc:694 -> 1
 9b405963aa12:774:774 [0] NCCL INFO enqueue.cc:2432 -> 1
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585] WorkerProc failed to start.
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585] Traceback (most recent call last):
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     worker = WorkerProc(*args, **kwargs)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 420, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.worker.init_device()
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 611, in init_device
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.worker.init_device()  # type: ignore
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 193, in init_device
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     init_worker_distributed_environment(self.vllm_config, self.rank,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in init_worker_distributed_environment
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     ensure_model_parallel_initialized(
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1184, in ensure_model_parallel_initialized
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     initialize_model_parallel(tensor_model_parallel_size,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1149, in initialize_model_parallel
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     _DP = init_model_parallel_group(group_ranks,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 882, in init_model_parallel_group
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     return GroupCoordinator(
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]            ^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 261, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.device_communicator = device_comm_cls(
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]                                ^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 52, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.pynccl_comm = PyNcclCommunicator(
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]                        ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 106, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.all_reduce(data)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 127, in all_reduce
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 314, in ncclAllReduce
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     raise RuntimeError(f"NCCL error: {error_str}")
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:24 [multiproc_executor.py:585] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 9b405963aa12:774:851 [0] NCCL INFO [Service thread] Connection closed by localRank 1
 9b405963aa12:774:849 [0] NCCL INFO RAS current socket connection with 172.17.0.3<41085> closed by peer on receive; terminating it
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585] WorkerProc failed to start.
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585] Traceback (most recent call last):
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     worker = WorkerProc(*args, **kwargs)
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 420, in __init__
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.worker.init_device()
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 611, in init_device
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.worker.init_device()  # type: ignore
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 193, in init_device
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     init_worker_distributed_environment(self.vllm_config, self.rank,
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in init_worker_distributed_environment
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     ensure_model_parallel_initialized(
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1184, in ensure_model_parallel_initialized
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     initialize_model_parallel(tensor_model_parallel_size,
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1149, in initialize_model_parallel
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     _DP = init_model_parallel_group(group_ranks,
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 882, in init_model_parallel_group
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     return GroupCoordinator(
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]            ^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 261, in __init__
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.device_communicator = device_comm_cls(
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]                                ^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 52, in __init__
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.pynccl_comm = PyNcclCommunicator(
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]                        ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 106, in __init__
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.all_reduce(data)
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 127, in all_reduce
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 314, in ncclAllReduce
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585]     raise RuntimeError(f"NCCL error: {error_str}")
 (EngineCore_DP0 pid=753) ERROR 09-09 23:51:24 [multiproc_executor.py:585] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718] EngineCore failed to start.
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718] Traceback (most recent call last):
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 705, in run_engine_core
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     engine_core = DPEngineCoreProc(*args, **kwargs)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 975, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     super().__init__(vllm_config, local_client, handshake_address,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     self.model_executor = executor_class(vllm_config)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     self._init_executor()
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     self.workers = WorkerProc.wait_for_ready(unready_workers)
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718]     raise e from None
 (EngineCore_DP1 pid=756) ERROR 09-09 23:51:25 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
 (EngineCore_DP1 pid=756) Process EngineCore_DP1:
 (EngineCore_DP1 pid=756) Traceback (most recent call last):
 (EngineCore_DP1 pid=756)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
 (EngineCore_DP1 pid=756)     self.run()
 (EngineCore_DP1 pid=756)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
 (EngineCore_DP1 pid=756)     self._target(*self._args, **self._kwargs)
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
 (EngineCore_DP1 pid=756)     raise e
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 705, in run_engine_core
 (EngineCore_DP1 pid=756)     engine_core = DPEngineCoreProc(*args, **kwargs)
 (EngineCore_DP1 pid=756)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 975, in __init__
 (EngineCore_DP1 pid=756)     super().__init__(vllm_config, local_client, handshake_address,
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
 (EngineCore_DP1 pid=756)     super().__init__(vllm_config, executor_class, log_stats,
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
 (EngineCore_DP1 pid=756)     self.model_executor = executor_class(vllm_config)
 (EngineCore_DP1 pid=756)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
 (EngineCore_DP1 pid=756)     self._init_executor()
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
 (EngineCore_DP1 pid=756)     self.workers = WorkerProc.wait_for_ready(unready_workers)
 (EngineCore_DP1 pid=756)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP1 pid=756)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
 (EngineCore_DP1 pid=756)     raise e from None
 (EngineCore_DP1 pid=756) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
 /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
 /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
 FAILED

 ================================================================== FAILURES ===================================================================
 _________________________________________________ test_load[True-mp-RequestOutputKind.DELTA] __________________________________________________

 output_kind = <RequestOutputKind.DELTA: 1>, data_parallel_backend = 'mp', async_scheduling = True

    @pytest.mark.parametrize(
        "output_kind",
        [
            RequestOutputKind.DELTA,
            RequestOutputKind.FINAL_ONLY,
        ],
    )
    @pytest.mark.parametrize("data_parallel_backend", ["mp", "ray"])
    @pytest.mark.parametrize("async_scheduling", [True, False])
    @pytest.mark.asyncio
    async def test_load(output_kind: RequestOutputKind, data_parallel_backend: str,
                        async_scheduling: bool):

        stats_loggers = {}

        @dataclass
        class SimpleStatsLogger(StatLoggerBase):
            init_count: int = 0
            finished_req_count: int = 0

            def __init__(self, vllm_config: VllmConfig, engine_index: int = 0):
                stats_loggers[engine_index] = self

            def record(self,
                       scheduler_stats: Optional[SchedulerStats],
                       iteration_stats: Optional[IterationStats],
                       engine_idx: int = 0):
                if iteration_stats:
                    self.finished_req_count += len(
                        iteration_stats.finished_requests)

            def log_engine_initialized(self):
                self.init_count += 1

        with ExitStack() as after:

            prompt = "This is a test of data parallel"

            engine_args.data_parallel_backend = data_parallel_backend
            engine_args.async_scheduling = async_scheduling
 >           engine = AsyncLLM.from_engine_args(engine_args,
                                               stat_loggers=[SimpleStatsLogger])

 v1/test_async_llm_dp.py:110:
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py:233: in from_engine_args
    return cls(
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py:129: in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:101: in make_async_mp_client
    return DPLBAsyncMPClient(*client_args)
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:1122: in __init__
    super().__init__(vllm_config, executor_class, log_stats,
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:972: in __init__
    super().__init__(vllm_config, executor_class, log_stats,
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:767: in __init__
    super().__init__(
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py:446: in __init__
    with launch_core_engines(vllm_config, executor_class,
 /usr/lib/python3.12/contextlib.py:144: in __exit__
    next(self.gen)
 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:729: in launch_core_engines
    wait_for_engine_startup(
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 handshake_socket = <zmq.Socket(zmq.ROUTER) at 0x7f142503d860 closed>
 addresses = EngineZmqAddresses(inputs=['ipc:///tmp/57bfca7e-8144-4f9e-b06e-817eae501212'], outputs=['ipc:///tmp/ac6aec22-ecf4-4c64...3e48ee9-6f14-44f0-9b9e-2f35b5ac4021', frontend_stats_publish_address='ipc:///tmp/a771cb97-53f7-49d1-aa8c-586bfad372a6')
 core_engines = [<vllm.v1.engine.utils.CoreEngine object at 0x7f1451a56ab0>, <vllm.v1.engine.utils.CoreEngine object at 0x7f14254a4620>]
 parallel_config = ParallelConfig(pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=2, data_parallel_size_local=2, dat...'', world_size=1, rank=0, _data_parallel_master_port_list=[49781, 48023, 57465, 50107], decode_context_parallel_size=1)
 cache_config = CacheConfig(block_size=16, gpu_memory_utilization=0.9, swap_space=4.0, cache_dtype='auto', is_attention_free=False, nu...he_dtype='auto', mamba_ssm_cache_dtype='auto', num_gpu_blocks=None, num_cpu_blocks=None, kv_sharing_fast_prefill=False)
 proc_manager = <vllm.v1.engine.utils.CoreEngineProcManager object at 0x7f142540b8f0>
 coord_process = <ForkProcess name='VLLM_DP_Coordinator' pid=748 parent=515 stopped exitcode=-SIGTERM daemon>

    def wait_for_engine_startup(
        handshake_socket: zmq.Socket,
        addresses: EngineZmqAddresses,
        core_engines: list[CoreEngine],
        parallel_config: ParallelConfig,
        cache_config: CacheConfig,
        proc_manager: Optional[CoreEngineProcManager],
        coord_process: Optional[Process],
    ):
        # Wait for engine core process(es) to send ready messages.
        local_count = parallel_config.data_parallel_size_local
        remote_count = len(core_engines) - local_count
        # [local, remote] counts
        conn_pending, start_pending = [local_count, remote_count], [0, 0]
        poller = zmq.Poller()
        poller.register(handshake_socket, zmq.POLLIN)

        remote_should_be_headless = not parallel_config.data_parallel_hybrid_lb \
            and not parallel_config.data_parallel_external_lb

        if proc_manager is not None:
            for sentinel in proc_manager.sentinels():
                poller.register(sentinel, zmq.POLLIN)
        if coord_process is not None:
            poller.register(coord_process.sentinel, zmq.POLLIN)
        while any(conn_pending) or any(start_pending):
            events = poller.poll(STARTUP_POLL_PERIOD_MS)
            if not events:
                if any(conn_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) "
                        "to connect.", *conn_pending)
                if any(start_pending):
                    logger.debug(
                        "Waiting for %d local, %d remote core engine proc(s) "
                        "to start.", *start_pending)
                continue
            if len(events) > 1 or events[0][0] != handshake_socket:
                # One of the local core processes exited.
                finished = proc_manager.finished_procs() if proc_manager else {}
                if coord_process is not None and coord_process.exitcode is not None:
                    finished[coord_process.name] = coord_process.exitcode
 >               raise RuntimeError("Engine core initialization failed. "
                                   "See root cause above. "
                                   f"Failed core proc(s): {finished}")
 E               RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP1': 1}

 /usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py:782: RuntimeError
 ============================================================== warnings summary ===============================================================
 ../../usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63
  /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

 ../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

 tests/v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]
 tests/v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]
 tests/v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]
  /usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=515) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

 -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
 =========================================================== short test summary info ===========================================================
 FAILED v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP1': 1}
 ======================================================= 1 failed, 5 warnings in 48.12s ========================================================
No results found