Created
February 2, 2026 23:57
-
-
Save naveenkumarmarri/28f2be4e7d788820a1429f68879c8e77 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| [2026-02-02 23:00:50] WARNING server_args.py:806: The tool_call_parser 'glm45' is deprecated. Please use 'glm' instead. | |
| [2026-02-02 23:00:52] WARNING server_args.py:1562: Disabling overlap schedule since MambaRadixCache no_buffer is not compatible with overlap schedule currently, try to use --mamba-scheduler-strategy extra_buffer to enable overlap schedule | |
| [2026-02-02 23:00:52] INFO server_args.py:1697: Attention backend not specified. Use fa3 backend by default. | |
| [2026-02-02 23:00:52] WARNING server_args.py:2136: Max running requests is reset to 48 for speculative decoding. You can override this by explicitly setting --max-running-requests. | |
| [2026-02-02 23:00:52] WARNING server_args.py:2145: Spec v2 is enabled for eagle/eagle3 speculative decoding and overlap schedule is turned on. | |
| [2026-02-02 23:00:52] Fail to set RLIMIT_STACK: current limit exceeds maximum limit | |
| [2026-02-02 23:00:52] server_args=ServerArgs(model_path='/mnt/models', tokenizer_path='/mnt/models', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=80, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='fp8_e4m3', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.886, max_running_requests=48, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=298140868, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='/mnt/models', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='glm45', tool_call_parser='glm', tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend='flashinfer', prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm='EAGLE', speculative_draft_model_path='/mnt/models', speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=3, speculative_eagle_topk=1, speculative_num_draft_tokens=4, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=True, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend='nixl', hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='in-seq-split', enable_fused_qk_norm_rope=True, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None) | |
| [2026-02-02 23:00:59 TP0] Init torch distributed begin. | |
| [2026-02-02 23:01:04 TP1] Init torch distributed begin. | |
| [2026-02-02 23:01:08 TP2] Init torch distributed begin. | |
| [2026-02-02 23:01:13 TP3] Init torch distributed begin. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2026-02-02 23:01:13 TP0] sglang is using nccl==2.28.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2026-02-02 23:01:14 TP0] Init torch distributed ends. mem usage=1.24 GB | |
| [2026-02-02 23:01:14 TP3] Init torch distributed ends. mem usage=1.06 GB | |
| [2026-02-02 23:01:14 TP1] Init torch distributed ends. mem usage=1.29 GB | |
| [2026-02-02 23:01:14 TP2] Init torch distributed ends. mem usage=1.29 GB | |
| [2026-02-02 23:01:14 TP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py) | |
| [2026-02-02 23:01:14 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py) | |
| [2026-02-02 23:01:14 TP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py) | |
| [2026-02-02 23:01:14 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py) | |
| [2026-02-02 23:01:15 TP3] Load weight begin. avail mem=138.21 GB | |
| [2026-02-02 23:01:15 TP1] Load weight begin. avail mem=137.97 GB | |
| [2026-02-02 23:01:15 TP2] Load weight begin. avail mem=137.97 GB | |
| [2026-02-02 23:01:15 TP0] Load weight begin. avail mem=138.02 GB | |
| [2026-02-02 23:01:15 TP0] Shared experts fusion optimization enabled. | |
| [2026-02-02 23:01:15 TP3] Using CompressedTensorsW8A8Fp8MoEMethod | |
| [2026-02-02 23:01:15 TP1] Using CompressedTensorsW8A8Fp8MoEMethod | |
| [2026-02-02 23:01:15 TP2] Using CompressedTensorsW8A8Fp8MoEMethod | |
| [2026-02-02 23:01:15 TP0] Using CompressedTensorsW8A8Fp8MoEMethod | |
| [2026-02-02 23:01:15] Using default HuggingFace chat template with detected content format: openai | |
| Loading safetensors checkpoint shards: 0% Completed | 0/93 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 2% Completed | 2/93 [00:00<00:24, 3.72it/s] | |
| Loading safetensors checkpoint shards: 3% Completed | 3/93 [00:01<00:35, 2.55it/s] | |
| Loading safetensors checkpoint shards: 4% Completed | 4/93 [00:01<00:42, 2.09it/s] | |
| Loading safetensors checkpoint shards: 5% Completed | 5/93 [00:02<00:48, 1.80it/s] | |
| Loading safetensors checkpoint shards: 6% Completed | 6/93 [00:03<00:52, 1.67it/s] | |
| Loading safetensors checkpoint shards: 8% Completed | 7/93 [00:03<00:54, 1.58it/s] | |
| Loading safetensors checkpoint shards: 9% Completed | 8/93 [00:04<00:55, 1.54it/s] | |
| Loading safetensors checkpoint shards: 10% Completed | 9/93 [00:05<00:55, 1.52it/s] | |
| Loading safetensors checkpoint shards: 11% Completed | 10/93 [00:05<00:55, 1.50it/s] | |
| Loading safetensors checkpoint shards: 12% Completed | 11/93 [00:06<00:55, 1.47it/s] | |
| Loading safetensors checkpoint shards: 13% Completed | 12/93 [00:07<00:55, 1.46it/s] | |
| Loading safetensors checkpoint shards: 14% Completed | 13/93 [00:07<00:54, 1.46it/s] | |
| Loading safetensors checkpoint shards: 15% Completed | 14/93 [00:08<00:54, 1.45it/s] | |
| Loading safetensors checkpoint shards: 16% Completed | 15/93 [00:09<00:53, 1.45it/s] | |
| Loading safetensors checkpoint shards: 17% Completed | 16/93 [00:10<00:53, 1.44it/s] | |
| Loading safetensors checkpoint shards: 18% Completed | 17/93 [00:10<00:52, 1.46it/s] | |
| Loading safetensors checkpoint shards: 19% Completed | 18/93 [00:11<00:50, 1.50it/s] | |
| Loading safetensors checkpoint shards: 20% Completed | 19/93 [00:11<00:47, 1.57it/s] | |
| Loading safetensors checkpoint shards: 22% Completed | 20/93 [00:12<00:44, 1.64it/s] | |
| Loading safetensors checkpoint shards: 23% Completed | 21/93 [00:13<00:42, 1.69it/s] | |
| Loading safetensors checkpoint shards: 24% Completed | 22/93 [00:13<00:41, 1.72it/s] | |
| Loading safetensors checkpoint shards: 25% Completed | 23/93 [00:14<00:40, 1.74it/s] | |
| Loading safetensors checkpoint shards: 26% Completed | 24/93 [00:14<00:39, 1.77it/s] | |
| Loading safetensors checkpoint shards: 27% Completed | 25/93 [00:15<00:38, 1.78it/s] | |
| Loading safetensors checkpoint shards: 28% Completed | 26/93 [00:15<00:37, 1.80it/s] | |
| Loading safetensors checkpoint shards: 29% Completed | 27/93 [00:16<00:36, 1.80it/s] | |
| Loading safetensors checkpoint shards: 30% Completed | 28/93 [00:16<00:35, 1.81it/s] | |
| Loading safetensors checkpoint shards: 31% Completed | 29/93 [00:17<00:35, 1.80it/s] | |
| Loading safetensors checkpoint shards: 32% Completed | 30/93 [00:17<00:35, 1.78it/s] | |
| Loading safetensors checkpoint shards: 33% Completed | 31/93 [00:18<00:35, 1.77it/s] | |
| Loading safetensors checkpoint shards: 34% Completed | 32/93 [00:19<00:34, 1.79it/s] | |
| Loading safetensors checkpoint shards: 35% Completed | 33/93 [00:19<00:33, 1.80it/s] | |
| Loading safetensors checkpoint shards: 37% Completed | 34/93 [00:20<00:32, 1.81it/s] | |
| Loading safetensors checkpoint shards: 38% Completed | 35/93 [00:20<00:31, 1.82it/s] | |
| Loading safetensors checkpoint shards: 39% Completed | 36/93 [00:21<00:31, 1.82it/s] | |
| Loading safetensors checkpoint shards: 40% Completed | 37/93 [00:21<00:30, 1.82it/s] | |
| Loading safetensors checkpoint shards: 41% Completed | 38/93 [00:22<00:30, 1.82it/s] | |
| Loading safetensors checkpoint shards: 42% Completed | 39/93 [00:22<00:30, 1.79it/s] | |
| Loading safetensors checkpoint shards: 43% Completed | 40/93 [00:23<00:30, 1.76it/s] | |
| Loading safetensors checkpoint shards: 44% Completed | 41/93 [00:24<00:29, 1.77it/s] | |
| Loading safetensors checkpoint shards: 45% Completed | 42/93 [00:24<00:28, 1.78it/s] | |
| Loading safetensors checkpoint shards: 46% Completed | 43/93 [00:25<00:27, 1.79it/s] | |
| Loading safetensors checkpoint shards: 47% Completed | 44/93 [00:25<00:27, 1.80it/s] | |
| Loading safetensors checkpoint shards: 48% Completed | 45/93 [00:26<00:26, 1.81it/s] | |
| Loading safetensors checkpoint shards: 49% Completed | 46/93 [00:26<00:25, 1.81it/s] | |
| Loading safetensors checkpoint shards: 51% Completed | 47/93 [00:27<00:25, 1.82it/s] | |
| Loading safetensors checkpoint shards: 52% Completed | 48/93 [00:27<00:24, 1.82it/s] | |
| Loading safetensors checkpoint shards: 53% Completed | 49/93 [00:28<00:24, 1.79it/s] | |
| Loading safetensors checkpoint shards: 54% Completed | 50/93 [00:29<00:24, 1.77it/s] | |
| Loading safetensors checkpoint shards: 55% Completed | 51/93 [00:29<00:23, 1.75it/s] | |
| Loading safetensors checkpoint shards: 56% Completed | 52/93 [00:30<00:23, 1.76it/s] | |
| Loading safetensors checkpoint shards: 57% Completed | 53/93 [00:30<00:22, 1.77it/s] | |
| Loading safetensors checkpoint shards: 58% Completed | 54/93 [00:31<00:22, 1.77it/s] | |
| Loading safetensors checkpoint shards: 59% Completed | 55/93 [00:31<00:21, 1.78it/s] | |
| Loading safetensors checkpoint shards: 60% Completed | 56/93 [00:32<00:20, 1.77it/s] | |
| Loading safetensors checkpoint shards: 61% Completed | 57/93 [00:33<00:20, 1.77it/s] | |
| Loading safetensors checkpoint shards: 62% Completed | 58/93 [00:33<00:19, 1.77it/s] | |
| Loading safetensors checkpoint shards: 63% Completed | 59/93 [00:34<00:19, 1.76it/s] | |
| Loading safetensors checkpoint shards: 65% Completed | 60/93 [00:34<00:18, 1.75it/s] | |
| Loading safetensors checkpoint shards: 66% Completed | 61/93 [00:35<00:18, 1.74it/s] | |
| Loading safetensors checkpoint shards: 67% Completed | 62/93 [00:35<00:17, 1.74it/s] | |
| Loading safetensors checkpoint shards: 68% Completed | 63/93 [00:36<00:17, 1.74it/s] | |
| Loading safetensors checkpoint shards: 69% Completed | 64/93 [00:37<00:16, 1.74it/s] | |
| Loading safetensors checkpoint shards: 70% Completed | 65/93 [00:37<00:16, 1.73it/s] | |
| Loading safetensors checkpoint shards: 71% Completed | 66/93 [00:38<00:15, 1.73it/s] | |
| Loading safetensors checkpoint shards: 72% Completed | 67/93 [00:38<00:14, 1.74it/s] | |
| Loading safetensors checkpoint shards: 73% Completed | 68/93 [00:39<00:14, 1.74it/s] | |
| Loading safetensors checkpoint shards: 74% Completed | 69/93 [00:40<00:13, 1.73it/s] | |
| Loading safetensors checkpoint shards: 75% Completed | 70/93 [00:40<00:13, 1.72it/s] | |
| Loading safetensors checkpoint shards: 76% Completed | 71/93 [00:41<00:12, 1.73it/s] | |
| Loading safetensors checkpoint shards: 77% Completed | 72/93 [00:41<00:12, 1.74it/s] | |
| Loading safetensors checkpoint shards: 78% Completed | 73/93 [00:42<00:11, 1.75it/s] | |
| Loading safetensors checkpoint shards: 80% Completed | 74/93 [00:42<00:10, 1.77it/s] | |
| Loading safetensors checkpoint shards: 81% Completed | 75/93 [00:43<00:10, 1.78it/s] | |
| Loading safetensors checkpoint shards: 82% Completed | 76/93 [00:43<00:09, 1.77it/s] | |
| Loading safetensors checkpoint shards: 83% Completed | 77/93 [00:44<00:08, 1.78it/s] | |
| Loading safetensors checkpoint shards: 84% Completed | 78/93 [00:45<00:08, 1.80it/s] | |
| Loading safetensors checkpoint shards: 85% Completed | 79/93 [00:45<00:07, 1.81it/s] | |
| Loading safetensors checkpoint shards: 86% Completed | 80/93 [00:46<00:07, 1.81it/s] | |
| Loading safetensors checkpoint shards: 87% Completed | 81/93 [00:46<00:06, 1.75it/s] | |
| Loading safetensors checkpoint shards: 88% Completed | 82/93 [00:47<00:06, 1.64it/s] | |
| Loading safetensors checkpoint shards: 89% Completed | 83/93 [00:48<00:06, 1.57it/s] | |
| Loading safetensors checkpoint shards: 90% Completed | 84/93 [00:48<00:05, 1.57it/s] | |
| Loading safetensors checkpoint shards: 91% Completed | 85/93 [00:49<00:05, 1.50it/s] | |
| Loading safetensors checkpoint shards: 92% Completed | 86/93 [00:50<00:04, 1.55it/s] | |
| Loading safetensors checkpoint shards: 94% Completed | 87/93 [00:50<00:03, 1.50it/s] | |
| Loading safetensors checkpoint shards: 95% Completed | 88/93 [00:51<00:03, 1.51it/s] | |
| Loading safetensors checkpoint shards: 96% Completed | 89/93 [00:52<00:02, 1.49it/s] | |
| Loading safetensors checkpoint shards: 97% Completed | 90/93 [00:52<00:02, 1.47it/s] | |
| Loading safetensors checkpoint shards: 98% Completed | 91/93 [00:53<00:01, 1.74it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 93/93 [00:53<00:00, 1.74it/s] | |
| [2026-02-02 23:02:09 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:09 TP3] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=55.10 GB, mem usage=83.11 GB. | |
| [2026-02-02 23:02:09 TP0] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:09 TP0] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.91 GB, mem usage=83.11 GB. | |
| [2026-02-02 23:02:10 TP2] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:10 TP1] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:10 TP2] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.86 GB, mem usage=83.11 GB. | |
| [2026-02-02 23:02:10 TP1] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.86 GB, mem usage=83.11 GB. | |
| [2026-02-02 23:02:10 TP0] Using KV cache dtype: torch.float8_e4m3fn | |
| [2026-02-02 23:02:11 TP1] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB | |
| [2026-02-02 23:02:11 TP1] Memory pool end. avail mem=15.65 GB | |
| [2026-02-02 23:02:11 TP3] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB | |
| [2026-02-02 23:02:11 TP3] Memory pool end. avail mem=15.89 GB | |
| [2026-02-02 23:02:11 TP2] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB | |
| [2026-02-02 23:02:11 TP0] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB | |
| [2026-02-02 23:02:11 TP2] Memory pool end. avail mem=15.65 GB | |
| [2026-02-02 23:02:11 TP0] Memory pool end. avail mem=15.70 GB | |
| [2026-02-02 23:02:11 TP1] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:11 TP1] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:11 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.15 GB | |
| [2026-02-02 23:02:11 TP3] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:11 TP3] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:11 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=15.38 GB | |
| [2026-02-02 23:02:11 TP2] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:11 TP2] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:11 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=15.15 GB | |
| [2026-02-02 23:02:11 TP0] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:11 TP0] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:11 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=15.19 GB | |
| [2026-02-02 23:02:11 TP0] Capture cuda graph bs [1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 40, 44, 48] | |
| 0%| | 0/23 [00:00<?, ?it/s] | |
| Capturing batches (bs=48 avail_mem=14.75 GB): 0%| | 0/23 [00:00<?, ?it/s]<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. | |
| <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. | |
| /usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. | |
| warnings.warn( # warn only once | |
| [2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f36c4000000, size: 125829120 | |
| [2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f0fa6000000, size: 125829120 | |
| [2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f6fb4000000, size: 125829120 | |
| [2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f40c4000000, size: 125829120 | |
| [2026-02-02 23:02:23 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2026-02-02 23:02:23 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2026-02-02 23:02:23 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2026-02-02 23:02:23 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| [2026-02-02 23:02:25 TP0] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json. | |
| [2026-02-02 23:02:25 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton | |
| [2026-02-02 23:02:25 TP1] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json. | |
| [2026-02-02 23:02:25 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton | |
| [2026-02-02 23:02:25 TP2] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json. | |
| [2026-02-02 23:02:25 TP2] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton | |
| [2026-02-02 23:02:25 TP3] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json. | |
| [2026-02-02 23:02:25 TP3] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton | |
| Capturing batches (bs=48 avail_mem=14.75 GB): 4%|▍ | 1/23 [00:15<05:38, 15.39s/it] | |
| Capturing batches (bs=44 avail_mem=13.45 GB): 4%|▍ | 1/23 [00:15<05:38, 15.39s/it] | |
| Capturing batches (bs=44 avail_mem=13.45 GB): 9%|▊ | 2/23 [00:16<02:21, 6.73s/it] | |
| Capturing batches (bs=40 avail_mem=13.45 GB): 9%|▊ | 2/23 [00:16<02:21, 6.73s/it] | |
| Capturing batches (bs=40 avail_mem=13.45 GB): 13%|█▎ | 3/23 [00:16<01:16, 3.84s/it] | |
| Capturing batches (bs=32 avail_mem=13.44 GB): 13%|█▎ | 3/23 [00:16<01:16, 3.84s/it] | |
| Capturing batches (bs=32 avail_mem=13.44 GB): 17%|█▋ | 4/23 [00:16<00:47, 2.48s/it] | |
| Capturing batches (bs=30 avail_mem=13.43 GB): 17%|█▋ | 4/23 [00:16<00:47, 2.48s/it] | |
| Capturing batches (bs=30 avail_mem=13.43 GB): 22%|██▏ | 5/23 [00:17<00:34, 1.93s/it] | |
| Capturing batches (bs=28 avail_mem=13.42 GB): 22%|██▏ | 5/23 [00:17<00:34, 1.93s/it] | |
| Capturing batches (bs=28 avail_mem=13.42 GB): 26%|██▌ | 6/23 [00:18<00:26, 1.56s/it] | |
| Capturing batches (bs=26 avail_mem=13.41 GB): 26%|██▌ | 6/23 [00:18<00:26, 1.56s/it] | |
| Capturing batches (bs=26 avail_mem=13.41 GB): 30%|███ | 7/23 [00:19<00:21, 1.33s/it] | |
| Capturing batches (bs=24 avail_mem=13.40 GB): 30%|███ | 7/23 [00:19<00:21, 1.33s/it] | |
| Capturing batches (bs=24 avail_mem=13.40 GB): 35%|███▍ | 8/23 [00:19<00:15, 1.04s/it] | |
| Capturing batches (bs=22 avail_mem=13.39 GB): 35%|███▍ | 8/23 [00:19<00:15, 1.04s/it] | |
| Capturing batches (bs=22 avail_mem=13.39 GB): 39%|███▉ | 9/23 [00:20<00:11, 1.19it/s] | |
| Capturing batches (bs=20 avail_mem=13.38 GB): 39%|███▉ | 9/23 [00:20<00:11, 1.19it/s] | |
| Capturing batches (bs=20 avail_mem=13.38 GB): 43%|████▎ | 10/23 [00:20<00:09, 1.41it/s] | |
| Capturing batches (bs=18 avail_mem=13.37 GB): 43%|████▎ | 10/23 [00:20<00:09, 1.41it/s] | |
| Capturing batches (bs=18 avail_mem=13.37 GB): 48%|████▊ | 11/23 [00:21<00:07, 1.62it/s] | |
| Capturing batches (bs=16 avail_mem=13.36 GB): 48%|████▊ | 11/23 [00:21<00:07, 1.62it/s] | |
| Capturing batches (bs=16 avail_mem=13.36 GB): 52%|█████▏ | 12/23 [00:21<00:06, 1.81it/s] | |
| Capturing batches (bs=14 avail_mem=13.35 GB): 52%|█████▏ | 12/23 [00:21<00:06, 1.81it/s] | |
| Capturing batches (bs=14 avail_mem=13.35 GB): 57%|█████▋ | 13/23 [00:22<00:06, 1.55it/s] | |
| Capturing batches (bs=12 avail_mem=13.35 GB): 57%|█████▋ | 13/23 [00:22<00:06, 1.55it/s] | |
| Capturing batches (bs=12 avail_mem=13.35 GB): 61%|██████ | 14/23 [00:23<00:06, 1.40it/s] | |
| Capturing batches (bs=10 avail_mem=13.34 GB): 61%|██████ | 14/23 [00:23<00:06, 1.40it/s] | |
| Capturing batches (bs=10 avail_mem=13.34 GB): 65%|██████▌ | 15/23 [00:24<00:06, 1.25it/s] | |
| Capturing batches (bs=8 avail_mem=13.33 GB): 65%|██████▌ | 15/23 [00:24<00:06, 1.25it/s] | |
| Capturing batches (bs=8 avail_mem=13.33 GB): 70%|██████▉ | 16/23 [00:25<00:07, 1.01s/it] | |
| Capturing batches (bs=7 avail_mem=13.32 GB): 70%|██████▉ | 16/23 [00:25<00:07, 1.01s/it] | |
| Capturing batches (bs=7 avail_mem=13.32 GB): 74%|███████▍ | 17/23 [00:26<00:05, 1.01it/s] | |
| Capturing batches (bs=6 avail_mem=13.31 GB): 74%|███████▍ | 17/23 [00:26<00:05, 1.01it/s] | |
| Capturing batches (bs=6 avail_mem=13.31 GB): 78%|███████▊ | 18/23 [00:27<00:04, 1.22it/s] | |
| Capturing batches (bs=5 avail_mem=13.31 GB): 78%|███████▊ | 18/23 [00:27<00:04, 1.22it/s] | |
| Capturing batches (bs=5 avail_mem=13.31 GB): 83%|████████▎ | 19/23 [00:28<00:03, 1.16it/s] | |
| Capturing batches (bs=4 avail_mem=13.30 GB): 83%|████████▎ | 19/23 [00:28<00:03, 1.16it/s] | |
| Capturing batches (bs=4 avail_mem=13.30 GB): 87%|████████▋ | 20/23 [00:29<00:02, 1.11it/s] | |
| Capturing batches (bs=3 avail_mem=13.29 GB): 87%|████████▋ | 20/23 [00:29<00:02, 1.11it/s] | |
| Capturing batches (bs=3 avail_mem=13.29 GB): 91%|█████████▏| 21/23 [00:30<00:01, 1.08it/s] | |
| Capturing batches (bs=2 avail_mem=13.28 GB): 91%|█████████▏| 21/23 [00:30<00:01, 1.08it/s] | |
| Capturing batches (bs=2 avail_mem=13.28 GB): 96%|█████████▌| 22/23 [00:30<00:00, 1.28it/s] | |
| Capturing batches (bs=1 avail_mem=13.27 GB): 96%|█████████▌| 22/23 [00:30<00:00, 1.28it/s] | |
| Capturing batches (bs=1 avail_mem=13.27 GB): 100%|██████████| 23/23 [00:31<00:00, 1.18it/s] | |
| Capturing batches (bs=1 avail_mem=13.27 GB): 100%|██████████| 23/23 [00:31<00:00, 1.37s/it] | |
| [2026-02-02 23:02:43 TP0] Registering 46 cuda graph addresses | |
| [2026-02-02 23:02:43 TP3] Capture cuda graph end. Time elapsed: 32.24 s. mem usage=1.74 GB. avail mem=13.64 GB. | |
| [2026-02-02 23:02:43 TP0] Capture cuda graph end. Time elapsed: 32.41 s. mem usage=1.93 GB. avail mem=13.27 GB. | |
| [2026-02-02 23:02:43 TP2] Capture cuda graph end. Time elapsed: 32.43 s. mem usage=1.97 GB. avail mem=13.17 GB. | |
| [2026-02-02 23:02:43 TP1] Capture cuda graph end. Time elapsed: 32.44 s. mem usage=1.97 GB. avail mem=13.17 GB. | |
| [2026-02-02 23:02:44 TP2] Init torch distributed begin. | |
| [2026-02-02 23:02:44 TP0] Init torch distributed begin. | |
| [2026-02-02 23:02:44 TP3] Init torch distributed begin. | |
| [2026-02-02 23:02:44 TP1] Init torch distributed begin. | |
| [2026-02-02 23:02:44 TP0] Init torch distributed ends. mem usage=0.00 GB | |
| [2026-02-02 23:02:44 TP3] Init torch distributed ends. mem usage=0.00 GB | |
| [2026-02-02 23:02:44 TP1] Init torch distributed ends. mem usage=0.00 GB | |
| [2026-02-02 23:02:44 TP2] Init torch distributed ends. mem usage=0.00 GB | |
| [2026-02-02 23:02:44 TP3] Load weight begin. avail mem=13.64 GB | |
| [2026-02-02 23:02:44 TP0] Load weight begin. avail mem=13.27 GB | |
| [2026-02-02 23:02:44 TP2] Load weight begin. avail mem=13.17 GB | |
| [2026-02-02 23:02:44 TP1] Load weight begin. avail mem=13.17 GB | |
| rank 0 allocated ipc_handles: [['0x7f0fdc000000', '0x7f0fc2000000', '0x7f0fbc000000', '0x7f0fb6000000'], ['0x7f0fbb000000', '0x7f0fbb200000', '0x7f0fbb400000', '0x7f0fbb600000'], ['0x7f0fa6000000', '0x7f0f96000000', '0x7f0f86000000', '0x7f0f76000000']] | |
| set flag_ptr[3] = lamport_comm_size: 83886080 | |
| Rank 0 workspace[0] 0x7f0fdc000000 | |
| Rank 0 workspace[1] 0x7f0fc2000000 | |
| Rank 0 workspace[2] 0x7f0fbc000000 | |
| Rank 0 workspace[3] 0x7f0fb6000000 | |
| Rank 0 workspace[4] 0x7f0fbb000000 | |
| Loading safetensors checkpoint shards: 0% Completed | 0/93 [00:00<?, ?it/s] | |
| Rank 0 workspace[5] 0x7f0fbb200000 | |
| Rank 0 workspace[6] 0x7f0fbb400000 | |
| Rank 0 workspace[7] 0x7f0fbb600000 | |
| Rank 0 workspace[8] 0x7f0fa6000000 | |
| Rank 0 workspace[9] 0x7f0f96000000 | |
| Rank 0 workspace[10] 0x7f0f86000000 | |
| Rank 0 workspace[11] 0x7f0f76000000 | |
| Rank 0 workspace[12] 0x7f30ce064400 | |
| Loading safetensors checkpoint shards: 1% Completed | 1/93 [00:00<00:45, 2.03it/s] | |
| Loading safetensors checkpoint shards: 2% Completed | 2/93 [00:00<00:24, 3.69it/s] | |
| Loading safetensors checkpoint shards: 10% Completed | 9/93 [00:00<00:04, 19.04it/s] | |
| Loading safetensors checkpoint shards: 17% Completed | 16/93 [00:00<00:02, 31.25it/s] | |
| Loading safetensors checkpoint shards: 25% Completed | 23/93 [00:00<00:01, 40.69it/s] | |
| Loading safetensors checkpoint shards: 32% Completed | 30/93 [00:01<00:01, 47.70it/s] | |
| Loading safetensors checkpoint shards: 40% Completed | 37/93 [00:01<00:01, 52.93it/s] | |
| Loading safetensors checkpoint shards: 47% Completed | 44/93 [00:01<00:00, 56.81it/s] | |
| Loading safetensors checkpoint shards: 55% Completed | 51/93 [00:01<00:00, 59.67it/s] | |
| Loading safetensors checkpoint shards: 62% Completed | 58/93 [00:01<00:00, 61.64it/s] | |
| Loading safetensors checkpoint shards: 70% Completed | 65/93 [00:01<00:00, 62.52it/s] | |
| Loading safetensors checkpoint shards: 77% Completed | 72/93 [00:01<00:00, 64.29it/s] | |
| Loading safetensors checkpoint shards: 85% Completed | 79/93 [00:01<00:00, 65.64it/s] | |
| Loading safetensors checkpoint shards: 92% Completed | 86/93 [00:01<00:00, 66.73it/s] | |
| [2026-02-02 23:02:46 TP2] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:46 TP2] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.43 GB, mem usage=1.74 GB. | |
| Loading safetensors checkpoint shards: 100% Completed | 93/93 [00:01<00:00, 48.19it/s] | |
| [2026-02-02 23:02:46 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:46 TP3] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.90 GB, mem usage=1.74 GB. | |
| [2026-02-02 23:02:46 TP1] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:46 TP0] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results! | |
| [2026-02-02 23:02:46 TP1] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.43 GB, mem usage=1.74 GB. | |
| [2026-02-02 23:02:46 TP0] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.53 GB, mem usage=1.74 GB. | |
| [2026-02-02 23:02:46 TP0] Using KV cache dtype: torch.float8_e4m3fn | |
| [2026-02-02 23:02:46 TP3] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB | |
| [2026-02-02 23:02:46 TP0] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB | |
| [2026-02-02 23:02:46 TP1] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB | |
| [2026-02-02 23:02:46 TP2] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB | |
| [2026-02-02 23:02:46 TP3] Memory pool end. avail mem=11.48 GB | |
| [2026-02-02 23:02:46 TP0] Memory pool end. avail mem=11.10 GB | |
| [2026-02-02 23:02:46 TP1] Memory pool end. avail mem=11.01 GB | |
| [2026-02-02 23:02:46 TP2] Memory pool end. avail mem=11.01 GB | |
| [2026-02-02 23:02:46 TP3] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:46 TP3] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:46 TP1] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:46 TP1] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:46 TP2] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:46 TP0] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3. | |
| [2026-02-02 23:02:46 TP2] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:46 TP0] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem. | |
| [2026-02-02 23:02:46 TP3] Capture draft cuda graph begin. This can take up to several minutes. avail mem=12.12 GB | |
| [2026-02-02 23:02:46 TP1] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.65 GB | |
| [2026-02-02 23:02:46 TP0] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.75 GB | |
| [2026-02-02 23:02:46 TP2] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.65 GB | |
| rank 3 allocated ipc_handles: [['0x7f36e0000000', '0x7f36da000000', '0x7f36d4000000', '0x7f36fc000000'], ['0x7f36d9200000', '0x7f36d9400000', '0x7f36d9600000', '0x7f36d9000000'], ['0x7f36b4000000', '0x7f36a4000000', '0x7f3694000000', '0x7f36c4000000']] | |
| set flag_ptr[3] = lamport_comm_size: 83886080 | |
| Rank 3 workspace[0] 0x7f36e0000000 | |
| Rank 3 workspace[1] 0x7f36da000000 | |
| Rank 3 workspace[2] 0x7f36d4000000 | |
| Rank 3 workspace[3] 0x7f36fc000000 | |
| Rank 3 workspace[4] 0x7f36d9200000 | |
| Rank 3 workspace[5] 0x7f36d9400000 | |
| Rank 3 workspace[6] 0x7f36d9600000 | |
| Rank 3 workspace[7] 0x7f36d9000000 | |
| Rank 3 workspace[8] 0x7f36b4000000 | |
| Rank 3 workspace[9] 0x7f36a4000000 | |
| Rank 3 workspace[10] 0x7f3694000000 | |
| Rank 3 workspace[11] 0x7f36c4000000 | |
| Rank 3 workspace[12] 0x7f57e4064400 | |
| rank 1 allocated ipc_handles: [['0x7f40e0000000', '0x7f40fc000000', '0x7f40da000000', '0x7f40d4000000'], ['0x7f40d9200000', '0x7f40d9000000', '0x7f40d9400000', '0x7f40d9600000'], ['0x7f40b4000000', '0x7f40c4000000', '0x7f40a4000000', '0x7f4094000000']] | |
| set flag_ptr[3] = lamport_comm_size: 83886080 | |
| Rank 1 workspace[0] 0x7f40e0000000 | |
| Rank 1 workspace[1] 0x7f40fc000000 | |
| Rank 1 workspace[2] 0x7f40da000000 | |
| Rank 1 workspace[3] 0x7f40d4000000 | |
| Rank 1 workspace[4] 0x7f40d9200000 | |
| Rank 1 workspace[5] 0x7f40d9000000 | |
| Rank 1 workspace[6] 0x7f40d9400000 | |
| Rank 1 workspace[7] 0x7f40d9600000 | |
| Rank 1 workspace[8] 0x7f40b4000000 | |
| Rank 1 workspace[9] 0x7f40c4000000 | |
| Rank 1 workspace[10] 0x7f40a4000000 | |
| Rank 1 workspace[11] 0x7f4094000000 | |
| Rank 1 workspace[12] 0x7f61fa064400 | |
| rank 2 allocated ipc_handles: [['0x7f6fd0000000', '0x7f6fca000000', '0x7f6fec000000', '0x7f6fc4000000'], ['0x7f6fc9200000', '0x7f6fc9400000', '0x7f6fc9000000', '0x7f6fc9600000'], ['0x7f6fa4000000', '0x7f6f94000000', '0x7f6fb4000000', '0x7f6f84000000']] | |
| set flag_ptr[3] = lamport_comm_size: 83886080 | |
| Rank 2 workspace[0] 0x7f6fd0000000 | |
| Rank 2 workspace[1] 0x7f6fca000000 | |
| Rank 2 workspace[2] 0x7f6fec000000 | |
| Rank 2 workspace[3] 0x7f6fc4000000 | |
| Rank 2 workspace[4] 0x7f6fc9200000 | |
| Rank 2 workspace[5] 0x7f6fc9400000 | |
| Rank 2 workspace[6] 0x7f6fc9000000 | |
| Rank 2 workspace[7] 0x7f6fc9600000 | |
| Rank 2 workspace[8] 0x7f6fa4000000 | |
| Rank 2 workspace[9] 0x7f6f94000000 | |
| Rank 2 workspace[10] 0x7f6fb4000000 | |
| Rank 2 workspace[11] 0x7f6f84000000 | |
| Rank 2 workspace[12] 0x7f90ec064400 | |
| 0%| | 0/23 [00:00<?, ?it/s] | |
| Capturing batches (bs=48 avail_mem=11.64 GB): 0%| | 0/23 [00:00<?, ?it/s] | |
| Capturing batches (bs=48 avail_mem=11.64 GB): 0%| | 0/23 [02:20<?, ?it/s] | |
| [2026-02-02 23:05:07 TP0] Registering 0 cuda graph addresses | |
| [2026-02-02 23:05:08 TP2] Scheduler hit an exception: Traceback (most recent call last): | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 328, in run_ninja | |
| subprocess.run( | |
| File "/usr/lib/python3.12/subprocess.py", line 571, in run | |
| raise CalledProcessError(retcode, process.args, | |
| subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90', '-f', '/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/build.ninja']' returned non-zero exit status 1. | |
| The above exception was the direct cause of the following exception: | |
| Traceback (most recent call last): | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 130, in __init__ | |
| self.capture() | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 179, in capture | |
| CudaGraphRunner.capture(self) | |
| File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 522, in capture | |
| _capture_one_stream() | |
| File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream | |
| ) = self.capture_one_batch_size(bs, forward, stream_idx) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 278, in capture_one_batch_size | |
| self.model_runner.draft_attn_backend.init_forward_metadata_capture_cuda_graph( | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1591, in init_forward_metadata_capture_cuda_graph | |
| self.common_template(forward_batch, self.cuda_graph_kv_indices, call_fn) | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1542, in common_template | |
| call_fn(i, forward_batch) | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1581, in call_fn | |
| self.attn_backends[i].init_forward_metadata_capture_cuda_graph( | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 568, in init_forward_metadata_capture_cuda_graph | |
| self.indices_updater_decode.update( | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 964, in update_single_wrapper | |
| self.call_begin_forward( | |
| File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1145, in call_begin_forward | |
| wrapper.begin_forward( | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1051, in plan | |
| self._cached_module = get_batch_prefill_module( | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 404, in get_batch_prefill_module | |
| module = gen_batch_prefill_module(backend, *args).build_and_load() | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 316, in build_and_load | |
| self.build(verbose, need_lock=False) | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 302, in build | |
| run_ninja(self.build_dir, self.ninja_path, verbose) | |
| File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 340, in run_ninja | |
| raise RuntimeError(msg) from e | |
| RuntimeError: Ninja build failed. Ninja output: | |
| ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90' | |
| [1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu". | |
| [2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu". | |
| [3/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu". | |
| [4/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu". | |
| [5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu". | |
| [6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu". | |
| [7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu". | |
| [8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu". | |
| ninja: build stopped: subcommand failed. | |
| During handling of the above exception, another exception occurred: | |
| Traceback (most recent call last): | |
| File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2937, in run_scheduler_process | |
| scheduler = Scheduler( | |
| ^^^^^^^^^^ | |
| File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 346, in __init__ | |
| self.init_model_worker() | |
| File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 536, in init_model_worker | |
| self.maybe_init_draft_worker() | |
| File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 532, in maybe_init_draft_worker | |
| self.draft_worker = DraftWorkerClass(**draft_worker_kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 615, in __init__ | |
| self._draft_worker = EagleDraftWorker( | |
| ^^^^^^^^^^^^^^^^^ | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 170, in __init__ | |
| self.init_cuda_graphs() | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 263, in init_cuda_graphs | |
| self.cuda_graph_runner = Device2DraftCudaGraphRunner[ | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 132, in __init__ | |
| raise Exception( | |
| Exception: Capture cuda graph failed: Ninja build failed. Ninja output: | |
| ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90' | |
| [1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu". | |
| [2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu". | |
| [3/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu". | |
| [4/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu". | |
| [5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu". | |
| [6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu". | |
| [7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu". | |
| [8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration." | |
| static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list | |
| argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>) | |
| using TiledMmaQK = decltype(cute::make_tiled_mma( | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Op const&, | |
| ^ | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction | |
| make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom, | |
| ^ | |
| detected during: | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config." | |
| static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config."); | |
| ^ | |
| detected during: | |
| instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh | |
| instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh | |
| instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu | |
| 3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu". | |
| ninja: build stopped: subcommand failed. | |
| Possible solutions: | |
| 1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7) | |
| 2. set --cuda-graph-max-bs to a smaller value (e.g., 16) | |
| 3. disable torch compile by not using --enable-torch-compile | |
| 4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss) | |
| Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose | |
| [2026-02-02 23:05:08] Received sigquit from a child process. It usually means the child failed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment