Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save naveenkumarmarri/28f2be4e7d788820a1429f68879c8e77 to your computer and use it in GitHub Desktop.

Select an option

Save naveenkumarmarri/28f2be4e7d788820a1429f68879c8e77 to your computer and use it in GitHub Desktop.
[2026-02-02 23:00:50] WARNING server_args.py:806: The tool_call_parser 'glm45' is deprecated. Please use 'glm' instead.
[2026-02-02 23:00:52] WARNING server_args.py:1562: Disabling overlap schedule since MambaRadixCache no_buffer is not compatible with overlap schedule currently, try to use --mamba-scheduler-strategy extra_buffer to enable overlap schedule
[2026-02-02 23:00:52] INFO server_args.py:1697: Attention backend not specified. Use fa3 backend by default.
[2026-02-02 23:00:52] WARNING server_args.py:2136: Max running requests is reset to 48 for speculative decoding. You can override this by explicitly setting --max-running-requests.
[2026-02-02 23:00:52] WARNING server_args.py:2145: Spec v2 is enabled for eagle/eagle3 speculative decoding and overlap schedule is turned on.
[2026-02-02 23:00:52] Fail to set RLIMIT_STACK: current limit exceeds maximum limit
[2026-02-02 23:00:52] server_args=ServerArgs(model_path='/mnt/models', tokenizer_path='/mnt/models', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=80, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='fp8_e4m3', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.886, max_running_requests=48, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=298140868, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='/mnt/models', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='glm45', tool_call_parser='glm', tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend='flashinfer', prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm='EAGLE', speculative_draft_model_path='/mnt/models', speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=3, speculative_eagle_topk=1, speculative_num_draft_tokens=4, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=True, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend='nixl', hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='in-seq-split', enable_fused_qk_norm_rope=True, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-02-02 23:00:59 TP0] Init torch distributed begin.
[2026-02-02 23:01:04 TP1] Init torch distributed begin.
[2026-02-02 23:01:08 TP2] Init torch distributed begin.
[2026-02-02 23:01:13 TP3] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-02-02 23:01:13 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-02-02 23:01:14 TP0] Init torch distributed ends. mem usage=1.24 GB
[2026-02-02 23:01:14 TP3] Init torch distributed ends. mem usage=1.06 GB
[2026-02-02 23:01:14 TP1] Init torch distributed ends. mem usage=1.29 GB
[2026-02-02 23:01:14 TP2] Init torch distributed ends. mem usage=1.29 GB
[2026-02-02 23:01:14 TP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-02 23:01:14 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-02 23:01:14 TP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-02 23:01:14 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-02 23:01:15 TP3] Load weight begin. avail mem=138.21 GB
[2026-02-02 23:01:15 TP1] Load weight begin. avail mem=137.97 GB
[2026-02-02 23:01:15 TP2] Load weight begin. avail mem=137.97 GB
[2026-02-02 23:01:15 TP0] Load weight begin. avail mem=138.02 GB
[2026-02-02 23:01:15 TP0] Shared experts fusion optimization enabled.
[2026-02-02 23:01:15 TP3] Using CompressedTensorsW8A8Fp8MoEMethod
[2026-02-02 23:01:15 TP1] Using CompressedTensorsW8A8Fp8MoEMethod
[2026-02-02 23:01:15 TP2] Using CompressedTensorsW8A8Fp8MoEMethod
[2026-02-02 23:01:15 TP0] Using CompressedTensorsW8A8Fp8MoEMethod
[2026-02-02 23:01:15] Using default HuggingFace chat template with detected content format: openai
Loading safetensors checkpoint shards: 0% Completed | 0/93 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 2/93 [00:00<00:24, 3.72it/s]
Loading safetensors checkpoint shards: 3% Completed | 3/93 [00:01<00:35, 2.55it/s]
Loading safetensors checkpoint shards: 4% Completed | 4/93 [00:01<00:42, 2.09it/s]
Loading safetensors checkpoint shards: 5% Completed | 5/93 [00:02<00:48, 1.80it/s]
Loading safetensors checkpoint shards: 6% Completed | 6/93 [00:03<00:52, 1.67it/s]
Loading safetensors checkpoint shards: 8% Completed | 7/93 [00:03<00:54, 1.58it/s]
Loading safetensors checkpoint shards: 9% Completed | 8/93 [00:04<00:55, 1.54it/s]
Loading safetensors checkpoint shards: 10% Completed | 9/93 [00:05<00:55, 1.52it/s]
Loading safetensors checkpoint shards: 11% Completed | 10/93 [00:05<00:55, 1.50it/s]
Loading safetensors checkpoint shards: 12% Completed | 11/93 [00:06<00:55, 1.47it/s]
Loading safetensors checkpoint shards: 13% Completed | 12/93 [00:07<00:55, 1.46it/s]
Loading safetensors checkpoint shards: 14% Completed | 13/93 [00:07<00:54, 1.46it/s]
Loading safetensors checkpoint shards: 15% Completed | 14/93 [00:08<00:54, 1.45it/s]
Loading safetensors checkpoint shards: 16% Completed | 15/93 [00:09<00:53, 1.45it/s]
Loading safetensors checkpoint shards: 17% Completed | 16/93 [00:10<00:53, 1.44it/s]
Loading safetensors checkpoint shards: 18% Completed | 17/93 [00:10<00:52, 1.46it/s]
Loading safetensors checkpoint shards: 19% Completed | 18/93 [00:11<00:50, 1.50it/s]
Loading safetensors checkpoint shards: 20% Completed | 19/93 [00:11<00:47, 1.57it/s]
Loading safetensors checkpoint shards: 22% Completed | 20/93 [00:12<00:44, 1.64it/s]
Loading safetensors checkpoint shards: 23% Completed | 21/93 [00:13<00:42, 1.69it/s]
Loading safetensors checkpoint shards: 24% Completed | 22/93 [00:13<00:41, 1.72it/s]
Loading safetensors checkpoint shards: 25% Completed | 23/93 [00:14<00:40, 1.74it/s]
Loading safetensors checkpoint shards: 26% Completed | 24/93 [00:14<00:39, 1.77it/s]
Loading safetensors checkpoint shards: 27% Completed | 25/93 [00:15<00:38, 1.78it/s]
Loading safetensors checkpoint shards: 28% Completed | 26/93 [00:15<00:37, 1.80it/s]
Loading safetensors checkpoint shards: 29% Completed | 27/93 [00:16<00:36, 1.80it/s]
Loading safetensors checkpoint shards: 30% Completed | 28/93 [00:16<00:35, 1.81it/s]
Loading safetensors checkpoint shards: 31% Completed | 29/93 [00:17<00:35, 1.80it/s]
Loading safetensors checkpoint shards: 32% Completed | 30/93 [00:17<00:35, 1.78it/s]
Loading safetensors checkpoint shards: 33% Completed | 31/93 [00:18<00:35, 1.77it/s]
Loading safetensors checkpoint shards: 34% Completed | 32/93 [00:19<00:34, 1.79it/s]
Loading safetensors checkpoint shards: 35% Completed | 33/93 [00:19<00:33, 1.80it/s]
Loading safetensors checkpoint shards: 37% Completed | 34/93 [00:20<00:32, 1.81it/s]
Loading safetensors checkpoint shards: 38% Completed | 35/93 [00:20<00:31, 1.82it/s]
Loading safetensors checkpoint shards: 39% Completed | 36/93 [00:21<00:31, 1.82it/s]
Loading safetensors checkpoint shards: 40% Completed | 37/93 [00:21<00:30, 1.82it/s]
Loading safetensors checkpoint shards: 41% Completed | 38/93 [00:22<00:30, 1.82it/s]
Loading safetensors checkpoint shards: 42% Completed | 39/93 [00:22<00:30, 1.79it/s]
Loading safetensors checkpoint shards: 43% Completed | 40/93 [00:23<00:30, 1.76it/s]
Loading safetensors checkpoint shards: 44% Completed | 41/93 [00:24<00:29, 1.77it/s]
Loading safetensors checkpoint shards: 45% Completed | 42/93 [00:24<00:28, 1.78it/s]
Loading safetensors checkpoint shards: 46% Completed | 43/93 [00:25<00:27, 1.79it/s]
Loading safetensors checkpoint shards: 47% Completed | 44/93 [00:25<00:27, 1.80it/s]
Loading safetensors checkpoint shards: 48% Completed | 45/93 [00:26<00:26, 1.81it/s]
Loading safetensors checkpoint shards: 49% Completed | 46/93 [00:26<00:25, 1.81it/s]
Loading safetensors checkpoint shards: 51% Completed | 47/93 [00:27<00:25, 1.82it/s]
Loading safetensors checkpoint shards: 52% Completed | 48/93 [00:27<00:24, 1.82it/s]
Loading safetensors checkpoint shards: 53% Completed | 49/93 [00:28<00:24, 1.79it/s]
Loading safetensors checkpoint shards: 54% Completed | 50/93 [00:29<00:24, 1.77it/s]
Loading safetensors checkpoint shards: 55% Completed | 51/93 [00:29<00:23, 1.75it/s]
Loading safetensors checkpoint shards: 56% Completed | 52/93 [00:30<00:23, 1.76it/s]
Loading safetensors checkpoint shards: 57% Completed | 53/93 [00:30<00:22, 1.77it/s]
Loading safetensors checkpoint shards: 58% Completed | 54/93 [00:31<00:22, 1.77it/s]
Loading safetensors checkpoint shards: 59% Completed | 55/93 [00:31<00:21, 1.78it/s]
Loading safetensors checkpoint shards: 60% Completed | 56/93 [00:32<00:20, 1.77it/s]
Loading safetensors checkpoint shards: 61% Completed | 57/93 [00:33<00:20, 1.77it/s]
Loading safetensors checkpoint shards: 62% Completed | 58/93 [00:33<00:19, 1.77it/s]
Loading safetensors checkpoint shards: 63% Completed | 59/93 [00:34<00:19, 1.76it/s]
Loading safetensors checkpoint shards: 65% Completed | 60/93 [00:34<00:18, 1.75it/s]
Loading safetensors checkpoint shards: 66% Completed | 61/93 [00:35<00:18, 1.74it/s]
Loading safetensors checkpoint shards: 67% Completed | 62/93 [00:35<00:17, 1.74it/s]
Loading safetensors checkpoint shards: 68% Completed | 63/93 [00:36<00:17, 1.74it/s]
Loading safetensors checkpoint shards: 69% Completed | 64/93 [00:37<00:16, 1.74it/s]
Loading safetensors checkpoint shards: 70% Completed | 65/93 [00:37<00:16, 1.73it/s]
Loading safetensors checkpoint shards: 71% Completed | 66/93 [00:38<00:15, 1.73it/s]
Loading safetensors checkpoint shards: 72% Completed | 67/93 [00:38<00:14, 1.74it/s]
Loading safetensors checkpoint shards: 73% Completed | 68/93 [00:39<00:14, 1.74it/s]
Loading safetensors checkpoint shards: 74% Completed | 69/93 [00:40<00:13, 1.73it/s]
Loading safetensors checkpoint shards: 75% Completed | 70/93 [00:40<00:13, 1.72it/s]
Loading safetensors checkpoint shards: 76% Completed | 71/93 [00:41<00:12, 1.73it/s]
Loading safetensors checkpoint shards: 77% Completed | 72/93 [00:41<00:12, 1.74it/s]
Loading safetensors checkpoint shards: 78% Completed | 73/93 [00:42<00:11, 1.75it/s]
Loading safetensors checkpoint shards: 80% Completed | 74/93 [00:42<00:10, 1.77it/s]
Loading safetensors checkpoint shards: 81% Completed | 75/93 [00:43<00:10, 1.78it/s]
Loading safetensors checkpoint shards: 82% Completed | 76/93 [00:43<00:09, 1.77it/s]
Loading safetensors checkpoint shards: 83% Completed | 77/93 [00:44<00:08, 1.78it/s]
Loading safetensors checkpoint shards: 84% Completed | 78/93 [00:45<00:08, 1.80it/s]
Loading safetensors checkpoint shards: 85% Completed | 79/93 [00:45<00:07, 1.81it/s]
Loading safetensors checkpoint shards: 86% Completed | 80/93 [00:46<00:07, 1.81it/s]
Loading safetensors checkpoint shards: 87% Completed | 81/93 [00:46<00:06, 1.75it/s]
Loading safetensors checkpoint shards: 88% Completed | 82/93 [00:47<00:06, 1.64it/s]
Loading safetensors checkpoint shards: 89% Completed | 83/93 [00:48<00:06, 1.57it/s]
Loading safetensors checkpoint shards: 90% Completed | 84/93 [00:48<00:05, 1.57it/s]
Loading safetensors checkpoint shards: 91% Completed | 85/93 [00:49<00:05, 1.50it/s]
Loading safetensors checkpoint shards: 92% Completed | 86/93 [00:50<00:04, 1.55it/s]
Loading safetensors checkpoint shards: 94% Completed | 87/93 [00:50<00:03, 1.50it/s]
Loading safetensors checkpoint shards: 95% Completed | 88/93 [00:51<00:03, 1.51it/s]
Loading safetensors checkpoint shards: 96% Completed | 89/93 [00:52<00:02, 1.49it/s]
Loading safetensors checkpoint shards: 97% Completed | 90/93 [00:52<00:02, 1.47it/s]
Loading safetensors checkpoint shards: 98% Completed | 91/93 [00:53<00:01, 1.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 93/93 [00:53<00:00, 1.74it/s]
[2026-02-02 23:02:09 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:09 TP3] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=55.10 GB, mem usage=83.11 GB.
[2026-02-02 23:02:09 TP0] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:09 TP0] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.91 GB, mem usage=83.11 GB.
[2026-02-02 23:02:10 TP2] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:10 TP1] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:10 TP2] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.86 GB, mem usage=83.11 GB.
[2026-02-02 23:02:10 TP1] Load weight end. type=Glm4MoeForCausalLM, dtype=torch.bfloat16, avail mem=54.86 GB, mem usage=83.11 GB.
[2026-02-02 23:02:10 TP0] Using KV cache dtype: torch.float8_e4m3fn
[2026-02-02 23:02:11 TP1] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB
[2026-02-02 23:02:11 TP1] Memory pool end. avail mem=15.65 GB
[2026-02-02 23:02:11 TP3] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB
[2026-02-02 23:02:11 TP3] Memory pool end. avail mem=15.89 GB
[2026-02-02 23:02:11 TP2] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB
[2026-02-02 23:02:11 TP0] KV Cache is allocated. #tokens: 892030, K size: 19.57 GB, V size: 19.57 GB
[2026-02-02 23:02:11 TP2] Memory pool end. avail mem=15.65 GB
[2026-02-02 23:02:11 TP0] Memory pool end. avail mem=15.70 GB
[2026-02-02 23:02:11 TP1] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:11 TP1] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:11 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.15 GB
[2026-02-02 23:02:11 TP3] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:11 TP3] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:11 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=15.38 GB
[2026-02-02 23:02:11 TP2] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:11 TP2] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:11 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=15.15 GB
[2026-02-02 23:02:11 TP0] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:11 TP0] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:11 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=15.19 GB
[2026-02-02 23:02:11 TP0] Capture cuda graph bs [1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 40, 44, 48]
0%| | 0/23 [00:00<?, ?it/s]
Capturing batches (bs=48 avail_mem=14.75 GB): 0%| | 0/23 [00:00<?, ?it/s]<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
warnings.warn( # warn only once
[2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f36c4000000, size: 125829120
[2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f0fa6000000, size: 125829120
[2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f6fb4000000, size: 125829120
[2026-02-02 23:02:23.738] [info] lamportInitialize start: buffer: 0x7f40c4000000, size: 125829120
[2026-02-02 23:02:23 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2026-02-02 23:02:23 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2026-02-02 23:02:23 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2026-02-02 23:02:23 TP2] FlashInfer workspace initialized for rank 2, world_size 4
[2026-02-02 23:02:25 TP0] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json.
[2026-02-02 23:02:25 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-02 23:02:25 TP1] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json.
[2026-02-02 23:02:25 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-02 23:02:25 TP2] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json.
[2026-02-02 23:02:25 TP2] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-02 23:02:25 TP3] Using MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True.json.
[2026-02-02 23:02:25 TP3] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=161,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,per_channel_quant=True_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=48 avail_mem=14.75 GB): 4%|▍ | 1/23 [00:15<05:38, 15.39s/it]
Capturing batches (bs=44 avail_mem=13.45 GB): 4%|▍ | 1/23 [00:15<05:38, 15.39s/it]
Capturing batches (bs=44 avail_mem=13.45 GB): 9%|▊ | 2/23 [00:16<02:21, 6.73s/it]
Capturing batches (bs=40 avail_mem=13.45 GB): 9%|▊ | 2/23 [00:16<02:21, 6.73s/it]
Capturing batches (bs=40 avail_mem=13.45 GB): 13%|█▎ | 3/23 [00:16<01:16, 3.84s/it]
Capturing batches (bs=32 avail_mem=13.44 GB): 13%|█▎ | 3/23 [00:16<01:16, 3.84s/it]
Capturing batches (bs=32 avail_mem=13.44 GB): 17%|█▋ | 4/23 [00:16<00:47, 2.48s/it]
Capturing batches (bs=30 avail_mem=13.43 GB): 17%|█▋ | 4/23 [00:16<00:47, 2.48s/it]
Capturing batches (bs=30 avail_mem=13.43 GB): 22%|██▏ | 5/23 [00:17<00:34, 1.93s/it]
Capturing batches (bs=28 avail_mem=13.42 GB): 22%|██▏ | 5/23 [00:17<00:34, 1.93s/it]
Capturing batches (bs=28 avail_mem=13.42 GB): 26%|██▌ | 6/23 [00:18<00:26, 1.56s/it]
Capturing batches (bs=26 avail_mem=13.41 GB): 26%|██▌ | 6/23 [00:18<00:26, 1.56s/it]
Capturing batches (bs=26 avail_mem=13.41 GB): 30%|███ | 7/23 [00:19<00:21, 1.33s/it]
Capturing batches (bs=24 avail_mem=13.40 GB): 30%|███ | 7/23 [00:19<00:21, 1.33s/it]
Capturing batches (bs=24 avail_mem=13.40 GB): 35%|███▍ | 8/23 [00:19<00:15, 1.04s/it]
Capturing batches (bs=22 avail_mem=13.39 GB): 35%|███▍ | 8/23 [00:19<00:15, 1.04s/it]
Capturing batches (bs=22 avail_mem=13.39 GB): 39%|███▉ | 9/23 [00:20<00:11, 1.19it/s]
Capturing batches (bs=20 avail_mem=13.38 GB): 39%|███▉ | 9/23 [00:20<00:11, 1.19it/s]
Capturing batches (bs=20 avail_mem=13.38 GB): 43%|████▎ | 10/23 [00:20<00:09, 1.41it/s]
Capturing batches (bs=18 avail_mem=13.37 GB): 43%|████▎ | 10/23 [00:20<00:09, 1.41it/s]
Capturing batches (bs=18 avail_mem=13.37 GB): 48%|████▊ | 11/23 [00:21<00:07, 1.62it/s]
Capturing batches (bs=16 avail_mem=13.36 GB): 48%|████▊ | 11/23 [00:21<00:07, 1.62it/s]
Capturing batches (bs=16 avail_mem=13.36 GB): 52%|█████▏ | 12/23 [00:21<00:06, 1.81it/s]
Capturing batches (bs=14 avail_mem=13.35 GB): 52%|█████▏ | 12/23 [00:21<00:06, 1.81it/s]
Capturing batches (bs=14 avail_mem=13.35 GB): 57%|█████▋ | 13/23 [00:22<00:06, 1.55it/s]
Capturing batches (bs=12 avail_mem=13.35 GB): 57%|█████▋ | 13/23 [00:22<00:06, 1.55it/s]
Capturing batches (bs=12 avail_mem=13.35 GB): 61%|██████ | 14/23 [00:23<00:06, 1.40it/s]
Capturing batches (bs=10 avail_mem=13.34 GB): 61%|██████ | 14/23 [00:23<00:06, 1.40it/s]
Capturing batches (bs=10 avail_mem=13.34 GB): 65%|██████▌ | 15/23 [00:24<00:06, 1.25it/s]
Capturing batches (bs=8 avail_mem=13.33 GB): 65%|██████▌ | 15/23 [00:24<00:06, 1.25it/s]
Capturing batches (bs=8 avail_mem=13.33 GB): 70%|██████▉ | 16/23 [00:25<00:07, 1.01s/it]
Capturing batches (bs=7 avail_mem=13.32 GB): 70%|██████▉ | 16/23 [00:25<00:07, 1.01s/it]
Capturing batches (bs=7 avail_mem=13.32 GB): 74%|███████▍ | 17/23 [00:26<00:05, 1.01it/s]
Capturing batches (bs=6 avail_mem=13.31 GB): 74%|███████▍ | 17/23 [00:26<00:05, 1.01it/s]
Capturing batches (bs=6 avail_mem=13.31 GB): 78%|███████▊ | 18/23 [00:27<00:04, 1.22it/s]
Capturing batches (bs=5 avail_mem=13.31 GB): 78%|███████▊ | 18/23 [00:27<00:04, 1.22it/s]
Capturing batches (bs=5 avail_mem=13.31 GB): 83%|████████▎ | 19/23 [00:28<00:03, 1.16it/s]
Capturing batches (bs=4 avail_mem=13.30 GB): 83%|████████▎ | 19/23 [00:28<00:03, 1.16it/s]
Capturing batches (bs=4 avail_mem=13.30 GB): 87%|████████▋ | 20/23 [00:29<00:02, 1.11it/s]
Capturing batches (bs=3 avail_mem=13.29 GB): 87%|████████▋ | 20/23 [00:29<00:02, 1.11it/s]
Capturing batches (bs=3 avail_mem=13.29 GB): 91%|█████████▏| 21/23 [00:30<00:01, 1.08it/s]
Capturing batches (bs=2 avail_mem=13.28 GB): 91%|█████████▏| 21/23 [00:30<00:01, 1.08it/s]
Capturing batches (bs=2 avail_mem=13.28 GB): 96%|█████████▌| 22/23 [00:30<00:00, 1.28it/s]
Capturing batches (bs=1 avail_mem=13.27 GB): 96%|█████████▌| 22/23 [00:30<00:00, 1.28it/s]
Capturing batches (bs=1 avail_mem=13.27 GB): 100%|██████████| 23/23 [00:31<00:00, 1.18it/s]
Capturing batches (bs=1 avail_mem=13.27 GB): 100%|██████████| 23/23 [00:31<00:00, 1.37s/it]
[2026-02-02 23:02:43 TP0] Registering 46 cuda graph addresses
[2026-02-02 23:02:43 TP3] Capture cuda graph end. Time elapsed: 32.24 s. mem usage=1.74 GB. avail mem=13.64 GB.
[2026-02-02 23:02:43 TP0] Capture cuda graph end. Time elapsed: 32.41 s. mem usage=1.93 GB. avail mem=13.27 GB.
[2026-02-02 23:02:43 TP2] Capture cuda graph end. Time elapsed: 32.43 s. mem usage=1.97 GB. avail mem=13.17 GB.
[2026-02-02 23:02:43 TP1] Capture cuda graph end. Time elapsed: 32.44 s. mem usage=1.97 GB. avail mem=13.17 GB.
[2026-02-02 23:02:44 TP2] Init torch distributed begin.
[2026-02-02 23:02:44 TP0] Init torch distributed begin.
[2026-02-02 23:02:44 TP3] Init torch distributed begin.
[2026-02-02 23:02:44 TP1] Init torch distributed begin.
[2026-02-02 23:02:44 TP0] Init torch distributed ends. mem usage=0.00 GB
[2026-02-02 23:02:44 TP3] Init torch distributed ends. mem usage=0.00 GB
[2026-02-02 23:02:44 TP1] Init torch distributed ends. mem usage=0.00 GB
[2026-02-02 23:02:44 TP2] Init torch distributed ends. mem usage=0.00 GB
[2026-02-02 23:02:44 TP3] Load weight begin. avail mem=13.64 GB
[2026-02-02 23:02:44 TP0] Load weight begin. avail mem=13.27 GB
[2026-02-02 23:02:44 TP2] Load weight begin. avail mem=13.17 GB
[2026-02-02 23:02:44 TP1] Load weight begin. avail mem=13.17 GB
rank 0 allocated ipc_handles: [['0x7f0fdc000000', '0x7f0fc2000000', '0x7f0fbc000000', '0x7f0fb6000000'], ['0x7f0fbb000000', '0x7f0fbb200000', '0x7f0fbb400000', '0x7f0fbb600000'], ['0x7f0fa6000000', '0x7f0f96000000', '0x7f0f86000000', '0x7f0f76000000']]
set flag_ptr[3] = lamport_comm_size: 83886080
Rank 0 workspace[0] 0x7f0fdc000000
Rank 0 workspace[1] 0x7f0fc2000000
Rank 0 workspace[2] 0x7f0fbc000000
Rank 0 workspace[3] 0x7f0fb6000000
Rank 0 workspace[4] 0x7f0fbb000000
Loading safetensors checkpoint shards: 0% Completed | 0/93 [00:00<?, ?it/s]
Rank 0 workspace[5] 0x7f0fbb200000
Rank 0 workspace[6] 0x7f0fbb400000
Rank 0 workspace[7] 0x7f0fbb600000
Rank 0 workspace[8] 0x7f0fa6000000
Rank 0 workspace[9] 0x7f0f96000000
Rank 0 workspace[10] 0x7f0f86000000
Rank 0 workspace[11] 0x7f0f76000000
Rank 0 workspace[12] 0x7f30ce064400
Loading safetensors checkpoint shards: 1% Completed | 1/93 [00:00<00:45, 2.03it/s]
Loading safetensors checkpoint shards: 2% Completed | 2/93 [00:00<00:24, 3.69it/s]
Loading safetensors checkpoint shards: 10% Completed | 9/93 [00:00<00:04, 19.04it/s]
Loading safetensors checkpoint shards: 17% Completed | 16/93 [00:00<00:02, 31.25it/s]
Loading safetensors checkpoint shards: 25% Completed | 23/93 [00:00<00:01, 40.69it/s]
Loading safetensors checkpoint shards: 32% Completed | 30/93 [00:01<00:01, 47.70it/s]
Loading safetensors checkpoint shards: 40% Completed | 37/93 [00:01<00:01, 52.93it/s]
Loading safetensors checkpoint shards: 47% Completed | 44/93 [00:01<00:00, 56.81it/s]
Loading safetensors checkpoint shards: 55% Completed | 51/93 [00:01<00:00, 59.67it/s]
Loading safetensors checkpoint shards: 62% Completed | 58/93 [00:01<00:00, 61.64it/s]
Loading safetensors checkpoint shards: 70% Completed | 65/93 [00:01<00:00, 62.52it/s]
Loading safetensors checkpoint shards: 77% Completed | 72/93 [00:01<00:00, 64.29it/s]
Loading safetensors checkpoint shards: 85% Completed | 79/93 [00:01<00:00, 65.64it/s]
Loading safetensors checkpoint shards: 92% Completed | 86/93 [00:01<00:00, 66.73it/s]
[2026-02-02 23:02:46 TP2] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:46 TP2] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.43 GB, mem usage=1.74 GB.
Loading safetensors checkpoint shards: 100% Completed | 93/93 [00:01<00:00, 48.19it/s]
[2026-02-02 23:02:46 TP3] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:46 TP3] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.90 GB, mem usage=1.74 GB.
[2026-02-02 23:02:46 TP1] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:46 TP0] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
[2026-02-02 23:02:46 TP1] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.43 GB, mem usage=1.74 GB.
[2026-02-02 23:02:46 TP0] Load weight end. type=Glm4MoeForCausalLMNextN, dtype=torch.bfloat16, avail mem=11.53 GB, mem usage=1.74 GB.
[2026-02-02 23:02:46 TP0] Using KV cache dtype: torch.float8_e4m3fn
[2026-02-02 23:02:46 TP3] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB
[2026-02-02 23:02:46 TP0] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB
[2026-02-02 23:02:46 TP1] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB
[2026-02-02 23:02:46 TP2] KV Cache is allocated. #tokens: 892030, K size: 0.21 GB, V size: 0.21 GB
[2026-02-02 23:02:46 TP3] Memory pool end. avail mem=11.48 GB
[2026-02-02 23:02:46 TP0] Memory pool end. avail mem=11.10 GB
[2026-02-02 23:02:46 TP1] Memory pool end. avail mem=11.01 GB
[2026-02-02 23:02:46 TP2] Memory pool end. avail mem=11.01 GB
[2026-02-02 23:02:46 TP3] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:46 TP3] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:46 TP1] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:46 TP1] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:46 TP2] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:46 TP0] Using hybrid attention backend for decode and prefill: decode_backend=flashinfer, prefill_backend=fa3.
[2026-02-02 23:02:46 TP2] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:46 TP0] Warning: Attention backend specified by --attention-backend or default backend might be overridden.The feature of hybrid attention backend is experimental and unstable. Please raise an issue if you encounter any problem.
[2026-02-02 23:02:46 TP3] Capture draft cuda graph begin. This can take up to several minutes. avail mem=12.12 GB
[2026-02-02 23:02:46 TP1] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.65 GB
[2026-02-02 23:02:46 TP0] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.75 GB
[2026-02-02 23:02:46 TP2] Capture draft cuda graph begin. This can take up to several minutes. avail mem=11.65 GB
rank 3 allocated ipc_handles: [['0x7f36e0000000', '0x7f36da000000', '0x7f36d4000000', '0x7f36fc000000'], ['0x7f36d9200000', '0x7f36d9400000', '0x7f36d9600000', '0x7f36d9000000'], ['0x7f36b4000000', '0x7f36a4000000', '0x7f3694000000', '0x7f36c4000000']]
set flag_ptr[3] = lamport_comm_size: 83886080
Rank 3 workspace[0] 0x7f36e0000000
Rank 3 workspace[1] 0x7f36da000000
Rank 3 workspace[2] 0x7f36d4000000
Rank 3 workspace[3] 0x7f36fc000000
Rank 3 workspace[4] 0x7f36d9200000
Rank 3 workspace[5] 0x7f36d9400000
Rank 3 workspace[6] 0x7f36d9600000
Rank 3 workspace[7] 0x7f36d9000000
Rank 3 workspace[8] 0x7f36b4000000
Rank 3 workspace[9] 0x7f36a4000000
Rank 3 workspace[10] 0x7f3694000000
Rank 3 workspace[11] 0x7f36c4000000
Rank 3 workspace[12] 0x7f57e4064400
rank 1 allocated ipc_handles: [['0x7f40e0000000', '0x7f40fc000000', '0x7f40da000000', '0x7f40d4000000'], ['0x7f40d9200000', '0x7f40d9000000', '0x7f40d9400000', '0x7f40d9600000'], ['0x7f40b4000000', '0x7f40c4000000', '0x7f40a4000000', '0x7f4094000000']]
set flag_ptr[3] = lamport_comm_size: 83886080
Rank 1 workspace[0] 0x7f40e0000000
Rank 1 workspace[1] 0x7f40fc000000
Rank 1 workspace[2] 0x7f40da000000
Rank 1 workspace[3] 0x7f40d4000000
Rank 1 workspace[4] 0x7f40d9200000
Rank 1 workspace[5] 0x7f40d9000000
Rank 1 workspace[6] 0x7f40d9400000
Rank 1 workspace[7] 0x7f40d9600000
Rank 1 workspace[8] 0x7f40b4000000
Rank 1 workspace[9] 0x7f40c4000000
Rank 1 workspace[10] 0x7f40a4000000
Rank 1 workspace[11] 0x7f4094000000
Rank 1 workspace[12] 0x7f61fa064400
rank 2 allocated ipc_handles: [['0x7f6fd0000000', '0x7f6fca000000', '0x7f6fec000000', '0x7f6fc4000000'], ['0x7f6fc9200000', '0x7f6fc9400000', '0x7f6fc9000000', '0x7f6fc9600000'], ['0x7f6fa4000000', '0x7f6f94000000', '0x7f6fb4000000', '0x7f6f84000000']]
set flag_ptr[3] = lamport_comm_size: 83886080
Rank 2 workspace[0] 0x7f6fd0000000
Rank 2 workspace[1] 0x7f6fca000000
Rank 2 workspace[2] 0x7f6fec000000
Rank 2 workspace[3] 0x7f6fc4000000
Rank 2 workspace[4] 0x7f6fc9200000
Rank 2 workspace[5] 0x7f6fc9400000
Rank 2 workspace[6] 0x7f6fc9000000
Rank 2 workspace[7] 0x7f6fc9600000
Rank 2 workspace[8] 0x7f6fa4000000
Rank 2 workspace[9] 0x7f6f94000000
Rank 2 workspace[10] 0x7f6fb4000000
Rank 2 workspace[11] 0x7f6f84000000
Rank 2 workspace[12] 0x7f90ec064400
0%| | 0/23 [00:00<?, ?it/s]
Capturing batches (bs=48 avail_mem=11.64 GB): 0%| | 0/23 [00:00<?, ?it/s]
Capturing batches (bs=48 avail_mem=11.64 GB): 0%| | 0/23 [02:20<?, ?it/s]
[2026-02-02 23:05:07 TP0] Registering 0 cuda graph addresses
[2026-02-02 23:05:08 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 328, in run_ninja
subprocess.run(
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90', '-f', '/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/build.ninja']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 130, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 179, in capture
CudaGraphRunner.capture(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 522, in capture
_capture_one_stream()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 278, in capture_one_batch_size
self.model_runner.draft_attn_backend.init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1591, in init_forward_metadata_capture_cuda_graph
self.common_template(forward_batch, self.cuda_graph_kv_indices, call_fn)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1542, in common_template
call_fn(i, forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1581, in call_fn
self.attn_backends[i].init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 568, in init_forward_metadata_capture_cuda_graph
self.indices_updater_decode.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 964, in update_single_wrapper
self.call_begin_forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1145, in call_begin_forward
wrapper.begin_forward(
File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1051, in plan
self._cached_module = get_batch_prefill_module(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 404, in get_batch_prefill_module
module = gen_batch_prefill_module(backend, *args).build_and_load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 316, in build_and_load
self.build(verbose, need_lock=False)
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 302, in build
run_ninja(self.build_dir, self.ninja_path, verbose)
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 340, in run_ninja
raise RuntimeError(msg) from e
RuntimeError: Ninja build failed. Ninja output:
ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90'
[1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu".
[2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu".
[3/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu".
[4/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu".
[5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu".
[6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu".
[7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu".
[8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu".
ninja: build stopped: subcommand failed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2937, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 346, in __init__
self.init_model_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 536, in init_model_worker
self.maybe_init_draft_worker()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 532, in maybe_init_draft_worker
self.draft_worker = DraftWorkerClass(**draft_worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 615, in __init__
self._draft_worker = EagleDraftWorker(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 170, in __init__
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 263, in init_cuda_graphs
self.cuda_graph_runner = Device2DraftCudaGraphRunner[
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 132, in __init__
raise Exception(
Exception: Capture cuda graph failed: Ninja build failed. Ninja output:
ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90'
[1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_2.cu".
[2/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_0.cu".
[3/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=128, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 128, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_1.cu".
[4/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu".
[5/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=true, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCausal, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_1.cu".
[6/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=false]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu".
[7/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeQ, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<96>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=PagedParams::DTypeKV, ElementB=PagedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<96>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=false, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=96, NUM_STAGES_=2, DTypeQ_=PagedParams::DTypeQ, DTypeKV_=PagedParams::DTypeKV, DTypeO_=PagedParams::DTypeO, IdType_=PagedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 358 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params,MULTIITEMSCORING>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<false, 128, 128, 128, 96, 2, PagedParams::DTypeQ, PagedParams::DTypeKV, PagedParams::DTypeO, PagedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=PagedParams, MULTIITEMSCORING=true]" at line 599 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=PagedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_3.cu".
[8/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(6108): error: static assertion failed with "MajorB must be GMMA::Major::K for this config."
static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 79 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kMultiItemScoring, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu
3 errors detected in the compilation of "/root/.cache/flashinfer/0.6.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_3.cu".
ninja: build stopped: subcommand failed.
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2026-02-02 23:05:08] Received sigquit from a child process. It usually means the child failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment