Created
December 31, 2025 01:24
-
-
Save Nottlespike/7c7977b66ef5529a775abdd93dfdbec2 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # P100 Extended Implementation Tasks - 100+ tasks for full agent utilization | |
| # Tesla P100 (GP100) - 56 SMs, 3584 CUDA cores, 16GB HBM2 @ 732 GB/s | |
| tasks: | |
| # ============================================ | |
| # CUDA Kernels (P0 - Critical) - 20 tasks | |
| # ============================================ | |
| - name: kernel-vecadd-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/vecadd_sm60.cu | |
| Reference: contrib/p40/kernels/vecadd.cu | |
| Implement vector addition kernel optimized for sm_60: | |
| - Use __ldg() for read-only data | |
| - Optimize for 732 GB/s HBM2 bandwidth | |
| - Include FP32 and FP16 variants | |
| - Add warp-level primitives for reduction | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-matmul-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/matmul_sm60.cu | |
| Implement tiled matrix multiplication for P100: | |
| - Shared memory tiling for 64KB per SM | |
| - Register blocking for high arithmetic intensity | |
| - Support for transposed inputs | |
| - FP32/FP16 variants with mixed precision accumulation | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-softmax-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/softmax_sm60.cu | |
| Implement numerically stable softmax for P100: | |
| - Online softmax algorithm (single pass) | |
| - Warp-level reductions using __shfl_down_sync | |
| - Support for variable sequence lengths | |
| - Fused with attention score scaling | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-gelu-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/gelu_sm60.cu | |
| Implement GELU activation for P100: | |
| - Exact GELU using erf() | |
| - Fast GELU approximation (tanh-based) | |
| - Fused with bias addition | |
| - In-place variant for memory efficiency | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-layernorm-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/layernorm_sm60.cu | |
| Implement fused LayerNorm for P100: | |
| - Single-pass mean and variance | |
| - Warp-level reductions | |
| - Fused with residual addition | |
| - Support for RMSNorm variant | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-attention-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/attention_sm60.cu | |
| Implement scaled dot-product attention for P100: | |
| - Q×K^T computation with scaling | |
| - Softmax in registers where possible | |
| - V projection fused | |
| - Support for causal masking | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-flash-attn-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/flash_attention_sm60.cu | |
| Implement Flash Attention algorithm for P100: | |
| - Tiled Q, K, V processing | |
| - Online softmax with rescaling | |
| - Maximize HBM2 bandwidth utilization | |
| - Support for multi-head attention | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-rope-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/rope_sm60.cu | |
| Implement Rotary Position Embeddings for P100: | |
| - Fused sin/cos computation | |
| - In-place rotation | |
| - Support for variable positions | |
| - Batch processing for efficiency | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-silu-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/silu_sm60.cu | |
| Implement SiLU/Swish activation for P100: | |
| - Fused sigmoid and multiply | |
| - Support for gated variants (SiLU-Gate) | |
| - In-place computation | |
| - FP16 support | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-embedding-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/embedding_sm60.cu | |
| Implement embedding lookup for P100: | |
| - Coalesced memory access | |
| - Support for large vocabularies | |
| - Position embedding addition | |
| - Padding handling | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-reduce-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/reduce_sm60.cu | |
| Implement general reduction kernels for P100: | |
| - Sum, mean, max, min operations | |
| - Multi-stage reduction for large tensors | |
| - Warp-level and block-level variants | |
| - Support for different data types | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-transpose-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/transpose_sm60.cu | |
| Implement optimized transpose for P100: | |
| - Shared memory transpose to avoid bank conflicts | |
| - Batched transpose for 3D/4D tensors | |
| - Support for non-contiguous strides | |
| - Fused with permute operations | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-concat-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/concat_sm60.cu | |
| Implement tensor concatenation for P100: | |
| - Along any dimension | |
| - Memory-efficient for large tensors | |
| - Support for variable number of inputs | |
| - Fused with split operations | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-scatter-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/scatter_gather_sm60.cu | |
| Implement scatter/gather operations for P100: | |
| - Index-based scatter/gather | |
| - Atomic operations for overlapping indices | |
| - Support for multi-dimensional indexing | |
| - Optimized for sparse patterns | |
| priority: P0 | |
| dependencies: [] | |
| - name: kernel-conv1d-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/conv1d_sm60.cu | |
| Implement 1D convolution for P100: | |
| - Direct convolution for small kernels | |
| - FFT-based for large kernels | |
| - Causal padding support | |
| - Grouped convolution | |
| priority: P1 | |
| dependencies: [] | |
| - name: kernel-topk-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/topk_sm60.cu | |
| Implement top-k selection for P100: | |
| - Radix-based selection | |
| - Support for large k values | |
| - Sorted and unsorted output | |
| - Index tracking | |
| priority: P1 | |
| dependencies: [] | |
| - name: kernel-dropout-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/dropout_sm60.cu | |
| Implement dropout for P100: | |
| - PHILOX random number generator | |
| - Deterministic with seeds | |
| - Fused with scaling | |
| - Inference mode (pass-through) | |
| priority: P1 | |
| dependencies: [] | |
| - name: kernel-cross-entropy-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/cross_entropy_sm60.cu | |
| Implement cross-entropy loss for P100: | |
| - Numerically stable log-softmax | |
| - Label smoothing support | |
| - Ignore index handling | |
| - Gradient computation | |
| priority: P1 | |
| dependencies: [] | |
| - name: kernel-adamw-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/adamw_sm60.cu | |
| Implement AdamW optimizer step for P100: | |
| - Fused parameter update | |
| - Weight decay handling | |
| - FP32 master weights with FP16 params | |
| - Gradient clipping option | |
| priority: P1 | |
| dependencies: [] | |
| - name: kernel-rmsprop-sm60 | |
| prompt: | | |
| Create contrib/p100/kernels/rmsprop_sm60.cu | |
| Implement RMSprop optimizer for P100: | |
| - Running average of squared gradients | |
| - Momentum variant | |
| - Epsilon for numerical stability | |
| - Centered variant | |
| priority: P1 | |
| dependencies: [] | |
| # ============================================ | |
| # Memory Management (P1 - High) - 15 tasks | |
| # ============================================ | |
| - name: mem-pool-allocator | |
| prompt: | | |
| Create contrib/p100/worker/p100_pool_allocator.py | |
| Implement memory pool allocator for P100 HBM2: | |
| - Power-of-2 size classes | |
| - Thread-safe allocation | |
| - Memory coalescing | |
| - Fragmentation tracking | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-async-transfer | |
| prompt: | | |
| Create contrib/p100/worker/p100_async_transfer.py | |
| Implement async memory transfers: | |
| - Pinned host memory | |
| - Bidirectional DMA | |
| - Stream-ordered transfers | |
| - Overlap with compute | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-tensor-cache | |
| prompt: | | |
| Create contrib/p100/worker/p100_tensor_cache.py | |
| Implement tensor caching for P100: | |
| - LRU eviction policy | |
| - Reference counting | |
| - Cache coherency with host | |
| - Size-based eviction | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-kv-cache-manager | |
| prompt: | | |
| Create contrib/p100/worker/p100_kv_cache.py | |
| Implement KV cache for transformer inference: | |
| - Paged attention support | |
| - Dynamic cache growth | |
| - Multi-sequence batching | |
| - Memory-efficient layout | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-gradient-checkpointing | |
| prompt: | | |
| Create contrib/p100/worker/p100_checkpoint.py | |
| Implement gradient checkpointing: | |
| - Selective recomputation | |
| - Segment boundaries | |
| - Memory-compute tradeoff | |
| - Integration with autograd | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-zero-copy | |
| prompt: | | |
| Create contrib/p100/worker/p100_zero_copy.py | |
| Implement zero-copy memory for P100: | |
| - Direct GPU access to host memory | |
| - Memory mapping | |
| - Access pattern optimization | |
| - Coherency management | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-prefetch | |
| prompt: | | |
| Create contrib/p100/worker/p100_prefetch.py | |
| Implement memory prefetching: | |
| - Hardware prefetch hints | |
| - Software-managed prefetch | |
| - Prefetch scheduling | |
| - Bandwidth management | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-compaction | |
| prompt: | | |
| Create contrib/p100/worker/p100_compaction.py | |
| Implement memory compaction: | |
| - Defragmentation algorithm | |
| - Background compaction | |
| - Minimal disruption | |
| - Statistics tracking | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-oversubscription | |
| prompt: | | |
| Create contrib/p100/worker/p100_oversubscription.py | |
| Implement memory oversubscription: | |
| - Page eviction to host | |
| - Working set tracking | |
| - Demand paging | |
| - Access pattern learning | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-shared-pool | |
| prompt: | | |
| Create contrib/p100/worker/p100_shared_pool.py | |
| Implement shared memory pool across processes: | |
| - IPC memory sharing | |
| - Reference counting | |
| - Cleanup on process exit | |
| - Security boundaries | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-nvlink-p2p | |
| prompt: | | |
| Create contrib/p100/worker/p100_nvlink.py | |
| Implement NVLink P2P transfers (if available): | |
| - Direct GPU-to-GPU transfer | |
| - Topology detection | |
| - Optimal routing | |
| - Fallback to PCIe | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-staging-buffer | |
| prompt: | | |
| Create contrib/p100/worker/p100_staging.py | |
| Implement staging buffers: | |
| - Double buffering | |
| - Ring buffer for streams | |
| - Size optimization | |
| - Lifetime management | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-hbm2-optimizer | |
| prompt: | | |
| Create contrib/p100/worker/p100_hbm2_optimizer.py | |
| Implement HBM2-specific optimizations: | |
| - Stack interleaving patterns | |
| - Bank conflict avoidance | |
| - Access coalescing | |
| - Pseudo-channel awareness | |
| priority: P1 | |
| dependencies: [] | |
| - name: mem-ecc-manager | |
| prompt: | | |
| Create contrib/p100/worker/p100_ecc.py | |
| Implement ECC memory management: | |
| - ECC status monitoring | |
| - Error counters | |
| - Page retirement | |
| - Health reporting | |
| priority: P2 | |
| dependencies: [] | |
| - name: mem-numa-aware | |
| prompt: | | |
| Create contrib/p100/worker/p100_numa.py | |
| Implement NUMA-aware memory allocation: | |
| - CPU socket detection | |
| - Optimal placement | |
| - Migration support | |
| - Affinity management | |
| priority: P2 | |
| dependencies: [] | |
| # ============================================ | |
| # Neural Network Layers (P1 - High) - 20 tasks | |
| # ============================================ | |
| - name: nn-linear | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_linear.py | |
| Implement Linear layer for P100: | |
| - Optimized GEMM wrapper | |
| - Bias fusion | |
| - Mixed precision support | |
| - Batch processing | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-embedding | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_embedding.py | |
| Implement Embedding layer: | |
| - Lookup kernel wrapper | |
| - Gradient accumulation | |
| - Padding index handling | |
| - Sparse gradients | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-multihead-attention | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_mha.py | |
| Implement Multi-Head Attention: | |
| - QKV projection fused | |
| - Attention kernel dispatch | |
| - Output projection | |
| - KV caching integration | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-mlp-block | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_mlp.py | |
| Implement MLP/FFN block: | |
| - Fused linear + activation | |
| - Gated variants (SwiGLU, GeGLU) | |
| - Gradient checkpointing hooks | |
| - Memory-efficient backward | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-transformer-block | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_transformer_block.py | |
| Implement full Transformer block: | |
| - Pre/Post LayerNorm variants | |
| - Attention + MLP composition | |
| - Residual connections | |
| - Parallel attention+FFN option | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-conv1d-layer | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_conv1d.py | |
| Implement Conv1d layer: | |
| - Kernel dispatch | |
| - Padding modes | |
| - Dilation support | |
| - Groups support | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-groupnorm | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_groupnorm.py | |
| Implement GroupNorm layer: | |
| - Group-wise statistics | |
| - Affine parameters | |
| - Memory-efficient backward | |
| - Instance norm as special case | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-batchnorm | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_batchnorm.py | |
| Implement BatchNorm layer: | |
| - Running statistics | |
| - Training vs eval modes | |
| - Sync BN for multi-GPU | |
| - Momentum for EMA | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-dropout-layer | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_dropout.py | |
| Implement Dropout layer: | |
| - Kernel wrapper | |
| - Deterministic mode | |
| - Various dropout patterns | |
| - Inference passthrough | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-positional-encoding | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_pos_encoding.py | |
| Implement positional encodings: | |
| - Sinusoidal encoding | |
| - Learned embeddings | |
| - RoPE integration | |
| - ALiBi support | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-softmax-layer | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_softmax.py | |
| Implement Softmax layer: | |
| - Stable computation | |
| - Temperature scaling | |
| - Log-softmax variant | |
| - Sparse softmax option | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-activation-layer | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_activations.py | |
| Implement activation functions: | |
| - GELU, SiLU, ReLU, Tanh | |
| - Fused variants | |
| - Custom activation support | |
| - In-place operations | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-cross-entropy-layer | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_loss.py | |
| Implement loss functions: | |
| - CrossEntropyLoss | |
| - Label smoothing | |
| - Focal loss variant | |
| - Gradient computation | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-optimizer-wrapper | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_optimizers.py | |
| Implement optimizer wrappers: | |
| - AdamW, SGD, RMSprop | |
| - Learning rate scheduling | |
| - Gradient clipping | |
| - Parameter groups | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-lm-head | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_lm_head.py | |
| Implement language model head: | |
| - Tied embeddings | |
| - Efficient logit computation | |
| - Temperature sampling | |
| - Top-k/p filtering | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-attention-mask | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_attention_mask.py | |
| Implement attention mask generation: | |
| - Causal masks | |
| - Padding masks | |
| - Sliding window | |
| - Custom patterns | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-weight-init | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_init.py | |
| Implement weight initialization: | |
| - Xavier/Glorot | |
| - Kaiming/He | |
| - Orthogonal | |
| - Custom init patterns | |
| priority: P2 | |
| dependencies: [] | |
| - name: nn-gradient-scale | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_grad_scale.py | |
| Implement gradient scaling: | |
| - Loss scaling for FP16 | |
| - Dynamic scaling | |
| - Overflow detection | |
| - Gradient accumulation | |
| priority: P1 | |
| dependencies: [] | |
| - name: nn-model-parallel | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_model_parallel.py | |
| Implement tensor parallelism: | |
| - Column parallel | |
| - Row parallel | |
| - Sequence parallel | |
| - All-reduce communication | |
| priority: P2 | |
| dependencies: [] | |
| - name: nn-pipeline-parallel | |
| prompt: | | |
| Create contrib/p100/worker/layers/p100_pipeline.py | |
| Implement pipeline parallelism: | |
| - Stage boundaries | |
| - Micro-batching | |
| - 1F1B scheduling | |
| - Memory optimization | |
| priority: P2 | |
| dependencies: [] | |
| # ============================================ | |
| # Benchmarks & Tests (P2 - Medium) - 20 tasks | |
| # ============================================ | |
| - name: bench-gemm | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_gemm.py | |
| GEMM benchmark suite: | |
| - Various matrix sizes | |
| - FP16/FP32 comparison | |
| - Batched GEMM | |
| - Peak TFLOPS measurement | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-memory-bandwidth | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_bandwidth.py | |
| Memory bandwidth benchmark: | |
| - HBM2 read/write bandwidth | |
| - Host-to-device transfers | |
| - Device-to-device if multi-GPU | |
| - Sustained vs peak bandwidth | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-attention | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_attention.py | |
| Attention benchmark: | |
| - Standard vs Flash attention | |
| - Various sequence lengths | |
| - Multi-head configurations | |
| - Memory usage tracking | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-transformer | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_transformer.py | |
| Transformer layer benchmark: | |
| - Forward pass timing | |
| - Backward pass timing | |
| - Memory footprint | |
| - Batch size scaling | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-end-to-end | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_e2e.py | |
| End-to-end inference benchmark: | |
| - Tokens per second | |
| - Time to first token | |
| - Memory efficiency | |
| - Different model sizes | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-kernel-launch | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_launch.py | |
| Kernel launch overhead benchmark: | |
| - Empty kernel timing | |
| - Grid size impact | |
| - Stream switching cost | |
| - Async launch efficiency | |
| priority: P2 | |
| dependencies: [] | |
| - name: bench-allreduce | |
| prompt: | | |
| Create contrib/p100/benchmarks/bench_collective.py | |
| Collective communication benchmark: | |
| - All-reduce timing | |
| - All-gather timing | |
| - Ring vs tree algorithms | |
| - Message size scaling | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-kernel-vecadd | |
| prompt: | | |
| Create contrib/p100/tests/test_kernels_vecadd.py | |
| Vector addition kernel tests: | |
| - Correctness verification | |
| - Edge cases (empty, large) | |
| - Different data types | |
| - Random input fuzzing | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-kernel-matmul | |
| prompt: | | |
| Create contrib/p100/tests/test_kernels_matmul.py | |
| Matrix multiplication tests: | |
| - Compare with CPU reference | |
| - Transposed variants | |
| - Non-square matrices | |
| - Numerical accuracy | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-kernel-attention | |
| prompt: | | |
| Create contrib/p100/tests/test_kernels_attention.py | |
| Attention kernel tests: | |
| - Softmax stability | |
| - Causal masking | |
| - Multi-head correctness | |
| - Gradient verification | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-memory-alloc | |
| prompt: | | |
| Create contrib/p100/tests/test_memory.py | |
| Memory allocation tests: | |
| - Alloc/free cycles | |
| - Fragmentation behavior | |
| - OOM handling | |
| - Pool efficiency | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-layers | |
| prompt: | | |
| Create contrib/p100/tests/test_layers.py | |
| Layer implementation tests: | |
| - Forward correctness | |
| - Backward correctness | |
| - Parameter initialization | |
| - State serialization | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-transformer | |
| prompt: | | |
| Create contrib/p100/tests/test_transformer.py | |
| Transformer integration tests: | |
| - Block stacking | |
| - KV cache consistency | |
| - Sequence length handling | |
| - Batch processing | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-precision | |
| prompt: | | |
| Create contrib/p100/tests/test_precision.py | |
| Numerical precision tests: | |
| - FP16 overflow detection | |
| - Accuracy vs FP32 | |
| - Loss scaling verification | |
| - Gradient magnitudes | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-streams | |
| prompt: | | |
| Create contrib/p100/tests/test_streams.py | |
| CUDA stream tests: | |
| - Stream creation/deletion | |
| - Synchronization | |
| - Event timing | |
| - Multi-stream execution | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-error-handling | |
| prompt: | | |
| Create contrib/p100/tests/test_errors.py | |
| Error handling tests: | |
| - CUDA error recovery | |
| - OOM behavior | |
| - Invalid input handling | |
| - Timeout detection | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-regression | |
| prompt: | | |
| Create contrib/p100/tests/test_regression.py | |
| Regression test suite: | |
| - Known bug reproductions | |
| - Performance regression | |
| - Memory leak detection | |
| - API compatibility | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-stress | |
| prompt: | | |
| Create contrib/p100/tests/test_stress.py | |
| Stress testing: | |
| - Long-running workloads | |
| - Memory pressure | |
| - Concurrent operations | |
| - Resource exhaustion | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-compatibility | |
| prompt: | | |
| Create contrib/p100/tests/test_compat.py | |
| Compatibility tests: | |
| - P40 code compatibility | |
| - API consistency | |
| - Behavior parity | |
| - Migration validation | |
| priority: P2 | |
| dependencies: [] | |
| - name: test-integration | |
| prompt: | | |
| Create contrib/p100/tests/test_integration.py | |
| Full integration tests: | |
| - Model loading | |
| - Inference pipeline | |
| - Training loop | |
| - Checkpoint save/load | |
| priority: P2 | |
| dependencies: [] | |
| # ============================================ | |
| # Documentation (P2 - Medium) - 15 tasks | |
| # ============================================ | |
| - name: docs-architecture | |
| prompt: | | |
| Create contrib/p100/docs/ARCHITECTURE.md | |
| P100 architecture documentation: | |
| - GP100 die layout | |
| - SM organization | |
| - Memory hierarchy | |
| - Compute capabilities | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-kernels | |
| prompt: | | |
| Create contrib/p100/docs/KERNELS.md | |
| Kernel development guide: | |
| - sm_60 specifics | |
| - Register usage | |
| - Shared memory | |
| - Optimization tips | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-hbm2 | |
| prompt: | | |
| Create contrib/p100/docs/HBM2_GUIDE.md | |
| HBM2 memory guide: | |
| - Stack architecture | |
| - Bandwidth optimization | |
| - Access patterns | |
| - Bank conflicts | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-api-reference | |
| prompt: | | |
| Create contrib/p100/docs/API_REFERENCE.md | |
| API reference documentation: | |
| - All public classes | |
| - All public functions | |
| - Parameter descriptions | |
| - Usage examples | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-getting-started | |
| prompt: | | |
| Create contrib/p100/docs/GETTING_STARTED.md | |
| Getting started guide: | |
| - Prerequisites | |
| - Installation | |
| - First example | |
| - Common pitfalls | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-migration | |
| prompt: | | |
| Create contrib/p100/docs/MIGRATION_FROM_P40.md | |
| P40 to P100 migration guide: | |
| - API differences | |
| - Performance changes | |
| - Memory considerations | |
| - Code examples | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-performance | |
| prompt: | | |
| Create contrib/p100/docs/PERFORMANCE_TUNING.md | |
| Performance tuning guide: | |
| - Profiling tools | |
| - Bottleneck identification | |
| - Optimization techniques | |
| - Benchmark interpretation | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-troubleshooting | |
| prompt: | | |
| Create contrib/p100/docs/TROUBLESHOOTING.md | |
| Troubleshooting guide: | |
| - Common errors | |
| - Debug techniques | |
| - Log interpretation | |
| - Recovery procedures | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-examples-inference | |
| prompt: | | |
| Create contrib/p100/examples/inference_example.py | |
| Inference example: | |
| - Model loading | |
| - Input preprocessing | |
| - Generation loop | |
| - Output postprocessing | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-examples-benchmark | |
| prompt: | | |
| Create contrib/p100/examples/benchmark_example.py | |
| Benchmark example: | |
| - Setup and teardown | |
| - Timing methodology | |
| - Results reporting | |
| - Comparison scripts | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-examples-memory | |
| prompt: | | |
| Create contrib/p100/examples/memory_example.py | |
| Memory management example: | |
| - Allocation patterns | |
| - Pool usage | |
| - Transfer optimization | |
| - Monitoring | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-examples-multistream | |
| prompt: | | |
| Create contrib/p100/examples/multistream_example.py | |
| Multi-stream example: | |
| - Stream creation | |
| - Work distribution | |
| - Synchronization | |
| - Performance benefits | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-changelog | |
| prompt: | | |
| Create contrib/p100/CHANGELOG.md | |
| Changelog documentation: | |
| - Version history | |
| - Breaking changes | |
| - New features | |
| - Bug fixes | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-contributing | |
| prompt: | | |
| Create contrib/p100/CONTRIBUTING.md | |
| Contributing guide: | |
| - Code style | |
| - Test requirements | |
| - PR process | |
| - Review guidelines | |
| priority: P2 | |
| dependencies: [] | |
| - name: docs-faq | |
| prompt: | | |
| Create contrib/p100/docs/FAQ.md | |
| Frequently asked questions: | |
| - Common questions | |
| - Best practices | |
| - Limitations | |
| - Future plans | |
| priority: P2 | |
| dependencies: [] | |
| # ============================================ | |
| # Utilities & Tools (P2 - Medium) - 10 tasks | |
| # ============================================ | |
| - name: tool-profiler-analysis | |
| prompt: | | |
| Create contrib/p100/tools/profiler_analyzer.py | |
| Profiler output analyzer: | |
| - Parse profiler data | |
| - Generate reports | |
| - Identify hotspots | |
| - Visualization | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-memory-visualizer | |
| prompt: | | |
| Create contrib/p100/tools/memory_visualizer.py | |
| Memory usage visualizer: | |
| - Timeline view | |
| - Allocation tracking | |
| - Fragmentation display | |
| - Export to HTML | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-kernel-comparator | |
| prompt: | | |
| Create contrib/p100/tools/kernel_compare.py | |
| Kernel performance comparator: | |
| - Side-by-side comparison | |
| - Statistical analysis | |
| - Regression detection | |
| - Report generation | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-model-analyzer | |
| prompt: | | |
| Create contrib/p100/tools/model_analyzer.py | |
| Model analysis tool: | |
| - Layer breakdown | |
| - Parameter count | |
| - FLOP estimation | |
| - Memory requirements | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-config-validator | |
| prompt: | | |
| Create contrib/p100/tools/config_validator.py | |
| Configuration validator: | |
| - Schema validation | |
| - Constraint checking | |
| - Compatibility verification | |
| - Suggestions for improvement | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-debug-helper | |
| prompt: | | |
| Create contrib/p100/tools/debug_helper.py | |
| Debug helper utilities: | |
| - Memory dump | |
| - State inspection | |
| - Breakpoint helpers | |
| - Trace logging | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-benchmark-runner | |
| prompt: | | |
| Create contrib/p100/tools/benchmark_runner.py | |
| Benchmark automation: | |
| - Configuration loading | |
| - Sequential execution | |
| - Results aggregation | |
| - Comparison with baselines | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-test-runner | |
| prompt: | | |
| Create contrib/p100/tools/test_runner.py | |
| Test automation: | |
| - Test discovery | |
| - Parallel execution | |
| - Coverage reporting | |
| - Failure analysis | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-code-generator | |
| prompt: | | |
| Create contrib/p100/tools/codegen.py | |
| Code generation utilities: | |
| - Kernel templates | |
| - Binding generators | |
| - Test scaffolding | |
| - Documentation stubs | |
| priority: P2 | |
| dependencies: [] | |
| - name: tool-health-dashboard | |
| prompt: | | |
| Create contrib/p100/tools/health_dashboard.py | |
| GPU health dashboard: | |
| - Real-time monitoring | |
| - Temperature tracking | |
| - Utilization graphs | |
| - Alert system | |
| priority: P2 | |
| dependencies: [] | |
| # ============================================ | |
| # P100 Facility Scale Tasks - Additional 400 tasks | |
| # Focus: multi-human, multi-node facility operations and optimizations | |
| # ============================================ | |
| # Cluster Scheduling & Orchestration - 40 tasks | |
| - name: cluster-scheduling-priority-queues-design | |
| prompt: | | |
| Design priority queues for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-priority-queues-implement | |
| prompt: | | |
| Implement priority queues for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-priority-queues-test | |
| prompt: | | |
| Test priority queues for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-priority-queues-document | |
| prompt: | | |
| Document priority queues for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-fair-share-scheduling-design | |
| prompt: | | |
| Design fair-share scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-fair-share-scheduling-implement | |
| prompt: | | |
| Implement fair-share scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-fair-share-scheduling-test | |
| prompt: | | |
| Test fair-share scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-fair-share-scheduling-document | |
| prompt: | | |
| Document fair-share scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-preemption-policy-design | |
| prompt: | | |
| Design preemption policy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-preemption-policy-implement | |
| prompt: | | |
| Implement preemption policy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-preemption-policy-test | |
| prompt: | | |
| Test preemption policy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-preemption-policy-document | |
| prompt: | | |
| Document preemption policy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-backfilling-design | |
| prompt: | | |
| Design backfilling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-backfilling-implement | |
| prompt: | | |
| Implement backfilling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-backfilling-test | |
| prompt: | | |
| Test backfilling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-backfilling-document | |
| prompt: | | |
| Document backfilling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-gang-scheduling-design | |
| prompt: | | |
| Design gang scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-gang-scheduling-implement | |
| prompt: | | |
| Implement gang scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-gang-scheduling-test | |
| prompt: | | |
| Test gang scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-gang-scheduling-document | |
| prompt: | | |
| Document gang scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-reservation-windows-design | |
| prompt: | | |
| Design reservation windows for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-reservation-windows-implement | |
| prompt: | | |
| Implement reservation windows for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-reservation-windows-test | |
| prompt: | | |
| Test reservation windows for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-reservation-windows-document | |
| prompt: | | |
| Document reservation windows for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-node-labeling-and-constraints-design | |
| prompt: | | |
| Design node labeling and constraints for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-node-labeling-and-constraints-implement | |
| prompt: | | |
| Implement node labeling and constraints for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-node-labeling-and-constraints-test | |
| prompt: | | |
| Test node labeling and constraints for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-node-labeling-and-constraints-document | |
| prompt: | | |
| Document node labeling and constraints for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-gpu-health-aware-placement-design | |
| prompt: | | |
| Design GPU health-aware placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-gpu-health-aware-placement-implement | |
| prompt: | | |
| Implement GPU health-aware placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-gpu-health-aware-placement-test | |
| prompt: | | |
| Test GPU health-aware placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-gpu-health-aware-placement-document | |
| prompt: | | |
| Document GPU health-aware placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-data-locality-placement-design | |
| prompt: | | |
| Design data locality placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-data-locality-placement-implement | |
| prompt: | | |
| Implement data locality placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-data-locality-placement-test | |
| prompt: | | |
| Test data locality placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-data-locality-placement-document | |
| prompt: | | |
| Document data locality placement for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-mixed-workload-isolation-design | |
| prompt: | | |
| Design mixed workload isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-mixed-workload-isolation-implement | |
| prompt: | | |
| Implement mixed workload isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: cluster-scheduling-mixed-workload-isolation-test | |
| prompt: | | |
| Test mixed workload isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: cluster-scheduling-mixed-workload-isolation-document | |
| prompt: | | |
| Document mixed workload isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # GPU/NUMA/Affinity Tuning - 40 tasks | |
| - name: numa-affinity-numa-pinning-design | |
| prompt: | | |
| Design NUMA pinning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-numa-pinning-implement | |
| prompt: | | |
| Implement NUMA pinning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-numa-pinning-test | |
| prompt: | | |
| Test NUMA pinning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-numa-pinning-document | |
| prompt: | | |
| Document NUMA pinning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-cpu-affinity-per-agent-design | |
| prompt: | | |
| Design CPU affinity per agent for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-cpu-affinity-per-agent-implement | |
| prompt: | | |
| Implement CPU affinity per agent for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-cpu-affinity-per-agent-test | |
| prompt: | | |
| Test CPU affinity per agent for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-cpu-affinity-per-agent-document | |
| prompt: | | |
| Document CPU affinity per agent for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-topology-awareness-design | |
| prompt: | | |
| Design PCIe topology awareness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-topology-awareness-implement | |
| prompt: | | |
| Implement PCIe topology awareness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-topology-awareness-test | |
| prompt: | | |
| Test PCIe topology awareness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-topology-awareness-document | |
| prompt: | | |
| Document PCIe topology awareness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-gpu-memory-partitioning-design | |
| prompt: | | |
| Design GPU memory partitioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-gpu-memory-partitioning-implement | |
| prompt: | | |
| Implement GPU memory partitioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-gpu-memory-partitioning-test | |
| prompt: | | |
| Test GPU memory partitioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-gpu-memory-partitioning-document | |
| prompt: | | |
| Document GPU memory partitioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-peer-access-design | |
| prompt: | | |
| Design PCIe peer access for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-peer-access-implement | |
| prompt: | | |
| Implement PCIe peer access for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-peer-access-test | |
| prompt: | | |
| Test PCIe peer access for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-pcie-peer-access-document | |
| prompt: | | |
| Document PCIe peer access for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-hbm2-bandwidth-throttling-design | |
| prompt: | | |
| Design HBM2 bandwidth throttling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-hbm2-bandwidth-throttling-implement | |
| prompt: | | |
| Implement HBM2 bandwidth throttling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-hbm2-bandwidth-throttling-test | |
| prompt: | | |
| Test HBM2 bandwidth throttling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-hbm2-bandwidth-throttling-document | |
| prompt: | | |
| Document HBM2 bandwidth throttling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-power-and-clock-management-design | |
| prompt: | | |
| Design power and clock management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-power-and-clock-management-implement | |
| prompt: | | |
| Implement power and clock management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-power-and-clock-management-test | |
| prompt: | | |
| Test power and clock management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-power-and-clock-management-document | |
| prompt: | | |
| Document power and clock management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-thermal-monitoring-design | |
| prompt: | | |
| Design thermal monitoring for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-thermal-monitoring-implement | |
| prompt: | | |
| Implement thermal monitoring for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-thermal-monitoring-test | |
| prompt: | | |
| Test thermal monitoring for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-thermal-monitoring-document | |
| prompt: | | |
| Document thermal monitoring for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-ecc-error-handling-design | |
| prompt: | | |
| Design ECC error handling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-ecc-error-handling-implement | |
| prompt: | | |
| Implement ECC error handling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-ecc-error-handling-test | |
| prompt: | | |
| Test ECC error handling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-ecc-error-handling-document | |
| prompt: | | |
| Document ECC error handling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-cuda-mps-sharing-design | |
| prompt: | | |
| Design CUDA MPS sharing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-cuda-mps-sharing-implement | |
| prompt: | | |
| Implement CUDA MPS sharing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: numa-affinity-cuda-mps-sharing-test | |
| prompt: | | |
| Test CUDA MPS sharing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: numa-affinity-cuda-mps-sharing-document | |
| prompt: | | |
| Document CUDA MPS sharing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Data Pipeline & Storage - 40 tasks | |
| - name: data-pipeline-dataset-cache-hierarchy-design | |
| prompt: | | |
| Design dataset cache hierarchy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-dataset-cache-hierarchy-implement | |
| prompt: | | |
| Implement dataset cache hierarchy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-dataset-cache-hierarchy-test | |
| prompt: | | |
| Test dataset cache hierarchy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-dataset-cache-hierarchy-document | |
| prompt: | | |
| Document dataset cache hierarchy for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-object-store-ingest-design | |
| prompt: | | |
| Design object store ingest for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-object-store-ingest-implement | |
| prompt: | | |
| Implement object store ingest for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-object-store-ingest-test | |
| prompt: | | |
| Test object store ingest for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-object-store-ingest-document | |
| prompt: | | |
| Document object store ingest for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-shard-management-design | |
| prompt: | | |
| Design shard management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-shard-management-implement | |
| prompt: | | |
| Implement shard management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-shard-management-test | |
| prompt: | | |
| Test shard management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-shard-management-document | |
| prompt: | | |
| Document shard management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-prefetching-design | |
| prompt: | | |
| Design prefetching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-prefetching-implement | |
| prompt: | | |
| Implement prefetching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-prefetching-test | |
| prompt: | | |
| Test prefetching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-prefetching-document | |
| prompt: | | |
| Document prefetching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-streaming-decompression-design | |
| prompt: | | |
| Design streaming decompression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-streaming-decompression-implement | |
| prompt: | | |
| Implement streaming decompression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-streaming-decompression-test | |
| prompt: | | |
| Test streaming decompression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-streaming-decompression-document | |
| prompt: | | |
| Document streaming decompression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-format-conversion-design | |
| prompt: | | |
| Design format conversion for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-format-conversion-implement | |
| prompt: | | |
| Implement format conversion for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-format-conversion-test | |
| prompt: | | |
| Test format conversion for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-format-conversion-document | |
| prompt: | | |
| Document format conversion for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-data-integrity-checks-design | |
| prompt: | | |
| Design data integrity checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-data-integrity-checks-implement | |
| prompt: | | |
| Implement data integrity checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-data-integrity-checks-test | |
| prompt: | | |
| Test data integrity checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-data-integrity-checks-document | |
| prompt: | | |
| Document data integrity checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-snapshot-versioning-design | |
| prompt: | | |
| Design snapshot versioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-snapshot-versioning-implement | |
| prompt: | | |
| Implement snapshot versioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-snapshot-versioning-test | |
| prompt: | | |
| Test snapshot versioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-snapshot-versioning-document | |
| prompt: | | |
| Document snapshot versioning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-metadata-catalog-design | |
| prompt: | | |
| Design metadata catalog for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-metadata-catalog-implement | |
| prompt: | | |
| Implement metadata catalog for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-metadata-catalog-test | |
| prompt: | | |
| Test metadata catalog for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-metadata-catalog-document | |
| prompt: | | |
| Document metadata catalog for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-cold-hot-tiering-design | |
| prompt: | | |
| Design cold/hot tiering for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-cold-hot-tiering-implement | |
| prompt: | | |
| Implement cold/hot tiering for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: data-pipeline-cold-hot-tiering-test | |
| prompt: | | |
| Test cold/hot tiering for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: data-pipeline-cold-hot-tiering-document | |
| prompt: | | |
| Document cold/hot tiering for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Network Transport & Messaging - 40 tasks | |
| - name: network-transport-zeromq-routing-scale-design | |
| prompt: | | |
| Design ZeroMQ routing scale for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-zeromq-routing-scale-implement | |
| prompt: | | |
| Implement ZeroMQ routing scale for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-zeromq-routing-scale-test | |
| prompt: | | |
| Test ZeroMQ routing scale for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-zeromq-routing-scale-document | |
| prompt: | | |
| Document ZeroMQ routing scale for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-redis-queue-scaling-design | |
| prompt: | | |
| Design Redis queue scaling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-redis-queue-scaling-implement | |
| prompt: | | |
| Implement Redis queue scaling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-redis-queue-scaling-test | |
| prompt: | | |
| Test Redis queue scaling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-redis-queue-scaling-document | |
| prompt: | | |
| Document Redis queue scaling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-message-batching-design | |
| prompt: | | |
| Design message batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-message-batching-implement | |
| prompt: | | |
| Implement message batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-message-batching-test | |
| prompt: | | |
| Test message batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-message-batching-document | |
| prompt: | | |
| Document message batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-backpressure-control-design | |
| prompt: | | |
| Design backpressure control for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-backpressure-control-implement | |
| prompt: | | |
| Implement backpressure control for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-backpressure-control-test | |
| prompt: | | |
| Test backpressure control for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-backpressure-control-document | |
| prompt: | | |
| Document backpressure control for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-transport-compression-design | |
| prompt: | | |
| Design transport compression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-transport-compression-implement | |
| prompt: | | |
| Implement transport compression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-transport-compression-test | |
| prompt: | | |
| Test transport compression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-transport-compression-document | |
| prompt: | | |
| Document transport compression for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-tls-for-control-plane-design | |
| prompt: | | |
| Design TLS for control plane for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-tls-for-control-plane-implement | |
| prompt: | | |
| Implement TLS for control plane for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-tls-for-control-plane-test | |
| prompt: | | |
| Test TLS for control plane for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-tls-for-control-plane-document | |
| prompt: | | |
| Document TLS for control plane for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-heartbeats-and-timeouts-design | |
| prompt: | | |
| Design heartbeats and timeouts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-heartbeats-and-timeouts-implement | |
| prompt: | | |
| Implement heartbeats and timeouts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-heartbeats-and-timeouts-test | |
| prompt: | | |
| Test heartbeats and timeouts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-heartbeats-and-timeouts-document | |
| prompt: | | |
| Document heartbeats and timeouts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-node-discovery-design | |
| prompt: | | |
| Design node discovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-node-discovery-implement | |
| prompt: | | |
| Implement node discovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-node-discovery-test | |
| prompt: | | |
| Test node discovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-node-discovery-document | |
| prompt: | | |
| Document node discovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-bandwidth-shaping-design | |
| prompt: | | |
| Design bandwidth shaping for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-bandwidth-shaping-implement | |
| prompt: | | |
| Implement bandwidth shaping for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-bandwidth-shaping-test | |
| prompt: | | |
| Test bandwidth shaping for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-bandwidth-shaping-document | |
| prompt: | | |
| Document bandwidth shaping for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-cross-datacenter-links-design | |
| prompt: | | |
| Design cross-datacenter links for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-cross-datacenter-links-implement | |
| prompt: | | |
| Implement cross-datacenter links for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: network-transport-cross-datacenter-links-test | |
| prompt: | | |
| Test cross-datacenter links for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: network-transport-cross-datacenter-links-document | |
| prompt: | | |
| Document cross-datacenter links for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Observability & Dashboards - 40 tasks | |
| - name: observability-per-agent-metrics-design | |
| prompt: | | |
| Design per-agent metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-per-agent-metrics-implement | |
| prompt: | | |
| Implement per-agent metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-per-agent-metrics-test | |
| prompt: | | |
| Test per-agent metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-per-agent-metrics-document | |
| prompt: | | |
| Document per-agent metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-gpu-metrics-export-design | |
| prompt: | | |
| Design GPU metrics export for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-gpu-metrics-export-implement | |
| prompt: | | |
| Implement GPU metrics export for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-gpu-metrics-export-test | |
| prompt: | | |
| Test GPU metrics export for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-gpu-metrics-export-document | |
| prompt: | | |
| Document GPU metrics export for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-queue-depth-metrics-design | |
| prompt: | | |
| Design queue depth metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-queue-depth-metrics-implement | |
| prompt: | | |
| Implement queue depth metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-queue-depth-metrics-test | |
| prompt: | | |
| Test queue depth metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-queue-depth-metrics-document | |
| prompt: | | |
| Document queue depth metrics for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-task-latency-tracing-design | |
| prompt: | | |
| Design task latency tracing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-task-latency-tracing-implement | |
| prompt: | | |
| Implement task latency tracing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-task-latency-tracing-test | |
| prompt: | | |
| Test task latency tracing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-task-latency-tracing-document | |
| prompt: | | |
| Document task latency tracing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-log-aggregation-design | |
| prompt: | | |
| Design log aggregation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-log-aggregation-implement | |
| prompt: | | |
| Implement log aggregation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-log-aggregation-test | |
| prompt: | | |
| Test log aggregation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-log-aggregation-document | |
| prompt: | | |
| Document log aggregation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-anomaly-detection-design | |
| prompt: | | |
| Design anomaly detection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-anomaly-detection-implement | |
| prompt: | | |
| Implement anomaly detection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-anomaly-detection-test | |
| prompt: | | |
| Test anomaly detection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-anomaly-detection-document | |
| prompt: | | |
| Document anomaly detection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-sla-dashboards-design | |
| prompt: | | |
| Design SLA dashboards for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-sla-dashboards-implement | |
| prompt: | | |
| Implement SLA dashboards for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-sla-dashboards-test | |
| prompt: | | |
| Test SLA dashboards for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-sla-dashboards-document | |
| prompt: | | |
| Document SLA dashboards for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-capacity-planning-reports-design | |
| prompt: | | |
| Design capacity planning reports for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-capacity-planning-reports-implement | |
| prompt: | | |
| Implement capacity planning reports for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-capacity-planning-reports-test | |
| prompt: | | |
| Test capacity planning reports for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-capacity-planning-reports-document | |
| prompt: | | |
| Document capacity planning reports for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-alerting-rules-design | |
| prompt: | | |
| Design alerting rules for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-alerting-rules-implement | |
| prompt: | | |
| Implement alerting rules for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-alerting-rules-test | |
| prompt: | | |
| Test alerting rules for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-alerting-rules-document | |
| prompt: | | |
| Document alerting rules for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-cost-accounting-design | |
| prompt: | | |
| Design cost accounting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-cost-accounting-implement | |
| prompt: | | |
| Implement cost accounting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: observability-cost-accounting-test | |
| prompt: | | |
| Test cost accounting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: observability-cost-accounting-document | |
| prompt: | | |
| Document cost accounting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Reliability & Fault Tolerance - 40 tasks | |
| - name: reliability-agent-crash-recovery-design | |
| prompt: | | |
| Design agent crash recovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-agent-crash-recovery-implement | |
| prompt: | | |
| Implement agent crash recovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-agent-crash-recovery-test | |
| prompt: | | |
| Test agent crash recovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-agent-crash-recovery-document | |
| prompt: | | |
| Document agent crash recovery for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-coordinator-failover-design | |
| prompt: | | |
| Design coordinator failover for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-coordinator-failover-implement | |
| prompt: | | |
| Implement coordinator failover for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-coordinator-failover-test | |
| prompt: | | |
| Test coordinator failover for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-coordinator-failover-document | |
| prompt: | | |
| Document coordinator failover for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-task-retry-policies-design | |
| prompt: | | |
| Design task retry policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-task-retry-policies-implement | |
| prompt: | | |
| Implement task retry policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-task-retry-policies-test | |
| prompt: | | |
| Test task retry policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-task-retry-policies-document | |
| prompt: | | |
| Document task retry policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-idempotency-keys-design | |
| prompt: | | |
| Design idempotency keys for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-idempotency-keys-implement | |
| prompt: | | |
| Implement idempotency keys for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-idempotency-keys-test | |
| prompt: | | |
| Test idempotency keys for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-idempotency-keys-document | |
| prompt: | | |
| Document idempotency keys for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-checkpointing-tasks-design | |
| prompt: | | |
| Design checkpointing tasks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-checkpointing-tasks-implement | |
| prompt: | | |
| Implement checkpointing tasks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-checkpointing-tasks-test | |
| prompt: | | |
| Test checkpointing tasks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-checkpointing-tasks-document | |
| prompt: | | |
| Document checkpointing tasks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-rate-limiting-design | |
| prompt: | | |
| Design rate limiting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-rate-limiting-implement | |
| prompt: | | |
| Implement rate limiting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-rate-limiting-test | |
| prompt: | | |
| Test rate limiting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-rate-limiting-document | |
| prompt: | | |
| Document rate limiting for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-circuit-breakers-design | |
| prompt: | | |
| Design circuit breakers for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-circuit-breakers-implement | |
| prompt: | | |
| Implement circuit breakers for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-circuit-breakers-test | |
| prompt: | | |
| Test circuit breakers for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-circuit-breakers-document | |
| prompt: | | |
| Document circuit breakers for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-rolling-restarts-design | |
| prompt: | | |
| Design rolling restarts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-rolling-restarts-implement | |
| prompt: | | |
| Implement rolling restarts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-rolling-restarts-test | |
| prompt: | | |
| Test rolling restarts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-rolling-restarts-document | |
| prompt: | | |
| Document rolling restarts for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-config-hot-reload-design | |
| prompt: | | |
| Design config hot reload for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-config-hot-reload-implement | |
| prompt: | | |
| Implement config hot reload for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-config-hot-reload-test | |
| prompt: | | |
| Test config hot reload for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-config-hot-reload-document | |
| prompt: | | |
| Document config hot reload for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-disaster-recovery-plan-design | |
| prompt: | | |
| Design disaster recovery plan for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-disaster-recovery-plan-implement | |
| prompt: | | |
| Implement disaster recovery plan for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: reliability-disaster-recovery-plan-test | |
| prompt: | | |
| Test disaster recovery plan for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: reliability-disaster-recovery-plan-document | |
| prompt: | | |
| Document disaster recovery plan for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Security & Multi-Tenant Controls - 40 tasks | |
| - name: security-tenant-isolation-design | |
| prompt: | | |
| Design tenant isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-tenant-isolation-implement | |
| prompt: | | |
| Implement tenant isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-tenant-isolation-test | |
| prompt: | | |
| Test tenant isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-tenant-isolation-document | |
| prompt: | | |
| Document tenant isolation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-api-authentication-design | |
| prompt: | | |
| Design API authentication for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-api-authentication-implement | |
| prompt: | | |
| Implement API authentication for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-api-authentication-test | |
| prompt: | | |
| Test API authentication for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-api-authentication-document | |
| prompt: | | |
| Document API authentication for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-audit-logging-design | |
| prompt: | | |
| Design audit logging for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-audit-logging-implement | |
| prompt: | | |
| Implement audit logging for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-audit-logging-test | |
| prompt: | | |
| Test audit logging for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-audit-logging-document | |
| prompt: | | |
| Document audit logging for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-secrets-management-design | |
| prompt: | | |
| Design secrets management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-secrets-management-implement | |
| prompt: | | |
| Implement secrets management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-secrets-management-test | |
| prompt: | | |
| Test secrets management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-secrets-management-document | |
| prompt: | | |
| Document secrets management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-network-segmentation-design | |
| prompt: | | |
| Design network segmentation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-network-segmentation-implement | |
| prompt: | | |
| Implement network segmentation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-network-segmentation-test | |
| prompt: | | |
| Test network segmentation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-network-segmentation-document | |
| prompt: | | |
| Document network segmentation for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-least-privilege-design | |
| prompt: | | |
| Design least privilege for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-least-privilege-implement | |
| prompt: | | |
| Implement least privilege for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-least-privilege-test | |
| prompt: | | |
| Test least privilege for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-least-privilege-document | |
| prompt: | | |
| Document least privilege for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-usage-quotas-design | |
| prompt: | | |
| Design usage quotas for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-usage-quotas-implement | |
| prompt: | | |
| Implement usage quotas for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-usage-quotas-test | |
| prompt: | | |
| Test usage quotas for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-usage-quotas-document | |
| prompt: | | |
| Document usage quotas for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-data-access-policies-design | |
| prompt: | | |
| Design data access policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-data-access-policies-implement | |
| prompt: | | |
| Implement data access policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-data-access-policies-test | |
| prompt: | | |
| Test data access policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-data-access-policies-document | |
| prompt: | | |
| Document data access policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-per-tenant-encryption-design | |
| prompt: | | |
| Design per-tenant encryption for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-per-tenant-encryption-implement | |
| prompt: | | |
| Implement per-tenant encryption for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-per-tenant-encryption-test | |
| prompt: | | |
| Test per-tenant encryption for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-per-tenant-encryption-document | |
| prompt: | | |
| Document per-tenant encryption for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-vulnerability-scanning-design | |
| prompt: | | |
| Design vulnerability scanning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-vulnerability-scanning-implement | |
| prompt: | | |
| Implement vulnerability scanning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: security-vulnerability-scanning-test | |
| prompt: | | |
| Test vulnerability scanning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: security-vulnerability-scanning-document | |
| prompt: | | |
| Document vulnerability scanning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Performance Kernels & Memory - 40 tasks | |
| - name: performance-kernel-auto-tuning-design | |
| prompt: | | |
| Design kernel auto-tuning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-auto-tuning-implement | |
| prompt: | | |
| Implement kernel auto-tuning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-auto-tuning-test | |
| prompt: | | |
| Test kernel auto-tuning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-kernel-auto-tuning-document | |
| prompt: | | |
| Document kernel auto-tuning for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-mixed-precision-policies-design | |
| prompt: | | |
| Design mixed precision policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-mixed-precision-policies-implement | |
| prompt: | | |
| Implement mixed precision policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-mixed-precision-policies-test | |
| prompt: | | |
| Test mixed precision policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-mixed-precision-policies-document | |
| prompt: | | |
| Document mixed precision policies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-fused-kernels-design | |
| prompt: | | |
| Design fused kernels for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-fused-kernels-implement | |
| prompt: | | |
| Implement fused kernels for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-fused-kernels-test | |
| prompt: | | |
| Test fused kernels for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-fused-kernels-document | |
| prompt: | | |
| Document fused kernels for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-kernel-launch-overhead-design | |
| prompt: | | |
| Design kernel launch overhead for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-launch-overhead-implement | |
| prompt: | | |
| Implement kernel launch overhead for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-launch-overhead-test | |
| prompt: | | |
| Test kernel launch overhead for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-kernel-launch-overhead-document | |
| prompt: | | |
| Document kernel launch overhead for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-memory-pool-allocator-design | |
| prompt: | | |
| Design memory pool allocator for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-memory-pool-allocator-implement | |
| prompt: | | |
| Implement memory pool allocator for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-memory-pool-allocator-test | |
| prompt: | | |
| Test memory pool allocator for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-memory-pool-allocator-document | |
| prompt: | | |
| Document memory pool allocator for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-unified-memory-vs-pinned-design | |
| prompt: | | |
| Design unified memory vs pinned for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-unified-memory-vs-pinned-implement | |
| prompt: | | |
| Implement unified memory vs pinned for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-unified-memory-vs-pinned-test | |
| prompt: | | |
| Test unified memory vs pinned for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-unified-memory-vs-pinned-document | |
| prompt: | | |
| Document unified memory vs pinned for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-overlap-compute-and-transfer-design | |
| prompt: | | |
| Design overlap compute and transfer for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-overlap-compute-and-transfer-implement | |
| prompt: | | |
| Implement overlap compute and transfer for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-overlap-compute-and-transfer-test | |
| prompt: | | |
| Test overlap compute and transfer for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-overlap-compute-and-transfer-document | |
| prompt: | | |
| Document overlap compute and transfer for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-vectorization-strategies-design | |
| prompt: | | |
| Design vectorization strategies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-vectorization-strategies-implement | |
| prompt: | | |
| Implement vectorization strategies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-vectorization-strategies-test | |
| prompt: | | |
| Test vectorization strategies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-vectorization-strategies-document | |
| prompt: | | |
| Document vectorization strategies for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-kernel-profiling-harness-design | |
| prompt: | | |
| Design kernel profiling harness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-profiling-harness-implement | |
| prompt: | | |
| Implement kernel profiling harness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-kernel-profiling-harness-test | |
| prompt: | | |
| Test kernel profiling harness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-kernel-profiling-harness-document | |
| prompt: | | |
| Document kernel profiling harness for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-microbenchmark-suite-design | |
| prompt: | | |
| Design microbenchmark suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-microbenchmark-suite-implement | |
| prompt: | | |
| Implement microbenchmark suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: performance-microbenchmark-suite-test | |
| prompt: | | |
| Test microbenchmark suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: performance-microbenchmark-suite-document | |
| prompt: | | |
| Document microbenchmark suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Model Serving & Batching - 40 tasks | |
| - name: serving-dynamic-batching-design | |
| prompt: | | |
| Design dynamic batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-dynamic-batching-implement | |
| prompt: | | |
| Implement dynamic batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-dynamic-batching-test | |
| prompt: | | |
| Test dynamic batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-dynamic-batching-document | |
| prompt: | | |
| Document dynamic batching for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-request-prioritization-design | |
| prompt: | | |
| Design request prioritization for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-request-prioritization-implement | |
| prompt: | | |
| Implement request prioritization for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-request-prioritization-test | |
| prompt: | | |
| Test request prioritization for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-request-prioritization-document | |
| prompt: | | |
| Document request prioritization for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-warmup-caches-design | |
| prompt: | | |
| Design warmup caches for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-warmup-caches-implement | |
| prompt: | | |
| Implement warmup caches for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-warmup-caches-test | |
| prompt: | | |
| Test warmup caches for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-warmup-caches-document | |
| prompt: | | |
| Document warmup caches for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-model-sharding-design | |
| prompt: | | |
| Design model sharding for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-model-sharding-implement | |
| prompt: | | |
| Implement model sharding for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-model-sharding-test | |
| prompt: | | |
| Test model sharding for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-model-sharding-document | |
| prompt: | | |
| Document model sharding for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-multi-model-routing-design | |
| prompt: | | |
| Design multi-model routing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-multi-model-routing-implement | |
| prompt: | | |
| Implement multi-model routing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-multi-model-routing-test | |
| prompt: | | |
| Test multi-model routing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-multi-model-routing-document | |
| prompt: | | |
| Document multi-model routing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-concurrency-limits-design | |
| prompt: | | |
| Design concurrency limits for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-concurrency-limits-implement | |
| prompt: | | |
| Implement concurrency limits for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-concurrency-limits-test | |
| prompt: | | |
| Test concurrency limits for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-concurrency-limits-document | |
| prompt: | | |
| Document concurrency limits for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-kv-cache-management-design | |
| prompt: | | |
| Design KV cache management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-kv-cache-management-implement | |
| prompt: | | |
| Implement KV cache management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-kv-cache-management-test | |
| prompt: | | |
| Test KV cache management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-kv-cache-management-document | |
| prompt: | | |
| Document KV cache management for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-quantization-pipeline-design | |
| prompt: | | |
| Design quantization pipeline for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-quantization-pipeline-implement | |
| prompt: | | |
| Implement quantization pipeline for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-quantization-pipeline-test | |
| prompt: | | |
| Test quantization pipeline for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-quantization-pipeline-document | |
| prompt: | | |
| Document quantization pipeline for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-a-b-deployment-design | |
| prompt: | | |
| Design A/B deployment for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-a-b-deployment-implement | |
| prompt: | | |
| Implement A/B deployment for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-a-b-deployment-test | |
| prompt: | | |
| Test A/B deployment for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-a-b-deployment-document | |
| prompt: | | |
| Document A/B deployment for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-sla-aware-scheduling-design | |
| prompt: | | |
| Design SLA-aware scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-sla-aware-scheduling-implement | |
| prompt: | | |
| Implement SLA-aware scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: serving-sla-aware-scheduling-test | |
| prompt: | | |
| Test SLA-aware scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: serving-sla-aware-scheduling-document | |
| prompt: | | |
| Document SLA-aware scheduling for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| # Testing & Validation - 40 tasks | |
| - name: validation-load-testing-design | |
| prompt: | | |
| Design load testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-load-testing-implement | |
| prompt: | | |
| Implement load testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-load-testing-test | |
| prompt: | | |
| Test load testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-load-testing-document | |
| prompt: | | |
| Document load testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-soak-testing-design | |
| prompt: | | |
| Design soak testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-soak-testing-implement | |
| prompt: | | |
| Implement soak testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-soak-testing-test | |
| prompt: | | |
| Test soak testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-soak-testing-document | |
| prompt: | | |
| Document soak testing for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-chaos-kill-agents-design | |
| prompt: | | |
| Design chaos kill agents for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-chaos-kill-agents-implement | |
| prompt: | | |
| Implement chaos kill agents for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-chaos-kill-agents-test | |
| prompt: | | |
| Test chaos kill agents for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-chaos-kill-agents-document | |
| prompt: | | |
| Document chaos kill agents for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-network-partition-tests-design | |
| prompt: | | |
| Design network partition tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-network-partition-tests-implement | |
| prompt: | | |
| Implement network partition tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-network-partition-tests-test | |
| prompt: | | |
| Test network partition tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-network-partition-tests-document | |
| prompt: | | |
| Document network partition tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-disk-latency-injection-design | |
| prompt: | | |
| Design disk latency injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-disk-latency-injection-implement | |
| prompt: | | |
| Implement disk latency injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-disk-latency-injection-test | |
| prompt: | | |
| Test disk latency injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-disk-latency-injection-document | |
| prompt: | | |
| Document disk latency injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-gpu-fault-injection-design | |
| prompt: | | |
| Design GPU fault injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-gpu-fault-injection-implement | |
| prompt: | | |
| Implement GPU fault injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-gpu-fault-injection-test | |
| prompt: | | |
| Test GPU fault injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-gpu-fault-injection-document | |
| prompt: | | |
| Document GPU fault injection for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-config-regression-tests-design | |
| prompt: | | |
| Design config regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-config-regression-tests-implement | |
| prompt: | | |
| Implement config regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-config-regression-tests-test | |
| prompt: | | |
| Test config regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-config-regression-tests-document | |
| prompt: | | |
| Document config regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-security-regression-tests-design | |
| prompt: | | |
| Design security regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-security-regression-tests-implement | |
| prompt: | | |
| Implement security regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-security-regression-tests-test | |
| prompt: | | |
| Test security regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-security-regression-tests-document | |
| prompt: | | |
| Document security regression tests for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-performance-regression-suite-design | |
| prompt: | | |
| Design performance regression suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-performance-regression-suite-implement | |
| prompt: | | |
| Implement performance regression suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-performance-regression-suite-test | |
| prompt: | | |
| Test performance regression suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-performance-regression-suite-document | |
| prompt: | | |
| Document performance regression suite for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-reproducibility-checks-design | |
| prompt: | | |
| Design reproducibility checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-reproducibility-checks-implement | |
| prompt: | | |
| Implement reproducibility checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P1 | |
| dependencies: [] | |
| - name: validation-reproducibility-checks-test | |
| prompt: | | |
| Test reproducibility checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
| - name: validation-reproducibility-checks-document | |
| prompt: | | |
| Document reproducibility checks for a multi-node P100 facility. | |
| Requirements: | |
| - Multi-tenant safe defaults | |
| - Clear operational runbook steps | |
| - Metrics for success and rollback | |
| priority: P2 | |
| dependencies: [] | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment