Results Submissions: 46 (46% of accepted papers) Evaluation Results: 45 Artifacts Available 43 Artifacts Functional 35 Results Reproduced Paper title Avail. Funct. Repro. ASTERINAS: A Linux ABI-Compatible, Rust-Based Framekernel OS with a Small and Sound TCB Burst Computing: Quick, Sudden, Massively Parallel Processing on Serverless Resources Chitu: Avoiding Unnecessary Fallback in Byzantine Consensus CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge Colocating ML Inference and Training with Fast GPU Memory Handover CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA Fast Distributed Transactions for RDMA-based Disaggregated Memory FlexPipe: Maximizing Training Efficiency for Transformer-based Models with Variable-Length Inputs GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs GMI-DRL: Empowering Multi-GPU DRL with Adaptive-Grained Parallelism GPREEMPT: GPU Preemptive Scheduling Made General and Efficient GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale HotRAP: Hot Record Retention and Promotion for LSM-trees with Tiered Storage IRHash: Efficient Multi-Language Compiler Caching by IR-Level Hashing JENGA: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters LEOCraft: Towards Designing Performant LEO Networks LITESHIELD: Secure Containers via Lightweight, Composable Userspace μKernel Services Mitigating Resource Usage Dependency in Sorting-based KV Stores on Hybrid Storage Devices via Operation Decoupling mTuner: Accelerating Parameter-Efficient Fine-Tuning on Multi-GPU Servers with Elastic Tensor On-Demand Container Partitioning for Distributed ML Para-ksm: Parallelized Memory Deduplication with Data Streaming Accelerator PathWeaver: A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search Poby: SmartNIC-accelerated Image Provisioning for Coldstart in Clouds PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs Resource Multiplexing in Tuning and Serving Large Language Models Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations Rex: Closing the language-verifier gap with safe and usable kernel extensions SAVE: Software-Implemented Fault Tolerance for Model Inference against GPU Memory Bit Flips Separate but Together: Integrating Remote Attestation into TLS ShieldReduce: Fine-Grained Shielded Data Reduction SpaceExit: Enabling Efficient Adaptive Computing in Space with Early Exits SwCC: Software-Programmable and Per-Packet Congestion Control in RDMA Engine The Koala Benchmarks for the Shell: Characterization and Implications Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference Turbocharge ANNS on Real Processing-in-Memory by Enabling Fine-Grained Per-PIM-Core Scheduling Understanding and Detecting Fail-Slow Hardware Failure Bugs in Cloud Systems Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism Unveiling Compiler Faults via Attribute-Guided Compilation Space Exploration Voltrix: Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization Weaver: Efficient Multi-LLM Serving with Attention Offloading XRT: An Accelerator-Aware Runtime for Accelerated Chip Multiprocessors μEFI: A Microkernel-Style UEFI with Isolation and Transparency