Projects
Deep-dive work focused on GPU-accelerated inference, distributed serving architectures, and performance-critical AI systems.
ML Systems & GPU Inference
Hands-on system design and performance analysis of modern LLM inference pipelines — focusing on latency, throughput, memory behavior, and scaling limits.
TensorRT-LLM Explainer Notebook
A deep dive into why LLM decode becomes memory-bound and how TensorRT-LLM achieves 2–4× acceleration through kernel fusion, paged KV cache, Tensor Core kernels, and FP8 execution.
Distributed Inference with NVIDIA Triton
A Triton-inspired system design and simulation exploring how inference platforms scale horizontally under high concurrency while maintaining strict p95 / p99 latency SLAs.
Open-Source & Production Contributions
Production deployments, reference architectures, and open-source work supporting large-scale GenAI and cloud workloads.
Meta Llama on AWS
Production-grade deployment and inference patterns for Meta Llama models, covering GPU-backed serving architectures, performance tradeoffs, and real-world GenAI workloads.
Amazon Bedrock Industry Use Cases
Pre-built examples of generative AI agents using Amazon Bedrock across multiple industries — healthcare, manufacturing, travel, and more. (AWS sample repo)