marcopunio.ai
ProjectsBlogAbout

Projects

Deep-dive work focused on GPU-accelerated inference, distributed serving architectures, and performance-critical AI systems.

ML Systems & GPU Inference

Hands-on system design and performance analysis of modern LLM inference pipelines — focusing on latency, throughput, memory behavior, and scaling limits.

TensorRT-LLM Explainer Notebook

A deep dive into why LLM decode becomes memory-bound and how TensorRT-LLM achieves 2–4× acceleration through kernel fusion, paged KV cache, Tensor Core kernels, and FP8 execution.

Distributed Inference with NVIDIA Triton

A Triton-inspired system design and simulation exploring how inference platforms scale horizontally under high concurrency while maintaining strict p95 / p99 latency SLAs.

Open-Source & Production Contributions

Production deployments, reference architectures, and open-source work supporting large-scale GenAI and cloud workloads.

Meta Llama on AWS

Production-grade deployment and inference patterns for Meta Llama models, covering GPU-backed serving architectures, performance tradeoffs, and real-world GenAI workloads.

Amazon Bedrock Industry Use Cases

Pre-built examples of generative AI agents using Amazon Bedrock across multiple industries — healthcare, manufacturing, travel, and more. (AWS sample repo)

© 2026 marcopunio.ai

Built with Next.js & Tailwind CSS.