About Me

Marco presenting

Hey, my name is Marco Punio and I'm a Cloud AI Architect focused on model optimization, distributed inference, and GPU performance. I work on high-performance LLM serving, fine-tuning pipelines, and CUDA-accelerated workflows across AWS, Azure, and personal GPU projects.

I spend most of my time thinking about throughput, latency, and efficiency — from understanding why inference becomes memory-bound to tuning NCCL behavior and DeepSpeed configurations in real systems.

I also enjoy presenting and teaching. A big part of my work is breaking down complex AI systems in a way that helps teams reason about performance, tradeoffs, and how to build scalable multimodal pipelines end-to-end.