VivoCity to Sentosa Express - $4 Entry | 4-min Journey - Learn More

LLM Inference Engineer

HPC AI TECHNOLOGY PTE. LTD.

Full Time D04 Harbourfront,Telok Blangah, Sentosa Island $6000 - $14000

Posted: February 24, 2026

Job Description

Location:

Singapore

Onsite Interview:

Required (Singapore or Beijing)

Level:

Early Career / High-Potential Engineers
We are building high-performance large model inference systems that push GPUs to their limits.

We are looking for exceptional engineers to design and optimize production-grade LLM inference infrastructure, achieving:

  • Extreme performance

  • Ultra-low latency

  • Maximum GPU utilization

  • Lowest cost per token

This is a core role that directly impacts our company’s technical competitiveness.

What You Will Do

Production LLM Inference Systems

Build and optimize high-performance inference services based on:

  • vLLM

  • TensorRT-LLM

  • SGLang

  • FasterTransformer

  • TGI (Text Generation Inference)

Deploy production-grade inference systems serving real workloads.

Inference Performance Optimization

Optimize:

  • Latency

  • Throughput

  • Cost per token

Using techniques such as:

  • KV cache optimization

  • Continuous batching

  • Paged attention

  • Speculative decoding

  • Prefix caching

  • Quantization (FP8 / INT8 / INT4)

GPU-Level Optimization

Improve GPU efficiency by optimizing:

  • Memory bandwidth utilization

  • Tensor Core utilization

  • Kernel launch efficiency

Work involving:

  • CUDA

  • Triton kernels

  • FlashAttention

  • Custom CUDA kernels

Distributed Inference

Design and implement:

  • Tensor parallelism

  • Pipeline parallelism

  • Expert parallelism (MoE)

  • Multi-node inference

Using:

  • NCCL

  • CUDA

  • RDMA

Large-Scale Inference Platform

Build large-scale inference platforms including:

  • Inference scheduler

  • Load balancer

  • Multi-tenant inference system

Supporting:

  • Thousands of GPUs

  • Billions of tokens per day

Cost Optimization

Reduce cost per token through:

Advanced batching strategies

GPU memory optimization

Cluster scheduling

Technical Requirements

Strong experience or project exposure in several of the following areas:

GPU & Low-Level Optimization

  • CUDA / CUDA Kernel development

  • GPU performance tuning

  • Kernel / Operator optimization

  • Triton / TVM

  • TensorRT acceleration

Large Model & Inference

  • Megatron-LM

  • DeepSpeed

  • Colossal-AI

  • vLLM / SGLang

  • Large model inference optimization

  • Quantization / KV cache optimization (plus)

Distributed & Systems

Distributed Systems

PyTorch Distributed

NCCL

HPC (High Performance Computing)

AI Infrastructure / ML Infra

Multi-GPU / Multi-node training systems

Preferred Background

  • Bachelor’s from a strong university (CS/EE/AI) or Master’s degree preferred

  • Strong foundation in:

    • Computer Systems

    • Operating Systems

    • Parallel Computing

    • Distributed Systems

    • Linear Algebra & ML fundamentals

  • Competitive programming / ACM / research experience is a plus

  • Publications or open-source contributions are a plus

How to Apply

Please click the "Apply Now" button below to submit your application on the employer's website.

Apply Now

Similar Jobs

Call Centre Executive

Contract D04 Harbourfront,Telok Blangah, Sentosa Island

We are seeking a dedicated and customer-focused individual to join our team as a Call Centre Executi...

View Details

Pump Attendant

Full Time D04 Harbourfront,Telok Blangah, Sentosa Island

Job Requirement: Good customer service Help customers to fill up...

View Details

Plate Collector收碗盘员 (Harbourfront)

Full Time Islandwide

🌟 Join Our Team Today! 🌟 📞 Call 97219095 to Start IMMEDIATELY! All age are welcome!! 💰 Get Paid...

View Details