Location:

Singapore

Onsite Interview:

Required (Singapore or Beijing)

Level:

Early Career / High-Potential Engineers
We are building high-performance large model inference systems that push GPUs to their limits.

We are looking for exceptional engineers to design and optimize production-grade LLM inference infrastructure, achieving:

Extreme performance
Ultra-low latency
Maximum GPU utilization
Lowest cost per token

This is a core role that directly impacts our company’s technical competitiveness.

What You Will Do

Production LLM Inference Systems

Build and optimize high-performance inference services based on:

vLLM
TensorRT-LLM
SGLang
FasterTransformer
TGI (Text Generation Inference)

Deploy production-grade inference systems serving real workloads.

Inference Performance Optimization

Optimize:

Latency
Throughput
Cost per token

Using techniques such as:

KV cache optimization
Continuous batching
Paged attention
Speculative decoding
Prefix caching
Quantization (FP8 / INT8 / INT4)

GPU-Level Optimization

Improve GPU efficiency by optimizing:

Memory bandwidth utilization
Tensor Core utilization
Kernel launch efficiency

Work involving:

CUDA
Triton kernels
FlashAttention
Custom CUDA kernels

Distributed Inference

Design and implement:

Tensor parallelism
Pipeline parallelism
Expert parallelism (MoE)
Multi-node inference

Using:

NCCL
CUDA
RDMA

Large-Scale Inference Platform

Build large-scale inference platforms including:

Inference scheduler
Load balancer
Multi-tenant inference system

Supporting:

Thousands of GPUs
Billions of tokens per day

Cost Optimization

Reduce cost per token through:

Advanced batching strategies

GPU memory optimization

Cluster scheduling

Technical Requirements

Strong experience or project exposure in several of the following areas:

GPU & Low-Level Optimization

CUDA / CUDA Kernel development
GPU performance tuning
Kernel / Operator optimization
Triton / TVM
TensorRT acceleration

Large Model & Inference

Megatron-LM
DeepSpeed
Colossal-AI
vLLM / SGLang
Large model inference optimization
Quantization / KV cache optimization (plus)

Distributed & Systems

Distributed Systems

PyTorch Distributed

NCCL

HPC (High Performance Computing)

AI Infrastructure / ML Infra

Multi-GPU / Multi-node training systems

Preferred Background

Bachelor’s from a strong university (CS/EE/AI) or Master’s degree preferred
Strong foundation in:
- Computer Systems
- Operating Systems
- Parallel Computing
- Distributed Systems
- Linear Algebra & ML fundamentals
Competitive programming / ACM / research experience is a plus
Publications or open-source contributions are a plus

LLM Inference Engineer

Job Description