Get in Touch

Course Outline

Tencent Hunyuan Production Fundamentals

  • Overview of Tencent Hunyuan model serving scenarios.
  • Production characteristics of large and MoE models.
  • Common latency, throughput, and cost bottlenecks.
  • Defining service-level objectives for inference workloads.

Deployment Architecture and Serving Flow

  • Core components of a production inference stack.
  • Choosing between containerised, on-premise, and cloud deployment models.
  • Model loading, request routing, and GPU allocation basics.
  • Designing for reliability and operational simplicity.

Latency Optimisation in Practice

  • Using optimised inference engines such as TensorRT where applicable.
  • KV-cache concepts and practical cache tuning.
  • Reducing startup, warmup, and response overhead.
  • Measuring time to first token and token generation speed.

Throughput, Batching, and GPU Efficiency

  • Continuous batching and request batching strategies.
  • Managing concurrency and queue behaviour.
  • Improving GPU utilisation without compromising user experience.
  • Handling long-context and mixed-workload requests.

Quantisation and Cost Control

  • Why quantisation matters for production serving.
  • Practical trade-offs of FP16, INT8, and other common precision options.
  • Balancing model quality, latency, and infrastructure cost.
  • Building a simple cost optimisation checklist.

Operations, Monitoring, and Readiness Review

  • Autoscaling triggers for inference services.
  • Monitoring latency, throughput, cache usage, and GPU health.
  • Logging, alerting, and incident response basics.
  • Reviewing a reference deployment and creating an improvement plan.

Requirements

  • Basic understanding of large language model deployment and inference workflows.
  • Experience with containers, cloud or on-premise infrastructure, and API-based services.
  • Working knowledge of Python or system engineering tasks.

Audience

  • ML engineers deploying LLMs into production.
  • Platform engineers responsible for GPU-based inference services.
  • Solution architects designing scalable AI serving platforms.
 14 Hours

Custom Corporate Training

Training solutions designed exclusively for businesses.

  • Customized Content: We adapt the syllabus and practical exercises to the real goals and needs of your project.
  • Flexible Schedule: Dates and times adapted to your team's agenda.
  • Format: Online (live), In-company (at your offices), or Hybrid.
Investment

Price per private group, online live training, starting from 2600 € + VAT*

Contact us for an exact quote and to hear our latest promotions

Provisional Upcoming Courses (Contact Us For More Information)

Related Categories