Use OpenVINO GenAI in Chat Scenario

For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.

Refer to the How It Works for more information about KV-cache.

tip

Use start_chat() and finish_chat() to properly manage the chat session's KV-cache. This improves performance by reusing context between messages.

info

Chat mode is supported for both LLMPipeline and VLMPipeline.

A simple chat example (with grouped beam search decoding):

Python
C++
JavaScript

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, 'CPU')

config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)

pipe.start_chat()
while True:
    try:
        prompt = input('question:\n')
    except EOFError:
        break
    answer = pipe.generate(prompt)
    print('answer:\n')
    print(answer)
    print('\n----------\n')
pipe.finish_chat()

#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
    std::string prompt;

    std::string model_path = argv[1];
    ov::genai::LLMPipeline pipe(model_path, "CPU");

    ov::genai::GenerationConfig config;
    config.max_new_tokens = 100;
    config.num_beam_groups = 3;
    config.num_beams = 15;
    config.diversity_penalty = 1.0f;

    pipe.start_chat();
    std::cout << "question:\n";
    while (std::getline(std::cin, prompt)) {
        std::cout << "answer:\n";
        auto answer = pipe.generate(prompt, config);
        std::cout << answer << std::endl;
        std::cout << "\n----------\n"
            "question:\n";
    }
    pipe.finish_chat();
}

import { LLMPipeline } from "openvino-genai-node";
import readline from 'readline';

const pipe = await LLMPipeline(model_path, 'CPU');

const config = { 
    max_new_tokens: 100, 
    num_beam_groups: 3, 
    num_beams: 15, 
    diversity_penalty: 1.5 
};

await pipe.startChat();

const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
});

console.log('question:');
rl.on('line', async (prompt) => {
    console.log('answer:');
    const answer = await pipe.generate(prompt, config);
    console.log(answer);
    console.log('\n----------\nquestion:');
});

rl.on('close', async () => {
    await pipe.finishChat();
    process.exit(0);
});

info

For more information, refer to the Python, C++, and JavaScript chat samples.