Use OpenVINO GenAI in Chat Scenario
For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.
Refer to the How It Works for more information about KV-cache.
tip
Use start_chat() and finish_chat() to properly manage the chat session's KV-cache. This improves performance by reusing context between messages.
info
Chat mode is supported for both LLMPipeline and VLMPipeline.
A simple chat example (with grouped beam search decoding):
- Python
- C++
- JavaScript
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, 'CPU')
config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)
pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
answer = pipe.generate(prompt)
print('answer:\n')
print(answer)
print('\n----------\n')
pipe.finish_chat()
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string prompt;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
pipe.start_chat();
std::cout << "question:\n";
while (std::getline(std::cin, prompt)) {
std::cout << "answer:\n";
auto answer = pipe.generate(prompt, config);
std::cout << answer << std::endl;
std::cout << "\n----------\n"
"question:\n";
}
pipe.finish_chat();
}
import { LLMPipeline } from "openvino-genai-node";
import readline from 'readline';
const pipe = await LLMPipeline(model_path, 'CPU');
const config = {
max_new_tokens: 100,
num_beam_groups: 3,
num_beams: 15,
diversity_penalty: 1.5
};
await pipe.startChat();
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
console.log('question:');
rl.on('line', async (prompt) => {
console.log('answer:');
const answer = await pipe.generate(prompt, config);
console.log(answer);
console.log('\n----------\nquestion:');
});
rl.on('close', async () => {
await pipe.finishChat();
process.exit(0);
});
info
For more information, refer to the Python, C++, and JavaScript chat samples.