Skip to content

Latest commit

 

History

History
128 lines (96 loc) · 6.19 KB

File metadata and controls

128 lines (96 loc) · 6.19 KB

Chat Feature Integration — Final Summary

PR: bernardladenthin/java-llama.cpp#61

Origin

Based on a large patch by @vaiju1981 that proposed OpenAI-compatible chat completions and JSON-in/JSON-out endpoints for the java-llama.cpp project. The patch was reimplemented from scratch against the current codebase (llama.cpp b8611) with significant improvements.

CI Status: All 16/16 jobs green

macOS 14 (Metal), macOS 15 (Metal + no-Metal), Ubuntu, Windows (x86 + x86_64), Android, Linux aarch64, manylinux, CUDA — all passing.


What was implemented (14 commits)

Phase 1-2: Chat Completions (core feature)

Method Description
chatComplete(InferenceParameters) Blocking OpenAI-compatible chat completion with automatic template application
generateChat(InferenceParameters) Streaming chat completion via LlamaIterator
handleChatCompletions(String) Native JSON-in/JSON-out chat endpoint
requestChatCompletion(String) Native streaming chat (returns task ID)

Phase 3: JNI Simplification

Change Description
receiveCompletionJson Returns JSON string instead of constructing LlamaOutput via JNI
handleRerank Returns JSON instead of JNI HashMap/LlamaOutput
Removed 5 JNI refs c_output, cc_output, c_llama_iterator, f_task_id, f_iter_has_next
LlamaOutput.fromJson() JSON parsing moved to Java — simpler, less fragile

Phase 4: Robustness Improvements

Change Description
loadModel Explicit ThrowNew on allocation failure and parse failure
delete Null-pointer guard, proper cleanup
setLogger format_log_as_json helper, always-on trampoline for JSON mode

Phase 5: JSON-in/JSON-out Endpoints

Method Description
handleCompletions(String) Blocking raw completion, JSON-in/JSON-out
handleCompletionsOai(String) OAI-compat /v1/completions format
handleInfill(String) Explicit infill with FIM token validation
handleEmbeddings(String, boolean) JSON embeddings with optional OAI-compat format
handleTokenize(String, boolean, boolean) Tokenize with optional piece information
handleDetokenize(int[]) Detokenize to JSON {"content": "..."}

Phase 6: Server Management

Method Description
getMetrics() Slot info, idle/processing counts, performance metrics
eraseSlot(int) Clear KV cache for a slot
saveSlot(int, String) / restoreSlot(int, String) Persist/restore slot state
configureParallelInference(String) Runtime config for similarity, threads

Bonus: Infrastructure Fixes

Fix Description
Thread join Replaced detached thread with joinable + ready barrier — eliminates flaky SIGABRT
DetachCurrentThread Worker thread detaches from JVM before exit — prevents "Corrupted channel"
jllama_context wrapper Proper ownership of server_context + std::thread + vocab_only flag
chat_template_kwargs Custom Jinja template variables for reasoning models

Comparison: Original patch by @vaiju1981 vs Final implementation

Patch Feature Final Implementation Status
handleChatCompletions handleChatCompletions + requestChatCompletion Improved — both blocking and streaming
handleCompletions handleCompletions Equivalent
handleCompletionsOai handleCompletionsOai Equivalent
handleInfill handleInfill Improved — FIM token validation
handleEmbeddings handleEmbeddings Equivalent
handleRerank handleRerank Improved — proper task cleanup
handleTokenize / handleDetokenize handleTokenize / handleDetokenize Equivalent
getNextStreamResult (polling) receiveCompletionJson (iterator) Improved — Java Iterator pattern
handleSlotAction handleSlotAction + typed Java wrappers ImprovedgetMetrics(), eraseSlot(), etc.
handleKVCacheAction Merged into handleSlotAction Simpler — KV cache is per-slot
configureParallelInference configureParallelInference Equivalent
JNI cleanup (remove refs) Done + jllama_context wrapper Improved — proper memory management
loadModel error handling Done Equivalent
delete cleanup Thread join + ready barrier Much improved — fixes flaky crash
setLogger JSON formatting format_log_as_json + always-on trampoline Equivalent
parse_jstring rewrite Skipped (cosmetic) N/A
chat_template_kwargs Not in patch — added New feature

Features the patch had that are now obsolete

  • All raw JNI object construction (c_output, cc_output, HashMap building) — replaced by JSON returns
  • getNextStreamResult polling pattern — replaced by LlamaIterator reuse
  • Separate handleKVCacheAction — merged into handleSlotAction

Features we added beyond the patch

  • chatComplete() / generateChat() Java convenience API
  • LlamaOutput.fromJson() / getContentFromJson() — JSON parsing in Java
  • jllama_context wrapper with joinable thread — fixes pre-existing flaky SIGABRT
  • chat_template_kwargs support — enables reasoning/thinking models
  • 20+ new tests covering all endpoints and edge cases

Upstream Compatibility (llama.cpp b8611)

Verified against ggml-org/llama.cpp master:

Feature Status
common_chat_templates_inputs — all 15 fields populated Correct
oaicompat_parser_options struct Matches upstream
oaicompat_chat_params_parse — message/tool/reasoning parsing Complete
chat_template_kwargs — custom Jinja variables Supported
Multimodal content (images/audio) Supported via upstream
Tool calling / function calling Supported via upstream
Reasoning format (DeepSeek, o1-style) Supported

The original patch by @vaiju1981 is now fully obsolete. All functionality has been reimplemented with improvements, comprehensive tests, and proper thread safety.