R Bindings for BebeLM CPU Inference • Rbebelm

Rbebelm provides experimental R bindings for upstream maximecb/bebelm, a pure-Rust CPU-only implementation of Liquid AI LFM2.5-8B-A1B inference. The R package uses savvy for the R/Rust boundary and a runtime backend layout for portable SIMD dispatch.

The package is designed for interactive LLM use: generation streams tokens to the R console as soon as they are decoded, while the function still returns the final text, token ids, stop reason, and timing statistics.

Installation

Install from R-universe:

install.packages(
  "Rbebelm",
  repos = c("https://sounkou-bioinfo.r-universe.dev", "https://cloud.r-project.org")
)

R-universe can also publish Linux binaries for this universe. To prefer those binaries on Linux, use the universe binary repository pattern:

options(repos = c(
  Rbebelm = sprintf(
    "https://sounkou-bioinfo.r-universe.dev/bin/linux/noble-%s/%s/",
    R.version$arch,
    substr(getRversion(), 1, 3)
  ),
  CRAN = sprintf(
    "https://cran.r-universe.dev/bin/linux/noble-%s/%s/",
    R.version$arch,
    substr(getRversion(), 1, 3)
  )
))
install.packages("Rbebelm")

Source installs require Cargo/rustc and GNU make. On Linux, macOS, and Windows, Rbebelm builds separate Rust backend libraries when possible: scalar, AVX2, and AVX-512 on x86_64; scalar and NEON on arm64. The dispatcher selects the best installed backend supported by the current CPU/runtime before loading model code.

The model weights are not bundled with the R package. Download the GGUF weights from the upstream model source documented by bebelm, then pass the local path to bebel_model_load().

Quick start

Set BEBELM_WEIGHTS_FILE to the local GGUF path, or replace weights with an explicit file path. The README examples are evaluated when a local model file is available during rendering.

library(Rbebelm)

weights <- Sys.getenv("BEBELM_WEIGHTS_FILE", "LFM2.5-8B-A1B-Q4_K_M.gguf")
model <- bebel_model_load(weights, num_threads = 2)

# Agent-first API: one loaded model can back several conversations.
agent <- bebel_agent(model, greedy = TRUE, max_gen = 48, max_think = 16)

bebel_append_user(agent, "What is the capital of Mali? Answer briefly.")
turn1 <- bebel_assistant_turn(agent, on_event = NULL)

bebel_append_user(agent, "What about Italy?")
turn2 <- bebel_assistant_turn(agent, on_event = NULL)

turn1
#> <BebeLM assistant turn>
#>   stop: eos 
#>   tokens: 27 generated; 19 prompt
#>   prefill: 9.3 tok/s 
#>   decode: 9.01 tok/s 
#>   text:
#> <think>
#> The user asks: "What is the capital of Mali? Answer briefly."</think>
#> The capital of Mali is Bamako.
turn2
#> <BebeLM assistant turn>
#>   stop: eos 
#>   tokens: 26 generated; 14 prompt
#>   prefill: 8.7 tok/s 
#>   decode: 8.62 tok/s 
#>   text:
#> <think>
#> The user asks: "What about Italy? Answer briefly." Likely they</think>
#> The capital of Italy is Rome.
bebel_agent_info(agent)[c("history_tokens", "processed_tokens", "kv_tokens")]
#> $history_tokens
#> [1] 88
#> 
#> $processed_tokens
#> [1] 86
#> 
#> $kv_tokens
#> [1] 86

A BebelAgent owns the token transcript and decode caches while sharing the loaded model weights. Later turns only prefill newly appended tokens. The direct methods agent$history(), agent$transcript(), and agent$clear() expose the same operations as bebel_history(agent), bebel_transcript(agent), and bebel_clear(agent).

length(agent$history())
#> [1] 88
substr(agent$transcript(), 1, 80)
#> [1] "<|startoftext|><|im_start|>user\nWhat is the capital of Mali? Answer briefly.<|im"
identical(agent$history(), bebel_history(agent))
#> [1] TRUE

reset_info <- agent$clear()
reset_info[c("history_tokens", "processed_tokens", "kv_tokens")]
#> $history_tokens
#> [1] 0
#> 
#> $processed_tokens
#> [1] 0
#> 
#> $kv_tokens
#> [1] 0

For an interactive terminal loop, call bebel_live_console(agent) or bebel_live_console(model) in an R session.

chat <- bebel_agent(model, max_gen = 256, max_think = 64)
bebel_live_console(chat)
#> ╔══════════════════════════════════════════════════════╗
#> ║  Entering BebeLM live console.                     ║
#> ║  Type /quit or /exit to return to R.               ║
#> ╚══════════════════════════════════════════════════════╝
#> >>> What is BebeLM?
#> <think>
#> The user asks for a brief explanation of BebeLM.
#> </think>
#> BebeLM is a small, CPU-focused local language model runtime written in Rust.
#>
#> >>> Why use the R agent API?
#> The R agent keeps the transcript and decode caches alive across turns while
#> sharing the loaded GGUF weights.
#>
#> >>> /quit

You can also create the console directly from a model:

bebel_live_console(model, max_gen = 256, max_think = 64)

Convenience helpers are still available for simple calls. bebel_chat() wraps a single ChatML user/assistant turn:

# on_event defaults to bebel_console_event(): thinking and text print live.
result <- bebel_chat(
  model,
  "In one concise sentence, what does runtime SIMD dispatch do?",
  greedy = TRUE,
  max_gen = 48,
  max_think = 16,
  on_event = bebel_console_event(),
  check_interrupt = TRUE
)
#> <think>
#> The user asks: "In one concise sentence, what does runtime SIMD</think>
#> Runtime SIMD dispatch dynamically selects and executes the most efficient instruction variant for the current hardware at execution time, allowing programs to adapt to varying processor

result
#> <BebeLM chat result>
#>   stop: max_new 
#>   tokens: 48 generated; 22 prompt
#>   prefill: 8.7 tok/s 
#>   decode: 8.92 tok/s 
#>   text:
#> <think>
#> The user asks: "In one concise sentence, what does runtime SIMD</think>
#> Runtime SIMD dispatch dynamically selects and executes the most efficient instruction variant for the current hardware at execution time, allowing programs to adapt to varying processor

For plain text completion, use bebel_generate():

raw_result <- bebel_generate(
  model,
  "Runtime SIMD dispatch is useful because",
  greedy = TRUE,
  max_gen = 24,
  max_think = 16,
  on_event = bebel_console_event(),
  check_interrupt = TRUE
)
#>  it allows the compiler to generate code that is specific to the target processor architecture, which can lead to better performance. However
raw_result
#> <BebeLM generation result>
#>   stop: max_new 
#>   tokens: 24 generated; 8 prompt
#>   prefill: 8.7 tok/s 
#>   decode: 9.20 tok/s 
#>   text:
#>  it allows the compiler to generate code that is specific to the target processor architecture, which can lead to better performance. However

Use bebel_append_system() for a ChatML system-role instruction. Raw appends do not add user framing, so the low-level bebel_append() form below is equivalent apart from being more explicit about the tokens.

system_agent <- bebel_agent(model)
bebel_append_system(system_agent, "You are concise.")
bebel_transcript(system_agent)
#> [1] "<|startoftext|><|im_start|>system\nYou are concise.<|im_end|>\n"

raw_system_agent <- bebel_agent(model)
bebel_append(raw_system_agent, "<|im_start|>system\nYou are concise.<|im_end|>\n")
identical(bebel_transcript(system_agent), bebel_transcript(raw_system_agent))
#> [1] TRUE

Agents can also be driven at the lower level with raw text or token ids.

raw_agent <- bebel_agent(model, greedy = TRUE, max_gen = 16, max_think = 0)
bebel_append(raw_agent, "The capital of Mali is")
raw_turn <- bebel_agent_generate(raw_agent, on_event = NULL)

ids <- bebel_tokenize(model, " and its airport code is", add_bos = FALSE)
bebel_append_tokens(raw_agent, ids)
bebel_history(raw_agent)[1:8]
#> [1] 124894    597   5205    302  46628    355  50593   6261
bebel_token_ids()[c("TOKEN_THINK", "TOKEN_TOOL_CALL_START", "TOKEN_TOOL_CALL_END")]
#>           TOKEN_THINK TOKEN_TOOL_CALL_START   TOKEN_TOOL_CALL_END 
#>                124901                124905                124906
raw_turn$text
#> [1] " Bamako. city of ... ... ... ... ... ... ... ... ... ... ..."

Tools can be orchestrated with an Agent run loop. The context object is private to R tools and hooks; it is not sent to the model. A tool is dispatched only when the model emits a BebeLM tool-call block, so prompts should describe the available tools and the expected call format. The prompt below asks directly for the tool-call form so the example exercises the dispatch path.

ctx <- new.env(parent = emptyenv())
ctx$thread_id <- "thread-001"
ctx$log <- character()

tools <- list(
  lookup_capital = bebel_tool(
    "lookup_capital",
    function(args, context, call) {
      context$log <- c(context$log, paste("tool", call$name, args$country))
      c(Mali = "Bamako", Italy = "Rome")[[args$country]]
    },
    description = "Return a capital city for a country."
  )
)

hooks <- list(
  tool_request = function(call, context, ...) {
    context$log <- c(context$log, paste("request", call$name))
  },
  tool_result = function(call, result, context, ...) {
    context$log <- c(context$log, paste("result", call$name, result))
  }
)

tool_prompt <- paste(
  "Return exactly this tool call and no other text:",
  "lookup_capital({\"country\":\"Italy\"})"
)

agent <- bebel_agent(model, greedy = TRUE, max_gen = 64, max_think = 0)
bebel_append_user(agent, tool_prompt)
run <- bebel_agent_run(agent, tools = tools, context = ctx, hooks = hooks, max_steps = 2)
run
#> <bebelAgentRun>
#>   turns: 2 
#>   tool calls: 1 
#> <BebeLM assistant turn>
#>   stop: eos 
#>   tokens: 7 generated; 31 prompt
#>   prefill: 8.8 tok/s 
#>   decode: 8.65 tok/s 
#>   text:
#> The capital of Italy is Rome.
ctx$log
#> [1] "request lookup_capital"     "tool lookup_capital Italy" 
#> [3] "result lookup_capital Rome"

R-native agent layer

bebel_r_agent() adds a small Corteza-inspired harness on top of the core model bindings. One session object owns a BebeLM agent, a tool catalog, a private context environment, and can be driven either from an R console or from a small JSON-RPC SDK surface.

r_agent <- bebel_r_agent(
  model,
  allow_eval = FALSE,
  greedy = TRUE,
  max_gen = 96,
  max_think = 16
)
bebel_agent_tool_catalog(r_agent$tools)
#>         name                                       description
#> 1  r_objects     List objects in the configured R environment.
#> 2     r_help Read R help for a topic, optionally in a package.
#> 3 list_files                     List files under a directory.
#> 4  read_file                                 Read a text file.
#> 5 grep_files                       Search text files by regex.

Interactive console. The /r command is a direct R escape hatch into the same environment used by the agent’s R tools; for example /r x <- mtcars creates an object that r_objects() can later see. Visible /r output is capped so large objects do not flood the chat prompt; assign objects or use summaries such as /r str(x) for inspection. For plots in an Rscript terminal, use /rplot, e.g. /rplot plot(mpg ~ cyl, mtcars), which saves a PNG under rbebelm-plots/. The r_eval and r_plot tools are only advertised to the model when allow_eval = TRUE.

bebel_r_agent_console(r_agent)

For a one-call launcher from R, use bebel_r_agent_start(). It keeps the loaded BebeLM model object local to the launcher while still sharing .GlobalEnv with /r and the agent’s R tools. The console prints a compact stats line after each user turn.

bebel_r_agent_start(Sys.getenv("BEBELM_WEIGHTS_FILE", "LFM2.5-8B-A1B-Q4_K_M.gguf"))

The package also installs a small script in inst/bin:

agent_bin <- system.file("bin/rbebelm-agent", package = "Rbebelm")
system2(agent_bin, "--help")

From a shell, after installation:

"$(Rscript -e 'cat(system.file("bin/rbebelm-agent", package = "Rbebelm"))')" \
  --weights /path/to/LFM2.5-8B-A1B-Q4_K_M.gguf

Optional RPC server, using nanonext and jsonlite only when requested:

server <- bebel_r_agent_rpc_server(r_agent, url = "http://127.0.0.1:8080")
server$start()
# ... handle requests ...
server$close()

The RPC endpoint accepts POST /rpc JSON-RPC calls such as tools/list, session/info, session/transcript, session/clear, and turn.

The same event stream can be consumed programmatically. For example, collect only answer-text deltas while suppressing console output:

deltas <- character()
invisible(bebel_generate(
  model,
  "A text delta callback can",
  greedy = TRUE,
  max_gen = 12,
  max_think = 16,
  on_event = bebel_event_handler(
    text_delta = function(event) deltas <<- c(deltas, event$delta)
  )
))
paste0(deltas, collapse = "")
#> [1] " be used to update a text field in a UI component."

You can also pass a named list of event-specific handlers directly:

counts <- c(text_delta = 0L, thinking_delta = 0L, done = 0L)
invisible(bebel_generate(
  model,
  "An event handler list can",
  greedy = TRUE,
  max_gen = 4,
  max_think = 16,
  on_event = list(
    text_delta = function(event) counts["text_delta"] <<- counts[["text_delta"]] + 1L,
    thinking_delta = function(event) counts["thinking_delta"] <<- counts[["thinking_delta"]] + 1L,
    done = function(event) counts["done"] <<- counts[["done"]] + 1L
  )
))
counts
#>     text_delta thinking_delta           done 
#>              4              0              1

Interrupts and streaming

Generation checks R_CheckUserInterrupt() during prompt prefill and before every decoded token, wrapped through savvy’s unwind protection so Ctrl-C does not longjmp through Rust frames. Streaming is event-based and uses a finite event protocol; bebel_event_types() reports the event enum for this build. The current enum is returned by bebel_event_types() and includes stream lifecycle, thinking blocks, answer text blocks, tool-list blocks, tool-call blocks, and done. Delta events contain delta, id, and index; control start/end events include the delimiter token id and marker; end events contain accumulated content. Console printing is just the default event handler. Use on_event = NULL for silent batch generation.

webR / wasm

Rbebelm builds the real Rust/savvy backend for webR as a static wasm_simd128 backend. The wasm build uses a patched local copy of upstream BebeLM that avoids native-only mmap and Rayon imports on Emscripten: GGUF files are read from the webR filesystem into memory and matmul runs serially. If you mount or download a GGUF into the webR virtual filesystem, bebel_model_load() will attempt to load it. Very large models can still exhaust browser/webR memory.

Runtime backend dispatch

Rbebelm installs one small R shared library plus separate Rust backend libraries. The R shared library owns registration and dispatch; model code lives in the selected backend library. The dispatcher checks CPU/OS support before loading SIMD backends, so a portable binary can avoid executing unsupported instructions.

Backend selection happens once per R process. If you need to benchmark or debug a specific backend, call rbebelm_set_backend() before the first native Rbebelm call in a fresh Rscript process:

rbebelm_set_backend("auto")

Inspect the current CPU/runtime and selected backend:

rbebelm_cpuid_info()
#> <Rbebelm CPU features>
#>   x86_64-v3: yes 
#>   x86_64-v4: no 
#>   NEON: no 
#>   ARM dotprod: no 
#>   wasm simd128: no
rbebelm_backend_features()
#> <Rbebelm backend features>
#>   backend: avx2 
#>   target: x86_64-linux 
#>   Rust crate: rbebelm_backend 0.1.0 
#>   native SIMD feature: yes 
#>   compiled features:
#>     AVX2: yes 
#>     AVX-512F: no 
#>     NEON: no 
#>     ARM dotprod: no 
#>     wasm simd128: no
rbebelm_backend_info()
#> <Rbebelm backend dispatch>
#>   mode: dynamic 
#>   requested: auto 
#>   selected: avx2 
#>   loaded: yes 
#>   installed: scalar,avx2,avx512 
#>   supported: scalar,avx2

Development

Common development commands from the repository root. The make vignettes target uses rawvignette; install it with remotes::install_github("matthewkling/rawvignette") when editing vignettes-raw/.

make rd           # regenerate savvy wrappers, dispatch init, NAMESPACE, and man/*.Rd
make rdm          # regenerate README.md from evaluated README.Rmd
make dev-install  # install the package locally from source
make test         # run tinytest tests
make check        # build and run R CMD check --no-manual
make vignettes    # precompile vignettes-raw/ into vignettes/ with rawvignette
make site         # build the pkgdown site
make clean        # remove generated build artifacts