Skip to contents

Rsassy provides R bindings to the Rust sassy approximate string matcher. It searches short patterns in DNA, IUPAC, or ASCII text.

Install

install.packages(
  "Rsassy",
  repos = c("https://sounkou-bioinfo.r-universe.dev", "https://cloud.r-project.org")
)
library(Rsassy)

sassy_search(list("ATCGATCG"), list("GGGGATCGATCGTTTT"), k = 1, alphabet = "dna")
#> <sassy_matches> 3 matches
#> pattern_idx text_idx text_start text_end pattern_start pattern_end cost strand  cigar
#>           0        0          2       10             0           8    1      -   7=1X
#>           0        0          4       12             0           8    0      +     8=
#>           0        0          6       14             0           8    1      - 1=1X6=

The result is a sassy_matches data frame. Coordinates are 0-based and half-open.

Reuse a searcher

searcher <- sassy_searcher("dna", rc = TRUE)
sassy_searcher_search(searcher, list("ATCGATCG"), list("GGGGATCGATCGTTTT"), k = 1)
#> <sassy_matches> 3 matches
#> pattern_idx text_idx text_start text_end pattern_start pattern_end cost strand  cigar
#>           0        0          2       10             0           8    1      -   7=1X
#>           0        0          4       12             0           8    0      +     8=
#>           0        0          6       14             0           8    1      - 1=1X6=

Multiple patterns or texts

List inputs search every pattern against every text. Each list element may be a raw vector or a non-missing character scalar, so callers can mix byte strings, regular strings, and ALTREP-backed raw elements. pattern_idx and text_idx identify the input indices.

sassy_search(
  list("ATG", charToRaw("TTT")),
  list("CCCCATGCCCCTTT"),
  k = 1,
  alphabet = "iupac",
  rc = FALSE,
  strategy = "encoded_patterns"
)
#> <sassy_matches> 2 matches
#> pattern_idx text_idx text_start text_end pattern_start pattern_end cost strand cigar
#>           0        0          4        7             0           3    0      +    3=
#>           1        0         11       14             0           3    0      +    3=

FASTA/FASTQ batches

sassy_fastx_iter() and sassy_fastx_next() parse FASTA/FASTQ files into record-count-bounded batches. Record IDs are exposed as an ALTREP character vector and sequences as a list of raw ALTREP slices over immutable native batch buffers.

fq <- tempfile(fileext = ".fastq")
writeLines(c("@r1", "ACGT", "+", "!!!!"), fq, useBytes = TRUE)
it <- sassy_fastx_iter(fq, batch_records = 1)
batch <- sassy_fastx_next(it)
sassy_search(list("ACG"), batch$seq, k = 0, alphabet = "dna", rc = FALSE, text_id = batch$id)
#> <sassy_matches> 1 match
#> pattern_idx text_idx text_id text_start text_end pattern_start pattern_end cost strand cigar
#>           0        0      r1          0        3             0           3    0      +    3=

See vignette("fastx-iteration", package = "Rsassy") for the performance and validation details.