Regression-based Z-score for trisomy prediction
nipter_regression.RdImplements NIPTeR's stepwise linear regression approach. Builds multiple regression models that predict the focus chromosome's fraction from other chromosomal fractions, then computes Z-scores based on the prediction residuals.
Usage
nipter_regression(
sample,
control_group,
chromo_focus,
n_models = 4L,
n_predictors = 4L,
exclude_chromosomes = c(13L, 18L, 21L),
include_chromosomes = NULL,
train_fraction = 0.6,
overdispersion_rate = 1.15,
force_practical_cv = FALSE,
seed = NULL
)Arguments
- sample
A
NIPTeRSampleobject.- control_group
A
NIPTeRControlGroupobject.- chromo_focus
Integer; the target chromosome (1-22).
- n_models
Number of regression models to build (default 4).
- n_predictors
Maximum number of predictor chromosomes per model (default 4).
- exclude_chromosomes
Integer vector of chromosomes to exclude from the predictor pool (default
c(13, 18, 21)).- include_chromosomes
Integer vector of chromosomes to force-include.
- train_fraction
Fraction of control samples used for training (default 0.6).
- overdispersion_rate
Overdispersion multiplier for theoretical CV (default 1.15).
- force_practical_cv
Logical; always use practical CV regardless of theoretical CV? Default
FALSE.- seed
Random seed for the train/test split. Default
NULL(no seed set).
Value
A list of class "NIPTeRRegression" with elements:
- models
A list of
n_modelssublists, each containing:z_score(numeric),cv(selected CV),cv_type("practical"or"theoretical"),predictors(character vector of predictor chromosomes),shapiro_p_value,control_z_scores(named numeric vector).- focus_chromosome
Character.
- correction_status
Character vector.
- sample_name
Character.
Details
For SeparatedStrands samples the predictor pool is doubled (each
autosomal chromosome contributes both a forward and reverse fraction),
and complementary-strand exclusion within each model prevents selecting
both strands of the same chromosome as predictors in the same model.
Across models, only the exact predictor string is excluded; the
complementary strand remains available.
The algorithm:
Split the control group 60/40 into train and test sets.
Compute chromosomal fractions for all samples.
For each of
n_modelsmodels, greedily selectn_predictorschromosomes by maximising adjusted R-squared (forward stepwise selection). Predictors already used in previous models are excluded.For each model, fit
lm(focus ~ predictors)on the training set, predict on the test set, and compute:Practical CV:
sd(obs/pred) / mean(obs/pred)on the test set.Theoretical CV:
overdispersion_rate / sqrt(total_reads_focus)for the test sample.The larger CV is used (unless
force_practical_cv = TRUE).Sample Z-score:
(sample_ratio - 1) / cv.