ClonalSim is an R/Bioconductor package for simulating tumor clonal evolution with realistic sequencing noise. It generates mutational profiles of heterogeneous tumor samples with hierarchical clonal structure, making it ideal for:
ClonalSim implements a two-stage realistic noise model:
The main function is simulateTumor(), which generates a
complete simulation:
# Set seed for reproducibility
set.seed(123)
sim <- simulateTumor(
subclone_freqs = c(0.15, 0.25, 0.30, 0.30),
n_mut_per_clone = c(20, 25, 30, 15),
n_mut_founder = 10
)
# Display summary
sim
#> ClonalSimData object
#> ==========================================
#> Number of mutations: 135
#> Number of clones: 4
#> Clone frequencies: 0.15, 0.25, 0.3, 0.3
#> Tumor purity: 1
#>
#> Mutation types:
#>
#> founder private shared
#> 10 90 35
#>
#> Sequencing depth:
#> Mean: 98.29
#> Range: 50 - 173
#>
#> VAF summary (observed):
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.04545 0.21603 0.33871 0.42137 0.58705 1.00000
#>
#> Metadata:
#> Created: 2026-04-28 17:42:54.928876
#> Package version: 1.0.0# Get mutation data
mutations <- getMutations(sim)
head(mutations)
#> Mutation Chromosome Position Ref Alt True_VAF VAF Depth Alt_reads
#> 1 Founder_1 chr2 3901294 T G 0.99 0.9591837 98 94
#> 2 Founder_2 chr12 156958367 A A 0.99 1.0000000 70 70
#> 3 Founder_3 chr15 113008658 T T 0.99 1.0000000 105 105
#> 4 Founder_4 chr17 55250578 G T 0.99 1.0000000 77 77
#> 5 Founder_5 chr3 191364048 G A 0.99 0.9871795 78 77
#> 6 Founder_6 chr7 110583980 G T 0.99 0.9901961 102 101
#> Clone Type Clone_IDs
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
#> 6 Founder founder 1,2,3,4
# Get simulation parameters
params <- getSimParams(sim)
params$subclone_freqs
#> Clone1 Clone2 Clone3 Clone4
#> 0.15 0.25 0.30 0.30
# Get true vs observed VAF
true_vaf <- getTrueVAF(sim)
obs_vaf <- getObservedVAF(sim)
# Compare
summary(true_vaf)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.08452 0.24178 0.32869 0.42351 0.59163 0.99000
summary(obs_vaf)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.04545 0.21603 0.33871 0.42137 0.58705 1.00000ClonalSim provides built-in plotting functions.
Note: Make sure ggplot2 is loaded:
The density plot shows the distribution of variant allele frequencies as they would appear in real sequencing data. Expected peaks correspond to clone frequencies.
Intra-tumor heterogeneity is modeled using a Beta distribution, which is more appropriate than Gaussian for frequency data (bounded 0-1).
# High heterogeneity (low concentration)
set.seed(123)
sim_high_het <- simulateTumor(
subclone_freqs = c(0.3, 0.4, 0.3),
n_mut_per_clone = c(30, 40, 30),
biological_noise = list(enabled = TRUE, concentration = 20)
)
# Low heterogeneity (high concentration)
set.seed(123)
sim_low_het <- simulateTumor(
subclone_freqs = c(0.3, 0.4, 0.3),
n_mut_per_clone = c(30, 40, 30),
biological_noise = list(enabled = TRUE, concentration = 100)
)
# Compare VAF spread
par(mfrow = c(1, 2))
hist(getTrueVAF(sim_high_het), main = "High Heterogeneity\n(concentration = 20)",
xlab = "True VAF", breaks = 30, col = "skyblue")
hist(getTrueVAF(sim_low_het), main = "Low Heterogeneity\n(concentration = 100)",
xlab = "True VAF", breaks = 30, col = "lightcoral")Real NGS data shows overdispersion (variance > mean) due to GC bias, mappability, and PCR artifacts. ClonalSim uses negative binomial distribution:
# Realistic (negative binomial)
set.seed(123)
sim_nb <- simulateTumor(
sequencing_noise = list(
mean_depth = 100,
depth_variation = "negative_binomial",
depth_dispersion = 20
)
)
# Unrealistic (Poisson)
set.seed(124)
sim_poisson <- simulateTumor(
sequencing_noise = list(
mean_depth = 100,
depth_variation = "poisson"
)
)
par(mfrow = c(1, 2))
hist(getMutations(sim_nb)$Depth, main = "Negative Binomial\n(realistic)",
xlab = "Depth", breaks = 30, col = "skyblue")
hist(getMutations(sim_poisson)$Depth, main = "Poisson\n(unrealistic)",
xlab = "Depth", breaks = 30, col = "lightcoral")Each sequencing read is independently sampled, introducing stochastic variation. This is especially important at low depth or low VAF:
# With binomial sampling (realistic)
set.seed(123)
sim_binom <- simulateTumor(
sequencing_noise = list(binomial_sampling = TRUE)
)
# Without binomial sampling (deterministic)
set.seed(123)
sim_det <- simulateTumor(
sequencing_noise = list(binomial_sampling = FALSE)
)
# Compare True_VAF vs observed VAF
par(mfrow = c(1, 2))
with(getMutations(sim_binom), {
plot(True_VAF, VAF, main = "With Binomial Sampling",
xlab = "True VAF", ylab = "Observed VAF", pch = 16, col = rgb(0,0,1,0.3))
abline(0, 1, col = "red", lty = 2)
})
with(getMutations(sim_det), {
plot(True_VAF, VAF, main = "Without Binomial Sampling",
xlab = "True VAF", ylab = "Observed VAF", pch = 16, col = rgb(0,0,1,0.3))
abline(0, 1, col = "red", lty = 2)
})Many clinical samples have significant normal cell contamination:
set.seed(123)
sim_low_purity <- simulateTumor(
subclone_freqs = c(0.05, 0.10, 0.15), # Sum = 0.30 (30% tumor)
n_mut_per_clone = c(20, 25, 30)
)
cat("Tumor purity:", sum(getSimParams(sim_low_purity)$subclone_freqs), "\n")
#> Tumor purity: 0.3
plot(sim_low_purity, type = "vaf_density")Deep targeted sequencing with uniform coverage:
set.seed(123)
sim_deep <- simulateTumor(
sequencing_noise = list(
enabled = TRUE,
mean_depth = 500,
depth_dispersion = 100, # More uniform
error_rate = 0.0005
)
)
summary(getMutations(sim_deep)$Depth)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 393.0 469.5 502.0 501.2 532.0 656.0
plot(sim_deep, type = "depth_histogram")Exome sequencing with variable coverage:
For testing algorithms without noise:
set.seed(123)
sim_ideal <- simulateTumor(
biological_noise = list(enabled = FALSE),
sequencing_noise = list(enabled = FALSE)
)
# VAF exactly matches expected frequencies
head(getMutations(sim_ideal)[, c("Clone", "True_VAF", "VAF")])
#> Clone True_VAF VAF
#> 1 Founder 1 1
#> 2 Founder 1 1
#> 3 Founder 1 1
#> 4 Founder 1 1
#> 5 Founder 1 1
#> 6 Founder 1 1Real tumor sequencing data contains both somatic and germline variants. Germline variants (heterozygous diploid) appear at VAF ~0.5 regardless of tumor purity because they are heterozygous in both tumor and normal cells:
# 70% purity tumor with germline variants
set.seed(789)
sim_germline <- simulateTumor(
subclone_freqs = c(0.3, 0.4), # 70% tumor purity
n_mut_per_clone = c(40, 50),
germline_variants = list(
enabled = TRUE,
n_variants = 150, # Number of germline SNPs
vaf_expected = 0.5 # Heterozygous diploid
)
)
# Check mutation types
table(getMutations(sim_germline)$Type)
#>
#> founder germline private shared
#> 10 150 90 8
# Compare VAFs
germline_vafs <- getMutations(sim_germline)[getMutations(sim_germline)$Type == "germline", "VAF"]
somatic_vafs <- getMutations(sim_germline)[getMutations(sim_germline)$Type != "germline", "VAF"]
cat("Germline VAF (expected ~0.5):", round(mean(germline_vafs), 3), "\n")
#> Germline VAF (expected ~0.5): 0.51
cat("Somatic VAF mean:", round(mean(somatic_vafs), 3), "\n")
#> Somatic VAF mean: 0.419
# Plot showing germline peak
plot(sim_germline, type = "vaf_density")Key biological insight: Unlike somatic mutations (whose VAF depends on tumor purity), germline variants maintain VAF ~0.5 because:
This creates a characteristic peak at VAF = 0.5 in real sequencing data that is independent of tumor cellularity.
library(GenomicRanges)
gr <- toGRanges(sim)
gr
#> GRanges object with 135 ranges and 10 metadata columns:
#> seqnames ranges strand | Mutation Ref Alt
#> <Rle> <IRanges> <Rle> | <character> <character> <character>
#> [1] chr2 3901294 * | Founder_1 T G
#> [2] chr12 156958367 * | Founder_2 A A
#> [3] chr15 113008658 * | Founder_3 T T
#> [4] chr17 55250578 * | Founder_4 G T
#> [5] chr3 191364048 * | Founder_5 G A
#> ... ... ... ... . ... ... ...
#> [131] chr14 151818959 * | Clone4_mut11 A A
#> [132] chr21 30935466 * | Clone4_mut12 T T
#> [133] chr17 116637069 * | Clone4_mut13 C C
#> [134] chr7 177080573 * | Clone4_mut14 G C
#> [135] chr7 75923037 * | Clone4_mut15 A C
#> True_VAF VAF Depth Alt_reads Clone Type
#> <numeric> <numeric> <integer> <integer> <character> <character>
#> [1] 0.99 0.959184 98 94 Founder founder
#> [2] 0.99 1.000000 70 70 Founder founder
#> [3] 0.99 1.000000 105 105 Founder founder
#> [4] 0.99 1.000000 77 77 Founder founder
#> [5] 0.99 0.987179 78 77 Founder founder
#> ... ... ... ... ... ... ...
#> [131] 0.278236 0.295082 122 36 Clone4 private
#> [132] 0.305871 0.302326 86 26 Clone4 private
#> [133] 0.378219 0.451613 93 42 Clone4 private
#> [134] 0.244493 0.236220 127 30 Clone4 private
#> [135] 0.324152 0.308824 136 42 Clone4 private
#> Clone_IDs
#> <character>
#> [1] 1,2,3,4
#> [2] 1,2,3,4
#> [3] 1,2,3,4
#> [4] 1,2,3,4
#> [5] 1,2,3,4
#> ... ...
#> [131] 4
#> [132] 4
#> [133] 4
#> [134] 4
#> [135] 4
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
# Access metadata
mcols(gr)[1:5, ]
#> DataFrame with 5 rows and 10 columns
#> Mutation Ref Alt True_VAF VAF Depth Alt_reads
#> <character> <character> <character> <numeric> <numeric> <integer> <integer>
#> 1 Founder_1 T G 0.99 0.959184 98 94
#> 2 Founder_2 A A 0.99 1.000000 70 70
#> 3 Founder_3 T T 0.99 1.000000 105 105
#> 4 Founder_4 G T 0.99 1.000000 77 77
#> 5 Founder_5 G A 0.99 0.987179 78 77
#> Clone Type Clone_IDs
#> <character> <character> <character>
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
# Subset by chromosome
gr_chr1 <- gr[seqnames(gr) == "chr1"]
length(gr_chr1)
#> [1] 4# Get VRanges object
vr <- suppressWarnings(toVCF(sim, sample_name = "TumorSample"))
vr
#> VRanges object with 99 ranges and 5 metadata columns:
#> seqnames ranges strand ref alt totalDepth
#> <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
#> [1] chr2 3901294 * T G 98
#> [2] chr17 55250578 * G T 77
#> [3] chr3 191364048 * G A 78
#> [4] chr7 110583980 * G T 102
#> [5] chr13 93127085 * C A 99
#> ... ... ... ... ... ... ...
#> [95] chr15 173679600 * C G 52
#> [96] chr14 162276325 * C G 79
#> [97] chr1 117412631 * C T 96
#> [98] chr7 177080573 * G C 127
#> [99] chr7 75923037 * A C 136
#> refDepth altDepth sampleNames softFilterMatrix | TRUE_VAF
#> <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <numeric>
#> [1] 4 94 TumorSample | 0.99
#> [2] 0 77 TumorSample | 0.99
#> [3] 1 77 TumorSample | 0.99
#> [4] 1 101 TumorSample | 0.99
#> [5] 2 97 TumorSample | 0.99
#> ... ... ... ... ... . ...
#> [95] 43 9 TumorSample | 0.216110
#> [96] 62 17 TumorSample | 0.307477
#> [97] 58 38 TumorSample | 0.345484
#> [98] 97 30 TumorSample | 0.244493
#> [99] 94 42 TumorSample | 0.324152
#> VAF CLONE TYPE CLONE_IDS
#> <numeric> <character> <character> <character>
#> [1] 0.959184 Founder founder 1,2,3,4
#> [2] 1.000000 Founder founder 1,2,3,4
#> [3] 0.987179 Founder founder 1,2,3,4
#> [4] 0.990196 Founder founder 1,2,3,4
#> [5] 0.979798 Founder founder 1,2,3,4
#> ... ... ... ... ...
#> [95] 0.173077 Clone4 private 4
#> [96] 0.215190 Clone4 private 4
#> [97] 0.395833 Clone4 private 4
#> [98] 0.236220 Clone4 private 4
#> [99] 0.308824 Clone4 private 4
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
#> hardFilters: NULL
# Write to VCF file (using tempdir to avoid polluting user's filesystem)
vcf_file <- tempfile(fileext = ".vcf")
suppressWarnings(toVCF(sim, output_file = vcf_file))
file.exists(vcf_file)
#> [1] TRUEpyclone_file <- tempfile(fileext = ".tsv")
toPyClone(sim, file = pyclone_file, sample_id = "sample1")
head(read.table(pyclone_file, header = TRUE, sep = "\t"))
#> mutation_id ref_counts var_counts normal_cn minor_cn major_cn sample_id
#> 1 Founder_1 4 94 2 0 2 sample1
#> 2 Founder_2 0 70 2 0 2 sample1
#> 3 Founder_3 0 105 2 0 2 sample1
#> 4 Founder_4 0 77 2 0 2 sample1
#> 5 Founder_5 1 77 2 0 2 sample1
#> 6 Founder_6 1 101 2 0 2 sample1sciclone_file <- tempfile(fileext = ".tsv")
toSciClone(sim, file = sciclone_file)
head(read.table(sciclone_file, header = TRUE, sep = "\t"))
#> chr pos ref_reads var_reads vaf
#> 1 chr2 3901294 4 94 95.91837
#> 2 chr12 156958367 0 70 100.00000
#> 3 chr15 113008658 0 105 100.00000
#> 4 chr17 55250578 0 77 100.00000
#> 5 chr3 191364048 1 77 98.71795
#> 6 chr7 110583980 1 101 99.01961df <- toDataFrame(sim)
csv_file <- tempfile(fileext = ".csv")
write.csv(df, csv_file, row.names = FALSE)
head(read.csv(csv_file))
#> Mutation Chromosome Position Ref Alt True_VAF VAF Depth Alt_reads
#> 1 Founder_1 chr2 3901294 T G 0.99 0.9591837 98 94
#> 2 Founder_2 chr12 156958367 A A 0.99 1.0000000 70 70
#> 3 Founder_3 chr15 113008658 T T 0.99 1.0000000 105 105
#> 4 Founder_4 chr17 55250578 G T 0.99 1.0000000 77 77
#> 5 Founder_5 chr3 191364048 G A 0.99 0.9871795 78 77
#> 6 Founder_6 chr7 110583980 G T 0.99 0.9901961 102 101
#> Clone Type Clone_IDs
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
#> 6 Founder founder 1,2,3,4Here’s a typical workflow for benchmarking a variant caller:
# 1. Generate ground truth
set.seed(42)
sim <- simulateTumor(
subclone_freqs = c(0.2, 0.3, 0.5),
n_mut_per_clone = c(50, 75, 50)
)
# 2. Export to VCF (ground truth)
toVCF(sim, output_file = "ground_truth.vcf")
# 3. Get true positive set
true_mutations <- getMutations(sim)
# 4. Run your variant caller on simulated data
# (your variant caller code here)
# 5. Compare results
# - Sensitivity: TP / (TP + FN)
# - Precision: TP / (TP + FP)
# - F1 score: 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
# 6. Evaluate VAF estimation accuracy
# cor(true_mutations$VAF, called_vaf)set.seed(123)
sim_complex <- simulateTumor(
subclone_freqs = c(0.1, 0.15, 0.2, 0.25, 0.3),
n_mut_per_clone = c(30, 40, 50, 40, 30),
n_mut_shared = list(
"2 3 4 5" = 20, # Shared by clones 2,3,4,5
"3 4 5" = 15, # Shared by clones 3,4,5
"4 5" = 10, # Shared by clones 4,5
"1 2" = 8 # Shared by clones 1,2
)
)
plot(sim_complex, type = "vaf_density")sessionInfo()
#> R version 4.6.0 Patched (2026-04-24 r89963)
#> Platform: aarch64-apple-darwin23
#> Running under: macOS Tahoe 26.3.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.64.0 Seqinfo_1.2.0 IRanges_2.46.0
#> [4] S4Vectors_0.50.0 BiocGenerics_0.58.0 generics_0.1.4
#> [7] ClonalSim_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.52.0 SummarizedExperiment_1.42.0
#> [3] gtable_0.3.6 rjson_0.2.23
#> [5] xfun_0.57 bslib_0.10.0
#> [7] ggplot2_4.0.3 Biobase_2.72.0
#> [9] lattice_0.22-9 vctrs_0.7.3
#> [11] tools_4.6.0 bitops_1.0-9
#> [13] curl_7.1.0 parallel_4.6.0
#> [15] tibble_3.3.1 AnnotationDbi_1.74.0
#> [17] RSQLite_2.4.6 blob_1.3.0
#> [19] pkgconfig_2.0.3 Matrix_1.7-5
#> [21] BSgenome_1.80.0 RColorBrewer_1.1-3
#> [23] cigarillo_1.2.0 S7_0.2.2
#> [25] lifecycle_1.0.5 compiler_4.6.0
#> [27] farver_2.1.2 Rsamtools_2.28.0
#> [29] Biostrings_2.80.0 codetools_0.2-20
#> [31] htmltools_0.5.9 sass_0.4.10
#> [33] RCurl_1.98-1.18 yaml_2.3.12
#> [35] tidyr_1.3.2 pillar_1.11.1
#> [37] crayon_1.5.3 jquerylib_0.1.4
#> [39] BiocParallel_1.46.0 DelayedArray_0.38.0
#> [41] cachem_1.1.0 abind_1.4-8
#> [43] tidyselect_1.2.1 digest_0.6.39
#> [45] purrr_1.2.2 restfulr_0.0.16
#> [47] dplyr_1.2.1 VariantAnnotation_1.58.0
#> [49] labeling_0.4.3 fastmap_1.2.0
#> [51] grid_4.6.0 cli_3.6.6
#> [53] SparseArray_1.12.0 magrittr_2.0.5
#> [55] S4Arrays_1.12.0 GenomicFeatures_1.64.0
#> [57] XML_3.99-0.23 dichromat_2.0-0.1
#> [59] withr_3.0.2 scales_1.4.0
#> [61] bit64_4.8.0 rmarkdown_2.31
#> [63] XVector_0.52.0 httr_1.4.8
#> [65] matrixStats_1.5.0 bit_4.6.0
#> [67] otel_0.2.0 png_0.1-9
#> [69] memoise_2.0.1 evaluate_1.0.5
#> [71] knitr_1.51 BiocIO_1.22.0
#> [73] rtracklayer_1.72.0 rlang_1.2.0
#> [75] glue_1.8.1 DBI_1.3.0
#> [77] jsonlite_2.0.0 R6_2.6.1
#> [79] GenomicAlignments_1.48.0 MatrixGenerics_1.24.0