Nextflow Modules
Showing module(s) with keyword "orf"
| Module | Keywords | Description |
|---|---|---|
| nf-core/custom/orfmerge | orf ribo-seq catalogue merge clustering | Cluster normalised per-sample, per-caller ORF predictions into a single cohort-level catalogue. Pair with `custom/orfnormalise` upstream and (typically) `bedtools/getfasta` + `seqkit/translate` downstream to obtain the AA FASTA. Strategy is class-aware (operating on the harmonised `orf_class` written by `custom/orfnormalise`): - canonical_cds: collapse by (transcript_id, strand). One canonical CDS per transcript by definition. - uORF, dORF, other: collapse by (transcript_id, strand, start, end). A single transcript can host multiple distinct uORFs / dORFs / internal ORFs, so keying on the outer span keeps them in separate clusters while still merging cross-caller calls that agree on coordinates. - novel_u, smORF: greedy reciprocal-overlap clustering on the outer genomic span at `--reciprocal-overlap` (default 0.8). Catches fuzzy cross-caller matches and exact-coordinate collapses in one pass. Order-dependent at the boundary: a chain A-B-C where A-B and B-C overlap at ~0.85 but A-C only at ~0.75 may cluster as {A,B,C} or {A,B}+{C} depending on iteration order. Rare in practice at 0.8. Cross-caller consensus is recorded in two column families on the catalogue TSV: - `called_by_<caller>`: 0/1 indicator per supported caller (ribotish, ribocode, ribotricer, rpbp, price). - `score_<caller>`: best score from that caller within the cluster. Score direction is per-caller (p-values are minimised; Bayes factors / phase scores are maximised). Cross-sample recurrence is recorded in two further columns: - `n_samples`: number of distinct samples contributing to the cluster (a cohort recurrence metric). - `samples`: sorted, comma-separated list of those sample ids. Emits a small MultiQC custom-content TSV (per-class counts) for inclusion in downstream MultiQC reports. |
| nf-core/custom/orfnormalise | orf ribo-seq normalisation bed12 translation | Convert one ORF caller's per-sample output table into a unified BED12 plus a sidecar metadata TSV, ready for cross-caller merging. An "ORF caller" is a tool that scans ribosome-profiling (Ribo-seq) data and predicts which open reading frames are being translated. Each caller writes its own table format and uses its own location encoding, classification vocabulary, and confidence score. This module reconciles five callers into one harmonised schema. The `caller` val input selects the parser; supported values: - ribocode (RiboCode predicted ORF table; transcript-coord input, lifted to genomic blocks against the GTF) - ribotish (Ribo-TISH predict output; GenomePos + optional Blocks) - ribotricer (Ribotricer detect-orfs translating ORFs TSV; ORF span parsed from ORF_ID, multi-exon blocks recovered by intersecting with host-transcript exon structure from the GTF) - rpbp (Rp-Bp predicted-orfs BED12 with extra columns) - price (PRICE orfs.tsv; Gedi-style Location field, already genomic) Output BED12 column order: chrom start end name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts The BED `name` column carries `<caller>|<caller-native-id>`. The BED `score` column is the caller's native score rescaled to 0-1000 (higher == more confident regardless of native direction). Output sidecar TSV columns: orf_id caller sample_id chrom start end strand gene_id transcript_id orf_class aa_length score Harmonised `orf_class` vocabulary written into the sidecar TSV: - canonical_cds: ORF maps to an annotated CDS (including truncated / extended variants of one). - uORF: upstream ORF (5'UTR-resident). - dORF: downstream ORF (3'UTR-resident). - novel_u: novel / intergenic ORF not assigned to an annotated CDS. - smORF: small ORF (aa_length <= 100); promoted regardless of location-based class so downstream tools can treat smORFs uniformly. - other: internal / overlap / frame variants and anything else. Per-caller mapping notes (lossy collapses): - PRICE `iORF` (internal ORF), `intronic`, and `orphan` map to `other`. Cross-caller catalogue tracking still flags these via `called_by_price`, but the specific PRICE sub-type is not preserved. - Rp-Bp's predicted-orfs BED carries no ORF-type column; this module defaults every Rp-Bp call to `canonical_cds` (the post- selectfinalpredictionset curated set is dominated by canonical CDSs). uORF/dORF/novel calls present in Rp-Bp's separate `.tab.gz` / `extracted-orfs.bed.gz` files are not propagated here. Each caller's native confidence score has a "direction" - some are lower-is-better (p-values), some are higher-is-better (Bayes factors, phase scores): ribocode: min (combined p-value) ribotish: min (combined p-value) ribotricer: max (phase_score) rpbp: max (Bayes factor mean) price: min (p-value) Downstream merging uses this to pick the best per-ORF call. |
| nf-core/gedi/indexgenome | riboseq index genome gedi price orf | Build a GEDI genome index from a FASTA and GTF for downstream PRICE ORF prediction |
| nf-core/gedi/price | riboseq orf price gedi translation | Identify translated ORFs from Ribo-seq BAMs using the PRICE algorithm |
| nf-core/ribotricer/detectorfs | riboseq orf genomics | Accurate detection of short and long active ORFs using Ribo-seq data |
| nf-core/ribotricer/prepareorfs | riboseq orf genomics | Accurate detection of short and long active ORFs using Ribo-seq data |
| nf-core/rpbp/estimatemetagenebayesfactors | rpbp metagene bayes orf riboseq | Score how strongly each per-read-length metagene profile shows the 3-nucleotide periodicity expected of actively translating ribosomes. For each candidate (read length, P-site offset) pair, Rp-Bp fits two competing Bayesian models to the count window around annotated start codons: a "periodic" model whose signal repeats every three nucleotides, and a "non-periodic" background model. The Bayes factor (ratio of the two marginal likelihoods) quantifies how much the data prefer the periodic explanation. Returns one row per (length, offset) pair with the mean and variance of the log Bayes factor across MCMC samples. Downstream, `rpbp/selectperiodicoffsets` picks the best offset per length from this table, and `rpbp/getperiodiclengthsoffsets` filters to the high-confidence pairs that drive ORF-level scoring. Uses the Stan models bundled inside the rpbp Python package. |
| nf-core/rpbp/estimateorfbayesfactors | rpbp orf bayes translation riboseq | Score every candidate ORF for evidence of active translation. For each ORF, Rp-Bp fits two competing Bayesian models to its per-codon P-site count vector: a "translated" model that expects P-site density to concentrate at codon-start positions (the in-frame signal a translating ribosome produces), and an "untranslated" / noise model for the same data. The Bayes factor (ratio of marginal likelihoods) quantifies how much the data favour the translated hypothesis. Emits a BED-style table with one row per ORF carrying genomic coordinates plus the mean and variance of the log Bayes factor across MCMC samples. Downstream, `rpbp/selectfinalpredictionset` applies Bayes-factor, length and overlap rules to this table to produce the final filtered prediction set. Uses the Stan models bundled inside the rpbp Python package. |
| nf-core/rpbp/extractmetageneprofiles | rpbp metagene orf riboseq | Build per-read-length pileups of Ribo-seq read 5'-ends around annotated start codons - the "metagene profile". For each read length, the profile counts how many reads of that length have their 5' end at each position in a window around every annotated start codon, summed across all transcripts. Looking at the profile across the window reveals whether reads of that length show the 3-nucleotide periodicity characteristic of translating ribosomes. This per-length view matters because different ribosome footprint lengths place the ribosomal P-site (the codon being decoded) at different offsets from the read's 5' end, so each length needs its own offset calibration. Output is consumed by `rpbp/estimatemetagenebayesfactors`, which scores each (length, offset) combination for periodicity. |
| nf-core/rpbp/extractorfprofiles | rpbp orf psite profile riboseq | Build a per-ORF P-site count vector for every candidate open reading frame (ORF) in the catalogue. For each ORF, walks the spliced exons in 3-nucleotide codon steps and counts the P-site positions (read 5'-end coordinate plus the length-specific offset selected upstream) that fall in each codon. Counts are summed across all read lengths that passed the periodicity filter from `rpbp/getperiodiclengthsoffsets`. The resulting per-ORF vectors are the input to Bayesian translation scoring in `rpbp/estimateorfbayesfactors`: a translated ORF should show P-site density concentrated at codon-start positions, while a non-translated region should look flat or noisy. Emitted as a sparse matrix (one row per ORF, columns indexed by codon position). |
| nf-core/rpbp/preparegenome | rpbp orf prepare genome bed riboseq | Build the per-ORF reference files that Rp-Bp's downstream scoring needs, starting from a genome FASTA and an annotation GTF. Enumerates every candidate open reading frame (ORF) in the annotation (annotated CDSs plus alternative start codons within transcript exons), records their genomic and per-exon coordinates, and labels them with the transcript and gene they belong to. Invokes Rp-Bp's `get_orfs` Python function directly, chaining the upstream helpers `gtf-to-bed12`, `extract-bed-sequences`, `extract-orf-coordinates`, `split-bed12-blocks` and `label-orfs`. Bypasses Rp-Bp's `prepare-rpbp-genome` umbrella script, which would also build `bowtie2` (rRNA filtering) and `STAR` (alignment) indices - neither is consumed by the Rp-Bp tools wrapped here, since alignment is supplied externally as a BAM. A minimal `chrName.txt` (one contig name per line) is seeded from the FASTA headers because `gtf-to-bed12` reads it via `--chr-name-file` to control output sort order. Note: emits the `*.annotated.bed.gz` filenames produced by `get_orfs` directly, rather than the `*.bed.gz`-renamed forms that the upstrea |