Nextflow Registry | Nextflow Modules

Nextflow Modules

Showing module(s) with keyword "catalogue"

Module	Keywords	Description
nf-core/custom/orfcollapse	orf ribo-seq catalogue smorf deduplication	Collapse small ORFs that share an amino-acid sequence cluster into a single catalogue entry. Pair with `custom/orfmerge` (coordinate-based catalogue), `bedtools/getfasta` + `seqkit/translate` (AA FASTA keyed by orf_id), and `mmseqs/easycluster` (AA clusters) upstream. The coordinate-based merge in `custom/orfmerge` only groups ORFs that overlap on the genome, so the same micropeptide encoded at several distinct, non-overlapping loci (typically repetitive regions) survives as separate rows. This adopts the peptide-level deduplication and 0.9 amino-acid-similarity threshold of the GENCODE Ribo-seq ORF consolidation (Mudge et al. 2022, Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9), implemented here with MMseqs2 sequence-identity clustering rather than that tool's longest-shared-string / P-site-overlap metric. Small ORFs (orf_class "smORF", i.e. aa_length <= 100) are clustered by amino-acid identity upstream and this module folds each multi-member cluster down to one representative. Only smORF rows are collapsed; larger ORFs and transcript-anchored classes are passed through untouched. Among the smORF members of a cluster the representative is chosen by longest aa_length (ties broken by orf_id), so the result does not depend on which sequence MMseqs2 labelled the cluster representative. Catalogue row order is preserved; dropped members fold their `called_by_<caller>` / `score_<caller>` evidence, `n_samples` / `samples` recurrence and gene mappings into the survivor.
nf-core/custom/orfmerge	orf ribo-seq catalogue merge clustering	Cluster normalised per-sample, per-caller ORF predictions into a single cohort-level catalogue. Pair with `custom/orfnormalise` upstream and (typically) `bedtools/getfasta` + `seqkit/translate` downstream to obtain the AA FASTA. Strategy is class-aware (operating on the harmonised `orf_class` written by `custom/orfnormalise`): - canonical_cds: collapse by (transcript_id, strand). One canonical CDS per transcript by definition. - uORF, dORF, other: collapse by (transcript_id, strand, start, end). A single transcript can host multiple distinct uORFs / dORFs / internal ORFs, so keying on the outer span keeps them in separate clusters while still merging cross-caller calls that agree on coordinates. - novel_u, smORF: greedy reciprocal-overlap clustering on the outer genomic span at `--reciprocal-overlap` (default 0.8). Catches fuzzy cross-caller matches and exact-coordinate collapses in one pass. Order-dependent at the boundary: a chain A-B-C where A-B and B-C overlap at ~0.85 but A-C only at ~0.75 may cluster as {A,B,C} or {A,B}+{C} depending on iteration order. Rare in practice at 0.8. Cross-caller consensus is recorded in two column families on the catalogue TSV: - `called_by_<caller>`: 0/1 indicator per supported caller (ribotish, ribocode, ribotricer, rpbp, price). - `score_<caller>`: best score from that caller within the cluster. Score direction is per-caller (p-values are minimised; Bayes factors / phase scores are maximised). Cross-sample recurrence is recorded in two further columns: - `n_samples`: number of distinct samples contributing to the cluster (a cohort recurrence metric). - `samples`: sorted, comma-separated list of those sample ids. Emits a small MultiQC custom-content TSV (per-class counts) for inclusion in downstream MultiQC reports. Alongside the full catalogue, emits a consensus view (`.consensus.`) filtered to ORFs supported by at least `--min-callers` distinct callers and recurring in at least `--min-samples` samples (both default 1, i.e. no filtering, so the consensus view equals the full catalogue). Raising either threshold yields a higher-confidence catalogue without altering the full one.