×

cellgeni/fetch10xmeta @ 0.0.3

Fetches and parses metadata for public 10x datasets from GEO (GSE*), ArrayExpress (E-MTAB*), or BioProject (PRJ*). Downloads raw metadata from SRA/ENA/BioStudies, resolves sample-to-run mappings, classifies each run by download type, and produces a merged links file.

Latest version: 0.0.4
Total downloads: 0
Authors: @aljes
Maintainers: @aljes

cellgeni/fetch10xmeta

Summary

Fetches and parses metadata for public 10x datasets from GEO (GSE*), ArrayExpress (E-MTAB*), or BioProject/ENA (PRJ*). For each dataset it:

  1. Downloads raw metadata from NCBI SRA, EBI ENA, or BioStudies depending on accession type.
  2. Resolves sample accessions to experiment and run IDs, building an accessions map.
  3. Classifies each run by download type (paired-end FASTQs, 10x BAM, or SRA) via parse_ena_metadata.sh / parse_sra_metadata.sh.
  4. Merges the per-run classification with sample IDs into links.tsv via add_samples.awk.

For GEO datasets the module falls back through project IDs → sub-series project IDs → BioSample IDs if earlier ENA/SRA metadata downloads fail.

Inputs

Name Type Description
meta.id string Dataset accession. Supported prefixes: GSE* (GEO), E-MTAB* (ArrayExpress), PRJ* (BioProject).
sample_ids list or string Sample accessions to restrict processing to (e.g. GSM*, ERS*, DRS*). Accepts a list or comma-separated string. Pass empty/null to process all samples in the dataset.

Outputs

Name File(s) Description
links links.tsv Per-run metadata with an appended sample_id column mapping each run back to its source sample.
list *.list Accession list files: run list (*.run.list), sample list (*.sample.list), project list (*.project.list), etc.
tsv *.tsv TSV files from the collection pipeline: raw SRA/ENA metadata, accession mapping (*.accessions.tsv), sample-run mapping (*.sample_x_run.tsv), and parsed run classification (*.parsed.tsv).
txt *.txt Optional SDRF/IDF plain-text files, present for ArrayExpress (E-MTAB*) datasets.
soft *_family.soft Optional GEO SOFT family file, present for GEO (GSE*) datasets.
versions versions.yml Pipeline version record.

Usage

include { FETCH10XMETA } from './modules/cellgeni/fetch10xmeta'

// With a list of sample IDs
FETCH10XMETA(
    channel.of([[id: 'GSE230685'], ['GSM7232572', 'GSM7232573']])
)

// With a comma-separated string (e.g. from a params file or TSV)
FETCH10XMETA(
    channel.of([[id: 'PRJDB14428'], 'DRS408305,DRS408306'])
)

// All samples in an ArrayExpress dataset (no sample ID filter)
FETCH10XMETA(
    channel.of([[id: 'E-MTAB-9221'], null])
)

License

MIT

Input 1 channel
#1 tuple
meta map

Map with dataset-level metadata. Must contain key 'id' with the dataset accession. e.g. [ id:'GSE230685' ]

sample_ids string

Comma-separated sample accession IDs to restrict processing to (e.g. 'GSM7232572,GSM7232573', 'DRS408305,DRS408306'). If empty or null, all samples in the dataset are used.

Output 6 channels
#1 tsv tuple
meta map

Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]

*.tsv file

TSV files from the collection pipeline: raw SRA/ENA metadata, accession mapping (.accessions.tsv), sample-run mapping (.sample_x_run.tsv), and parsed run classification (*.parsed.tsv).

*.tsv
#2 txt tuple
meta map

Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]

*.txt file

Optional SDRF/IDF plain-text files, present for ArrayExpress (E-MTAB*) datasets.

*.txt
#3 list tuple
meta map

Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]

*.list file

Accession list files produced during metadata collection (*.run.list, *.sample.list, *.project.list, etc.).

*.list
#4 soft tuple
meta map

Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]

*_family.soft file

Optional GEO SOFT family file, present for GEO (GSE*) datasets.

*_family.soft
#5 links tuple
meta map

Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]

links.tsv file

TSV file with per-run metadata and an appended sample_id column mapping each run back to its source sample accession.

links.tsv
#6 versions
Tool Description Homepage
collect_metadata.sh Downloads and parses GEO/ArrayExpress/BioProject metadata, resolving sample-to-run mappings. https://github.com/cellgeni/nf-reprocessing-public-10x
Version 0.0.3
Release Date 15 May 2026 15:17:25 (UTC)
Download URL https://registry.nextflow.io/api/v1/modules/cellgeni%2Ffetch10xmeta/0.0.3/download
OCI Store URL https://public.cr.seqera.io/v2/nextflow/plugin/modules/cellgeni/fetch10xmeta/blobs/sha256:a0ff95e8407aa7481ddc48d0fb72558abd2956609e5ecb9a733028fc6eabbacc
Size 8.2 KB
Checksum sha256:a0ff95e8407aa7481ddc48d0fb72558abd2956609e5ecb9a733028fc6eabbacc
Downloads 0
Version Date Status Downloads Size
0.0.4 15 May 2026 15:53:27 (UTC) 0 8.1 KB