cellgeni/fetch10xmeta @ 0.0.3

Fetches and parses metadata for public 10x datasets from GEO (GSE*), ArrayExpress (E-MTAB*), or BioProject (PRJ*). Downloads raw metadata from SRA/ENA/BioStudies, resolves sample-to-run mappings, classifies each run by download type, and produces a merged links file.

metadata GEO SRA ENA ArrayExpress BioProject 10x single-cell public data

Latest version: 0.0.4

Total downloads: 0

Authors: @aljes

Maintainers: @aljes

cellgeni/fetch10xmeta

Summary

Fetches and parses metadata for public 10x datasets from GEO (GSE*), ArrayExpress (E-MTAB*), or BioProject/ENA (PRJ*). For each dataset it:

Downloads raw metadata from NCBI SRA, EBI ENA, or BioStudies depending on accession type.
Resolves sample accessions to experiment and run IDs, building an accessions map.
Classifies each run by download type (paired-end FASTQs, 10x BAM, or SRA) via parse_ena_metadata.sh / parse_sra_metadata.sh.
Merges the per-run classification with sample IDs into links.tsv via add_samples.awk.

For GEO datasets the module falls back through project IDs → sub-series project IDs → BioSample IDs if earlier ENA/SRA metadata downloads fail.

Inputs

Name	Type	Description
`meta.id`	string	Dataset accession. Supported prefixes: `GSE` (GEO), `E-MTAB` (ArrayExpress), `PRJ*` (BioProject).
`sample_ids`	list or string	Sample accessions to restrict processing to (e.g. `GSM`, `ERS`, `DRS*`). Accepts a list or comma-separated string. Pass empty/null to process all samples in the dataset.

Outputs

Name	File(s)	Description
`links`	`links.tsv`	Per-run metadata with an appended `sample_id` column mapping each run back to its source sample.
`list`	`*.list`	Accession list files: run list (`.run.list`), sample list (`.sample.list`), project list (`*.project.list`), etc.
`tsv`	`*.tsv`	TSV files from the collection pipeline: raw SRA/ENA metadata, accession mapping (`.accessions.tsv`), sample-run mapping (`.sample_x_run.tsv`), and parsed run classification (`*.parsed.tsv`).
`txt`	`*.txt`	Optional SDRF/IDF plain-text files, present for ArrayExpress (`E-MTAB*`) datasets.
`soft`	`*_family.soft`	Optional GEO SOFT family file, present for GEO (`GSE*`) datasets.
`versions`	`versions.yml`	Pipeline version record.

Usage

include { FETCH10XMETA } from './modules/cellgeni/fetch10xmeta'

// With a list of sample IDs
FETCH10XMETA(
    channel.of([[id: 'GSE230685'], ['GSM7232572', 'GSM7232573']])
)

// With a comma-separated string (e.g. from a params file or TSV)
FETCH10XMETA(
    channel.of([[id: 'PRJDB14428'], 'DRS408305,DRS408306'])
)

// All samples in an ArrayExpress dataset (no sample ID filter)
FETCH10XMETA(
    channel.of([[id: 'E-MTAB-9221'], null])
)

License

MIT

Input 1 channel

#1 tuple

`meta` map	Map with dataset-level metadata. Must contain key 'id' with the dataset accession. e.g. [ id:'GSE230685' ]
`sample_ids` string	Comma-separated sample accession IDs to restrict processing to (e.g. 'GSM7232572,GSM7232573', 'DRS408305,DRS408306'). If empty or null, all samples in the dataset are used.

Output 6 channels

#1 tsv tuple

`meta` map	Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]
`*.tsv` file	TSV files from the collection pipeline: raw SRA/ENA metadata, accession mapping (.accessions.tsv), sample-run mapping (.sample_x_run.tsv), and parsed run classification (.parsed.tsv). `.tsv`

#2 txt tuple

`meta` map	Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]
`*.txt` file	Optional SDRF/IDF plain-text files, present for ArrayExpress (E-MTAB) datasets. `.txt`

#3 list tuple

`meta` map	Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]
`*.list` file	Accession list files produced during metadata collection (.run.list, .sample.list, .project.list, etc.). `.list`

#4 soft tuple

`meta` map	Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]
`*_family.soft` file	Optional GEO SOFT family file, present for GEO (GSE) datasets. `_family.soft`

#5 links tuple

`meta` map	Map with dataset-level metadata, passed through from input. e.g. [ id:'GSE230685' ]
`links.tsv` file	TSV file with per-run metadata and an appended sample_id column mapping each run back to its source sample accession. `links.tsv`

#6 versions

Tool	Description	Homepage
collect_metadata.sh	Downloads and parses GEO/ArrayExpress/BioProject metadata, resolving sample-to-run mappings.	https://github.com/cellgeni/nf-reprocessing-public-10x

Version	0.0.3
Release Date	15 May 2026 15:17:25 (UTC)
Download URL	https://registry.nextflow.io/api/v1/modules/cellgeni%2Ffetch10xmeta/0.0.3/download
OCI Store URL	https://public.cr.seqera.io/v2/nextflow/plugin/modules/cellgeni/fetch10xmeta/blobs/sha256:a0ff95e8407aa7481ddc48d0fb72558abd2956609e5ecb9a733028fc6eabbacc
Size	8.2 KB
Checksum	sha256:a0ff95e8407aa7481ddc48d0fb72558abd2956609e5ecb9a733028fc6eabbacc
Downloads	0

Version	Date	Status	Downloads	Size
0.0.4	15 May 2026 15:53:27 (UTC)		0	8.1 KB