There are two sets of datasets: TRAINING and CHALLENGE. Each of these sets are comprised of 4 different communities with 4 samples each:

  1. “bio” refers to real metagenomics dataset generated from the mouse fecal samples
  2. “sim_low” refers to the simulated low complexity dataset
  3. “sim_med” refers to the simulated medium complexity dataset
  4. “sim_high” refers to the simulated high complexity dataset

Learn more about the datasets here.

Upon joining the challenge, three workspaces will be made available to you, with the first two containing the datasets for the TRAINING and the CHALLENGE:

  • “Strains Training Assets” contains all the files for the TRAINING DATASETS (raw reads, reference genomes, profiling truths, etc).
  • “Strains Challenge Assets” contains ONLY the raw reads for the CHALLENGE DATASETS.
  • “Strains Challenge Workspace for <username>” is the workspace where you can perform work for participating in the challenge. Any costs incurred by running jobs or storing data in this workspace related to the challenge are covered by the Strains Challenge sponsors. In order to ensure your privacy, only you have access to this workspace. You can however share your workspace with your collaborators, or with the Mosaic team, if you need technical support. Additionally, if you prefer to perform the analysis outside Mosaic, you can download individual files from the Assets workspace or a compressed file containing all datasets.

You can choose to work on the challenge in two ways:

  • Option A
    • Create a Mosaic app: see tutorial
    • Run app on datasets
  • Option B
    • Download the entire Training dataset as a tarball
    • Download the entire Challenge dataset (only the reads) as a tarball
    • Run your method in your own system

  1. The accepted format for results depends on the method category as follows:
    • Assembly

      multi-FASTA file of contigs that would be processed by metaQUAST. There should be 1 set of genomes per dataset (bio, sim_low, sim_med, sim_high), i.e. 4 in total for the Training and another 4 for Challenge datasets. Participants have to submit all 4 at the same time for the Evaluation App for the system to accept the submission.

    • Binning

      tarball of multiple multi-fasta files of bins. There should be 1 tarball per dataset (bio, sim_low, sim_med, sim_high), i.e. 4 in total for the Training and 4 for the Challenge datasets.

    • Profiling

      tab-delimited file with 5 columns (1+number of samples, 4 in this case) whereby the first column is a comma-separated string with NCBI taxonomic ids going down to the species rank followed by the strain_index, e.g. <superkingdom_id>, <phylum>,<class>, <order>,<family>,<genus>,<species>, <strain_index>. As we are only interested in the number of strains per species for the time being, the strain_index is simply enumeration of the number of strains for that species. In the example below, the first species has 2 strains.The 4 subsequent columns contain the relative abundances of the taxa.

  2. Submit results for the training datasets through Submit Results for Evaluation. You can select to submit your results anonymously although we strongly discourage that in the spirit of open science.
  3. Review results for the training datasets using the Testing Ground.
  4. Refine method (locally or on Mosaic) if you are not satisfied with your performance, resubmit and see if you can improve your performance.
  5. When you are satisfied with the performance of your method, run your method on the challenge datasets and submit the results through Challenge Entry Submission.
  6. [OPTIONAL] Decide if you want to make an app related to this submission publicly available after the conclusion of the challenge. At the end of the challenge, the version of the App that you provided will become available for all Mosaic users to use. You are also free to publish your App at any time.

The Testing Ground provides a space for you to evaluate the performance of your analyses before submitting entries to the Challenge. Additionally, in this space you will be able to visualize your Training analysis submissions and compare them with those of other users.

You can submit the results for the Challenge datasets analyses at the Challenge submission page. Here are some considerations to keep in mind:

  • Submission is exactly the same as the Testing Ground, i.e. specific to Profiling, Assembly and Binning.
  • Expected submission file formats are the same as in the Testing Ground part.
  • Unlike the Testing Ground, after you submit to the Challenge through this page, you will not be able to see the performance of your submission, until after the Strains #1 challenge has ended.
  • You will, however, have access to some submissions details, such as:
    • Submission name
    • Submission ID
    • Type (Profiling, Assembly, or Binning)
    • Submitted As
    • Submitted Date
    • Status
    • Names and IDs of submitted files
  • If the evaluation of your submission fails, a Mosaic administrator will contact you to provide more details and assist you in resolving the issue.

After the Challenge ends on March 30, 2018, everyone who submitted to the Challenge will get access to the evaluation metrics of their submissions. These metrics will not include rankings. The official results of the Challenge will be announced and be made publicly available in April 2018.

If you have any questions, please feel free to contact us.

Dataset Name Files Purpose
Training Datasets
  • (bio) 4 raw FASTQs from 4 mouse fecal samples
  • (sim_low) 4 Simulated FASTQ with low complexity
  • (sim_med) 4 Simulated FASTQ with medium complexity
  • (sim_high) 4 Simulated FASTQ with high complexity
These are the raw sequencing data on which you can run your tools for testing purposes to benchmark and refine your tools.
Truth for Training Datasets
  • 4 tables with abundances for each of the 4 samples of the provided datasets
  • A compressed file containing the reference genome sequences for each dataset (4 files total)
  • 4 multi-fasta files (`_binning-ref.fasta`) containing the reference genomes for binning evaluation
  • These files contain the gold standards (for the biological samples) and absolute truths (for the simulated data). The genome sequences for the organisms used in creating the gold standards and the simulated data are also provided.
  • The reference genomes are used to evaluate the submitted Assemblies.
  • (bio) 4 raw FASTQs from 4 mouse fecal samples
  • (sim_low) 4 Simulated FASTQ with low complexity
  • (sim_med) 4 Simulated FASTQ with medium complexity
  • (sim_high) 4 Simulated FASTQ with high complexity
These are the raw sequencing data on which you can run your tools and submit to the Challenge. Results of this datasets would be the final results.
Other ete3_sqlite_dir_15Sep2017.tar.gz This file contains the compressed `.etetoolkit/` directory which is used by the ete3 tool during the conversion of NCBI-IDs to taxonomy names.
  • “Bio” DATASETS


    Two microbial cocktails, i.e. synthetic communities, (each with < 30 organisms) were developed from novel strains isolated from human fecal samples. Hence these strains do not exist in any public reference databases or with anyone beyond the organizers. The communities were inoculated into 2 gnotobiotic mice. The identity and genomes of the isolates were also established using MALDI-TOF and whole genome sequencing.

    Four fecal samples for each mouse were collected, after changing their diet to perturb their gut microbiome. These samples were then sent for shotgun sequencing.


    The samples were sequenced using HiSeq4000 150bp paired end. HiSeq4000 reads tend to deteriorate near the ends and especially for the second read. Participants should perform quality trimming and filtering appropriate for their methods. We have mapped the reads back to the isolate genomes and determined that there would be sufficient depth and coverage of most of the organisms that were expected to colonize their hosts.

  • “Sim” DATASETS

    Simulated genomes and communities

    A library of human gut species was collated and used to generate simulated strains for each of the species. For each of the species, all genome assemblies available were downloaded from NCBI and one was randomly picked as the parent strain and the rest were used as donor strains to feed into sgEvolver [A.E. Darling et al., 2010] to generate a library of simulated strains for each species.

    Three simulated communities (sim-low, sim-med and sim-high) were generated for each of the training and challenge datasets. Using the StrainMetaSim pipeline developed by Christopher Quince at the University of Warwick, species for the communities were picked randomly from the library and then for each species between 1-5 strains were picked randomly. The pipeline also generated coverage across the samples at the species level following a lognormal distribution and a Dirichlet at the strain level.

    Metagenomic Reads

    HiSeq2500 150bp paired end reads for the samples were simulated the ART simulator through the StrainMetaSim pipeline to match the coverage specified in the samples.

The results of analyses of the Training Dataset and the Challenge Datasets will be evaluated by the Evaluator App. The Evaluator compares the submitted results with the appropriate reference datasets. In the case of the Training Dataset, these evaluations are real-time and become available at the successful completion of the evaluation. For the Challenge Dataset, the results will become available after the end of the challenge.

If you would prefer to use the Evaluator without submitting to the Testing Ground, you may run the Evaluator independently and then view the raw results. Since the Challenge Truth is hidden, there's no option to run the Evaluator for Challenge Results.

For the Early Access program, the evaluator app may analyze results for only a subset of the challenge, e.g. only for Assembly. The evaluator app will be updated as new parts of the Challenge go live.


To evaluate Profiling submissions, the number of True Positives, False Positives and False Negatives for each dataset analysis will be determined. All non-zero abundance strains across the 4 samples (e.g. the 4 samples of the simulated low complexity datasets) are used in this calculation for each dataset. Using these values Precision, Recall and the F1 score will be calculated. To calculate each submission's score, a weighted average of the 4 datasets will be calculated using the following weights:

Dataset Weight
Biological 0.4
Simulated High Complexity 0.3
Simulated Medium Complexity 0.2
Simulated Low Complexity 0.1


Assembly is evaluated using MetaQUAST v4.5. The following metrics are used for the leaderboard:

  1. Genome Fraction (%)
  2. Misassemblies (count)
  3. The sum of Indels and Mismatches (per 100kb)

To calculate each submission's score, a weighted average of the Genome Fraction scores of the 4 datasets will be calculated using the same weights used for Profiling (above).


Binning submissions will be evaluated by first mapping the assembled contigs back to the challenge reference genomes using nucmer with very stringent cutoffs to establish a 1:1 mapping between assembled contigs and genomes. Next, the contigs in each bin are evaluated in terms of whether they are all from the same genome and build a confusion matrix to calculate the following metrics:

  1. Precision
  2. Recall
  3. Adjusted rand index

The score for each submission is the weighted average of the Adjusted Rand Index using the same weights as in Profiling (above).