Preprocessing
Assemble datasets
Firstly, you will have to assemble your set of reads into contigs. For this purpose, you can use metaSPAdes, SGA or metaFlye.
metaSPAdes
SPAdes is a short-read assembler based on the de Bruijn graph approach. metaSPAdes is the dedicated metagenomic assembler of SPAdes. Use metaSPAdes (SPAdes in metagenomics mode) software to assemble short reads into contigs. A sample command is given below.
spades --meta -1 Reads_1.fastq -2 Reads_2.fastq -o /path/output_folder -t 16
SGA
SGA (String Graph Assembler) is a short-read assembler based on the overlap-layout-consensus (more recently string graph) approach. Use SGA software to assemble short reads into contigs. Sample commands are given below. You may change the parameters to suit your datasets.
sga preprocess -o reads.fastq --pe-mode 1 Reads_1.fastq Reads_2.fastq
sga index -a ropebwt -t 16 --no-reverse reads.fastq
sga correct -k 41 --learn -t 16 -o reads.k41.fastq reads.fastq
sga index -a ropebwt -t 16 reads.k41.fastq
sga filter -x 2 -t 16 reads.k41.fastq
sga fm-merge -m 45 -t 16 reads.k41.filter.pass.fa
sga index -t 16 reads.k41.filter.pass.merged.fa
sga overlap -m 55 -t 16 reads.k41.filter.pass.merged.fa
sga assemble -m 95 reads.k41.filter.pass.merged.asqg.gz
metaFlye
Flye is a long-read assembler based on the de Bruijn graph approach. metaFlye is the dedicated metagenomic assembler of Flye. Use metaFlye (Flye in metagenomics mode) software to assemble long reads into contigs. A sample command is given below.
flye --meta --pacbio-raw reads.fasta --genome-size estimated_metagenome_size --out-dir /path/output_folder --threads 16
Bin contigs
Next, you have to bin the resulting contigs using an existing contig-binning tool. We have used the following tools with their commands for the experiments.
MaxBin2
perl MaxBin-2.2.5/run_MaxBin.pl -contig contigs.fasta -abund abundance.abund -thread 8 -out /path/output_folder
SolidBin
python scripts/gen_kmer.py /path/to/data/contig.fasta 1000 4
sh gen_cov.sh
python SolidBin.py --contig_file /path/to/contigs.fasta --composition_profiles /path/to/kmer_4.csv --coverage_profiles /path/to/cov_inputtableR.tsv --output /output/result.tsv --log /output/log.txt --use_sfs
Prepare binning results
The binning output file should have delimiter separated (e.g., comma separated) values (contig_identifier, bin_number) for each contig. The contents of the binning output file should look similar to the example given below. Contigs are named according to their original identifier and the numbering of bins starts from 1. You can use the prepResult command to format an initial binning result in to the .csv format with contig identifiers and bin ID. Further details can be found here and in the next page.
Example binned inputs
Example metaSPAdes binned input
NODE_1_length_507141_cov_16.465306,1
NODE_2_length_487410_cov_94.354557,1
NODE_3_length_483145_cov_59.410818,1
NODE_4_length_468490_cov_20.967912,2
NODE_5_length_459607_cov_59.128379,2
...
Example SGA binned input
contig-0,1
contig-1,2
contig-2,1
contig-3,1
contig-4,2
...
Example Flye binned input
contig_1,1
contig_2,2
contig_3,1
contig_4,1
contig_5,2
...
Obtain the coverage of contigs (abundance.tsv)
You can use CoverM to get the coverage of contigs. You can run the following commands to get the abundance.tsv file. Please make sure that there are no headers in the abundance.tsv file.
coverm contig -1 reads_1.fastq -2 reads_2.fastq -r contigs.fasta -o abundance.tsv -t 8
sed -i '1d' abundance.tsv # remove the header of the file
The resulting abundance.tsv file can be directly used in GraphBin2.
Once you have obtained the assembly output, binning results and the coverage information file, you can run GraphBin2.