Report on prokaryotic genome assembly, Functional Genomics¶

~6.9k tokens

This post walks through an end to end Nanopore assembly of a Bacillus subtilis sized bacterial genome: read QC, reference mapping sanity checks, de novo assembly with Flye, assembly graph inspection in Bandage, quality assessment with QUAST and CheckM, and final annotation with Prokka. The emphasis is on showing why each step exists and how to interpret the key outputs, not on chasing maximum performance.

Table of sequencing technologies¶

1. To better learn the existing sequencing technologies, please fill in the table¶

Technique	Illumina	PacBio	Nanopore
Method	Sequencing by synthesis (SBS), fluorescence readout during nucleotide incorporation, clonal clusters on a flow-cell.	SMRT (Single Molecule Real-Time) in ZMW, single molecule, signal readout during synthesis. Modes: CLR and HiFi (CCS).	A single molecule passes through a nanopore, measurement of current changes and basecalling.
Read length	Short, usually 50–300 bp (often 2*150 bp).	Long. CLR usually 10–30 kb (sometimes more). HiFi usually 10–25 kb.	Very long. Typically 10 - 10 kb. Possible ultralong >100 kb (even hundreds of kb with an excellent sample).
Read error rate	Low, usually about 0.1 - 1% per base (*Q30+ common).	HiFi very low: about 0.1 - 0.5% (often Q30–Q40). CLR higher, several to a dozen % raw.	Variable. Raw often about 2 - 1 depending on chemistry and basecaller. In consensus much better.
Advantages	Very high accuracy, high throughput, low cost per base, good pipelines, good for SNV variants and RNA-seq.	Long reads and high accuracy in HiFi, good for de novo assembly, SV, haplotyping, repeated regions.	Longest reads, portable hardware, flexible scaling (from small to large runs), real-time results, good detection, possible detection of modifications (e.g. methylation).
Disadvantages	Short reads make genome assembly difficult, problems in repeats and complex rearrangements, worse for large SV.	More expensive than Illumina per sample, requires good quality and quantity of DNA, instruments are expensive. CLR has higher raw error.	Lower raw accuracy than Illumina and PacBio HiFi, greater dependence on DNA quality and basecalling, specific errors in homopolymers and some motifs.
Run time / cost per 1 Mbp	Time from hours to a few days depending on platform. Cost per Mbp usually the lowest.	Time from a few hours to a dozen. Cost per Mbp higher than Illumina.	Time from a few hours to 1–3. Cost per Mbp often competitive, depends on the flow cell and output, usually between Illumina and PacBio, or similar to Illumina with good utilization.

Q30 is a measure of read quality in DNA sequencing (most often Illumina). It is the so-called Phred quality score, which tells how large the probability is of an incorrect base call (A/C/G/T).

\(Q = -10 \times \log_{10}P(error)\)

What Q30 means:

HiFi in PacBio is the mode of “very accurate long reads”. Q30 means:

\(P(error) = 10^{-3} = 0.001 = 0\%\text{ per base}\)

So on average 1 error per 1000 bases.

Data analysis¶

The longest sequence was 122,674 bp and the smallest was 140 bp. The number of these sequences is 26,277.

2. Do you think the read set is sufficient to assemble the genome of a bacterium of length X?¶

Result of seqkit:

seqkit stats b_subtilis_reads.fastq.gz > seqkit_stats.txt

It is here:

file format type num_seqs
sum_len min_len avg_len max_len
b_subtilis_reads.fastq.gz FASTQ DNA 26,277
178,233,472 140 6,782.9 122,674

Then there is NanoStat analysis:

NanoStat --fastq b_subtilis_reads.fastq.gz > nanostat.txt

General summary:

Mean read length: 6,782.9
Mean read quality: 8.1
Median read length: 4,608.0
Median read quality: 8.2
Number of reads: 26,277.0
Read length N50: 9,122.0
STDEV read length: 8,593.0
Total bases: 178,233,472.0

Number, percentage and megabases of reads above quality cutoffs
>Q10: 3 (0.0%) 0.0Mb
>Q15: 0 (0.0%) 0.0Mb
>Q20: 0 (0.0%) 0.0Mb
>Q25: 0 (0.0%) 0.0Mb
>Q30: 0 (0.0%) 0.0Mb

Top 5 highest mean basecall quality scores and their read lengths
1: 10.4 (316)
2: 10.1 (937)
3: 10.1 (1761)
4: 10.0 (189)
5: 9.9 (3480)

Top 5 longest reads and their mean basecall quality score
1: 122674 (8.8)
2: 115971 (8.1)
3: 112796 (8.8)
4: 110874 (7.7)
5: 106437 (7.7)

Plots from the report created by NanoPlot:

NanoPlot --fastq b_subtilis_reads.fastq.gz -o nanoplot --threads 4

(ml) konrad_guest@konrad-hp:~/ExternalDisk/functional_genomics/gorzelanczyk_asemblacja/nanoplot$ ls
LengthvsQualityScatterPlot_dot.html
LengthvsQualityScatterPlot_dot.png
LengthvsQualityScatterPlot_kde.html
LengthvsQualityScatterPlot_kde.png
NanoPlot_20251229_2115.log
NanoPlot-report.html
NanoStats.txt
Non_weightedHistogramReadlength.html
Non_weightedHistogramReadlength.png
Non_weightedLogTransformed_HistogramReadlength.html
Non_weightedLogTransformed_HistogramReadlength.png
WeightedHistogramReadlength.html
WeightedHistogramReadlength.png
WeightedLogTransformed_HistogramReadlength.html
WeightedLogTransformed_HistogramReadlength.png
Yield_By_Length.html
Yield_By_Length.png

Next we do alignment, where map-ont is a preset specifically for Nanopore and the result goes directly to BAM.

We sorted and created the index.

Now there is mapping control via samtools:

minimap2 -ax map-ont -t 4 \
b_subtilis_reference.fasta \
b_subtilis_reads.fastq.gz \
| samtools view -b -o aln.bam -

samtools sort -@ 4 -o aln.sorted.bam aln.bam
samtools index aln.sorted.bam

cat flagstat.txt

27503 + 0 in total (QC-passed reads + QC-failed reads)
26277 + 0 primary
1034 + 0 secondary
192 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
27251 + 0 mapped (99.08% : N/A)
26025 + 0 primary mapped (99.04% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

From this we know how many reads are mapped. Next we check coverage:

mosdepth -t 4 coverage aln.sorted.bam
cat coverage.mosdepth.summary.txt

chrom length bases mean min max
BS.pilon.polished.v3.ST170922 4045677 166159099 41.07 3 130
total 4045677 166159099 41.07 3 130

And now the depth distribution:

head coverage.mosdepth.global.dist.txt

BS.pilon.polished.v3.ST170922 123 0.00
BS.pilon.polished.v3.ST170922 122 0.00
BS.pilon.polished.v3.ST170922 121 0.00
BS.pilon.polished.v3.ST170922 120 0.00
BS.pilon.polished.v3.ST170922 119 0.00
BS.pilon.polished.v3.ST170922 118 0.00
BS.pilon.polished.v3.ST170922 117 0.00
BS.pilon.polished.v3.ST170922 116 0.00
BS.pilon.polished.v3.ST170922 115 0.00
BS.pilon.polished.v3.ST170922 114 0.00

If \(>99\%\) mapped, it is very good.

If \(>70\%\) , something is wrong (bad data, wrong reference, bad basecalling).

Is it enough? Yes, the Nanopore read set for Bacillus subtilis provides high and even coverage of the reference genome of length about \(~4.05\) Mb (on average \(~41*\), \(>99\%\) of reads mapped), and with N50 length \(~9.1\) kb and the presence of very long reads (>100 kb) it is sufficient to perform de novo assembly of a bacterial genome, despite moderate quality of single reads.

There are 178,233,472 bp of data, which with a genome of 4,045,677 bp gives theoretically \(~44\times\), and from mapping the mean coverage is \(~41\times\).

This is a level sufficient for de novo assembly of bacteria.

Known from flagstat.

Graphs¶

draft assembly, de Bruijn graphs

“overlap-layout-consensus” (OLC)

This is a directed weighted graph from graph theory. In bioinformatics, such an object is often called a read overlap graph (overlap graph) and it fits the OLC approach (Overlap, Layout, Consensus). It is not a de Bruijn graph because the vertices are not k-mers, but whole reads.

3. Please describe such a basic graph from graph theory and show an example graph (not necessarily) de Bruijn and write the elements of the structure of such a graph¶

Vertices¶

A vertex is a single DNA read. Here there are 6 vertices and they are exactly these sequences:

GTCAGCCCACACAGAT
AGTTTTGC
TTGCGGATATAGCGAC
ATCCCGACGGAAACGA
ATACGACTTT
CGACGTTATACCCCCC

Mapping is almost complete: 99.08% of reads map to the reference, so the data really come from this organism and are not dominated by contamination. Known from nanostat.

Edges¶

An edge is directed from read A to read B when the end (suffix) of A overlaps the start (prefix) of B. Direction means a potential order of assembly.

In this graph there are the following edges:

GTCAGCCCACACAGAT → AGTTTTGC (weight 12)
GTCAGCCCACACAGAT → TTGCGGATATAGCGAC (weight 14)
AGTTTTGC → TTGCGGATATAGCGAC (weight 2)
TTGCGGATATAGCGAC → ATCCCGACGGAAACGA (weight 8)
ATCCCGACGGAAACGA → ATACGACTTT (weight 10)
ATCCCGACGGAAACGA → CGACGTTATACCCCCC (weight 13)
ATACGACTTT → CGACGTTATACCCCCC (weight 3)

Edge weights¶

A weight is a number describing the strength of the relation, most often the overlap length or an alignment score. The larger the weight, the more reliable the connection.

In this graph it is clearly visible:

The connection AGTTTTGC → TTGCGGATATAGCGAC has weight 2, so it is very weak (this weight is arbitrarily assumed in the code og.add_edge(reads[1], reads[2], weight = 2)).

Strong candidates for sequence skeletonization are edges 14, 13, 12, 10, 8.

Interpretation¶

This is a graph with source and sink nodes that in practice correspond to the beginnings and ends of the assembled sequence.

The node ATCCCGACGGAAACGA is a branching node because it has two outgoing edges (to ATACGACTTT and to CGACGTTATACCCCCC).

A path with the highest weights suggests a natural order of linking reads, for example:

GTCAGCCCACACAGAT → TTGCGGATATAGCGAC → ATCCCGACGGAAACGA → CGACGTTATACCCCCC

with weights 14, 8, 13.

Genome assembly¶

I chose flye because it is easy to install via microconda. There is no other special reason. I want to use it first rather than search for the best option out of the three given.

If it would not manage, then we have shown here how much we need to take to 30x.

4. Please assemble the genome of the selected bacterium using one of the assemblers (to your taste)¶

genome_len=$(seqkit fx2tab -l b_subtilis_reference.fasta | awk '{s+=$3} END{print s}')
echo "Genome length (X) = $genome_len bp"

Genome length (X) = 4045677 bp

Now make a sample (I give a fixed seed -s so that the result is reproducible).

Practical note: this is an approximation of 30× computed from the total number of bases in the FASTQ. It is sufficient for the purpose of the task.

Assembly of the genome. Results:

total_bases=$(seqkit stats -T b_subtilis_reads.fastq.gz | awk 'NR==2{print $5}')
echo "Total bases in reads = $total_bases"

Total bases in reads = 178233472

p=$(awk -v g="$genome_len" -v b="$total_bases" 'BEGIN{print (30*g)/b}')
echo "Sampling proportion p = $p"

seqkit sample -p "$p" -s 13 b_subtilis_reads.fastq.gz -o b_subtilis_reads_30x.fastq.gz
seqkit stats b_subtilis_reads_30x.fastq.gz

Sampling proportion p = 0.680962

Run flye:

flye --nano-raw b_subtilis_reads.fastq.gz \
--genome-size 4.1m \
--threads 4 \
--out-dir flye_out

Flye log (excerpt):

[2025-12-29 22:24:11] INFO: Starting Flye 2.9.6-b1802
[2025-12-29 22:24:13] INFO: Total read length: 178233472
[2025-12-29 22:24:13] INFO: Estimated coverage: 43
[2025-12-29 22:24:13] INFO: Reads N50/N90: 9122 / 3373
[2025-12-29 22:27:49] INFO: Generated 1 contigs
[2025-12-29 22:32:09] INFO: Assembly statistics:
Total length: 4032848
Fragments: 1
Fragments N50: 4032848
Largest frg: 4032848
Scaffolds: 0
Mean coverage: 43
[2025-12-29 22:32:09] INFO: Final assembly:
.../flye_out/assembly.fasta

It produced one contig, and in the seqkit statistics you can see very clearly that the length is almost 4.1 Mbp.

Showing the output:

assembly=flye_out/assembly.fasta
grep -c '^>' "$assembly"
seqkit stats "$assembly"

1
file format type num_seqs sum_len min_len avg_len max_len
flye_out/assembly.fasta FASTA DNA 1 4,032,848 4,032,848 4,032,848 4,032,848

Showing the graph with Bandage¶

Flye creates flye_out as output. It is enough to use Bandage and obtain an image. Because Flye computed from the data and generated 1 contig of length ~4.03 Mb, the graph in Bandage should show:

One main connected component and in practice one dominant path without major branching. This is a visual proof that the assembly is closed or almost closed.

If the genome is circular (typical for bacteria), the graph often looks like a single loop or a single path that in Bandage can look like a closed circuit.

If, however, we see short branches or bubbles, these are usually repeats (e.g. rRNA operons) or places where the data do not resolve connections unambiguously. However, with 1 contig there should not be such things, or there should be few.

We use the program:

5. Using Bandage show the graph¶

cd ~/ExternalDisk/functional_genomics/gorzelanczyk_asemblacja
ls flye_out | grep -E '\.gfa$'
Bandage

And in Bandage we select the file.

The assembly graph (GFA) is interesting and matches what was predicted, because with one contig it creates in practice a single dominant structure without branching, which suggests that repeats were effectively resolved and an almost complete chromosome sequence was obtained.

The problem turned out to be trivial and practically model to present as an example.

What does the BLAST panel in Bandage mean?¶

I was a bit curious what this BLAST on the left side at the bottom means. From what I read, in Bandage this BLAST panel is used so that I can attach biological information to the assembly graph. I provide a query sequence (e.g. 16S rRNA, a resistance gene, a plasmid marker). Bandage runs BLAST locally and then: - highlights graph nodes that contain hits, - allows filtering hits by E-value, % identity, alignment length, - allows describing the graph with labels like “here is 16S”, “here is blaTEM”.

But in this case I have a graph with 1 node and one loop, so practically a complete chromosome.

In such a graph BLAST will be less visually impressive, because whatever I find will land on the same node.

It is still useful for checks such as:

is this really B. subtilis,
do I have 16S (The 16S rRNA gene is almost always present in them. It is used for identification of microorganisms and studying microbiome composition without cultivation.),
are there genes X.

I took the first 1500 bp from our fasta and in the alignment you can see that this element repeats through some places.

With 1 node the visualization is not spectacular, because everything is on the same loop anyway.

In a graph with many nodes BLAST is much more interesting, because it shows which nodes belong to which genes, plasmids, contamination.

Detecting repeats: 16S rRNA (or other rRNA)¶

The most interesting in a bacterial genome are rRNA operons, because they are multi-copy and often cause problems in graphs.

I will use barrnap for this, a tool for detecting rRNA.

https://github.com/tseemann/barrnap

Results:

barrnap --kingdom bac --outseq rrna_hits.fasta flye_out/assembly.fasta > rrna_hits.gff

(barrnap log excerpt is shown in the PDF.)

Here we have written how many 16S it detected. This is an interesting situation, because 16S is multi-copy and in Bandage BLAST should give many hits in different places of the contig. Hits have high sequence agreement (often ~95 - 100% identity), and differences (mismatches, gap opens) indicate variants between 16S copies and possible small consensus errors after Nanopore reads.

grep -c "16S" rrna_hits.gff

And here are their headers.

We extract 16S into a separate FASTA file (to provide query to BLAST).

There are many in our genome, as can be seen from BLAST results:

grep '^>' rrna_hits.fasta | head -n 20

>16S_rRNA::contig_1:3583978-3585517(-)
>16S_rRNA::contig_1:1235367-1236905(+)
>16S_rRNA::contig_1:144158-145691(-)
>16S_rRNA::contig_1:13727-15265(-)
>16S_rRNA::contig_1:164697-166232(-)
>16S_rRNA::contig_1:3292164-3293701(-)
>16S_rRNA::contig_1:83984-85523(-)
>16S_rRNA::contig_1:8150-9687(-)
>16S_rRNA::contig_1:78107-79640(-)
>16S_rRNA::contig_1:3184-4721(-)
>23S_rRNA::contig_1:140904-143809(-)
>23S_rRNA::contig_1:161445-164350(-)
>23S_rRNA::contig_1:3580907-3583809(-)
>23S_rRNA::contig_1:1237076-1239979(+)
>23S_rRNA::contig_1:75034-77937(-)
>23S_rRNA::contig_1:3289089-3291993(-)
>23S_rRNA::contig_1:10653-13557(-)
>23S_rRNA::contig_1:80909-83815(-)
>23S_rRNA::contig_1:5084-7979(-)
>23S_rRNA::contig_1:121-3013(-)

We can notice that most are at the beginning. Many hits in different places of the contig usually mean several rRNA operons. (One hit means one copy or too strict filters.) The fact that one 16S query gives many hits in different positions shows that 16S is a repeated (multi-copy) region and constitutes a typical source of ambiguity in assembly.

Completeness assessment of the assembled genome¶

I will do this in two ways (because I liked this task and I did not want to wait for an answer. If something is off, I will correct it). There are two sensible options.

The first is fast (QUAST with reference).
The second is more “quality genomics” via CheckM (completeness and contamination on markers).

QUAST¶

QUAST is a tool for assessing assembly quality, which in reference mode (-r) aligns contigs to a known genome and splits the problem into two levels:

structural correctness, meaning whether contigs are assembled in correct order and orientation relative to the reference
correctness at the single-base level, meaning how many nucleotide differences there are after alignment

Run:

quast.py flye_out/assembly.fasta \
-r b_subtilis_reference.fasta \
-o quast_out \
-t 4

Here is the result.

The result shows that the assembly is very consistent.

1 contig of length 4,032,848 bp was obtained, which covers 99,998% of the reference genome (4,045,677 bp) and shows misassemblies = 0.

At the same time, base-level quality is typical for raw Nanopore data, because it reports 216.53 mismatch/100 kbp and 364.84 indel/100 kbp.

CheckM¶

CheckM is an entire linear workflow for assessing the quality of bins (MAGs). It first guesses what taxonomic lineage the bin belongs to, then selects for that lineage a set of marker genes, and finally calculates completeness and contamination based on those markers.

CheckM in lineage_wf works as a quality controller for a genomic bin, based on marker genes. First it scans the bin (b_subtilis) with HMM profiles to find conserved genes that are almost always present (markers). Then based on the set of found markers and their sequences it places the bin in a reference genome tree (pplacer), thanks to which it selects the correct taxonomic lineage and a dedicated set of markers for that lineage (g__Bacillus (UID864)).

infographic - gemini, nano banana pro In the end it calculates metrics: - completeness is the fraction of expected markers that it found
(example: if for Bacillus it expects 711 markers, and it effectively detects about 81%, then Completeness is 81.18, which suggests the bin is not a full genome, only a fairly good draft) - contamination is a signal that some markers occur multiple times, which usually means admixture of another organism or incorrect binning
(if most markers are in one copy, and only a few in two copies, contamination will be low like here 0.99, and if there were many markers in 2–5 copies, contamination would jump to several to a dozen percent and the bin would be suspicious)

In the table the lines “0,1,2,3,4,5+” say how many markers occur in 0 copies (missing), 1 copy (ideal), 2 copies (suspicion of duplication/contamination), etc. So here it is a simple situation: many markers in 1 copy (579), some missing (124), and very few duplications (8), which fits an incomplete but clean bin.

Setting the database root. I tried to run it but I did not have enough RAM:

checkm data setRoot /home/konrad_guest/ExternalDisk/micromamba/envs/ml/checkm_data
checkm lineage_wf -t 4 -x fasta bins checkm_out

OOM error.

CheckM in the pplacer step was interrupted by the OOM-killer due to lack of RAM, therefore the --reduced_tree variant was used.

--reduced_tree uses a smaller reference tree, usually faster and uses less RAM. This is often key, because the pplacer stage can consume memory.

Result:

Out of memory: Killed process 8684 (pplacer) total-vm:24667464k

checkm lineage_wf --reduced_tree -t 1 -x fasta bins checkm_out

Bin Id Marker lineage # genomes #markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
b_subtilis g__Bacillus (UID864) 93 711 241 124 579 8 0 0 0 81.18 0.99 0.00

Strain heterogeneity is an indicator of whether extra copies of marker genes in the bin look like very similar variants of the same species (mixture of strains), or rather something more distant (typical contamination by another organism). CheckM looks at markers that came out in more than one copy, compares their sequences with each other and assesses whether these copies are almost identical. If yes, it suggests that the bin contains several strains of the same species and strain heterogeneity increases. If the copies are more different, it fits admixture of another taxon and then contamination itself increases more.

Annotation of protein-coding genes¶

From what I searched, the simplest way is to do it through Prokka (Rapid prokaryotic genome annotation). It works locally and gives a complete set of files for further analysis.

Prokka is an automatic pipeline for annotating a prokaryotic genome.

It takes contigs from assembly.fasta and first finds potential genes and other functional elements, meaning it predicts protein-coding loci and selected RNAs.

Then it translates predicted CDS into protein sequences and assigns functions to them through matching to built-in databases and profiles, meaning it searches for similarity to known proteins and conserved domains, and compiles the results into a consistent annotation.

The parameters --kingdom Bacteria narrow models and databases to bacteria, and --genus Bacillus --species subtilis help select the most accurate references and naming, which usually improves the quality of function labels.

At the end Prokka writes a complete set of output files in prokka_out with prefix b_subtilis, usually in formats used later in analyses, such as an annotation file (GFF/GBK), a list of genes and proteins (FASTA), and summary tables.

infographic - gemini, nano banana pro

Run:

prokka flye_out/assembly.fasta \
--outdir prokka_out \
--prefix b_subtilis \
--cpus 4 \
--kingdom Bacteria \
--genus Bacillus \
--species subtilis

It found, and we have:

grep -c '^>' prokka_out/b_subtilis.faa

CDS 8189
rRNA 29
tRNA 82
tmRNA 1

cat prokka_out/b_subtilis.txt

organism: Bacillus subtilis strain
contigs: 1
bases: 4032848
CDS: 8189
rRNA: 29
repeat_region: 1

All files at the end of the project¶

b_subtilis.gff  - main annotation file (gene positions, feature types, products)
b_subtilis.faa  - protein sequences (predicted CDS)
b_subtilis.ffn  - nucleotide sequences of genes (CDS)
b_subtilis.gbk  - GenBank
b_subtilis.tsv  - annotation table
b_subtilis.tbl  - format for NCBI submission

Presentation¶

I presented the work at the Poznan Univeristy Of Technology for a functional genomics course. Here they are: Download Presentation