HUMAnN Workshop on Genomics 2023

Welcome to the HUMAnN 3.6 tutorial

HUMAnN (the HMP Unified Metabolic Analysis Network) is a method for efficiently and accurately profiling the abundance of microbial metabolic pathways and other molecular functions from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community, not just the human microbiome (the name "HUMAnN" is a historical product of the method's origins in the Human Microbiome Project).

HUMAnN 3.0 can be installed 1) from PyPI, 2) as source code from the HUMAnN Github repository, or 3) using our quickstart documentation.

Support for HUMAnN is available via the HUMAnN channel of the bioBakery Support Forum.

Overview

1. Setup

1.1 Requirements

Python (>= 3.7)
MetaPhlAn (version >= 3.0)
Bowtie2 (version >= 2.2)
DIAMOND (=0.9.36)
MinPath

NOTE: Bowtie2, DIAMOND, and MinPath are automatically installed when installing HUMAnN. If these dependencies do not appear to be installed after installing HUMAnN with pip, it might be that your environment is setup to use wheels instead of installing from source. HUMAnN must be installed from source for it to also be able to install dependencies. To force pip to install HUMAnN from source, add one of the following options to your install command, --no-use-wheel or --no-binary :all:.

1.2 Installation

NOTE: If you are using a machine image as a part of the course, you do not need to install HUMAnN because the software and its dependencies are already available and you can move directly to the next section

1.2.1 Install HUMAnN

The easiest way to install HUMAnN is with pip, Conda, or Docker. For other options, please refer to the HUMAnN User Manual.

To install with pip:

pip install humann

You can also install via Conda, which will additionally install MetaPhlAn and utility dependencies. Before doing so, make sure to configure your channels so that biobakery is at the top of the list:

conda install -c biobakery humann

By default, a pip (or other) HUMAnN install does not include the full-scale nucleotide and protein sequence databases needed for real-world metagenome or metatranscriptome profiling, as these are quite large (several GBs). The humann_databases script can be used to fetch these from the web and connect them to your HUMAnN installation. For tutorial purposes, we'll be using (small) demo databases, which are also fetchable with humann_databases (if not already available).

humann_databases --download chocophlan DEMO humann_dbs
humann_databases --download uniref DEMO_diamond humann_dbs

The commands above download the demo-scale databases to a folder named humann_dbs in the current working directory. Please note that you can download these (and full-scale) databases to any location on your machine: HUMAnN will track where you have placed the databases and update its configuration file.

After installation from pip or another source, you may optionally test your local HUMAnN environment (this takes ~1 minute):

humann_test

Which yields (highly abbreviated):

test_calculate_percent_identity (basic_tests_nucleotide_search.TestBasicHumann2NucleotideSearchFunctions) ... ok
test_calculate_percent_identity_insert_delete (basic_tests_nucleotide_search.TestBasicHumann2NucleotideSearchFunctions) ... ok
test_calculate_percent_identity_missing_cigar_identifier (basic_tests_nucleotide_search.TestBasicHumann2NucleotideSearchFunctions) ... ok
test_calculate_percent_identity_missing_md_field (basic_tests_nucleotide_search.TestBasicHumann2NucleotideSearchFunctions) ... ok
test_calculate_percent_identity_multiple_M_cigar_fields (basic_tests_nucleotide_search.TestBasicHumann2NucleotideSearchFunctions) ... ok
...

1.2.2 Install MetaPhlAn

HUMAnN provides organism-specific community functional profiles, and to do so it first detects the organisms present in a community using MetaPhlAn. As MetaPhlAn is thus a prerequisite for running the HUMAnN software, please ensure that you've installed MetaPhlAn before running HUMAnN. You can test if MetaPhlAn is available in your system path by executing metaphlan --version.

2. Metagenome functional profiling

2.1 HUMAnN input formats

HUMAnN can start from a few different types of input data each in a few different types of formats:

Quality-controlled shotgun sequencing reads
- This is the most common starting point
- A metagenome (DNA reads) or metatranscriptome (RNA reads)
- Formats: fastq, fastq.gz, fasta, or fasta.gz
Pre-computed mappings of reads to database sequences
- Formats: sam, bam, or blastm8
Pre-computed (typically gene) abundance tables
- Formats: tsv or biom

This tutorial will use demo input files distributed with the HUMAnN installation under the examples directory. If you installed with pip or conda and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":

2.2 Running HUMAnN: the basics

If you are running this tutorial as part of a short course, all input files have been pre-downloaded into the cloud instances. To run this tutorial from the correct location, enter:

cd ~/workshop_materials/humann/HUMAnN_tutorial

A canonical HUMAnN run uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome. By default, gene families are annotated using UniRef90 definitions and pathways using MetaCyc definitions. From the examples folder (or the location where you saved demo.fastq.gz above) execute the following:

conda activate humann36

humann --input demo.fastq.gz --output demo_fastq

(Note1: If you have more than 1 CPU available for computation, you can use N of them by adding --threads N to the above command.)

This command yields (blank lines removed):

Creating output directory: demo_fastq
Output files will be written to: demo_fastq
Running metaphlan ........
Found g__Bacteroides.s__Bacteroides_dorei : 57.96% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 42.04% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 1195 hits
g__Bacteroides.s__Bacteroides_dorei: 1260 hits
Total gene families from nucleotide alignment: 545
Unaligned reads after nucleotide alignment: 88.3095238095 %
Running diamond ........
Aligning to reference database: uniref90_demo_prots_v201901b.dmnd
Total bugs after translated alignment: 3
g__Bacteroides.s__Bacteroides_vulgatus: 1195 hits
g__Bacteroides.s__Bacteroides_dorei: 1260 hits
unclassified: 252 hits
Total gene families after translated alignment: 559
Unaligned reads after translated alignment: 87.2000000000 %
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created: 
demo_fastq/demo_genefamilies.tsv
demo_fastq/demo_pathabundance.tsv
demo_fastq/demo_pathcoverage.tsv

Note the three phases of HUMAnN's tiered search in the above log: 1) identifying community species (with MetaPhlAn), 2) mapping reads to community pangenomes (with bowtie2), and 3) aligning unmapped reads to a protein database (with DIAMOND).

Why is nucleotide-level search generally faster than translated search?
What makes nucleotide-level search extra fast in HUMAnN?

2.3 HUMAnN default outputs

(Note: The outputs in this tutorial were generated using HUMAnN 3.0.0 with its corresponding demo databases. If you are running a different version of HUMAnN or are using a different set of databases, the outputs might differ.)

When HUMAnN is run from any input type, three main output files will be created:

$OUTPUT_DIR/$SAMPLE_genefamilies.tsv
$OUTPUT_DIR/$SAMPLE_pathabundance.tsv
$OUTPUT_DIR/$SAMPLE_pathcoverage.tsv

For the demo.fastq.gz file analyzed above, these look like:

demo_fastq/demo_genefamilies.tsv
demo_fastq/demo_pathabundance.tsv
demo_fastq/demo_pathcoverage.tsv

Let's inspect the contents of the individual files:

demo_genefamilies.tsv

This file contains the abundances of each gene family in the community in reads per kilobase (RPK) units. Each gene family's total abundance is broken down into the contributions from individual organisms when possible (delineated by the pipe character). For example:

UniRef90_G1UL42|g__Bacteroides.s__Bacteroides_dorei

Read mappings generated during (organism-agnostic) translated search are assigned to an unclassified bin. We refer to this format as "stratified output" and the individual per-organism and unclassified contributions are referred to as "stratifications". By default, gene families are identified by their unique UniRef90 identifiers (e.g. UniRef90_G1UL42 in the example above). While these are convenient for computing, they are not very easy to interpret by eye. We'll see how to attach human-readable names to UniRef90 (and other) identifiers in a subsequent section.

You can also learn more about a given UniRef family by looking up its representative protein on UniProt thusly (note the removal of the UniRef90_ prefix):

https://www.uniprot.org/uniprot/G1UL42

Reads that map to gene families that lack UniRef90 identifiers are represented as UniRef90_unknown while unmapped reads are collected as UNMAPPED. The total coverage depth (RPK) of the sample is thus represented by summing either 1) UNMAPPED, UniRef90_unknown, and all UniRef IDs prior to organism-specific decomposition, or 2) UNMAPPED, UniRef90_unknown|<organism>, and all other UniRef_*|<organism> IDs. The sum of either set of values can then be normalized to CPMs (Copies Per Million) or relative abundances.

To inspect the demo_genefamilies.tsv file, use:

column -t -s $'\t' demo_fastq/demo_genefamilies.tsv | less -S

Which yields:

# Gene Family                                               demo_Abundance-RPKs
UNMAPPED                                                    18312.0000000000
UniRef90_G1UL42                                             333.3333333333
UniRef90_G1UL42|g__Bacteroides.s__Bacteroides_dorei         333.3333333333
UniRef90_I9QXW8                                             333.3333333333
UniRef90_I9QXW8|g__Bacteroides.s__Bacteroides_dorei         333.3333333333
UniRef90_A0A174QBF2                                         200.0000000000
UniRef90_A0A174QBF2|g__Bacteroides.s__Bacteroides_vulgatus  200.0000000000
UniRef90_A0A078RDY6                                         166.6666666667
UniRef90_A0A078RDY6|g__Bacteroides.s__Bacteroides_vulgatus  166.6666666667
UniRef90_G1ULL9                                             166.6666666667
UniRef90_G1ULL9|g__Bacteroides.s__Bacteroides_vulgatus      166.6666666667
UniRef90_A0A015QIN1                                         66.6666666667
UniRef90_A0A015QIN1|g__Bacteroides.s__Bacteroides_vulgatus  66.6666666667

Let's find some of the "unclassified" stratifications in this file using the grep command:

grep unclassified demo_fastq/demo_genefamilies.tsv | less -S

Which yields:

UniRef90_Q9ZUH4|unclassified  18.3746362785
UniRef90_W8YTG4|unclassified  13.9024544125
UniRef90_Q6XMI3|unclassified  13.5681161633
UniRef90_U5FT06|unclassified  13.4157878980
UniRef90_Z9K4C5|unclassified  12.7418650404
UniRef90_U3KI10|unclassified  12.3842423992
UniRef90_B8RBV1|unclassified  12.1442525829
UniRef90_U3KP22|unclassified  11.6963164253
UniRef90_Z9JLB1|unclassified  11.5045104368
UniRef90_Q8GT21|unclassified  11.2241858840

HUMAnN can quantify protein families at 90 or 50% identity (via mapping to UniRef90 and UniRef50 representatives, respectively). What are the advantages of quantifying UniRef90 gene families?
UniRef50 families?

demo_pathabundance.tsv

This file lists the abundances of each pathway in the community, also in RPK units as described for gene families. For details on the calculation of pathway abundances from constituent gene family abundances, see the HUMAnN User Manual. Like gene families, pathways are broken down into per-organism contributions when possible (again delineated by the | character), with the total read abundance consisting of pathways assigned to organisms, UNMAPPED reads, and UNINTEGRATED reads that are attributable to specific gene families (typically also within specific organisms), but for gene families that are not annotated to any pathway in the corresponding database (e.g. MetaCyc).

As with gene families, pathway RPKs can be normalized to copies per million or relative abundances by summing either 1) UNMAPPED reads, UNINTEGRATED organism-specific features, and organism-specific (i.e. pipe-containing) pathways, or 2) UNMAPPED reads, total UNINTEGRATED features, and pathways that are not broken down per organism.

To inspect the demo_pathabundance.tsv file, use:

column -t -s $'\t' demo_fastq/demo_pathabundance.tsv | less -S

Which yields:

# Pathway                                                                   demo_Abundance
UNMAPPED                                                                    9520.1365299060
UNINTEGRATED                                                                4236.5362970855
UNINTEGRATED|unclassified                                                   54.1292645042
PWY-5423: oleoresin monoterpene volatiles biosynthesis                      18.3746362785
PWY-5423: oleoresin monoterpene volatiles biosynthesis|unclassified         18.3746362785
PWY-4203: volatile benzenoid biosynthesis I (ester formation)               13.3772872602
PWY-4203: volatile benzenoid biosynthesis I (ester formation)|unclassified  13.3772872602

demo_pathcoverage.tsv

(Note to instructors: Pathway coverage output is being revised; you may skip reviewing it during tutorials pending these updates.)

Unlike gene family abundance, stratified pathway abundance does not necessarily sum to yield the community total pathway abundance. Why not? (Hint: HUMAnN estimates the number of complete pathways in the community or a given stratification.)

2.4 Running HUMAnN: pre-mapped SAM/BAM input

The raw results of mapping metagenome/metatranscriptome reads against a gene family nucleotide database can be saved as a SAM/BAM file and used to run HUMAnN after the mapping (e.g. bowtie2) step. The only requirement is that gene family names provided in such a file be annotated using either 1) identifiers compatible with HUMAnN's default mapping to UniRef, or 2) UniRef identifiers directly. Unmapped reads in such an input file are searched against a non-organism-specific protein database as are unmapped reads in a "full" HUMAnN run. Inspect the demo sam file:

less -S demo.sam

We can transpose and view a single row (per-read alignment) as followed:

head -1 demo.sam | tr "\t" "\n"

The head -1 command extracts the first line of the file, while the tr command replaces tabs with newlines:

s__Bacteroides_dorei_read000001
16
357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7NX76|UniRef50_R5GDB8|927
423
11
100M
*
0
0
TTTCGTATTAGTGGAGTATGGCATTGAAAGTACGGATGATGACACGTTGCGCAGGATAAACCGGGGACATACTTTTGCCGTTTCTGCAGAGGCTGTTCGG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
AS:i:-12
XS:i:-30
XN:i:0
XM:i:2
XO:i:0
XG:i:0
NM:i:2
MD:Z:3T38A57
YT:Z:UU

We can provide this sam file as input to HUMAnN using a command similar to the one used above for the fastq file:

humann --input demo.sam --output demo_sam

which yields (blank lines removed):

Creating output directory: demo_sam
Output files will be written to: demo_sam
Process the sam mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created: 
demo_sam/demo_genefamilies.tsv
demo_sam/demo_pathabundance.tsv
demo_sam/demo_pathcoverage.tsv

Note the creation of the three expected output files.

Gap filling allows us to quantify a pathway even when a small number of its reactions are conspicuously absent. What are some potential pros and cons of this approach in real data?

2.5 Running HUMAnN: pre-computed protein blastx M8 input

HUMAnN typically searches metagenome/metatranscriptome reads against two databases: first a nucleotide database using rapid mapping, then a broader protein database using more computationally expensive translated search. One can execute HUMAnN using either search alone, however; using nucleotides only is more computationally efficient but less sensitive, and using translated search alone can emulate HUMAnN 1.0 output or be carried out even for very novel or unusual microbial communities (i.e. those with few reference genomes available).

If BLAST tab-delimited M8 format search results have already been generated for HUMAnN-compatible protein IDs (UniRef90 by default), they can be run by providing the appropriate input format. Let's examine such a file:

less -S demo.m8

Which yields:

s__Bacteroides_vulgatus_read000635|100  UniRef90_X8JRH1|978     57.9    19      8       0       95      39      189     207     2.0e-01 21.6
unclassified_read000001|100  UniRef90_W8YTG4|1614  90.6  32  3   0  98   3   382  413  9.0e-15  65.9
unclassified_read000002|100  UniRef90_U5FT06|1224  90.9  33  3   0  100  2   127  159  2.1e-14  64.7
unclassified_read000003|100  UniRef90_V5V2L3|2169  90.6  32  3   0  3    98  388  419  4.5e-14  63.5
unclassified_read000004|100  UniRef90_U5FT06|1224  90.6  32  3   0  99   4   148  179  4.6e-14  63.5
unclassified_read000005|100  UniRef90_X8FHU5|1080  90.6  32  3   0  98   3   240  271  3.1e-15  67.4
unclassified_read000006|100  UniRef90_Z9JXD8|1428  90.6  32  3   0  98   3   22   53   2.6e-14  64.3
unclassified_read000007|100  UniRef90_Q6XMI3|1137  90.6  32  3   0  3    98  155  186  2.0e-14  64.7
unclassified_read000007|100  UniRef90_Q9SPV4|1077  65.6  32  11  0  3    98  129  160  1.5e-09  48.5
unclassified_read000008|100  UniRef90_Z9JUT2|1308  90.6  32  3   0  98   3   222  253  2.4e-15  67.8
unclassified_read000009|100  UniRef90_X5M7Z0|1134  93.9  33  2   0  1    99  202  234  2.9e-16  70.9

To run HUMAnN using this file as input, use:

humann --input demo.m8 --output demo_m8

This again produces the three expected outputs (blank lines removed):

Creating output directory: /demo_m8
Output files will be written to: /demo_m8
Process the blastm8 mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created: 
/demo_m8/demo_genefamilies.tsv
/demo_m8/demo_pathabundance.tsv
/demo_m8/demo_pathcoverage.tsv

The BLAST m8 output fields allow us to compute the coverage of the database sequence by a single sequencing read (subject coverage) and the coverage of the read by the database sequence (query coverage). Why is subject coverage almost always smaller than query coverage?

2.6 Running HUMAnN: skipping translated search

Conversely, to run HUMAnN without any translated protein search (i.e. using rapid nucleotide mapping alone), one can provide the --bypass-translated-search flag. This applies to FASTA, FASTQ, SAM/BAM, or other raw nucleotide inputs:

humann --input demo.fastq.gz --output demo_fastq_fast --bypass-translated-search

Note the absence of a diamond (translated search) step in this run compared to the demo_fastq analysis above:

Creating output directory: demo_fastq_fast
Output files will be written to: demo_fastq_fast
Running metaphlan ........
Found g__Bacteroides.s__Bacteroides_dorei : 58.10% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 41.90% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 1195 hits
g__Bacteroides.s__Bacteroides_dorei: 1260 hits
Total gene families from nucleotide alignment: 545
Unaligned reads after nucleotide alignment: 88.3095238095 %
Bypass translated search
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
demo_fastq_fast/demo_genefamilies.tsv
demo_fastq_fast/demo_pathabundance.tsv
demo_fastq_fast/demo_pathcoverage.tsv

It is informative to compare the results of this demo file with and without translated search; even in a small, subsampled input, including amino acid search is more computationally costly but recovers additional gene families. To count unique gene families in the original file, execute:

cut -f1 demo_fastq/demo_genefamilies.tsv | tail -n +3 | grep -v "|" | sort -u | wc -l

What is this command doing?

cut -f1 extracts the first column of the file, i.e. gene identifiers;

tail -n +3 removes the first two lines of the file, i.e. the header and UNMAPPED;

grep -v "|" removes the stratified headers;

sort -u isolates the unique headers;

wc -l counts those headers.

The resulting answer is 559 (unique gene families). The same command executed on the file demo_fastq_fast/demo_genefamilies.tsv yields 545 unique gene families. We infer that 559 - 545 = 14 additional gene families were uncovered using translated search. (On a non-demo metagenome, the relative number of additional gene families uncovered by translated search would be far larger!)

In general, any HUMAnN intermediate file can be saved and used to resume processing from an intermediate point in the analysis pipeline, and several steps including taxonomic pre-screening, nucleotide search, or amino acid search can be independently skipped. For more information, see the relevant sections of the HUMAnN User Manual.

Under what circumstances would skipping translated search be reasonable?
Under what circumstances would skipping translated search be a bad idea?

3. Manipulating HUMAnN output tables

3.1 Normalizing RPKs to relative abundance

To facilitate comparisons between samples with different sequencing depths, it is important to normalize HUMAnN's default RPK values prior to performing statistical analyses. Sum-normalization to relative abundance or "copies per million" (CPM) units are common approaches for this; these methods are implemented in the humann_renorm_table script (with special considerations for HUMAnN's output formats, e.g. feature stratification). Let's normalize our gene abundances to CPM units:

humann_renorm_table --input demo_fastq/demo_genefamilies.tsv \
    --output demo_fastq/demo_genefamilies-cpm.tsv --units cpm --update-snames

Inspect the output file demo_fastq/demo_genefamilies-cpm.tsv using the less command as we did above:

# Gene Family                                               demo_Abundance-CPM
UNMAPPED                                                    690444
UniRef90_G1UL42                                             12568.2
UniRef90_G1UL42|g__Bacteroides.s__Bacteroides_dorei         12568.2
UniRef90_I9QXW8                                             12568.2
UniRef90_I9QXW8|g__Bacteroides.s__Bacteroides_dorei         12568.2
UniRef90_A0A174QBF2                                         7540.89
UniRef90_A0A174QBF2|g__Bacteroides.s__Bacteroides_vulgatus  7540.89
UniRef90_A0A078RDY6                                         6284.08
UniRef90_A0A078RDY6|g__Bacteroides.s__Bacteroides_vulgatus  6284.08
UniRef90_G1ULL9                                             6284.08
UniRef90_G1ULL9|g__Bacteroides.s__Bacteroides_vulgatus      6284.08
UniRef90_A0A015QIN1                                         2513.63
UniRef90_A0A015QIN1|g__Bacteroides.s__Bacteroides_vulgatus  2513.63

Note the difference in scale: units tend to be larger than in the original file as a result of being scaled to sum to 1 million. Also note that the sample header has been updated from demo_Abundance-RPK to demo_Abundance-CPM courtesy of the --update-snames flag.

By default, humann_renorm_table 1) forces the community total abundances to sum to 1 million and then 2) adjusts the stratifications so that they continue to behave as fractions of community total abundances (see humann_renorm_table -h for additional options). Let's verify that the unstratified abundance values sum up to 1 million:

grep -v "#" demo_fastq/demo_genefamilies-cpm.tsv | grep -v "|" | cut -f2 | paste -s -d+ - | bc

What is this command doing?

grep -v "#": excludes lines containing "#", in this case the file's header;

grep -v "|": excludes lines containing "|", which mark stratified abundances;

cut -f2: extracts the second column of the file containing the abundances;

paste -s -d+ -: converts the column of numbers to a summation (1 + 2 + 3 + ...);

bc: sums the numbers.

The final answer should be very close to 1 million (it may be slightly off due to rounding error). A small modification of the above command can be used to find the sum of the stratified values (this sum will be < 1,000,000 unless you also account for the UNMAPPED value).

HUMAnN natively normalizes gene family abundance by gene length. Why is this step important? (In other words, why are RPK units more interpretable than raw read counts?)

3.2 Regrouping genes to other functional categories

HUMAnN's default "units" of microbial function are UniRef gene families (which we use internally to compute reaction abundances, and from there pathway abundances). However, starting from gene family abundance, it is possible to reconstruct the abundance of other functional categories in a microbiome using the humann_regroup_table script. Inspect the regrouping options using the humann_regroup_table -h command.

Let's regroup our CPM-normalized gene family abundance values to MetaCyc reaction (RXN) abundances, which are bundled with the default HUMAnN installation:

humann_regroup_table --input demo_fastq/demo_genefamilies-cpm.tsv \
    --output demo_fastq/rxn-cpm.tsv --groups uniref90_rxn

We are given the output message:

Original Feature Count: 560; Grouped 1+ times: 60 (10.7%); Grouped 2+ times: 13 (2.3%)

Indicating that the original file contained 560 features (UniRef90 gene families and UNMAPPED). Of these, 60 were annotated to a RXN, and 13 were annotated to more than one RXN. Inspect the output file demo_fastq/rxn-cpm.tsv using the less command:

# Gene Family                                        demo_Abundance-CPM
UNMAPPED                                             690444.0
UNGROUPED                                            290292.34000000026
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei        139621.93409999993
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus     148318.30399999986
UNGROUPED|unclassified                               2352.097
2.5.1.19-RXN                                         233.716
2.5.1.19-RXN|g__Bacteroides.s__Bacteroides_dorei     133.704
2.5.1.19-RXN|g__Bacteroides.s__Bacteroides_vulgatus  100.012
2.7.13.3-RXN                                         236.729
2.7.13.3-RXN|g__Bacteroides.s__Bacteroides_dorei     172.166
2.7.13.3-RXN|g__Bacteroides.s__Bacteroides_vulgatus  64.5624
2.7.7.13-RXN                                         398.989
2.7.7.13-RXN|g__Bacteroides.s__Bacteroides_dorei     119.697
2.7.7.13-RXN|g__Bacteroides.s__Bacteroides_vulgatus  279.292

We see that UniRef90 features have been regrouped as RXN features, such as 2.5.1.19-RXN. A new stratified feature also appears in the output: UNGROUPED. This feature reports the total abundance of features that were not otherwise included in a group (similar to the UNINTEGRATED feature of the pathways output). If each UniRef90 gene family were regrouped into at most one RXN, the total community abundance of features in the regrouped file (including UNGROUPED and UNMAPPED) would still be 1 million.

The actual sum here (1,024,360) is slightly higher than 1 million due to the 13 UniRef90 features whose abundances were counted in two (or more) different RXN features. You can verify this by modifying the summation command we used above:

grep -v "#" demo_fastq/rxn-cpm.tsv | grep -v "|" | cut -f2 | paste -s -d+ - | bc

You could force the RXN abundances themselves to sum to 1 million by running humann_renorm_table on the above table, or by regrouping UniRef90s to RXNs before renormalizing their abundances (and then normalizing at the RXN stage). The results will be the same! (That said, in practice, we typically normalize only once at the gene family level to adjust for initial differences in sample sequencing depth.)

When regrouping genes to RXNs (or ECs or KOs), we sum over the gene sequences belonging to a particular group. Under what circumstance would it make sense to average over the members of a group rather than sum? (Hint: the answer is related to the difference between the "is_a" and "part_of" relationships used in the Gene Ontology.)

3.3 Attaching names to features

So far we've been looking at a bunch of UniRef90 codes and RXN codes. While HUMAnN is happy to work with these all day, it's typically helpful to attach some human-readable descriptions of these IDs to facilitate biological interpretation. This can be accomplished with the humann_rename_table script. Let's attach names to the RXN table we generated above:

humann_rename_table --input demo_fastq/rxn-cpm.tsv \
    --output demo_fastq/rxn-cpm-named.tsv --names metacyc-rxn

We are told that the script Renamed 116 of 118 entries (98.31%), suggesting that most features were successfully renamed (indeed, the two features that did not receive new names were the special UNMAPPED and UNGROUPED features). Let's confirm this by inspecting the new output table with less:

# Gene Family                                                                                                          demo_Abundance-CPM
UNMAPPED                                                                                                               690444.0
UNGROUPED                                                                                                              290292.34000000026
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei                                                                          139621.93409999993
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus                                                                       148318.30399999986
UNGROUPED|unclassified                                                                                                 2352.097
2.5.1.19-RXN: (expasy) 3-phosphoshikimate 1-carboxyvinyltransferase [2.5.1.19]                                         233.716
2.5.1.19-RXN: (expasy) 3-phosphoshikimate 1-carboxyvinyltransferase [2.5.1.19]|g__Bacteroides.s__Bacteroides_dorei     133.704
2.5.1.19-RXN: (expasy) 3-phosphoshikimate 1-carboxyvinyltransferase [2.5.1.19]|g__Bacteroides.s__Bacteroides_vulgatus  100.012
2.7.13.3-RXN: (expasy) Histidine kinase [2.7.13.3]                                                                     236.729
2.7.13.3-RXN: (expasy) Histidine kinase [2.7.13.3]|g__Bacteroides.s__Bacteroides_dorei                                 172.166
2.7.13.3-RXN: (expasy) Histidine kinase [2.7.13.3]|g__Bacteroides.s__Bacteroides_vulgatus                              64.5624
2.7.7.13-RXN: (expasy) Mannose-1-phosphate guanylyltransferase [2.7.7.13]                                              398.989
2.7.7.13-RXN: (expasy) Mannose-1-phosphate guanylyltransferase [2.7.7.13]|g__Bacteroides.s__Bacteroides_dorei          119.697
2.7.7.13-RXN: (expasy) Mannose-1-phosphate guanylyltransferase [2.7.7.13]|g__Bacteroides.s__Bacteroides_vulgatus       279.292

It turns out that our friend 2.5.1.19-RXN is actually 3-phosphoshikimate 1-carboxyvinyltransferase, though you probably already knew that... In the event that a name can't be found for a non-special feature (e.g. UNMAPPED), the script will assign it the name NO_NAME for consistency.

The regrouping and renaming options used above are limited for demo purposes. Just as can install full-scale nucleotide and protein databases for real-world HUMAnN use, you can install a full-scale "utility mapping" database to help out with your downstream analyses. See the humann_databases script for acquiring these databases.

From a statistical testing perspective, what could be an advantage of regrouping raw gene abundance to a smaller list of gene groups?
What is a more practical advantage of regrouping?

4. Advanced HUMAnN topics

4.1 HUMAnN intermediate files

It is in some cases useful to understand the collection of intermediate files generated by the individual steps of a typical HUMAnN run:

ls -lh demo_fastq/demo_humann_temp/

Which yields:

-rw-rw-r-- 1  15M May 20 14:26 demo_bowtie2_aligned.sam
-rw-rw-r-- 1 3.9M May 20 14:26 demo_bowtie2_aligned.tsv
-rw-rw-r-- 1  17M May 20 14:26 demo_bowtie2_index.1.bt2
-rw-rw-r-- 1 5.4M May 20 14:26 demo_bowtie2_index.2.bt2
-rw-rw-r-- 1 225K May 20 14:26 demo_bowtie2_index.3.bt2
-rw-rw-r-- 1 5.4M May 20 14:26 demo_bowtie2_index.4.bt2
-rw-rw-r-- 1  17M May 20 14:26 demo_bowtie2_index.rev.1.bt2
-rw-rw-r-- 1 5.4M May 20 14:26 demo_bowtie2_index.rev.2.bt2
-rw-rw-r-- 1 2.5M May 20 14:26 demo_bowtie2_unaligned.fa
-rw-rw-r-- 1  27M May 20 14:26 demo_custom_chocophlan_database.ffn
-rw-rw-r-- 1  88K May 20 14:26 demo_diamond_aligned.tsv
-rw-rw-r-- 1 2.4M May 20 14:26 demo_diamond_unaligned.fa
-rw-rw-r-- 1  24K May 20 14:27 demo.log
-rw-rw-r-- 1  19K May 20 14:25 demo_metaphlan_bowtie2.txt
-rw-rw-r-- 1 2.3K May 20 14:26 demo_metaphlan_bugs_list.tsv

These include SAM files containing the alignment of input reads against the default nucleotide database, which are first reduced by counting up and tabulating the hits to each annotated gene family. It can be useful to reprocess either the reads unaligned after nucleotide mapping (bowtie2_unaligned.fa) or, more likely, the reads still unaligned after amino acid search. Finally, the MetaPhlAn2 bowtie2.txt intermediate output is itself saved. For more information, see the HUMAnN User Manual's output file section.

4.2 HUMAnN for multiple samples

It is typical to execute HUMAnN on multiple individual samples, then merge the resulting gene families or pathways into a single tab-delimited multi-sample table. When combining or comparing any HUMAnN output, it is critical to normalize the default RPK abundance units (which will be sensitive to the sequencing depth of each sample) into CPM units (which will not be).

We can demonstrate this by downloading a collection of six subsampled input files from the Human Microbiome Project. The original files, and many others, can be downloaded from the HMP DACC.

Click on each individual subsampled file to download. Open a terminal and change directories into the folder where you have downloaded the files.

You will see something like the following contents if you run ls -lh:

-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014459-Stool.fasta
-rw-r--r-- 1  2.6M Jun  4 11:33 763577454-SRS014464-Anterior_nares.fasta
-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014470-Tongue_dorsum.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014472-Buccal_mucosa.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014476-Supragingival_plaque.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014494-Posterior_fornix.fasta

Process each input file into a single output directory:

humann -i 763577454-SRS014459-Stool.fasta -o hmp_subset
humann -i 763577454-SRS014464-Anterior_nares.fasta -o hmp_subset
humann -i 763577454-SRS014470-Tongue_dorsum.fasta -o hmp_subset
humann -i 763577454-SRS014472-Buccal_mucosa.fasta -o hmp_subset
humann -i 763577454-SRS014476-Supragingival_plaque.fasta -o hmp_subset
humann -i 763577454-SRS014494-Posterior_fornix.fasta -o hmp_subset

Alternatively, if you're comfortable with shell scripting, you can use the following command to process all six at once:

for f in *.fasta; do humann -i $f -o hmp_subset; done

Regardless of how you run HUMAnN, this will result in six output directories each containing the three main analysis types (gene families, pathway abundances, and pathway coverages) along with intermediate files. Let's merge the 6 per-sample gene family abundance outputs into a single table for this dataset using the script humann_join_tables:

humann_join_tables -i hmp_subset -o hmp_subset_genefamilies.tsv --file_name genefamilies

Unfortunately the resulting file is not very interesting: because we're working with small demo databases, we don't identify much of anything as known from these subsampled HMP metagenomes. The same would not be true if we were using full-scale HUMAnN databases, which recruit the vast majority of reads from metagenomes of the human microbiome. Note also that, in a real-world application, it would be particularly important to normalize our merged file to adjust for differences in sequencing depth across the samples.

humann_renorm_table -i hmp_subset_genefamilies.tsv -o hmp_subset_genefamilies-cpm.tsv --units cpm

For more information, see the HUMAnN User Manual's description of a standard analysis workflow.

When analyzing less-well-characterized metagenomes, we recommend using a UniRef50-based HUMAnN workflow rather than the default UniRef90-based workflow. Why might that be the case, and how would you expect mapping rates in HUMAnN's three search tiers to change in that context?

5. Plotting stratified functions

For this section, we'll operate on a table of pathway abundances from ~400 samples from the Human Microbiome Project (as originally profiled by HUMAnN 2.0):

hmp_pathabund.pcl

Download or copy (if in VM: cp input/hmp_pathabund.pcl .) this file to your working directory and examine it using less:

less -S hmp_pathabund.pcl

Which yields:

FEATURE \ SAMPLE                                                                                      SRS011084    SRS011086      SRS011090
STSite                                                                                                Stool        Tongue_dorsum  Buccal_mucosa
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis                                                  0.000498359  0.000628096    0.000304951
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Acidovorax.s__Acidovorax_ebreus               0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Bacteroides.s__Bacteroides_finegoldii         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Haemophilus.s__Haemophilus_influenzae         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Klebsiella.s__Klebsiella_pneumoniae           0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Staphylococcus.s__Staphylococcus_epidermidis  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Veillonella.s__Veillonella_sp_oral_taxon_158  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|unclassified                                     7.89502e-06  1.60101e-05    0

We can see three samples from three different body sites (via the STSite metadatum). A single pathway, N10-formyl-tetrahydrofolate biosynthesis (1CMET2-PWY), and its stratifications are visible. In these three samples, the pathway can only be assigned to the unclassified stratum, and with very low abundance. In a third sample (SRS011090) the pathway is only observed at the community level (how?).

Using MaAsLin 2.0 (or another method for making statistical associations with microbiome data), we can compare community total abundances from HUMAnN tables with sample metadata. In this case, a model relating community-total pathway abundance to sample body site reveals a strong enrichment for L-homoserine and L-methionine biosynthesis (METSYN-PWY) at the three oral sites (buccal mucosa, tongue, and plaque).

Now that we have identified this enrichment, use the grep command to explore the stratifications of this pathway from the hmp_pathabund.pcl file.

grep 'METSYN-PWY' hmp_pathabund.pcl | less -S

Do the major contributing species seem reasonable for an orally-enriched pathway?

HUMAnN includes a utility humann_barplot to assist with visualizing stratified outputs. Let's use this tool to make a plot of the differentially abundant pathway we identified above:

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot1.png --focal-feature METSYN-PWY

Use see plot1.png to open the resulting figure if working in a graphical environment.)

Not terribly informative... Let's try adding a "sort" on the summed stratified abundance:

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot2.png --focal-feature METSYN-PWY --sort sum

A pattern has started to emerge: we can clearly see that the oral body sites are enriched on the left (high) end of the plot. We can continue this line of analysis by adding an additional sort on body site (metadata) and "logstack" scaling (i.e. scaling the tops of the stacked bars in proportion to the log of community abundance, while scaling species contributions linearly within total bar height):

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot3.png --focal-feature METSYN-PWY --sort sum metadata --scaling logstack

We see now that the pathway is most strongly enriched in the buccal mucosa, where it is contributed almost exclusively by Streptococcus species. The pathway's relative abundance in the other two oral sites is generally lower (by about an order of magnitude, as emphasized on the log scale), and is more often "unclassified" at the tongue body site.

How could you find other pathways with a similar abundance distribution in this functional profile?
What might be causing the "unclassified" contributors of homoserine/methionine synthesis in tongue dorsum communities?

Let's examine one additional pathway, coenzyme A biosynthesis (COA-PWY), which is more broadly conserved across body sites:

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot4.png --focal-feature COA-PWY --sort sum

Now sorting by total abundance is less informative. To remedy that, let's try some additional plotting options: 1) regrouping species to genera (--as-genera), 2) removing samples that lack the coenzyme A biosynthesis pathway (--remove-zeros), 3) and sorting on ecological similarity (--sort braycurtis):

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot5.png --focal-feature COA-PWY --sort braycurtis --scaling logstack --as-genera --remove-zeros

Notably, sorting by the similarity of pathway contributions has tended to group samples from the same body site even without the explicit metadata sort. This is because, for broadly distributed pathways, organismal contributions tend to follow overall sample ecology (which tends to be quite different from one body site to the next). This plot also includes an appearance of the dark gray "other" bin, which in this case represents the sum of contributions from named genera that were outside of the top N we selected for coloring (default N=18).

Is it surprising to see coenzyme A biosynthesis present in diverse human body site microbiomes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HUMAnN Workshop on Genomics 2023

Welcome to the HUMAnN 3.6 tutorial

Table of Contents

Overview

1. Setup

1.1 Requirements

1.2 Installation

1.2.1 Install HUMAnN

1.2.2 Install MetaPhlAn

2. Metagenome functional profiling

2.1 HUMAnN input formats

2.2 Running HUMAnN: the basics

2.3 HUMAnN default outputs

demo_genefamilies.tsv

demo_pathabundance.tsv

demo_pathcoverage.tsv

2.4 Running HUMAnN: pre-mapped SAM/BAM input

2.5 Running HUMAnN: pre-computed protein blastx M8 input

2.6 Running HUMAnN: skipping translated search

3. Manipulating HUMAnN output tables

3.1 Normalizing RPKs to relative abundance

3.2 Regrouping genes to other functional categories

3.3 Attaching names to features

4. Advanced HUMAnN topics

4.1 HUMAnN intermediate files

4.2 HUMAnN for multiple samples

5. Plotting stratified functions

MetaPhlAn documentation

StrainPhlAn documentation

Workshop on Genomics 2023

Workshop at World of Microbiome 2023

Clone this wiki locally