Identifying the Source of Virulence | Arwin Lashawn

Goal: To identify genes or genomic features of the virulent Escherichia coli O104:H4 strain TY-2482 that could have been responsible for its extreme virulence during the 2011 outbreak in Germany and France

Introduction

Tool chosen

The Comprehensive Antibiotic Resistance Database/The Resistance Gene Identifier (CARD/RGI)

Tool description

The CARD is a rigorously curated database of characterized, peer-reviewed resistance determining genes and associated antibiotics, organized by the Antibiotic Resistance Ontology (ARO) and antimicrobial resistance (AMR) gene detection models. The CARD provides tools such as the RGI software which is capable of predicting resistomes from protein or nucleotide data based on homology and SNP models. The RGI utilizes reference data from the CARD.

How CARD/RGI works

For the purpose of this class, Dr. Runcie has already installed the tool on FARM. Therefore, only the few lines of code below are needed before running any RGI commands in Git Bash.

# request for required resources
srun -p bit150 -t 4:00:00 --mem=16000 -n 2 -c 4 --pty bash -l

# to load conda3 module
module load conda3

# to activate the CARD environment
source activate CARD

Accepted input types: nucleotide or protein sequences
Accepted formats: FASTA and gzip
RGI processes inputs differently depending on whether they are nucleotide or protein data.

Option 1: Nucleotide sequence as input

RGI first predicts complete open reading frames (ORF) by using Prodigal (ignores ORF below 30 bp) and analyzes predicted protein sequences

This includes a secondary correction by RGI if Prodigal undercalls the correct start codon to ensure complete AMR genes are predicted

If Prodigal fails to predict an AMR ORF, false negative result is returned

Note: Prodigal is an unsupervised machine learning algorithm used to provide fast, accurate protein-coding gene predictions in GFF3, Genbank, or Sequin table format.

Option 2: Protein sequence as input

RGI skips ORF prediction and directly uses the protein sequence

Currently, RGI supports four of CARD’s models. The models together with their functions are listed below:

protein homolog models: detects functional homologs of AMR genes
protein variant models: to accurately differentiate between susceptible intrinsic genes and intrinsic genes that have acquired mutations conferring AMR
rRNA mutation models: to detect drug resistant rRNA target sequences
protein over-expression models: to detect efflux subunits associated AMR, also highlights mutations conferring over-expression when present

The three terms that will be mentioned multiple times in this report i.e. Perfect, Strict and Loose are explained as follows by the RGI README:

Perfect: often applied to clinical surveillance as it detects perfect matches to the curated reference sequences and mutations in the CARD
Strict: detects previously unknown variants of known AMR genes, including secondary screen for key mutations, using detection models with CARD’s curated similarity cut-offs to ensure the detected variant is likely a functional AMR gene
Loose: works outside of the detection model cut-offs to provide detection of new, emergent threats and more distant homologs of AMR genes, but will also catalog homologous sequences and spurious partial hits that may not have a role in AMR. Combined with phenotypic screening, the Loose algorithm allows researchers to hone in on new AMR genes

Methods

The steps I took in running CARD/RGI are described below in chronological order:

# to create a directory called "Project_3"
mkdir Project_3

# to change current directory to "Project_3"
cd Project_3

# request for required resources
srun -p bit150 -t 4:00:00 --mem=16000 -n 2 -c 4 --pty bash -l

# to link the fasta files to current directory
# these fasta files will be used as the input files
ln -s /group/bit150/Project_3/Ec_55989.contigs.fa .
ln -s /group/bit150/Project_3/Ec_TY2482.contigs.fa .

# to load conda3 module
module load conda3

# to activate the CARD environment
source activate CARD

# to run RGI main to generate Perfect or Strict hits
# for Ec_55989 (avirulent)
rgi main --input_sequence Ec_55989.contigs.fa --output_file Ec_55989 --input_type contig --clean

# for Ec_TY2482 (virulent)
rgi main --input_sequence Ec_TY2482.contigs.fa --output_file Ec_TY2482 --input_type contig --clean


# to run RGI main to generate Perfect or Strict hits, INCLUDING Loose hits
# for Ec_55989 (avirulent)
rgi main --input_sequence Ec_55989.contigs.fa --output_file Ec_55989_loose --input_type contig --include_loose --clean

# for Ec_TY2482 (virulent)
rgi main --input_sequence Ec_TY2482.contigs.fa --output_file Ec_TY2482_loose --input_type contig --include_loose --clean

# now it is time to make some heatmap visualizations
# .json files are required to create the heatmap visualizations
# to create a directory called "jsonfiles" and "jsonfiles_loose"
# "jsonfiles" directory is for json files WITHOUT Loose hits (DEFAULT)
# "jsonfiles_loose" directory is for .json files WITH Loose hits (LOOSE)
mkdir jsonfiles
mkdir jsonfiles_loose

# to move DEFAULT .json files to the "jasonfiles" directory
mv Ec_55989.json ./jsonfiles
mv Ec_TY2482.json ./jsonfiles

# move LOOSE .json files to "jsonfiles_loose" directory
mv Ec_55989_loose.json ./jsonfiles_loose
mv Ec_TY2482_loose.json ./jsonfiles_loose

# now the heatmap command line
# now that the .json files are in their respective directories, the heatmap command can be run
# NOTE: yellow for Perfect hit, teal for Strict hit, purple for No hit 

# to generate a heat map from pre-compiled RGI main DEFAULT .json files
# samples and AMR genes organized alphabetically
rgi heatmap --input ./jsonfiles --output ./heatmap

# to generate a heat map from pre-compiled RGI main LOOSE .json files
# samples and AMR genes organized alphabetically
rgi heatmap --input ./jsonfiles_loose --output ./heatmap_loose

In addition to using the terminal to run RGI commands to generate outputs, I also uploaded all the four .json files to the CARD website in order to obtain more interesting visualizations that would aid my assessment of genes responsible for virulence.

I encountered an issue when trying to obtain visualizations for the LOOSE .json files that will be explained in the Results section. As a result, I resorted to using RStudio to count the number of Loose, Perfect and Strict hits obtained for the LOOSE outputs.

R code:

# to import the output .txt file for the avirulent strain
Ec_55989_loose <- read.delim("~/Desktop/Ec_55989_loose.txt")

# to view the number of Loose, Perfect and Strict hits
# for avirulent strain
summary(Ec_55989_loose$Cut_Off)

# to import the output .txt file for the virulent strain
Ec_TY2482_loose <- read.delim("~/Desktop/Ec_TY2482_loose.txt")

# to view the number of Loose, Perfect and Strict hits
# for virulent strain
summary(Ec_TY2482_loose$Cut_Off)

Results

Diagram 1: Visualization of AMR genes in avirulent Ec_55989 (DEFAULT). A total of 56 hits were found. As expected, no Loose hits are displayed.

Diagram 2: Visualization of AMR genes in virulent Ec_TY2482 (DEFAULT). A total of 63 hits were found. As expected, no Loose hits are displayed.

I attempted to upload the LOOSE .json files to obtain the same visualizations for them. However, the error message below was displayed as each of the files exceeded 20 MB.

Diagram 3: The heatmap output obtained by using the heatmap command and DEFAULT .json files. Yellow represents Perfect hit, teal represents Strict hit and purple represents No hit. The genes that appear to be responsible for virulence are CTX-M-15 and TEM-1

Diagram 4: The heatmap output obtained by using the heatmap command and LOOSE .json files. The output appears to be exactly the same as the one produced using the DEFAULT .json files

Using RStudio to count the number of Loose, Perfect and Strict hits led to the following results for the LOOSE outputs:

Diagram 5: Hit counts for Ec_55989 according to whether they are Loose, Perfect or Strict

Diagram 6: Hit counts for Ec_TY2482 according to whether they are Loose, Perfect or Strict. Note: It is observed that some of the Loose hits were “nudged” into Strict hits as the number of Strict hits increased to 53 compared to just 49 in the DEFAULT output

Discussion

Pattern Observed

One interesting pattern that can be observed about the results is that the virulent strain contains more AMR genes i.e 63 AMR genes compared to 59 AMR genes in the avirulent strain. In addition, it is observed that if Loose hits were to be included in the output, 4 of the Loose hits for the virulent strain will be nudged to Strict hits while none of the Loose hits for the avirulent strain is nudged to Strict hits. With regard to this difference, the RGI manual mentions that by default, all Loose hits of 95% identity or better are automatically listed as Strict, regardless of alignment length. If I prefer to not have the Loose hits nudged to Strict hits, I should then include “–exclude_nudge” in the RGI main command. Another interesting pattern to note is that a higher total number of Loose hits are obtained for the virulent strain compared to the avirulent strain.

Genes Identified

Diagram 7: Screenshots of the sections of identified genes indicate Perfect Hit for the virulent strain (left) but No hit for the avirulent strain (right)

According to the heatmap visualizations, it was determined that the genes most likely responsible for virulence are CTX-M-15 and TEM-1. The reason for choosing those genes is because Perfect hits were found for those genes in the virulent strain while No hits were found for the same genes in the avirulent strain. Other genes that are less likely to cause virulence are vgAc, tetC, dfrA7, sul1, sul2, APH(6)-ld and APH(3”)-lb. Those genes are mentioned because Strict hits are found for those genes in the virulent strain while No hits are found for the same genes in the avirulent strain. There are a total of 7 genes that show such pattern.

Result Comparison with Literature

According to “Open-Source Genomic Analysis of Shiga-Toxin–Producing E. coli O104:H4”, it mentions the how a large plasmid found in an E. coli strain contains the genes I identified above:

The largest plasmid, pESBL TY2482, was an IncI plasmid similar to pEC_Bactec, which was found in an E. coli strain isolated from the joint of a horse with arthritis.18 The pESBL TY2482 plasmid encodes a CTX-M-15 ESBL, as well as a beta-lactamase from the TEM class.

However, the paper simply mentions TEM as a class and not a specific gene. I then looked at another article to confirm whether it is indeed TEM-1 reponsible for virulence. The article “Epidemic Profile of Shiga-Toxin–Producing Escherichia coli O104:H4 Outbreak in Germany” mentions that:

Other typical Shiga-toxin–producing E. coli genes such as stx1, eae, and ehx are missing. All isolates classified as the outbreak strain are resistant to beta-lactam antibiotics (e.g., ampicillin) and third-generation cephalosporins and are partially resistant to fluoroquinolones (nalidixic acid). The strain is sensitive to carbapenems and ciprofloxacin. The outbreak strain produces an ESBL complex (CTX-M15) and beta-lactamase TEM-1.

The excerpts above very well acknowledge the cause of virulence in the the TY2482 strain by the genes CTX-M-15 and TEM-1. It makes sense since those genes are linked to antibiotic resistance. This suggests that the virulent strain TY2482 is able to destroy beta-lactam ring of antibiotics, thus its resistance.

Output Difference (or not)

A change I made to the command line for RGI main is to add “–include loose”. As expected, the output files were bigger as Loose hits were included. However, I did not expect that many Loose hits to have been included. Nearly 5000 Loose hits were added when the option was used. However, when I attempted to create a heatmap visualization by taking into account the Loose hits, RGI automatically removed the Loose hits before producing the heatmap. This resulted in exactly the same heatmap output despite using the LOOSE .json files as input. While I am not aware of making Loose hits be included in the heatmap, it would make sense for RGI to do so as the heatmap output would be too big to assess properly if Loose hits were actually included.

Exploring Other Options

I am yet to explore what I can possible do with the .eps files I obtained as outputs to the RGI main commands. A Google search revealed that .eps is a file extension for a graphics file format used in vector-based images in Adobe Illustrator. Apparently it also usually contains a bit map version of the image for simpler viewing rather than the vector instructions to draw the image.