Protein Sequences
Lab 3-Mutations and Antimicrobial Resistance
Introduction
As sequencing becomes more accessible and the amount of sequence data continues to grow substantially, modern biologists should be equipped with the knowledge to manage and use this data. Whether a conservation biologist fighting species extinction, a biochemist creating life-saving vaccines, or a virologist tracking the mutations of a virus, all biologists can use sequencing data to aid in their research. Students will understand where to search for sequence data, how to manipulate sequences, and how mutations lead to antimicrobial resistance by simulating mutations of a sequence in MATLAB.
Learning Objectives
Students understand the probability concepts of "experiments" and "Independently and Identically Distributed" events.
Students understand probability in terms of the count of desired events divided by the count of all possible events.
Students can compute probabilities for "and" and "or" combinations, and sequence of events.
Students understand how random numbers are generated and used by computers.
Students can generate random numbers (real numbers) in any range.
I Do
***
We Do
***
You Do
Simulate different mutations in a DNA sequence to analyze the effect on protein production.
Protein Production
Central Dogma
Before being able to use sequence data, we have to understand what exactly it is. The 'Central Dogma' is the process of creating a protein product from a DNA sequence and follows the path of DNA to RNA to protein. The first step is transcription where the complement sequence of double-stranded DNA composed of adenine, thymine, guanine, and cytosine is transferred to a single-stranded mRNA of adenine, uracil, guanine, and cytosine. The mRNA is then processed for the next step of the central dogma: RNA to protein through translation. The RNA sequence of AUGC is translated into amino acids which makes up a protein. A sequence of three nucleic acids, a codon, codes for one amino acid represented by a letter, but the codon table is redundant. There are 64 combinations of codons, but they only code for 20 amino acids, so some codons code for the same amino acid.
Sequence data is usually accessed in the form of DNA made up of ATGC, RNA of AUGC, or proteins of the 20 amino acids. Examples of each of these types of sequences is shown below.
DNA:
CGATCCAGTTTATCTCACGAAACTATAGTC
RNA:
UGCUGAACACCCAGCAUAACUAGGUACGCU
Protein:
VYLAFWDVWTWTGLRMIFHYEWHSTMFRNC
Mutation types
Mutations are a change in the sequence of nucleotides in DNA. There are 3 different types of mutations: base substitutions, insertions, and deletions.
Base substitutions are point mutations where one nucleotide is substituted by another. Adenine and guanine are purines and thymine and cytosine are pyrimidines, so these groups have similar structures and properties.
Transitions occur when one purine or pyrimidine is replaced with another.
Transversion occurs when a pyrimidine is replaced with a purine or vice versa.
Base substitutions can be silent, missense, or nonsense.
Silent mutations results in the formation of the same amino acid, so there is no effect on protein production
Missense mutations form different amino acids, so the effect varies depending on whether the substituted amino acid has similar properties.
Nonsense mutations lead to the formation of a stop codon that prevents the rest of the sequence from being translated.
Insertions are frameshift mutations where one nucleotide is added to the sequence.
Because nucleotides are read every three for one codon, the insertion of one nucleotide changes the way the rest of the sequence is read. This leads to a greater chance of a faulty protein being formed from the sequence
Deletions are frameshift mutations where one nucleotide is removed from the sequence.
Similar to insertions, the deletion of one nucleotide results in the rest of the sequence being read differently.
Antimicrobial Resistance
Mutations can have no effect on protein production, be harmful if it inhibits an organism survival, or even give organisms an evolutionary advantage when mutations are beneficial. In the case of bacteria, mutations can cause them to be resistant to antibiotics. Bacteria that are resistant to antibiotics are more likely to survive and reproduce making antimicrobial resistance a growing concern.
Biological Sequence Databases
There are over 100 databases of biological sequence data, so it is important for biologists to know where to start searching.
UniProtKB is one of the most common sources and stores protein sequences and function information. It is split into two sections.
Swiss-Prot is the reviewed section that is manually annotated
TrEMBL is not reviewed, auto annotated, and not as thorough
Genbank stores publicly available nucleotide sequences and protein translations.
Protein Data Bank is a database used to store annotated data on the 3d structure of proteins.
Each database can be used to download sequences to use for analysis in fasta format. Fasta has one heading line describing what the sequence is followed by lines of nucleotide or amino acid sequences.
Using Matlab for Biological Data
fastaread(fastafile)
Use this built-in function to extract the header and sequence from a fasta file
Try downloading a fasta file from one of the biological databases and use fastaread('filename') to extract the sequence. Make sure the downloaded file is in the same directory or include a file path. The file for the Mycobacterium phage named "Happiness" is shown as an example below.