Question

1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You d
Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aae
1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You do not need to know anything about biology to do these assignments other than what is contained in the description itself. The objective of each assignment is for you to acquire certain particular skills or knowledge, and the choice of topic is independent of that objective. Sometimes the topics will be related to computational problems in biology, chemistry, or physics, and sometimes not This particular assignment is an exercise in extracting information from files that are too big for mere mortals to process manually. The real power of computers is that they can do simple things extremely quickly, mean- ing millions, perhaps billions of times per second, much faster than people can. There is a kind of file called a PDB file that contains structural information about proteins, nucleic acids, and other macromolecules. A macromolecule is just a big molecule. Macro means big. PDB is an acronym for the Protein Data Bank PDB files can be downloaded from the Protein Data Bank at http://www.resb.org/pdb/home/home.do. A PDB file contains information obtained experimentally, usually by ether X-ray crystallography, NM spectroscopy, or cryo-electron microscopy (You do not need to know this to do the assignbut is important for those who intend to pursue a bioinformatics concentration.) These files completely characterize the molecule, providing, for example the three-dimensional positions of every single atom in the file, where the bonds are . which amino acids it contains if it is a protei吧(or nucleotides if DNA or RNA) and much more. The information is not necessarily exact. Associated with some of this information are confidence values that indicate how accurate it is. A PDB file is a plain text file; you can view its contents in any text editor, such as gedit or nedit, or with commands such as cat, more, and less. Each line in a PDB file begins with a word that characterizes what type of line it is. These individual lines are called records For example, some lines start with the word REMARK, which means they are comments about the file itself, or about the experimemt through which the data was collected. Some lines start with SOURCE, and they have information about the source of the data in the file, Some lines start with words such as MODEL, CONECT ATOM, and HETATM. Each has a different meaning in the file. Take a look at some of the PDB files in the directory /data/biocs/b/student.accounts/cs132/data/pdb.files before you read any further, so that you can see what they contain. I suggest picking files that are small, meaning smaller than a megabyte in size Proteins are chains of amino acids. Amino acids are organic compounds that carry out many important bodily functions, such as giving cells their structure. They are also instrumental in the transport and the storage of nutrients, and in the functioning of organs, glands, tendons and arteries. Amino acids have names such as alanine, glycine, tyrosine, and tryptophan. They are also known more succinctly by unique three-letter codes. The table below lists the twenty standard amino acids with their three-letter codes. For a summary of what these methods are, see http://www.pdb.org/pdb/static do?p education discussion/Looking-at- Amino acids are the building blocks of proteins, which may contain many thousands of them tructures/methods.html
Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start with HETATM, but these are atoms in water molecules surrounding the particular protein when it was crystallized, and we want to ignore them for now. Lines that start with ATOM contain the three-letter code for the amino acid of which that atom is a part. For example, an atom line for an atom in a phenylalanine molecule looks like this: ATOM 3814 N PHE J 24-17.763-7.816-12.014 1.00 0.00 The three-letter code is always in uppercase. The exact form of a PDB file is standardized. The standard is revised every few years. The most recent standard that describes the format can be found at: https://www.wwpdb.org documentation file-format-content format 33/v3.3.htm On that page you can scroll down to the Coordinate Section and find the link to the ATOM record format There you will see that the line for an atom is defined by the following table COLUMNS DATA FIELD DEFINITION 1 -6 Record naneATOM 7 11 Integer 13 16 Aton 17 18 20 Residue name resName serial Aton serial number Aton name Alternate location indicator Residue nane Chain identifier Residue sequence nunber Code for insertion of residues Orthogonal coordinates for I in Angstroms Orthogonal coordinates for Y in Angstroms Orthogonal coordinates for Z in Angstroms altLoc 23 26 Integer resSeq 31 38 Real(8.3) 39 46 Real (8.3) 47 54 Real (8.3) 55 60 Real (6.2) 61 66 Real (6.2) 77 78 LString() lement Element symbol, right-justified. 79 80 LString(2) tempFactor Temperature factor charge Charge on the aton.
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aaegtttgtaaccagaactgt is a DNA string of length 21. Each letter is called a base, and a sequece of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac, and tgt. A DNA string can be hundreds of thousands of codons long, even millions of codons long, so it is hard to count them by hand. It would be useful to have a simple utility script that could count the number of occurrences of a specific codon in such a string. For instance, in the example string above, aac occurs three times and tgt occurs twice. For simplicity, we always assume that we look for codons at positions that are multiples of three in the file, i.e., starting at positions 0, 3, 6, 9, 12, and so on Write a script named countcodon that is given two arguments on the command line. The first is a lowercase three letter codon string such as aaa or cgt. The second is the name of a file containing a DNA string with no newline characters or white space characters of any kind except at the end after the sequence of bases; it is just a sequence of the letters a, c, g, and t. The script will output a single number, which is the number of occurrences of the given codon in the given fie. It should output nothing but that nmber. If it finds no occurrences, it should output O. For example, if the above string is in a file named dnafile, then it should work like this: $ countcodon ttt dnafile countcodon aac dnafile $ countcodon ccc dnafile The script should check that it has two arguments and exit with a usage message if it does not. It should make sure that it can open the file for reading and print a usage statement if it cannot. It does not have to check that the string is actually a codon, but it should check that the file contains nothing but the bases and possible terminating newline character Hint: You will not be able to solve this problem using grep alone. There are a number of commands that might be useful, such as sort, cut, fold, and uniq. One of these makes it very easy. Find the right one.
0 0
Add a comment Improve this question Transcribed image text
Answer #1
#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o ${base} ${dir} | wc -l)

echo "${count}"

OUTPUT :

$ ./countmatches ttt dnafile
1

(Alternative awk script for the same )

AWK SCRIPT :-

grep -v ">"  < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c
Add a comment
Know the answer?
Add Answer to:
1 Overview and Background Many of the assignments in this course will introduce you to topics in ...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • INSTRUCTIONS You may print out this assignment and fill it in by hand. We suggest using...

    INSTRUCTIONS You may print out this assignment and fill it in by hand. We suggest using pencil in case you make mistakes!! Submit your Assignment as a single doc on Canvas. ASSIGNMENT 1) For the DNA sequence given below, write the complementary DNA sequence that would complete the double-strand. DNA 3-A TTGCT TACTTGCA T-5° DNA 5 2) Does it matter which strand is the 'code strand'? The following two sequences look identical, except one runs 3-5' and the other 5'-3'....

  • Question #2 A DNA molecule can be specified using a string of the characters , ‘g...

    Question #2 A DNA molecule can be specified using a string of the characters , ‘g 'a', 't'. Each of these characters represents one of the four nucleobases Cytosine, Guanine, Adenine, and Thymine. Consult the wikipedia page on DNA for more details. A codon consists of a sequence of three DNA nucle- obases and can be represented by a string of length 3 consisting of characters from the set ‘c', ‘g', ‘a", ‘t'. So "cga" and "ttg" are examples of...

  • Chapter 15: 1. What is the significance of the fact that many synonymous codons differ in...

    Chapter 15: 1. What is the significance of the fact that many synonymous codons differ in the third nucleotide position? 2. Define the following terms as they apply to the genetic code: a. Reading frame b. Overlapping code C. Nonoverlapping code d. Initiation codon e. Termination codon f. Sense codon 8. Nonsense codon h. Universal code i. Nonuniversal code 3. What role do the initiation factors play in protein synthesis? 4. Compare and contrast the process of protein synthesis in...

  • Please develop a Java program to read in a piece of DNA sequence from a FASTA format sequence fil...

    Please develop a Java program to read in a piece of DNA sequence from a FASTA format sequence file (alternatively you can use the getRandomSeq(long) method of the RandomSeq class to generate a piece of DNA sequence), and then print out all the codons in three forward reading frames. Design a method called codon() that can be used to find all the codons from three reading frames. The method will take in an argument, the reading frame (1, 2, or...

  • C++ Programming help, please include comments to help me understand the code. Thank you for helping....

    C++ Programming help, please include comments to help me understand the code. Thank you for helping. Task C: Substitution and Hamming Distance For this task, we will explore mutations that occur by substitution. Your task is to write a program called hamming.cpp that calculates the Hamming distance between two strings. Given two strings of equal length, the Hamming distance is the number of positions at which the two strings differ. e. g.: Hamming("aactgc", "atcaga") would output 3. Notice that certain...

  • C++ Help Task B: Translation While a nucleotide is the basic unit of information, three nucleotid...

    C++ Help Task B: Translation While a nucleotide is the basic unit of information, three nucleotides, or codon, is the basic unit of storage. The reason for this is that each gene codes for a protein, and all proteins are made from 20 amino acids. Recall that there are 4 different bases that make up dna. Thus, three bases can encode for 4x4x4 = 64 different symbols. Two base pairs can only encode for 4x4 = 16 symbols, which is...

  • Genetics! help please May S Tae suumissis hot be accepted. 1. (10 points) A series of...

    Genetics! help please May S Tae suumissis hot be accepted. 1. (10 points) A series of tRNAs have the anticodon sequences shown below Considering wobble, use Figure 13.12 to determine the possible codons with which each tRNA could pair Posible codons (Indicate the s end of each codon) Anticodon sequence 5-ACG-3 5'-xm UmGG-3 5'-IGA-3' Indicate which amino acid would be covalently bonded to tRNAs with the anticodon sequences given above. Use Table 13.1 to help you with your answer. Anticodon...

  • C++: Translating mRNA sequence help Homework Description Codon 1 You are working in a bioinformatics lab...

    C++: Translating mRNA sequence help Homework Description Codon 1 You are working in a bioinformatics lab studying messenger RNA (mRNA) sequences. mRNA is a sequence of the nucleotide bases (Adenine, Cytosine, Guanine, and Uracil) that conveys information stored in DNA to Ribosomes for translation into proteins. The bases in the sequences are denoted by the first letters of the nucleotide bases (e.g. A, C, G, and U). A sequence of mRNA is made up of hundres to thousands of nucleotide...

  • table for reference for question 4 table h ell-free synoms produce Golshopecule consisting of a string...

    table for reference for question 4 table h ell-free synoms produce Golshopecule consisting of a string of urucille RNA free systems produced the polypeptide poly string of phenylalanine amino acids phenylalanine ed entymes to produkce RNA polymers with they leotide. These polymers allowed them to inter to signal "start" and is therefore the art codon. In this case the as a dual function because also encodes the mind methionine (Met). You can see that 61 codons are more than enough...

  • Background Information How can we predict where a coding gene will be in bacteria? And can...

    Background Information How can we predict where a coding gene will be in bacteria? And can we then predict what protein will be produced? Take the DNA sequence below, for example. tcaggctttaattcatccgtgatctttgacgacggtaaatacgatgcagatataatacgatgaccgatgccaatcgaccgatcaaggaggcaccgaatggcgatgatggcgatgattgcgattaacgaagtggaacgcattatggcgggcattaacgaagatacccatgcgaccggcgaaaacgaaaccatttgcagctgcgcgaactttgaagaactgacccatgcgaccggccgcgaagcgacctaaaagtcgtaattacgtatcaagtcatgggccgcgggcgcccggcccactgactagactagggccgggcgcccgcggcccaccatataaataaaaaaaaaaaaaacgaggctatagctcatcaatgacct If you were a bacterial RNA polymerase, what sequence(s) should there be in this DNA for you to bind and begin transcribing? And if you found such sequence(s), where would you begin transcription? As a human being looking at this fragment of DNA, what type of consensus sequence(s)...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT