1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Question

Question

1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You d

Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start

2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aae

1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You do not need to know anything about biology to do these assignments other than what is contained in the description itself. The objective of each assignment is for you to acquire certain particular skills or knowledge, and the choice of topic is independent of that objective. Sometimes the topics will be related to computational problems in biology, chemistry, or physics, and sometimes not This particular assignment is an exercise in extracting information from files that are too big for mere mortals to process manually. The real power of computers is that they can do simple things extremely quickly, mean- ing millions, perhaps billions of times per second, much faster than people can. There is a kind of file called a PDB file that contains structural information about proteins, nucleic acids, and other macromolecules. A macromolecule is just a big molecule. Macro means big. PDB is an acronym for the Protein Data Bank PDB files can be downloaded from the Protein Data Bank at http://www.resb.org/pdb/home/home.do. A PDB file contains information obtained experimentally, usually by ether X-ray crystallography, NM spectroscopy, or cryo-electron microscopy (You do not need to know this to do the assignbut is important for those who intend to pursue a bioinformatics concentration.) These files completely characterize the molecule, providing, for example the three-dimensional positions of every single atom in the file, where the bonds are . which amino acids it contains if it is a protei吧(or nucleotides if DNA or RNA) and much more. The information is not necessarily exact. Associated with some of this information are confidence values that indicate how accurate it is. A PDB file is a plain text file; you can view its contents in any text editor, such as gedit or nedit, or with commands such as cat, more, and less. Each line in a PDB file begins with a word that characterizes what type of line it is. These individual lines are called records For example, some lines start with the word REMARK, which means they are comments about the file itself, or about the experimemt through which the data was collected. Some lines start with SOURCE, and they have information about the source of the data in the file, Some lines start with words such as MODEL, CONECT ATOM, and HETATM. Each has a different meaning in the file. Take a look at some of the PDB files in the directory /data/biocs/b/student.accounts/cs132/data/pdb.files before you read any further, so that you can see what they contain. I suggest picking files that are small, meaning smaller than a megabyte in size Proteins are chains of amino acids. Amino acids are organic compounds that carry out many important bodily functions, such as giving cells their structure. They are also instrumental in the transport and the storage of nutrients, and in the functioning of organs, glands, tendons and arteries. Amino acids have names such as alanine, glycine, tyrosine, and tryptophan. They are also known more succinctly by unique three-letter codes. The table below lists the twenty standard amino acids with their three-letter codes. For a summary of what these methods are, see http://www.pdb.org/pdb/static do?p education discussion/Looking-at- Amino acids are the building blocks of proteins, which may contain many thousands of them tructures/methods.html
Tyrosine TyT Val Each line that starts with the word ATOM represents a unique atom in the protein. So do the lines that start with HETATM, but these are atoms in water molecules surrounding the particular protein when it was crystallized, and we want to ignore them for now. Lines that start with ATOM contain the three-letter code for the amino acid of which that atom is a part. For example, an atom line for an atom in a phenylalanine molecule looks like this: ATOM 3814 N PHE J 24-17.763-7.816-12.014 1.00 0.00 The three-letter code is always in uppercase. The exact form of a PDB file is standardized. The standard is revised every few years. The most recent standard that describes the format can be found at: https://www.wwpdb.org documentation file-format-content format 33/v3.3.htm On that page you can scroll down to the Coordinate Section and find the link to the ATOM record format There you will see that the line for an atom is defined by the following table COLUMNS DATA FIELD DEFINITION 1 -6 Record naneATOM 7 11 Integer 13 16 Aton 17 18 20 Residue name resName serial Aton serial number Aton name Alternate location indicator Residue nane Chain identifier Residue sequence nunber Code for insertion of residues Orthogonal coordinates for I in Angstroms Orthogonal coordinates for Y in Angstroms Orthogonal coordinates for Z in Angstroms altLoc 23 26 Integer resSeq 31 38 Real(8.3) 39 46 Real (8.3) 47 54 Real (8.3) 55 60 Real (6.2) 61 66 Real (6.2) 77 78 LString() lement Element symbol, right-justified. 79 80 LString(2) tempFactor Temperature factor charge Charge on the aton.
2. A DNA string is a sequence of the letters a, c, g, and t in any order, whose length is a multiple of 1 3· For example, aaegtttgtaaccagaactgt is a DNA string of length 21. Each letter is called a base, and a sequece of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac, and tgt. A DNA string can be hundreds of thousands of codons long, even millions of codons long, so it is hard to count them by hand. It would be useful to have a simple utility script that could count the number of occurrences of a specific codon in such a string. For instance, in the example string above, aac occurs three times and tgt occurs twice. For simplicity, we always assume that we look for codons at positions that are multiples of three in the file, i.e., starting at positions 0, 3, 6, 9, 12, and so on Write a script named countcodon that is given two arguments on the command line. The first is a lowercase three letter codon string such as aaa or cgt. The second is the name of a file containing a DNA string with no newline characters or white space characters of any kind except at the end after the sequence of bases; it is just a sequence of the letters a, c, g, and t. The script will output a single number, which is the number of occurrences of the given codon in the given fie. It should output nothing but that nmber. If it finds no occurrences, it should output O. For example, if the above string is in a file named dnafile, then it should work like this: $ countcodon ttt dnafile countcodon aac dnafile $ countcodon ccc dnafile The script should check that it has two arguments and exit with a usage message if it does not. It should make sure that it can open the file for reading and print a usage statement if it cannot. It does not have to check that the string is actually a codon, but it should check that the file contains nothing but the bases and possible terminating newline character Hint: You will not be able to solve this problem using grep alone. There are a number of commands that might be useful, such as sort, cut, fold, and uniq. One of these makes it very easy. Find the right one.

1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

1 Overview and Background Many of the assignments in this course will introduce you to topics in computational biology. You do not need to know anything about biology to do these assignments other than what is contained in the description itself. The objective of each assignment is for you to acquire certain particular skills or knowledge, and the choice of topic is independent of that objective. Sometimes the topics will be related to computational problems in biology, chemistry, or physics, and sometimes not This particular assignment is an exercise in extracting information from files that are too big for mere mortals to process manually. The real power of computers is that they can do simple things extremely quickly, mean- ing millions, perhaps billions of times per second, much faster than people can. There is a kind of file called a PDB file that contains structural information about proteins, nucleic acids, and other macromolecules. A macromolecule is just a big molecule. Macro means big. PDB is an acronym for the Protein Data Bank PDB files can be downloaded from the Protein Data Bank at http://www.resb.org/pdb/home/home.do. A PDB file contains information obtained experimentally, usually by ether X-ray crystallography, NM spectroscopy, or cryo-electron microscopy (You do not need to know this to do the assignbut is important for those who intend to pursue a bioinformatics concentration.) These files completely characterize the molecule, providing, for example the three-dimensional positions of every single atom in the file, where the bonds are . which amino acids it contains if it is a protei吧(or nucleotides if DNA or RNA) and much more. The information is not necessarily exact. Associated with some of this information are confidence values that indicate how accurate it is. A PDB file is a plain text file; you can view its contents in any text editor, such as gedit or nedit, or with commands such as cat, more, and less. Each line in a PDB file begins with a word that characterizes what type of line it is. These individual lines are called records For example, some lines start with the word REMARK, which means they are comments about the file itself, or about the experimemt through which the data was collected. Some lines start with SOURCE, and they have information about the source of the data in the file, Some lines start with words such as MODEL, CONECT ATOM, and HETATM. Each has a different meaning in the file. Take a look at some of the PDB files in the directory /data/biocs/b/student.accounts/cs132/data/pdb.files before you read any further, so that you can see what they contain. I suggest picking files that are small, meaning smaller than a megabyte in size Proteins are chains of amino acids. Amino acids are organic compounds that carry out many important bodily functions, such as giving cells their structure. They are also instrumental in the transport and the storage of nutrients, and in the functioning of organs, glands, tendons and arteries. Amino acids have names such as alanine, glycine, tyrosine, and tryptophan. They are also known more succinctly by unique three-letter codes. The table below lists the twenty standard amino acids with their three-letter codes. For a summary of what these methods are, see http://www.pdb.org/pdb/static do?p education discussion/Looking-at- Amino acids are the building blocks of proteins, which may contain many thousands of them tructures/methods.html

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

#!/bin/bash
#set first and second arguments (dnafile and base respectively)

dir=$1
base=$2

count=$(grep -o ${base} ${dir} | wc -l)

echo "${count}"

OUTPUT :

$ ./countmatches ttt dnafile
1

(Alternative awk script for the same )

AWK SCRIPT :-

grep -v ">"  < input.fa |
tr -d '\n' |
sed 's/\([ATGCatgcNn]\{3,3\}\)/\1#/g' |
tr "#" "\n" |
awk '(length($1)==3)' |
sort |
uniq -c

Add a comment

Answer 2

1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Homework Answers

Add Answer to:
1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Post as a guest

Earn Coins

INSTRUCTIONS You may print out this assignment and fill it in by hand. We suggest using...

Question #2 A DNA molecule can be specified using a string of the characters , ‘g...

Chapter 15: 1. What is the significance of the fact that many synonymous codons differ in...

Please develop a Java program to read in a piece of DNA sequence from a FASTA format sequence fil...

C++ Programming help, please include comments to help me understand the code. Thank you for helping....

C++ Help Task B: Translation While a nucleotide is the basic unit of information, three nucleotid...

Genetics! help please May S Tae suumissis hot be accepted. 1. (10 points) A series of...

C++: Translating mRNA sequence help Homework Description Codon 1 You are working in a bioinformatics lab...

table for reference for question 4 table h ell-free synoms produce Golshopecule consisting of a string...

Background Information How can we predict where a coding gene will be in bacteria? And can...

1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Homework Answers

Add Answer to: 1 Overview and Background Many of the assignments in this course will introduce you to topics in ...

Post as a guest

Earn Coins

Add Answer to:
1 Overview and Background Many of the assignments in this course will introduce you to topics in ...