Description
Files you need to read:
ecoli.fasta : The data file for your submission.
short.fasta : A data file to debug your code. Do not use for submission.
Files you need to submit:
Your filenames need to start with ‘problem1’ or ‘problem2’. For each problem, submit:
- A properly commented and PEP8-compliant Python file for the module.
- A properly commented and PEP8-compliant Python file for the script.
- A PDF of the terminal output using the print icon in the lower left corner in PyCharm.
- One screenshot of your terminal showing the git commands you typed.
- One screenshot of your internet browser showing your Python files in GitHub.
Problem 1. Count the Bases (50 points)
Write a module with a function that takes a filename, tests whether the file extension is .fasta, prints the number of As, Cs, Gs, and Ts found in the sequence if it is a fasta file, and throws an exception if the input is not a fasta file. Use a single space between the upper case letter and the count.
Write a script that uses this module and shows the number of bases in ecoli.fasta in the terminal. Were you to process short.fasta, you should see the following in your terminal.
A 47
C 27
G 31
T 49
Problem 2. Count Bigrams (50 points)
Without reusing the code you wrote for Problem 1, write a new module and a new script. Your code need to show the number of pairs of DNA bases in ecoli.fasta in the terminal using the function printDigrams provided in the lecture slides. You might want to practice using short.fasta , but your final submission needs to use ecoli.fasta. The second line is
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
which has the pairs AG, GC, CT, TT, TT, TT, TC in order, and so on.
Create a dictionary that maps strings to the number of times the string appears. AA appears 338006 times in the file ecoli.fasta
If you count the bases in that sample file short.fasta and print the result, you would get the following: AA appears 18 times, AG appears 9 times, and so on.
In your terminal, include a short description with each output, e.g. “The result is:”. The output should not exceed 100 lines.
A G C T
A 18 9 8 12
G 7 6 9 9
C 7 3 3 13
T 14 13 7 15
Rubric (50 points per problem):
– (10 points) Your submission include all files listed under ‘Files you need to submit’. Files have meaningful names and the content matches the filename.
– (10 points) The code reads the correct data, not the sample data, and generates the correct results in the terminal.
– (10 points) Exceptions and explicit error messages are used to cover common error cases, e.g. trying to read a file that doesn’t exist or trying to read a file with the wrong extension.
– (10 points) Code is commented and PEP8 compliant. Variable names are meaningful. Every module and function has a meaningful docstring.
– (10 points) Concepts covered in the last lecture are used. Unnecessary structures, global variables, hard-coded values, break, and continue are not used. Code, results, and test results are easy to read.