Sequence alignment

OBJECTIVE: Given two DNA files, one (view file) contains 100,000 short sequence reads (30~50bp) from a single anonymous donor, the other (view file) has a template DNA sequence for human chromosome 13. Identify the best match along chromosome 13 for each short read sequence.

STRATEGY:

Summary: Double hashing. List all occurrences of each 13-letter word for chrom13 sequence. Process one short sequence at a time using hash search and smith waterman alignment algorithm.

Detailed explanation:

1. Initialization and indexing. Count number of characters in chrom13 file using charCount. Determine a good hash table size using hashCal. For chrom13, the two numbers are 114,142,980 and 201,326,611, respectively. Create and initialize four arrays (these count for most of the memory usage):

 

char charTable[num of chars for chrom13]

int intTable[num of chars for chrom13]

int hashTable[hash table size]

int locTable[hash table size]

 

charTable stores each character in the DNA sequence in uppercase. Each character is converted to a small number using encode and stored in intTable. Calculate hash value for each 13-letter word using hashFunc and insert into hashTable using insert. When a value is inserted, location of the first character in charTable is inserted into locTable. A maximum of 50 (COLLISION_THRESHOLD) collisions is allowed if repeat value occurs.

2. Short sequence searching. Create and initialize three arrays:

char estName[MAX_EST_NAME_LENGTH] // store the name of the EST

char *estCharTable = new char[MAX_EST_LENGTH] // store forward EST sequence

char *rc_estCharTable = new char[MAX_EST_LENGTH] // store reverse complement EST sequence

 

Go through the short sequence file and count the number of short sequences using estCount. Both forward and reverse complement short sequences will be examined.

EST searching and imperfect matches handling:

First search 1~13 characters in forward and reverse complement direction, if found then perform smith-waterman local pairwise alignment between short sequence and its corresponding chrom13 sequence, if percentage of matches is bigger than 95 (MATCH_PERCENT_THRESHOLD) and gap number is smaller than 3 (GAP_NUM_THRESHOLD) then this counts for a ‘found’ or ‘mapped’, then alignment and mapped location are printed out. When printing, reverse complement sequence is indicated as “*rc_EST”. Mismatches are in lower case. hashSearch also handles repeats under COLLISION_THRESHOLD.

If first attempt failed, proceed with 2~14 characters and 3~15, 4~16… until found. If not found then just print out short sequence name and length.

code snippets1

 

*3. Calculate and display top repeats. This function (and four other auxiliary functions) is for calculating top repeats by double hashing for displaying some statistics of genome information. The number of repeats is defined by REPEATNUM.

However, the top repeats information is independent and not implemented in indexing or short sequence finding.

Top repeats

RESULT:

Output files (reads1000.txt), (reads100000.txt) and code (sequence.cpp) are available at http://www-personal.umich.edu/~yuliang/biostat615/ .

check function evaluation times (number of probes) : 1,936,052,712. Indexing time is 156.46 seconds (CPU: 2.4G Intel Core 2, RAM: 4GB 1067 MHz DDR3).

96% short sequences are found and mapped to at least one location. 35,000 short sequence reads per minute are processed on average.

Drawbacks and further improvements:

  1. Computationally expensive. Needs about 2GB of memory.
  2. Does not “learn” from the template file. Can calculate top “repeat words” but does not implement that information.
  3. Missed some short sequence reads. Possibly because of setting a collision threshold (COLLISION_THRESHOLD).
  4. Only tracks a maximum of 100 (2*COLLISION_THRESHOLD) repeats for each short sequence. Also due to collision threshold.
  5. Short sequence mapping efficiency is low (35,000 reads/min).

Sample output:

  1. Perfect matches (forward and reverse complement at two locations)

  1. Imperfect matches (with gaps and mismatch)

 

List of functions:

code snippets2

 

 

 


  Technology used: