October 1996 Programmer's Challenge
DNA Match
Mail solutions to:
progchallenge@mactech.com
Due Date: 1 October 1996
This month's Challenge is based on a suggestion submitted by
Vicente Giles of the Universidad de Málaga. Vincente faces a
real-world problem to look for all the genomic sequences that match
certain criteria, given a DNA database sequence and a problem
sequence. A DNA sequence is a string of the four different
nucleotides involved in the genetic code, denoted 'A', 'C', 'G', and
'U', which stand for adenine, cytosine, guanine, and uracil. The
problem is to find all possible matches of the problem sequence in
the database sequence, allowing a specified number of differences.
The prototype for the code you should write is:
long FindMatch(
char *alphabet, /* legal characters for database and fragment strings */
char *database, /* reference string to compare fragments against */
void *storage, /* storage preallocated for your use */
char *fragment, /* string to match against database */
long diffsAllowed, /* differences allowed between fragment and database */
long matchPosition[] /* return match positions in this array*/
);
void InitMatch(
char *alphabet, /* legal characters for database and fragment strings */
char *database, /* reference string to compare fragments against */
void *storage /* storage preallocated for your use */
);
Because we would like our DNA-matching algorithm to be useful even
if scientists discover an extraterrestrial genetic code based on
other nucleotides, the algorithm accepts the genetic
alphabet as a parameter. In the problem posed by Vincente,
this would be the string "ACGU", but in our Challenge it might
include any of the characters 'a'..'z' or 'A'..'Z'. (Extraterrestrial
DNA is case sensitive.) The null-terminated reference string
contained in the database parameter can be up to 1000000000
(10**9) characters long. The fragment that you are to match
is also null-terminated, but will be significantly shorter on average
(up to 10000 characters) than the database string. You
should compare the input fragment against database,
finding all occurrences of fragment that differ in no more
than diffsAllowed positions from a substring of
database. Your code should populate one entry in the
preallocated matchPosition array for each match found,
storing the offset of the character in database that
corresponds to the first character of fragment. The
FindMatch function should return the number of matches
found.
As an example, given the following input ...
alphabet: ACGU
database: ACGTACGTACGTAAAAAATACGTACGTATA
fragment: ACGTACGTAC
diffsAllowed: 5
... your code should find 7 matches and store the following values
in matchPosition:
-4 0 4 8 15 19 23
Notice that partial matches can occur at the beginning or the end
of database, and as a result, the offsets returned in
matchPosition can be negative or greater than
strlen(database) - strlen(fragment).
To allow you to do some preprocessing, your InitMatch routine will
be called once before a sequence of calls to FindMatch.
InitMatch will be called with the same alphabet and
database parameters provided to subsequent
FindMatch calls. Both routines will also be given the same
storage parameter that points to at least 1MB of memory
allocated and initialized to zero by the calling routine.
FindMatch will be called between 100 and 1000 times, on
average, for each call to InitMatch. The winning solution will be the
one with the fastest execution time, including the execution time for
both InitMatch and FindMatch.
Other fine print: The alphabet characters will be
provided in increasing ASCII order. The offsets you store in
matchPosition need not be in any particular order. The value
for diffsAllowed will typically be smaller than 50% of
strlen(fragment). And finally, you should not allocate any
dynamic storage in your solution beyond that provided in the
storage parameter.
This will be a native PowerPC Challenge, using the latest Symantec
environment. Solutions may be coded in C or C++.
|