The journal of Macintosh technology

Magazine In Print

About MacTech

Home Page

Archives DVD

Submit News

MacTech Forums

Get a copy of MacTech RISK FREE

MacTech Only Search:
Community Search:

MacTech Central

by Category

by Company

by Product

MacTech News

MacTech News

Previous News

MacTech RSS

Article Archives

Show Indices

by Volume

by Author

Source Code FTP

Inside MacTech

Writer's Kit

Editorial Staff

Editorial Calendar

Back Issues

Advertising

Contact Us

Customer Service

MacTech Store

Legal/Disclaimers

Webmaster Feedback

Get Netflix

October 1996 Programmer's Challenge

DNA Match

Mail solutions to: progchallenge@mactech.com

Due Date: 1 October 1996

This month's Challenge is based on a suggestion submitted by Vicente Giles of the Universidad de Málaga. Vincente faces a real-world problem to look for all the genomic sequences that match certain criteria, given a DNA database sequence and a problem sequence. A DNA sequence is a string of the four different nucleotides involved in the genetic code, denoted 'A', 'C', 'G', and 'U', which stand for adenine, cytosine, guanine, and uracil. The problem is to find all possible matches of the problem sequence in the database sequence, allowing a specified number of differences.

The prototype for the code you should write is:

long FindMatch(
  char *alphabet,       /* legal characters for database and fragment strings */
  char *database,       /* reference string to compare fragments against */
  void *storage,        /* storage preallocated for your use */
  char *fragment,       /* string to match against database */
  long diffsAllowed,    /* differences allowed between fragment and database */
  long matchPosition[]  /* return match positions in this array*/
);
 void InitMatch(
  char *alphabet,        /* legal characters for database and fragment strings */
  char *database,        /* reference string to compare fragments against */
  void *storage          /* storage preallocated for your use */
);

Because we would like our DNA-matching algorithm to be useful even if scientists discover an extraterrestrial genetic code based on other nucleotides, the algorithm accepts the genetic alphabet as a parameter. In the problem posed by Vincente, this would be the string "ACGU", but in our Challenge it might include any of the characters 'a'..'z' or 'A'..'Z'. (Extraterrestrial DNA is case sensitive.) The null-terminated reference string contained in the database parameter can be up to 1000000000 (10**9) characters long. The fragment that you are to match is also null-terminated, but will be significantly shorter on average (up to 10000 characters) than the database string. You should compare the input fragment against database, finding all occurrences of fragment that differ in no more than diffsAllowed positions from a substring of database. Your code should populate one entry in the preallocated matchPosition array for each match found, storing the offset of the character in database that corresponds to the first character of fragment. The FindMatch function should return the number of matches found.

As an example, given the following input ...

  alphabet:    ACGU
  database:    ACGTACGTACGTAAAAAATACGTACGTATA
  fragment:    ACGTACGTAC
  diffsAllowed:  5

... your code should find 7 matches and store the following values in matchPosition:

  -4 0 4 8 15 19 23

Notice that partial matches can occur at the beginning or the end of database, and as a result, the offsets returned in matchPosition can be negative or greater than strlen(database) - strlen(fragment).

To allow you to do some preprocessing, your InitMatch routine will be called once before a sequence of calls to FindMatch. InitMatch will be called with the same alphabet and database parameters provided to subsequent FindMatch calls. Both routines will also be given the same storage parameter that points to at least 1MB of memory allocated and initialized to zero by the calling routine. FindMatch will be called between 100 and 1000 times, on average, for each call to InitMatch. The winning solution will be the one with the fastest execution time, including the execution time for both InitMatch and FindMatch.

Other fine print: The alphabet characters will be provided in increasing ASCII order. The offsets you store in matchPosition need not be in any particular order. The value for diffsAllowed will typically be smaller than 50% of strlen(fragment). And finally, you should not allocate any dynamic storage in your solution beyond that provided in the storage parameter.

This will be a native PowerPC Challenge, using the latest Symantec environment. Solutions may be coded in C or C++.

Spread the word. Add to:

Slashdot

Digg

del.icio.us

newsvine

Y! MyWeb

Click on the cover to see this month's issue!

TRIAL SUBSCRIPTION Get a RISK-FREE subscription to the only technical Mac magazine!

Today's Deal

Apple Special Order Snow Leopard, Mac Box Set, Family Pack, and Snow Leopard Server at a discount.

MacTech is a registered trademark of Xplain Corporation. Xplain, Video Depot, Movie Depot, Palm OS Depot, Explain It, MacDev, MacDev-1, THINK Reference, NetProfessional, NetProLive, JavaTech, WebTech, BeTech, LinuxTech, Apple Expo, MacTech Central and the MacTutorMan are trademarks or service marks of Xplain Corporation. Sprocket is a registered trademark of eSprocket Corporation. Other trademarks and copyrights appearing in this printing or software remain the property of their respective holders.

October 1996 Programmer's Challenge

DNA Match

Today's Deal

Apple Special