Item – Theses Canada

OCLC number
1335713641
Link(s) to full text
LAC copy
Author
Steeg, Evan W.
Title
Automated motif discovery in protein structure prediction.
Degree
Ph.D. -- University of Toronto, 1997.
Publisher
[Toronto, Ontario] : University of Toronto, 1997
Description
1 online resource
Abstract
The protein structure prediction problem (PSP) is one of the central problems in molecular and structural biology. A computational method that could produce a correct detailed three-dimensional structural model for a protein, given its linear sequence of amino acids, would greatly accelerate progress in the biomedical sciences and industries. This thesis presents PSP as a combinatorial optimization problem, the most straightforward formulations of which require search of an exponentially-large conformation space and are known to be NP-Hard. This otherwise intractable search can in practice be reduced or eliminated through the discovery and use of motifs. Motifs are abstractions of observed patterns that encode structurally important relationships among constituent parts of a complex object like a protein tertiary structure. Motif discovery is accomplished by particular combinatorial search and statistical estimation methods. This thesis explores in detail two particular motif discovery subproblems, and discusses how their solutions can be applied to the overall structure prediction problem: (1) For a complex multi-stage prediction task, what makes a good intermediate representation language? We address this question by presenting and analyzing methods for the discovery of protein secondary structure classes that are more predictable from amino acid sequence than the standard classes of $\alpha$-helix, $\beta$-sheet, and "random coil". (2) Given a database of M objects, each characterized by values $a\sb{ij}\in {\cal A}\sb{j}$ for each of N discrete variables $\{c\sb{j}\}\sbsp{j=1}{N},$ return the list of "most interesting" higher-order features $\gamma\sb{l},$ i.e., sets of $k\sb{l}$ variables with highest estimated correlation, for any $2 \le k\sb{l} \le N$. In the PSP context, the problem is the detection of correlations between amino acid residues in an aligned set of evolutionarily-related protein sequences. We present and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods. The focus of this thesis is PSP, but the techniques and analysis are also aimed at wider application to other hard, multi-stage prediction problems.
Other link(s)
hdl.handle.net
www.collectionscanada.ca
tspace.library.utoronto.ca