Jinyan Li, PhD

Associate Professor, School of Computer Engineering, Nanyang Technological University

Nanyang Avenue, Singapore 639798

**Tel:** (65) 67906253 (office)

**Office room:** Academic North N4, #02a-30

**Email: **jyli@ntu.edu.sg

## Biclique Subgraphs and Applications in Bioinformatics and Finance

In this presentation, i will give a definition for biclique subgraphs. Then, I move to talk about how this type of classical subgraph concept can be used to deal with bioinformatics and finance problems. One problem discussed here is the important problem of modeling binding hot spots in protein interactions; the other problem is a problem of stock portfolio construction.

A bipartite graph *G *is a graph whose vertices can be divided into two disjoint sets *V*1 and *V*2 such that no two vertices within *V*1 or within *V*2 are adjacent [1]. It can be denoted by *G *= (*V*1*; V*2*;E*), where *E*is the set of edges of *G*. For a bipartite graph *G *= (*V*1*; V*2*;E*), we are interested in the complete set of maximal biclique subgraphs from *G*. A biclique subgraph *H *of *G *is a graph consisting of two sets of vertices *X*1 *µ V*1 and *X*2 *µ V*2 such that every vertex in *Xi *is adjacent to all vertices in *Xj *where *j 6*= *i*. A biclique subgraph *H *is maximal in *G *if there is no other biclique in *G *that contains *H*. Maximal biclique subgraphs are identi¯ed through our earlier LCM-MBC algorithm [8]. The LCM-MBC algorithm has two input parameters *p *and *q *that can control the minimum number of vertices in each side of a maximal biclique.

A protein binding interface composes of two relatively large, spatially close protein surfaces with good geometric shape and chemical complementarity. The formation of protein chain interfaces is driven by various natural forces such as van der Waals contacts and electrostatic interactions, resulting in the removal of water molecules from the binding sites [7, 10]. This water-removal-then-binding has been well characterized by the influential `O'-ring theory [2, 11, 5, 3, 9], which is also called `water exclusion' hypothesis. It highlights that the stability of a complex is determined by only a small number of energetically outstanding residues; It also reveals that these `hot spot' residues are usually located at the center of the interface and surrounded by energetically less important residues that shape like an O-ring to occlude bulk water molecules from the hot spot. A lab-verified example of the O-ring theory was reported in the pioneering work [4] in which alanine-scanning mutagenesis was used to determine the binding hot spots between human-growth hormones and human-growth hormone-binding proteins. The O-ring theory is profounding. However, the organizational topology of the ring-inside, energetically more important hot residues is uncertain and not specified by the O-ring theory.

To refine this long-standing theory, we suggest a `double water exclusion 'hypothesis to characterize the topological organization of residues in a hot spot and their neighboring residues. On one side, we agree with the O-ring theory that there should exist a ring of residues surrounding the hot spot for avoiding the invasion of water molecules after the complex formation; On the other hand, we suppose that the hot spot itself is water-free. The water-free assumption may be too strict, so, our another hypothesis is whether the amount of water molecules inside a hot spot is proportional to its phylogenetic evolution progression towards to the perfect water-free binding. In this work, we focus on the investigation of the `double water exclusion' hypothesis only, while the latter (and bigger) hypothesis is left as an open question for interested readers and ourselves in future research.

We propose to use contact residues that are *densely *interacted in a compact region to model a hot spot. (Here, a pair of residues contact to each other if there exists a pair of atoms whose distance is below the sum of their corresponding van der Waals radii plus the diameter (2.75ºA) of a water molecule.) We take the classical graph term `biclique' [6] to denote this molecular topology, and call it `biclique pattern'. Formally, a biclique pattern between two chains in a protein complex consists of two *maximal *groups of residues (one from each chain) such that every residue contacts with all residues in the other group. Thus, if a biclique pattern is observed from a protein complex, then all possible pairs of the residues from the two sides are spatially so close that there is no sufficient room between them to accommodate any water molecule. That is, this is an interface structure exhibiting a zero-water tolerance property and forming a collective force of multiple, dense atom-atom pairs. Such a structure is therefore useful for lowering the local dielectric constant and enhancing specific electrostatic and hydrogen bond interactions to strengthen the stability of the binding. On the other hand, by the maximality requirement of biclique pattern, any neighbor residue at one side of the pattern disassociates with at least one residue at the other side. This makes the neighbor residues flexible to form a fence preventing the biclique pattern residues from solvent partially or entirely. Thus, if every residue in a biclique pattern has a small SASA in the complex,

then this biclique pattern reflects best the spirit of `double water exclusion'-all the inner hot residues are organized as a biclique without holding any water molecule, while there exist a ring of residues blocking the solvent accessibility to the hot spot.

We found a total of 1293 biclique patterns which have a non-redundant occurrence of at least 5, and which each have a minimum 2 and 4 residues at the two sides. Through extensive queries to the HotSprint and ASEdb databases, we verified that biclique patterns are rich of true hot residues. Our algorithm and results provide a new way to identify hot spots by examining proteins' structural data.

Our second application of the biclique idea is for the co-clustering of financial ratios with financial stocks in the aim to a balanced portfolio construction. This is an unsupervised process to co-cluster groups of stocks and

financial ratios. Investors can gain more insight on how they are correlated.

The traditional approach of value investment is of linear query-style. A number of financial ratios are preselected, and some thresholds are subjectively set for these financial ratio values. Then stocks satisfying these conditions are grouped together for further analysis. However, this process has a weakness; it is hard to know which financial ratios should be selected and what thresholds should be set to find correlated stocks. Exhaustive combinations of different financial ratios and their thresholds make this decision process a daunting task for investors.

Our unsupervised graph-based data mining method is to co-cluster stocks and financial ratios. The stocks are clustered based on their similarities in financial ratios and concurrently, these financial ratios are implicitly clustered according to their occurrences in the stocks. Our co-clustering patterns can be interpreted by investors to understand which clusters of financial ratios and at what levels of thresholds they are correlated to clusters of stocks.

We model stocks and their financial ratio values by a bipartite graph. We use one side of the vertices in the bipartite graph to represent the stocks, the other set to represent the financial ratio values. As the financial ratio values are continuous, we use a discretization technique to divide financial ratio values into intervals, and then each interval is represented by a vertex. An edge exists between a stock vertex s and a financial ratio vertex f, if the financial ratio value for the stock s falls in the interval f of this financial ratio. Maximal quasi-biclique subgraphs, our newly defined graph concept, are then used to co-cluster stocks and financial ratios from the bipartite graph.

We managed to access to 12 financial ratios belonging to 470 stocks from S&P 500, for the year 2001. An interesting co-cluster discovered by our algorithm is about stocks AVP, CLX, MKC, PG and financial ratios ROA (7.99, 8.27), Dividend Yield (1.45, 2.48), PE (19.69, 21.7), Price/Sales (0.16, 2.4). We found that all the price performances of the stocks increased for the year 2002, unlike the S&P 500 Index. From this co-cluster, investors gained two types of information; firstly, investors can buy this cluster of stocks to diversify theirportfolios, secondly, this cluster of financial ratios can be considered as an useful pattern to find good performing stocks.

### Short Bio

Dr. Li Jinyan is an associate professor at School of Computer Engineering, Nanyang Technological University. His research interests include pattern mining, graph mining, machine learning, and computational biology. He is known for his pioneering work on emerging patterns, as well as the development of several innovative classification methods (PCL, CS4, and DeEPs). He was a co-winner of the 2003 Asia Innovation Award for his contributions in developing a single-test platform for accurate prediction of outcome in pediatric acute lymphoblastic leukemia. He serves on the editorial board of the International Journal of Data Mining and Bioinformatics (IJDMB), and on the PC of many data mining conferences (ICML, KDD, SDM, etc).