What can you do with StatEpigen



1. Database Provision and Querying

- Introduction

- Simple and Conditional Molecular Events

- Querying StatEpigen

2. Data Analysis

- Data visualisation and integration tools

- Graphical Data visualisation using Cytoscape

3. Future Work

Database Provision and querying


The data is stored in a database, implemented in MySQL, following a relational schema generated according to an expert assessment of the data structures encountered in the literature. Efforts will are made to support extension and refinement of the resource, to encompass additional details, as new data are published. The goal of this project is an integrative one: to provide high quality identification and storage of pathological molecular determinants by bringing together, for common analysis, data obtained under different platforms (epigenetic, genetic, mutation and others). Therefore, the database is designed to support storage for a variety of distinct types of data, and allows for evolution to store new types of data as they emerge. This is achieved through StatEpigen’s own internal classification system, in order to facilitate both curation and querying processes. Cross references with GO, ENSEMBL and PubMed have been established. Currently, work is in progress to connect StatEpigen to other relevant resources.

The core of the data model are molecular events (epigenetic and genetic). We are interested how the epigenetic events (such as CpG island hyper- and hypomethylation, various histone modifications, loss of heterozygocity and others) are correlated with:

- each other.

- other molecular events such as gene expression, various types of mutations and polymorphisms.

- more complex molecular signatures such as MSI (Microsatellite Instability), CIMP (CpG Island Methylator Phenotype) and others.

For this reason, the user can mainly query the database for two types of records:

- Simple Frequencies of Molecular Events (or Simple Molecular Events).

- Conditional Frequencies of Molecular Events (or Conditional Molecular Events).

Simple and Conditional Frequencies


The events can form associations or correlations with various pathologies (such as cancers and intermediary stages of cancers). These associations are at the basis of the two main types of results that one can retrieve using StatEpigen.

- Simple Frequencies of Molecular Events: such a record gives the incidence of a molecular event given a phenotype.

- Conditional Frequencies of Molecular Events: such a record gives the incidence of a molecular event (event 2) given a phenotype and the fact that another molecular event (event 1) takes place in all analysed samples.

A Simple Molecular Event record contains information about the molecular event itself, such as: its name, the name of the gene (in case the event is related to a gene), its specification - more details about the event. Further, the records give statistical information about the event, such as number of samples analysed, incidence of the event in the samples, any qualitative or quantitative information available on the event, and phenotype-related information. The status of the event refers to the qualitative information available on the event and is determined by the assay described in the reference. For example, a paper can only verify the presense or the absence of the event or it can give qualitative details whether its intensity is high or low. Other papers go further than that and give even quantitative details on the analysed molecular events. This is dealt with by the fields "Quantified" and "Units", giving the quantification of the event and its respective units (which depend on the assay). Finally, the record also contains fields describing the phenotype of the samples and the PubMed ID of the reference.

Consider a simple example APC methylation in sporadic carcinoma. Go to the "Advanced Search" option, chose "sporadic" in the "Origin" tab and "carcinoma" in the Histology tab. Assume we are interested to include all subhistologies and are not interested in a study according to various clinicopathological factors. Thus we miss tabs 3 and 4 and go directly to tab 5. Here we chose the APC gene, tick the case "CpG island Hypermethylation" and click on the Results tab. First look at the Simple Molecular Events retrieved by the system. If APC gene promoter methylation was observed in 50% of 62 carcinoma samples, a Simple Molecular Event record contains the following main fields (first record from the table):

From the point of view of conditional probabilities, this can be written as P(APC promoter methylation = YES | Phenotype = carcinoma), where APC promoter methylation and phenotype are variables which can take different values. The status of the molecular event is "YES", because the promoter methylation is only known to be present, while the authors have not used an assay to verify to which extent the promoter is methylated. For a highly or totally methilated promoter, the status of the event is "HIGH" or "TOTAL", respectively. The results table can also contains 2 records where the quantitative assession of the methylation has been performed. In one of these two cases, we see that all samples are methylated but the quantitative assay shows that the average percentage of methylation is very low. We do not know how many of the samples are highly enough methylated to include in a further analysis. This is why the case of this particular record is not ticked.

A Conditional Molecular Event record is very similar to a Simple Molecular Event record, but it describes only a subpopulation of the population described by a Simple Molecular record. Alike a Simple Molecular Event record, a Conditional Molecular Event record gives information about the event (Event 2), qualitative and/or quantitative data about it, the number of samples analysed, the frequency of the event (Event 2), information about the phenotype of the samples and the PubMed ID of the reference. The only difference is that in this case, the record also gives information about Event 1, which is known to be present in all analysed samples. For example, assume that we are interested to find out whether the frequency of APC promoter hypermethylation is biased by other molecular signatures in the cell. For this we go back to tab 5, chose APC gene and include all molecular events in the study. To includes all cases in the search, do not tick any of them. Go to tab 6 (Results) and click on Conditional Molecular Events, sort by Event 2 (shift - click on all 4 columns describing Event 2). In the results, we can see that colon carcinoma samples have been verified for APC promoter hypermethylation according to LOH on the APC gene. From all samples characterised by LOH of 5q, 36.3% samples are found to be hypermethylated on APC promoter. Here, the event which takes place in all saples is APC LOH (Event 1) and the event whose frequency is analysed is APC promoter hypermethylation (Event 2):

This can be written as P(APC promoter methylation = YES | APC LOH = YES and Phenotype = colon carcinoma). The variable phenotype can take other values such as colon adenoma, colon polyps, normal colon and others.

Does LOH on 5q influence APC promoter hypermethylation in any way? Do the results on Simple Events and Conditional Events suggest a conclusion? Statistical tests can be used on these results to answer these question. StatEpigen offers integration tools to manage the multiple records (Data visualisation and integration tools) and statistical tools to directly analyse the results from the screen are in development at the moment.

Querying StatEpigen


A number of querying facilities are available in StatEpigen. You can query StatEpigen in a gene/molecular event - centred manner, in a phenotype - centred manner (see the left vertical menu) or both (option "Advanced Search").

There are a few different options for performing phenotype-centred StatEpigen querying.

- According to most frequent histologies and subhistologies.
- According to clinicopathological factors.
- According to cell lines.

StatEpigen is a phenotype-focused resource, hence genetic and epigenetic data are available for a large range of phenotypes. The option "Advanced Search" deals with all possibilities. The option "By Histology and Subhistology" gives a selection of the most frequently phenotypes to speed up a phenotype-focused user. When a phenotype-centred query according to clinicopathological factors is performed, the data which do not contain clinicopathological factors are excluded, to prevent confusions. Data on more than 100 colon cancer cell lines are available in StatEpigen. Each cell-line is treated as a distinct phenotype.

In order to submit a query, click the button "Select Filter Value". The interface will display the number of retrievable records for both Simple and Conditional Events.

The option “Advanced search” is an interface allowing selection among all phenotypes (except specific cell-lines) as well as among genes and events. The tabs offer an easy navigation among the different filter possibilities. In order to submit an "Advanced Search" query, click the tab "Results". The interface will display the number of records for both Simple and Conditional Relations. If more filtering is of interest, go back to the option tabs to make additional choices or to modify the initial choices. Clicking the "Results" tab will display the number of available records corresponding to the newest filtering choice.

Data Analysis

Data visualisation and integration tool

After submitting a query and reading the number of available records, one can choose to visualise them. The buttons "Result Set I" and "Result Set II" respond to this purpose. Once a results table is displayed, it can be sorted by column or by multiple columns. Say one would like to see the results sorted by genes: simply click on the header of the "Gene" column and it will be sorted according to the genes - in ascending or descending order. To visualise the molecular events, available for each gene, in alphabetical order too - sort by gene and then by event name. Sort the genes first, and then, keeping the Shift button pressed, click on the header of the column "Name Event". One can continue in this manner, to sort by Specification, Status etc.

In some cases, in may be interesting to discard some of the retrieved records from further analyses (see the example from Simple and Conditional Molecular Events). This can be achieved by unselecting the undesired rows, leaving selected the interesting rows only. To memorise your selection, click on "Confirm Selected Records". All future manipulations of the data will be made using the selected records only.

For example, the selected records in both Simple and Conditional Events tables can be downloaded, by clicking on "Download Conformed Rows" button. This button is available only after having confirmed a row selection.

Two types of molecular events retained in StatEpigen have been presented in Section 1. Turning back to the example from Section 1 (Simple and Conditional Molecular Events), there are usually more than one references to mention frequencies of the same molecular event in the context of the same phenotype. By combining the frequencies given by more references (in case it is relevant to do so - if the chosen records are homogeneous), it is possible to refine the probability characterising a particular molecular event in a particular phenotype. Records giving intensities of the molecular events quantitatively are subject to discretisation if they are amenable to it.

Idem with the Conditional Molecular Events – they can be mentioned in more than one references, hence the statistics on each Conditional Event can be refined. The integrated information on a given database search can be accessed by clicking on the button "View Statistics". The page which will open displays a table with all distinct phenotypes in the selected records:

To visualise the integrated molecular event information, available for the displayed phenotypes, click directly on the button "Print Statistics". One can decide that two or more distinct phenotypes from the records should be treated as an unique phenotype, and the data integrated accordingly. This can be done by selecting these phenotypes from the phenotypes table and clicking on "Confirm Phenotypes to Merge". In this example, we decide that all carcinomas, intependently of differentiation, will be treated as an unique phenotype:

To compute the integrated data table, corresponding to the new phenotype definition, click again the button "Print Statistics".

Two tables will be displayed: one corresponding to Simple, and the second to Conditional Events. The following figure shows an extract from the integrated information table on the selected Simple Events characterising the APC gene.

The next table shows the integrated information table for the conditional events characterising APC gene. This table helps answer questions asked earlier. For example, habe a look at the first 2 rows. You can notice that APC LOH appears more frequently with APC hypermethylation, than by itself. Look at the rows 7 and 8. APC methylation without APC LOH appears more rarely than APC methylation with APC LOH. This suggests a clear correlation between APC methylation with transcriptional loss. Going back to the two refrences involved, you will find that indeed, the conclusion of the authors is that APC hypermethylation contributes to the loss of APC expression in colorectal cancers with allelic loss on 5q, as a second hit of expression loss after LOH. The tumous without APC LOH may be cancers with biallelic methylation [16336454].

The p-value displayed in the table is calculated in comparing the frequency of the Conditional Event to the frequency of the Simple Event. A low p-value, highlighted in red, means that the frequency of the Conditional Event is significantly different to the frequency of its Simple counterpart from the previous table. The test, used to compare proportions is the chi-squared test with 1 degree of freedom. This test involves independent populations. Thus, we first check if the population representing the Conditional Event is part of the population giving the Simple Event.

- if the Conditional Event is subsample of the Simple Event, and Simple Event is composed of only one sample (from which the subsample to test the condition came), the test does not satisfy the sample independency condition. The "not indep" message is displayed in the p-value column.

- if the Simple Event is composed of many samples (a1, a2, ..., an), and the condition was tested using a small subsample of one of the many samples composing the Simple Event (e.g. subsample of a1), than the two populations are approximated as being independent due to the presence of the other samples (a2, ..., an). The test is performed, but is has to be kept in mind that this is an approximation.

- There is another case the test can not be performed. This is the case when there is no Simple counterpart to an existing Conditional Event. The "na" (not available) message is displayed in the p-value column.

- if the Conditional Event is not subsample of Simple Event sample, the test is performed.

Graphical data visualisation using Cytoscape

A key endpoint for this platform is provision of suitable data analysis and data visualisationmethods, for modelling the dynamics of epigenetic-genetic transformations, leading to cell abnormalities in cancer. Stochastic networks not only facilitate knowledge extraction from the data, but also provide a sound method for data visualisation. Some advantages of networks include:

- Pattern / signature generation and improved visualisation of these aspects.
- Zooming features allows analysis at different levels.
- Ability to merge separate networks from multiple data sources into one network, enhancing understanding of information connections.

In the previous section we showed how the data integration tool can be used to compute the refined frequencies of the molecular events and their associations. The following parallel can be made then:

- A Simple Molecular Event, together with its statistics issuing from all available references, can be defined as a node.
- A Conditional Molecular Event, together with its statistics issuing from all available references, can be defined as an edge, connecting two nodes, with the sense of Event 1 to Event 2.

The nodes and the edges and their statistics can be graphically represented as networks. While the final datafiles can be downloaded on the user's workstation, where they can be visually represented with users' independent software, at this stage StatEpigen supports downloading the data in Cytoscape format. Cytoscape is a freeware, which is very powerful for the construction in biologically-related networks. It is available here. For representing the networks, Cytoscape needs a number of specific files. StatEpigen creates these files, which are available for downloading with the option "See Statistics/Download Cytoscape files". This facility can be used only if the selected data are all related to a particular unique phenotype, or they are related to close phenotypes. If the dataset is on a number of distant phenotypes, the networks built will be meaningless. For example, a network is meaningless if the underlying data is related to both colon carcinoma and healthy tissue samples. Conversely, if all the data is, for example, on cell line DLD1, or colon polyps only, or on early stage tumours, such as polyps and adenomas, this data can be subsequently used to build networks in Cytoscape.

On a primary network, obtained by using Cytoscape, a first meaning of the edges is that the underlying correlations have been tested by researchers and reported in literature. The weight of the edge is showing its incidence (e.g.: the probability of the gene being expressed, depending on intensity of its epigenetic regulation). If the frequency of an edge is low, this can mean that the events at the nodes rarely occur in the sames samples, hence represent independent pathways to the phenotype they appear in. By adding information from more references to the data pool, both node and edges probabilities can be refined.

You can see a network built in Cytoscape here and read how to build it from StatEpigen data here.

Future Work

- Curate and annotate more data on colon cancer.
- Extend StatEpigen to other pathologies: gastric cancer and lymphomas.
- Publish our annotated dataset of correlations between molecular events and environmental factors.
- Connect StatEpigen to other relevant resources.
- Incorporate automatical statistical analysis for the data on screen.