About StatEpigen

The human genetic code is capable of moderating most external environmental factors, (with some notable exceptions, e.g. exposure to massive radiation doses). Nevertheless, the expression of individual genes within a sequence can be permanently or semi-permanently altered by relatively innocuous-seeming influences, such as diet, human interaction or exchange, and stressful environment. At the cellular level, input signals may cause significant changes at the DNA structural level called epigenetic changes. You can read more about epigenetic changes here.

When our team at Sci-Sym research Centre in DCU became interested in developing probabilistic models of epigenetic processes, reliable quantitative data sets were necessary to drive and validate the models. We have found that there is a large amount of cancer epigenetic data in the published scientific literature, but in this dispersed and heterogeneous form it did not suit for bioinformatics analysis, modelling and profile searching.

The StatEpigen project aims at generating cancer epigenetic data in a computational format. The project is based on expert manual curation and annotation. StatEpigen does not replace the PubMeth resource, but comes to complement it by focusing on the correlations among the cellular genetic and epigenetic events.

At present, a lot of papers on the epigenetic determinants of cancer have sections where they examine correlations or associations between a particular epigenetic event, found in a particular cancer phenotype, to other molecular events which are already known cancer signatures. We find it is important to know about these correlations, because they let us see a more general picture about the molecular determinants of this phenotype, and to make conclusions about the pathway leading to this particular cancer phenotype. The final platform is intended to assist pathology research and diagnostics.

Scientific literature curation

Manual annotation is currently recognised as the gold standard biological annotation system. High-quality manually annotated and non-redundant data will be curated in relation to (i) colon cancer and (ii) other types of cancer (for comparison reasons). Manual annotation consists of:

- Analysis of pathology phenotypes, focusing on pathology initiation and dynamics, to generate a set of high-quality annotation targets.

- Determining of a neat phenotype classification.

- Retrieval of epigenetic events from the literature. Making association to the phenotype classification.

- Retrieval of information on external factors, correlated with certain epigenetic and genetic events.

The choice of manual annotation, as opposed to automatic annotation, is due to the complexity of data-types researched on the one hand as well as the limitations of text-mining algorithms on the other. Expert analysis, comparison and annotation, is needed to extract the biological information from the peer-reviewed literature, and to fit it to the relational data schema generated for this purpose.

Articles identified from their abstracts to include relevant epigenetic information relating to pathologies of interest are selected for annotation. After examination of the information in full text, the sample-related, epigenetic and other data is extracted. Any papers containing incomplete data (e.g. epigenetic events that are reported but not fully described) or data of insufficient quality (e.g. qualitative only) are curated but flagged as incomplete.