Holiday Hopes: Defeat Cancer
The non-profit Canary Foundation, which is dedicated to cancer’s early detection, and the National Cancer Institute’s Early Detection Research Network (EDRN) have commissioned what is known as a translational research informatics platform to be provisioned by GenoLogics Life Sciences Software and Jet Propulsion Laboratory (JPL). The platform is being created to integrate lung cancer-patient information from a number of hospital source systems and rich complex biological data sets, in order to discover biomarkers that can be linked to the early detection of the disease. The five-year project involves 60 principal investigators in the EDRN, and data initially from lung cancer patients at the British Columbia Cancer Agency and the University of Texas SouthWestern Medical Center. That data includes clinical information such as electronic medical records and hospital information systems – in a potpourri of formats that presents an information integration challenge not just between hospitals but even within a single facility’s diverse source systems. Also featured in the project is biological data drawn from tissue and serum samples and cell lines that will involve sites including Johns Hopkins and a next-generation genome sequencing facility in Shanghai. Phase one of the project involes about 150 patients, with about eight data samples taken from each one, says James DeGreef, VP market strategy for Genologics. Some of the data the project will encompass is very messy, rich and raw – a large mass spectrometer data file might be 50 gigabytes, for instance. And a next-generation genome sequencer run on one human sample can create 7 terabytes of raw data. DeGreef expects the project will accumulate about a couple of hundred terabytes of raw data in the first year, though the semantic web data sets culled from all this (such as the analysis of mass spectrometer results) will be significantly smaller. “With the semantic web it’s really about the knowledge and turning data into knowledge and doing it in a flexible way,” he says – as well as a future-proofed one. “The data is quite complex and involves some crazy relationships.” Getting the relevant data to a point where it can be applied to increase knowledge around cancer is a huge integration nightmare, he says. Traditional data warehousing approaches don’t work as well for this kind of a project as they do for less complex efforts, such as reporting for hospital billing systems and financial metrics. “Classic data warehousing doesn’t lend itself to domains focused on health research, partly because it’s evolving very rapidly,” he says. What was once considered one cancer is now understood to be 14 different varieties, and the differences among patients’ genetics, lifestyles, and environments makes each of them practically a research project unto themselves, he says. “Doing early detection of cancer research across diverse individuals is a challenge and the classic data warehouse waterfall approach doesn’t work well,” DeGreef says. Conquering Cancer The answer isn’t just not to smoke (though it’s a smart first step), as the disease isn’t always caused by smoking and isn’t just manifested in a single form. As it happens, DeGreef says, about 20 percent of lung cancers are associated to lifetime non-smokers, and today in the U.S. there’s a larger population of former smokers than current ones. So it’s important to understand genetic factors that may contribute to the formation of the disease and identify those markers sooner rather than later. NASA’s JPL brings to the project a great deal of expertise in handling large data sets, including in RDF formats, from its work around planetary systems. Amazon’s EC2 and S3 clouds for data storage and application hosting provide an efficient means for collaborating over the web for those involved, DeGreef says. He adds that the project is at the stage now -- about two months in -- where the involved parties are working with researchers around defining common data elements, terms and hierarchies across different hospital systems and biology domains for data harmonization, and that that work is progressing quickly. In the next year or two the hope is to leverage these efforts and apply them to other cancers—pancreatic, prostate, and ovarian, for example. “One thing with ovarian cancer is that it almost always is detected in stage four and then is generally fatal,” DeGreef says. “If we can detect it in stage one you can save almost everybody.” Future efforts also could focus on other diseases, such as heart disease and diabetes. DeGreef says that if semantic web technologies can prove out on this project, the door is wide open to putting them to use in other ways in a world where it soon will be cheap enough for every American to have their genome sequenced. The costs have fallen from something like $6 billion to sequence the first human genome in 2002 to about $60,000 per human genome today, he says, and it’s falling further ridiculously fast. “This opens up huge possibilities for health research, but it’s all a massive data integration and data mining exercise to associate all genome sequence data, make sense of it and associate it to clinical information,” De Greef says, especially in the U.S. where every different hospital system has various software vendors, and data in proprietary formats and schemas. “So hopefully this project can prove out semantic web technology as a key enabler for this because it’s going to need some type of technology approach for what’s coming down the pipe in the next few years.” Email This Post |
The Voice of Semantic Web Business
|
|||||||