GenomeQuest on Cloud Mine for Next-Generation Sequencing Data



Loading...

By Kevin Davies

July 29, 2009 | GenomeQuest today announced the launch of GenomeQuest 6.0Beta, a new sequence data management solution that provides a web-accessible, cloud computing environment for researchers to “align and mine” next-generation sequencing data.  

“There’s a lot of interest in the cloud,” says president/CEO Ron Ranauro. “In a sense, GenomeQuest has built the first commercial application-specific cloud for biocomputing.”

Users can access the cloud from any internet-connected client server. “Sitting behind all this is a 500-CPU compute farm for processing that’s purpose-built for processing volumes of sequence data.”

“We’ve always had a platform technology, but when I got to the company in 2002, the market wasn’t really ready for another platform,” Ranauro told Bio-IT World.  “The Human Genome Project had crested by then, a lot of pharma, biotech and major academic labs had defined their platform over the preceding five years. What we did, which was a good strategy, was focus our application on … mining genetic sequence data for IP.”

The strategy netted more than 100 customers, including 16/20 big pharmas, and several big agricultural science customers who recognized GenomeQuest as a powerful search engine. “The strategy has always been to create an enterprise sequence data management platform. The question we’d been facing is when would the market be ready?” The launch of the next-gen machines from 454, Illumina and Applied Biosystems in 2006-07 marked what Ranauro calls the “catalytic event for causing the enterprise and academic markets to rethink the way they’re managing sequence data.”

Easy Button

GenomeQuest 6.0Beta is the culmination of bringing GenomeQuest’s core platform technology into a broad platform capable of managing sequence data from raw FASTA to high-level pathway information. Ranauro explains: “Sequence data is (sic) not structured data, so it doesn’t lend itself to data management strategies that are organized to handle structured data very well. From the beginning, we took a distributed computing dataflow model for managing the unstructured sequence data. That gives us the scalability.”

Using the GenomeQuest Engine to provide scalability, GenomeQuest 6.0 addresses the needs of three key constituencies -- the researcher, the bioinformatician, and the IT manager.

Researchers “don’t care so much about bioinformatics,” says Ranauro. “The early visionary market for next-gen sequencing wants to do everything, but the mainstream market wants “the Easy button.” They also want some flexibility to tune parameters. They’re not interested in managing data but want common workflows.”

GenomeQuest delivers the two largest production workflows for gene expression and variant (SNP) discovery. “Any researcher can self register and upload a file, or use the sample file and start getting results very quickly.”

Bioinformaticians, on the other hand, “have to be able to access the data models and the algorithms through an open API. We’ve put a tremendous investment in exposing the application programming interface at multiple levels. Since it’s a web application, there’s a URL API used to script and access any data or workflow or database in the system. There’s a scripted command line API which most bioinformatics developers will prefer, which also has this very nice property of providing access to data, workflows, results and analytics while hiding the details of the computing and the reference data itself. A bioinformatics [specialist] can use the command line API to focus on the task at hand, and not the specifics of the IT.”

And from the perspective of the IT manager, scalability is critical. “The volumes of these next-gen machines just continues to escalate,” says Ranauro. “A system that won’t scale is going to be a difficult investment to justify.”

Web Gem

Ranauro half-jokingly says GenomeQuest is becoming a web company. Normally offering researchers a demo requires multiple steps involving a salesman, a web demo, and registering for an account. “Now the researchers can come to the site, self register, use a sample data set or upload their own, run workflows and mine the results.” The available sample data includes donations from Illumina, Life Technologies and 454, including metagenomic pathogen data (454), and variant detection workflows and gene expression data (Illumina, Life Technologies).

GenomeQuest 6.0 fits into the next-gen workflow from the generation of the raw data. Ranauro describes the pipeline: “We would pick it up from the raw FASTA files, post image processing – it’s the read and an ID... That file can be uploaded. A multigigabyte file can take half a day. If it’s an even bigger file, they can sneakernet it to us.” (GenomeQuest is currently using “fairly rudimentary” compression, but Ranauro acknowledges “there are better ways of doing it,” and is open to leveraging data-transfer services from companies such as Aspera.)

“The end user is presented with a simple web application where (s)he can select the reference genome… They can also select how much extended annotation they want. Do they want to know if the variants found are novel related to dbSNP? Are the variants falling inside coding regions..? The result file is a sequence database of the assembly which can be mined according to those properties. You might say, give me only the novel SNPs in coding regions of very high quality.

“Being able to mine and filter the results is the secret sauce of the scalable engine. Now the biologists can do this work without needing to be a programmer, through a very simple web application. That’s the contribution we’re making – allowing a broader, mainstream audience to participate fully in next-generation sequencing.”

Biologists can select and create custom views of the appropriate reference sequence or subsets thereof. “It’s providing data management, but data isn’t really moving around or up and down from the server to the PC. All the manipulation is happening in the cloud but the user is able to manipulate [it].”

The web architecture enables everything to be shared, including workflow, result databases, and selected views on results. “Those can be used as hypothesis drivers for the next set of experiments,” he says.

Upside and Roadmaps

While Ranauro has his sights set on mainstream users, he sees upside elsewhere. “In the fullness of time, a genome center is going to want to get onto the cloud, because they have to lower their costs, just the same as anybody else, to get to the $1000 genome. It might be that GenomeQuest‘s platform provides a smoother path onto the cloud than taking all the in-house infrastructure and trying to recreate it on Amazon… We see ourselves providing the on-ramp to the cloud.”

While the GenomeQuest platform currently runs on a homegrown datacenter cloud, Ranauro says, “we’re actively looking at scaling options that might include Amazon. Hosting this on Amazon is a very real possibility, but it’s not currently on our roadmap.”

De novo assembly functionality is on the roadmap, however, for the second half of 2009. “We’ll provide the computational and alignment engine but we’ll rely on the industry for the assembly. There are important assemblers, such as 454 Newbler, today. For short reads, later this year – there we’ll rely probably on Velvet or Abyss.”

Ranauro also sees a rich environment for next-gen software companies such as CLC bio and DNAStar to add value. “Those tools have a very rich feature set. There’s always going to be a researcher that can benefit. The problem we’re solving is, having that data on the PC is having it siloed again and the industry goes back to where it was ten years ago, with silos of data.”

Ranauro says he’s actively looking for feedback from early users. That will go a long way to determining how long the ‘beta’ designation lasts, but he says early users “are loving it.” He continues: “We’re actually giving a very powerful sequence data management capability away for free. You don’t have to do next-gen sequencing to get high value from this web site!”

 “This is the only product that can process the data and then mine it using an easy-to-use web-based platform,” says Ranauro. “There’s a reason why the IT industry went from client-server to web-based. It provides centralized management, local control, more of a tractable knowledge engineering environment for an enterprise. We don’t see our customers wanting to move data up and down between PCs and servers or across networks. They really want to have it stored centrally but be able to manipulate it easily. We’re really the only company offering that.

“The ability to align the data and mine the alignments using sequence analysis and annotation simultaneously in a scalable way -- no-one has that!”

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

oracle20723
The Role of Analytics in Transforming Healthcare
Sponsored by Oracle

Sharing many of the data challenges and opportunities faced by Healthcare, the Life Sciences industry remains focused on delivering new, innovative therapies and solutions to patients in a cost effective, timely and safe way. With spiraling R&D costs, new methods such as adaptive trials, and never ending need for deep pharmacovigilance, the Life Sciences companies that effectively use analytics to explore, monitor and optimize their business will rapidly become the new leaders.

Oracle’s strategy—built upon Enterprise Health Analytics and Health Data Warehouse Foundation—provides a powerful, practical, and extensible approach to delivering the IT analytics infrastructure required to confront the worldwide healthcare challenge.



pegasystems
BPM-Based Case Management Approach to Optimizing Clinical Trial Efficiency
Sponsored by Pegasystems

Business Process Management (BPM) software offers liberation in the planning and management of clinical trials today. SmartBPM provides the components for automating critical clinical trial processes ranging from protocol development and patient enrollment to site management and investigator payments. Advantages are:

  • Potentially stunning return on investment at multiple levels.
  • A 500%, or better, increase in application development time by directly executing business requirements
  • Improved customer retention
  • A 50% possible reduction in training time

Discovered is opportunity to enhance relationships with investigators, subjects, and regulators while bringing momentum to a technology-impaired study startup phase. Learn more about SmartBPM in this complimentary white paper.



Cmed paper
Next-gen Cloud-based eClinical
Sponsored by Cmed Technology

New technologies are available to leverage Cloud Computing in  managing clinical trial data. This paper discusses a next generation eClinical
platform that:

  • Speeds trial set up
  • Accommodates changes with zero downtime
  • Integrates effectively with other clinical trial technology systems

It is offered with either software-as-a-service (SaaS), or turnkey infrastructure options in which the user organization operates their own cloud using their IT teams, within their data centers. Read this paper to learn and decide how best to leverage cloud computing’s many strengths for your organization’s  particular needs.



Job Openings

mskc logo
Software Engineer – Computational Biology Center

Memorial Sloan-Kettering Cancer Center seeks an Engineer to design and develop complex data analysis systems in support of cancer genomics research projects at the Computational Biology Center. Qualified candidate will have a BA, 5+ years of software development experience and expert knowledge of Java, SQL, and HTML.

Apply: www.mskcciscareers.org.  Equal opportunity and affirmative action employer.

Web Symposia
Loading...

Bio-IT World proudly presents the Bio-IT World Web Symposia Series covering a broad array of topics within the life sciences and drug development enterprise.

Leveraging BPM to Increase Efficiencies in Clinical Trial Case Management
August 3, 2010 | 1:00 – 2:30 p.m. EST
Sponsored by: Pegasystems
Program Details | Register Here 

 


Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.