BioJava
An Open-Source Java Library
for Bioinformatics
(taken from M. Pocock, BioJava Consulting LTD, presentation)
What is BioJava?
●
Java code (Java2 required – 1.2 and higher)
●
Open-Source
●
Bioinformatics
●
Library for building Applications
●
Sequence Centric
●
Part of the Open Bioinformatics Foundation
(OBF)
Where is BioJava?
●
http://www.biojava.org
●
mailto:biojava-l@biojava.org
●
#biojava on irc.openprojects.net
Who is BioJava?
●
35+ Developers in most continents and
time-zones
●
Core team >5 individuals
A look at some API Stuff
●
BioJava is a collection of ~40 packages
organized in 9 main categories
●
Each category contains classes and interfaces
devoted to particular tasks, such as:
– sequence handling
– running external programs – utilities
– graphical interfaces – sequence alignment
What’s Been There for a While?
● Sequences with hierarchical features ● Sequence databases
● Sequence IO
– Various sequence formats (embl, genbank, gff, swissprot…) – Object model can be bypassed for high-performance scanning
● Probability distributions over symbols and Dynamic
programming toolkit
What’s Reasonably New?
● TagValue parser API ● Sequence Search APIs
– Interoperable with BioJava XML-based parsers for many common sequence search algorithms
● Pure-Java SSAHA implementation ● Bit-packed sequence storage ● Taxonomies
● Literature References ● Phred
What’s Recently Improved?
● Gap handling
– Consistent algebra for representing ambiguities (e.g. n), compound symbols (e.g. codons) and gaps
● DAS Client is now very robust
– Distributed sequence API allows DAS-like distributed sequence databases to be easily built and implemented
● More ‘framey’ annotation bundles ● Sequence Rendering
Java 1.4-reliant Source
● Java 1.4 offers APIs that are really useful for
Bioinformatics
– Logging
– NIO interfaces for fast IO and raw data access – Regular expressions
– Cascading Exceptions
● Biojava code relying on 1.4 APIs are conditionally built
– SSAHA implementation
– Some parsers and handlers for TagValue – Restriction enzyme digests
OBDA Support
● OBDA is a joint project between the various Open-Bio
projects which is attempting to establish a unified access route for sequences in local and remote sequence
databases.
● BIOCORBA – corba sequence interfaces
● BioSQL – relational tables and standard semantics for
storing sequences
● BioFetch – cgi-bin-based sequence fetching ● XEMBL – xml-based sequence fetching
● Bio Directories – configuration file for resolving
Things We’d Like To Do in the Near
Future
● Support non-DNA areas of Bioinformatics
– Cladistics, evolutionary trees, clusters – Expression data
– Proteomics
– Networks/pathways – Biochemical reactions
● Integrate pre- and post-1.4 exception systems ● Modify the change notification system
– Better synchronization and transaction support – Easier to optimize events that don’t have listeners – More robust handling of event cascades
What Will We See in BioJava 2?
● Pervasive use of Ontologies – Storing annotating data
– Definition of processing pipelines (e.g. customizing parsers) – Bindings between BioJava interfaces and external data sources
● Das, biosql, biocorba
– Pervasive querying making any BioJava application an Object Data Store with easy routes for data-providers to optimize searches ● Much more code generation
– Push most repetitive code into code generators – Auto-generate much of the event notification web
And the Biggest Change of All?
●
Make the library accessible to casual developers
for writing throw-away scripts as well as system
architects
– Documentation – Tutorials – Training
– Utility classes (e.g. SeqIOTools)