Parsing Emboss Water output with Ruby

By kenglish

First, you will need to install the emboss suite on your computer:

sudo apt-get install emboss emboss-lib

If don’t already have the BioRuby installed, you will need that too:

sudo gem install bio --no-ri --no-rdoc

Your first ruby script calling Emboss Water from Ruby:

1
2
3
4
5
require 'rubygems'
require 'bio'
test_filename =ARGV.shift
target_filename =ARGV.shift
result = Bio::EMBOSS.run('water', '-asequence', test_filename, '-bsequence', target_filename)

Unforntunately, there is not a nice report result class in BioRuby for Emboss Water so you will have to parse the output yourself. Here’s an example script that finds percent similarity:

require 'rubygems'
require 'bio'
test_filename =ARGV.shift
target_filename =ARGV.shift
result = Bio::EMBOSS.run('water', '-asequence', test_filename, '-bsequence', target_filename)
# result now has the text output of water...
# Here's an example of looping through each line of the result to get the similary:
 
test_seq = ""
target_seq = ""
similarity = ''
 
 
result.split("\n").each do | line |
  # This mean
  if line =~ /^# Aligned_sequences/
    puts "Seq '#{test_seq}' has similarity to Seq '#{target_seq}' of #{similarity}"  unless (test_seq == "" ) && (target_seq == "")
    test_seq = ""
    target_seq = ""
  end
  # Get sequence numbers 
  if line =~ /^# (\d+): (\d+)/
     test_seq  = $2 if $1 == '1'
     target_seq = $2 if $1 == '2'
  end
  # parse similarity
  if line =~ /^# Similarity:.*\((.*)%\)/
    similarity  = $1
  end
end
 
puts "Seq '#{test_seq}' has similarity to Seq '#{target_seq}' of #{similarity}"

Place this in a file called water.rb and run it with frags.fasta and frags1.fasta and the above script will output this.

$ ruby water.rb fastas/frags1.fasta frags.fasta 
Seq '1' has similarity to Seq '1' of 100.0
Seq '1' has similarity to Seq '2' of 96.6
Seq '1' has similarity to Seq '3' of 64.3
Seq '1' has similarity to Seq '4' of 97.9
Seq '1' has similarity to Seq '5' of 96.9
Seq '1' has similarity to Seq '6' of 94.1
Seq '1' has similarity to Seq '7' of 62.5
Seq '1' has similarity to Seq '8' of 61.1
Seq '1' has similarity to Seq '9' of 62.5
Seq '1' has similarity to Seq '10' of 57.1
Seq '1' has similarity to Seq '11' of 57.4
Seq '1' has similarity to Seq '12' of 97.8
Seq '1' has similarity to Seq '13' of 50.0
Seq '1' has similarity to Seq '14' of 62.5
Seq '1' has similarity to Seq '15' of 97.9
Seq '1' has similarity to Seq '16' of 62.5
Seq '1' has similarity to Seq '17' of 59.1
Seq '1' has similarity to Seq '18' of 55.9
Seq '1' has similarity to Seq '19' of 61.9
Seq '1' has similarity to Seq '20' of 60.0
Seq '1' has similarity to Seq '21' of 56.4
Seq '1' has similarity to Seq '22' of 56.2

Water is the worse name for a program, EVER. Because it is impossible to Google…



categoriaBioinformatics, Programming commentoNo Comments dataNovember 20th, 2009

About... kenglish

This author published 67 posts in this site.

Share

FacebookTwitterEmailWindows LiveTechnoratiDeliciousDiggStumbleponMyspaceLikedin

Leave a comment