Software / tuple_plot




-- tuple_plot README --

Copyright (C) 2006
Karol Szafranski & Niels Jahn
Genome Analysis Group, Leibniz Institute for Age Research - Fritz Lipmann
  Institute, Jena (Germany)


PURPOSE

  Tuple_plot identifies and visualizes local similarities between two genomic
  sequences, typically 100 kb or longer, by applying the well-known dotplot
  principle.  The implemented scoring scheme results in a high signal-to-noise
  ratio.


INSTALLATION

  The software is known to build and run properly on different Linux/UNIX
  platforms and under MacOS.  It should compile and run under any system
  providing an ANSI-conforming C++ compiler (C89), preferably GNU g++.

  Tuple_plot requires some shared libraries to be installed on your system

    GD (http://www.boutell.com/gd/)
      which itself depends on
      libjpeg (http://www.ijg.org/)
      libpng (http://www.libpng.org/pub/png/libpng.html)
      libttf or freetype (http://www.freetype.org)

  Installation under MacOS first requires installation of the compiler Xcode as
  well as X11 (both available on the OS disc). For GD library installation on
  the MacOS via terminal, some useful descriptions are available on the web
  (e.g. http://www.paginar.net/matias/articles/gd_x_howto.html). You may also
  follow the protocol file macos-install-gd.rtf included in this package.
  Darwinport offers another way to do the installation (http://homepage.mac.com/
  duling/halfdozen/GD-Howto.html).

  On Linux systems, installation of required libraries should be convenient
  using rpm files, as provided by the system distributor or available on the
  web.  Note that you need to install the shared library as well as development
  versions of the library packages, at least for GD.

  To compile the tuple_plot program, run the makefile included in this package
  with commands:

    cd install_dir
    make

  If you successfully installed the program on a new platform, or you
  encounter any problems during installation, contact the authors through
  the distribution web site http://genome.fli-leibniz.de/software.html .


PROGRAM DESCRIPTION

  This section will focus on the command line interface of tuple_plot.  The
  implemented algorithm and the general procedural scheme has been published
  (see below).

  The minimal program call requires at least two statements: (i) path(s) of the
  two input sequences and (ii) a directive describing what type of output is
  desired.  The input sequences must be provided in fasta format, either
  together in a single file or separately in two files.  The output mode may
  be either a PNG image file alone (directive -o ) or that image file
  wrapped by an HTML document (directive -H ).

    tuple_plot -o tplot.png seq1.fa [seq2.fa]
    tuple_plot -H tplot seq1.fa [seq2.fa]

  The latter mode is highly recommended since the HTML document (file
  ofile_stump.html) verbosely describes all program settings used for
  computation as well as the steps of the computational process.  This
  information will provide a detailed documentation of the sequence comparison
  results for later inspection, and it allows to develop an effective strategy
  to optimize the comparison task, if necessary.

  tuple_plot dynamically adjusts different parameters of the comparison proced-
  ure.  This self-parametrization will result in satisfying results, in most
  cases.  However, several program options can be used in order to obtain
  optimal, and these will be described in the following. A complete listing
  of the command line options can be obtained calling the built-in usage help:

    tuple_plot -h

  First, to better understand the available command line options, it is useful
  to know about the basic structure of the program's work flow.  It is
  organized in three sections:

    A. prelude
       - analysis of sequence composition
       - suggestion of optimal word size used for local sequence comparison
       - construction of word frequency and word instance dictionaries
       - masking of overrepresented words
       - ranking of words
    B. actual performance of the sequence comparison
       - word hits are sampled, scored, and transferred to the dot matrix
    C. dotplot presentation
       - application of display thresholds to values of the dot matrix
       - preparation of the dotplot image
       - merging of supplied annotation data
       - finishing of output files

  Options that affect the sensitivity and specificity of the dotplot approach
  will modulate steps in sections A and C, as indicated by headlines in the
  command line help (section A: "options affecting the sequence comparison";
  section C: "thresholding hit display" and "options affecting the dotplot
  image").

  The user can choose if the sequence comparison is performed both,
  co-directional and counter-directional (forward/forward as well as reverse/
  forward; this is default behavior), co-directional only (option -f), or
  counter-directional only (option -r). Co-directional and counter-directional 
  hits are computed as independent layers of the dotplot and will be displayed
  by different colors (co-directional black, counter-directional red).

  A word (tuple) size optimal for comparison is automatically suggested by the
  program, dynamically adapted to the length of the input sequences.  Forced
  settings (option -t) will have little effect on the dotplot results unless
  extreme values are applied.  Note that increasing the word size will cause
  longer computation time and increased memory requirement.  However, both
  these effects are not an issue with sequence sizes below 1 Mb.

  A stochastical scoring scheme is the outstanding feature of tuple_plot which
  results in appreciable signal-to-noise ratio. First, words will be completely
  ignored if their overall frequency is x-fold compared to the expected
  frequency (option -i), compared to a homogeneous distribution of words.
  Second, the expected frequency of random hits is used to counter-correct
  the observed hits (default -s1, switched off by -s0).  A second correction
  scheme (option -s2), additional to the one described in the publication,
  uses squared correction weights and results in slightly different results.
  However, since the latter is less founded theoretically we recommend the
  default correction scheme.  Reports that allow to monitor the process of
  word exclusion and word/hit scoring can be invoked using options -n and
  option -m, possibly in combination with option -M.

  After scored hits have been sampled to the dotplot matrix (work flow
  section B), the next subtask is to transfer the matrix data to a graphical
  representation, i.e. the dotplot image.  Parametrization of this subtask
  applies to the fraction of the dotplot image pixels that shall be colored
  colored to indicate hit state.  The default behavior refers to the expecta-
  tion that the dotplot will show a perfect solid diagonal, composed by
  
    2 * min(size_x,size_y)

  pixels.  With default settings (corresponding to option -A 1.0), the program
  determines this number of highest-scoring matrix values and transfer these
  to colored pixels into the dotplot image.  If you expect (or experience)
  much background signal that scatters outside the expected match diagonal,
  it is reasonable to rise the sensitivity of the sequence comparison by
  increasing values given with option -A.  Option -a similarly scales the
  signal of the dotplot image, directly specifying the fraction of colored
  pixels.  Option -c directly sets the score threshold that is applied during
  transfer of dotplot matrix values to the dotplot image.  Option -A is
  recommended in favor of -a or -c because it gives most robust behavior with
  varying settings of image size and other parameters that influence the
  sensitivity/specificity of the comparison.
  
  Finally, a set of options influences the shape of the dotplot image.  Options
  -x and -y set the image dimensions.  As a default, the maximum edge size is
  500 pixels and the ratio of horizontal (x) and vertical (y) dimensions is
  proportional to the sizes of the two input sequences.  Option -q turns off
  the proportional scaling and forces quadratic shape.  Option -g provides an
  interface to user-supplied GFF-formatted annotations that will be merged
  into the dotplot image, using colors specified in the feature field (field
  #3 according to GFF definition, cf. http://www.sanger.ac.uk/Software/formats/
  GFF/) using a hexadecimal RGB color format as defined by the HTML standard
  (e.g. "#C8E2C8").  Use sequence IDs "seq1"/"seq2" or "seqA"/"seqB" in the
  GFF sequence field (field #1) to refer to one of the input sequences.


HOW TO CITE

  The program tuple_plot and its underlying algorithm is described in a
  publication

  Szafranski K, Jahn N, Platzer M. tuple_plot: fast pairwise nucleotide
  sequence comparison with noise suppression. Bioinformatics 22, 1917-1918
  (2006).


THANKS TO

  We thank Christoph Grunau for documentation material concerning gdlib
  installation under MacOS, Klaus Huse for extensive beta testing.


LICENSE

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; either version 2 of the License, or
  (at your option) any later version.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA