for Molecular Biology

Background: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. Results: We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long ( L > 400), a "modified" Gumbel

The purpose of this series of articles will be to review computer programs that are currently available for use by biologists, particularly in the field of molecular genetics.The nature of the data, comprising encoded forms of the genetic information in living organisms, is such that computerized manipulation and analysis become essential for all but trivial tasks.
A wide range of problems may be tackled.Indeed, most things which can be done by hand and which are repetitious or tedious may be programmed to advantage on a computer.This includes not only numerical work such as statistical analysis, but also a range of non-numerical problems requiring macromolecular sequence comparisons.A common application is the search for special signals in nucleic acid sequences such as ribosome binding sites, promotor sequences and splice junctions.The recent discovery of the cellular gene analogues of viral oncogenes depended upon computer searching methods.Comparative studies of sequences and the inference of molecular phylogenetic trees are other areas which provoke considerable interest.
This article describes the availability of molecular sequence databases and of programs to search them.The second article in the series will be concerned with software tools to aid the experimentalist: programs for entering data from restriction mapping and DNA sequencing gels and for managing DNA sequencing projects.A third article will deal with programs for the day-to-day manipulation and analysis of molecular sequences.
To the biologist, computer terminology may prove decidedly inhibitory.In the fast-developing area of computers (and particularly microcomputers) the most economical and effective way of proceeding may be none too clear even to the professional computer scientist.By 'hardware' we refer to the physical equipment upon which computation is performed, and to the peripheral devices such as terminals, printers and magneticdisc stores which are connected to the computer.By 'software' we mean the stored programs consisting of machine instructions which are executed and the data which are manipulated when a program runs in the computer.We shall attempt to provide an impartial assessment of the quality of the software available, subject to the limitations of hardware.Hardware will oftenconstrain the practicality of certain tasks: searching a large sequence database will be impractical if magnetic-disc storage capacity is limited; alignment of two nucleic acid or polypeptide sequences for maximum match will be a frustratingly slow task on a machine with a slow processor.But it is in software that the success or failure of a system will usually lie.Be warned, commercial software, particularly for microcomputers, may be of poor quality and may contain bugs which prevent its correct operation.Some manufacturers lose interest once a product has been sold, and further support may be poor or nonexistent.It is essential to have an efficient editor for the entry and manipulation of text, and a good compiler or interpreter to convert the text of a program in a chosen programming language into instructions which can be executed on the machine.
Assuming that one has been wise or fortunate in the choice of a machine, and that a reasonable operating system, text editor, and compilers are available, what advice can be given about the choice of suitable software for molecular biology?This depends upon one's own expertise, and the extent to which help is available to implement programs.Involvement may range from the acquisition of executable code ready to run on the machine to the alteration and adaptation of a program in some standard programming language to produce a version of the program which runs, for the first time, on your particular machine.The latter course is not to be undertaken lightly, because experience has shown that it may be a very time consuming process; it is sometimes quicker to rewrite the program from scratch.Though guidelines for software portability are fairly widely disseminated in the literature of computer science (an introductory guide is given by Wallis'), the message has often failed to reach writers of software for molecular biology.This is partly through

SOFTWARE CLUB
ignorance and partly through the failure to design portable software in the first instance, but primarily because manufacturers proffer the carrot of 'extra facilities' to augment programming languages which may or may not be appropriate to the chosen task.
The major difficulties are certain to occur in the area of input and output of data: communication with the outside world.Though it is possible to standardize algorithms (particular methods of computing) it is less likely that the models of file storage will be the same between different operating systems; only occasionally will the input and output of data to and from its external repository be compatible on different machines.Production of graphical output is even more of a problem, though there is some hope of rationalization with the introduction of GKS (Graphics Kernel System) -the first graphics standard.
In assessing software, it is essentid to try out the product.There is little to be gained from reviewing commercial descriptions of software, indeed this approach has proven to be sadly misleading.In order to gain experience of new software, BioEssays would welcome the receipt of new products for review in this column.
Turning to our main subject, there are two recommended choices of databanks of nucleic acid sequences, the EMBL Nucleotide Sequence Data Library compiled in Europe and the GenBank Genetic Sequence Data Bank compiled in the USA.These have similar formats, and there has recently been some discussion between the two organizations with the intention of making the GenBank database more consistent with the EMBL library.
The EMBL library is compiled at the European Molecular Biology Laboratory at Heidelberg.The data library manager is Gregg H. Hamm.The library is available on a magnetic tape free of charge [I].The latest release is version number 3.0 of 15 December 1983.Almost 1,500 sequences are included in the library which is made up as shown in Table I.The sequences are supplied as a large single file or as 1481 small files.In addition, there is a user manual and indexes by author, journal, keyword and species. The

SOURCES OF PROGRAMS AND DATABASES
[I] EMBL Nucleotide Sequence Data Library, European Molecular Biology Laboratory, Postfach 10.2209 Heidelberg, W. Germany.

Hardware
The physical computer and its peripheral devices.

Software
The programs which the computer executes.Programs consist of a series of instructions which act on inputs to produce outputs.

Processing unit
Part of the hardware of a computer which can execute instructions.The main processor is called the Central Processing Unit (CPU).There may be others as well: disc drives will have processing units which are responsible for input and output to and from the discs; there may also be special processing units responsible for input and output to and from the terminals connected to the computer.

Operating system
A special computer program which controls what the computer does.Other programs execute under the supervision of the operating system, being allocated resources (execution time, internal memory, access to peripheral devices such as disc drives) according to rules contained within the operating system, the availability of the resources, and the importance of the task being carried out by the program.

Text edjtor
A computer program which allows the user to input and change text, usually by using a terminal or visual display unit (VDU).

Programming language
The instructions which the computer can carry out are extremely simple and a complex computer program may contain many thousands of these instructions.A programming language is a method of allowing the computer programmer to use much more complex instructions than the computer can execute, a compiler being used to convert the complex instructions which the programming language offers into the simple instructions which the computer can deal with.
Programming languages offer a set of instructions.While the way in which these instructions can be used is subject to stringent rules of syntax, the computer programmer has great freedom in determining what he builds from the instructions.

Compiler
A complex computer program which is capable of converting other computer programs in their textual form (so-called source code) into a series of instructions which can be executed by the computer (so-called object code).The text of the computer program must obey the rules imposed by the programming language.
Because different manufacturers' computers will often have different instructions which the computer can execute, compilers will differ from computer to computer because the object code will differ.However, because the rules which the computer programmer has to obey can be kept constant from computer to computer it is possible to compile a computer program in its textual form for use on different machines, simply by using the appropriate compiler.This leads to the idea of 'portable' computer programsprograms which can be carried from one computer to another without changing their content.

Portable
A computer program is said to be portable if it is not dependent on the facilities of a particular computer.Portability is extremely important to those people who would change their computer without having to change existing computer programs for use on the new machine.

Algorithm
A set of rules devised to solve a particular problem.

Fragile Sites Provide a N e w Look at Human Chromosome Structure and One Form of X-linked Mental Retardation Herbert A. Lubs
Considerable excitement is engendered when clinical observation and basic investigation interact rapidly and effectively.Testimony to this lies in the January 1984 issue of the American Journal of Medical Genetics, which was devoted almost entirely to current studies of a fragile site on the distal long arm of the X chromosome, fra (X)(q27).
This fragile site, also called the marker X (Fig. I), is closely associated with mental retardation in males.Although it is probably a manifestation of the gene causing mental retardation, close linkage has not been ruled out.The gene is probably second only to trisomy 21 as a genetic cause of mental retardation in males.
Only recently has the importance of fragile sites in human chromosomes been recognized as a means of studying chromosome structure and function as well as human disease.The initial family with this disorder was reported in 1969.lFour males were observed to have both a fragile X and mental retardation and the mode of inheritance was clearly X-linked.TC 199 was employed in the lymphocyte culture.Even in this family, the clinical manifestations were quite variable with one male who had successfully completed a tour through the Army (estimated IQ 80) and a nearly normal appearance while another had moderately severe mental retardation and a high abnormal facial appearance.
Although it was known through the early suggestion of Penrose,2 and the then-current work of Lehrke,3 that mental retardation was clearly more frequent in males, there was great resistance to the idea that this excess might be due largely to X-linked genes.It is now accepted that there is an excess of at least 25% of retarded males over females and that the roughly 50 X-linked disorders which include mental retardation contribute significantly to this excess.
Progress in this area, however, was delayed until Sutherland found he was unable to repeat earlier observations of fragile sites.This led to carefully defined studies of the various media involved in
[6] D. J. Lipman and W. J. Wilbur, Mathematical Research Branch, N.I.A.D.D.K., National Institutes of Health, Bethesda, M D 20205, USA.M A R T I N J. BISHOP is in the Department of Zoology, University of Cambridge,

Proc. Natl. Acad. Sci. USA 80, 726730.
GenBank Genetic Sequence Data Bank is established by Bolt Berenek and Newman Inc. and Los Alamos National Laboratory under contract with the National Institutes of Health.It is available on a magnetic tape for $65 [2].The latest release is version number 18.0 of 12 March 1984.There are ten sequence files on the tape; these are shown in Table11.There are two databases of peptide sequences available.The most complete is the Protein Sequence Database of the 2 WILBUR, W. J. & LIPMAN, D. I.(1983).

TABLE Z .
EMBL Nucleotide Sequence Library

TABLE ZZ .
GenBank Genetic Sequence Data