WWC snapshot of http://www.ncbi.nlm.nih.gov/NCBI/progsfs.html taken on Fri May 5 15:43:06 1995

NCBI Programs and Activities Fact Sheet

Introduction

Understanding nature's mute but elegant language of living cells is the quest of modern molecular biology. From an alphabet of only four letters, representing the chemical subunits of DNA, emerges a syntax of life processes whose most complex expression is man. The unraveling and use of this "alphabet" to form new "words and phrases" is a central focus of the field of molecular biology. The staggering volume of molecular data and its cryptic and subtle patterns have led to an absolute requirement for computerized databases and analysis tools. The challenge is in finding new approaches to deal with the volume and complexity of data and in providing researchers with better access to analysis and computing tools in order to advance understanding of our genetic legacy and its role in health and disease.

Creating the National Center

The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research and sponsored legislation that established in November, 1988, the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). The NLM was chosen because it had experience in creating and maintaining biomedical databases and because, as part of the National Institutes of Health (NIH), it could establish an intramural research program in computational molecular biology. NCBI's mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. Its mandate includes four major tasks:

The NCBI has implemented its four goals through the following programs.

Basic Research

From the inception of the NCBI it was considered essential to have a multi-disciplinary group of intramural investigators concentrated on basic research in computational molecular biology. The expectation is that these investigators will not only make important contributions to basic science but will also serve as a wellspring of new methods for applied research activities. A research group comprised of computer scientists, molecular biologists, mathematicians, biochemists, research physicians, and structural biologists is studying fundamental biomedical problems at the molecular level using mathematical and computational methods. These problems include gene organization, sequence analysis, and structure prediction. The theory and application of the methods are themselves also the object of research. A sampling of research projects includes: detection and characterization of repeating sequence patterns, domain organization and classification of protein structural elements, mathematical modeling of the kinetics of HIV infection, analysis of effects of sequencing errors for database searching, development of new algorithms for database searching and multiple sequence alignment, construction of non-redundant sequence databases, mathematical models for estimation of statistical significance of sequence similarity, and vector models for text retrieval. NCBI investigators maintain ongoing collaborations with several Institutes within the NIH and also with numerous academic and government research laboratories.

Databases and Software

Diverse molecular biology databases presently serve the scientific community. These databases are created and curated by experts actively engaged in research, and are generally constructed independently on different computer systems with few data or conceptual links. Thus, they are most useful for the specialized communities for which they were designed, even though the data may have broader implications and applications. Because the connections between these different databases are crucial, NCBI's most vigorous database activities have been to develop methods for structuring new databases that combine or enhance the existing ones, and to develop links between the various new and existing databases.

NCBI assumed responsibility for the GenBank DNA sequence database in October, 1992. At the NLM, specially trained indexers with graduate-level biology experience create sequence data records from the scientific literature. This information is augmented by data submissions directly from authors. Collaborations and exchange of data exist with the international nucleotide databases, European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ). Arrangements with the National Agricultural Library and the U.S. Patent and Trademark Office enable the incorporation of plant and patent sequence data.

Database efforts are employing the ISO-standard data description language, ASN.1, in order to represent scientific data in a general form and to build tools which can be based directly on ASN.1 specifications and independent of particular software environments. Software tools for managing ASN.1 specifications have been built and are being distributed for a variety of software platforms. Through a 'software toolkit' approach, a set of computer programs is being assembled that is modular and readily adaptable to other systems. The toolkit will allow for end-user customization, accommodate contributions from outside sources, and make it possible for commercial firms to develop value-added components and interfaces.

The increasing availability and bandwidth of national research networks and the participation of the NLM in the Federal High Performance Computing Initiative has motivated the development of network-based programs which can run on local workstations and, transparently to the user, connect to high-performance central servers. The servers can search databases for sequence comparisons or document retrieval and rapidly return the results of the search to the user's workstation over the network. A program for sequence similarity searching developed at NCBI, called BLAST, is being used by over 80 major sequencing centers to execute network sequence searches against the entire DNA database in less than 15 seconds. Complementing the network services, databases and software on CD-ROMs permit standalone access to subsets of databases on investigators' own desktop workstations. A CD-ROM, Entrez: Sequences, is being distributed for end-user access to an integrated view of sequence and bibliographic data. NCBI e-mail servers provide an alternative way to retrieve sequences by similarity or full-text searching.

In conjunction with the NLM's Library Operations Division, monthly literature surveillance is provided to research databanks such as PIR and the Genome Data Bank. The NCBI assists in the cross-indexing of related molecular biology databases in MEDLINE by providing accession numbers from the databases to link to the MEDLINE bibliographic records. The NLM has supported Dr. Victor McKusick's text, Mendelian Inheritance in Man, and developed the text retrieval software (IRX ) for the online version, OMIM. IRX also supports access intramurally at NIH by several hundred users to 15 sequence and factual databases.

Education and Training

The Center fosters scientific communication in the area of computers as applied to molecular biology and genetics by sponsoring meetings, workshops, and lecture series. NCBI staff offer a lecture series on computational molecular biology to NIH investigators. A Scientific Visitors Program has been established to foster collaborations with extramural scientists and to provide on-site training in the use of NCBI programs. Scientific Visitors may receive grants for travel and living costs for periods from a week to several months. A network electronic bulletin board (BITS) has been established for exchange of information on biomolecular software.

Extramural Programs

All extramural funding in the area of computer analysis of molecular biology data is the responsibility of a separate division at the NLM, the Extramural Program Office. The scope of projects it funds in this area is broad and is intended to foster the development of new computer-based analysis methods for the interpretation of molecular biological data.