Distributed System

WWC snapshot of http://www.nbs.gov/nbii/distributed.html taken on Mon May 29 0:13:07 1995

Network-based Analysis on the National Biological Information Infrastructure and other Distributed Data Systems

The NBII will be a network of many distributed data bases, information, and applications that users may discover, browse, and download over the Internet. To deal effectively with data -- in contrast to information -- the NBII and its sister NII data systems need technologies that will allow users to run applications and perform analyses right on the network, rather than downloading data sets to their own sites for conventional, standalone analyses. The National Biological Service is investigating the creation of technologies for performing network-based analyses and distributed computing on the National Biological Information Infrastructure (NBII), including other NII distributed data systems. The technologies would provide the following capabilities:

Network-based analysis
Distributed computing
Collaborative computing
Data interoperability
Metadata creation
Transaction lineage tracking (and documenting)
Use-profile generation
Capture of user feedback, including failures
Use of a virtual repository
Use of fee-processing procedures

Users would make use of these tools through predefined, menu- accessible applications available on the network. Examples of applications include waterfowl production models and associated wetlands habitat management (the suggested testbed for this project), air quality permitting, ecosystem risk assessment, gene flow in populations of endangered species, landscape evaluation, and grazing lands simulation. The applications may utilize GIS, remote sensing, visualization, statistical analyses, expert and rule-based systems, or other processes that yield a desired result. The universe of applications is limited only by their availability and potential to be put up on the network. Some may be as simple as recalling a map of given area; others may be the result of a complex interaction of many variables. In time users will have a menu of many applications from which to choose.

The processes may be run singly, in combination with one another, or in some complex iterative feedback stream, in which results generate subsequent results. A high-speed network will connect the sites, creating a virtual analysis space in which distributed tools process distributed data on distributed machines, returning desired results to users wherever they may be.

Although the needs of the NBII are the specific impetus for this initiative, the tools that will be developed will be extensible and applicable to data systems other than the NBII.

Use of distributed computing technologies will allow users discretion in the choice of computing environments for conducting their analyses. Costs, location, capabilities, and speed are characteristics that users may consider in choosing their (virtual or actual) compute platform. Dynamic services advertisement allows users to learn about and select from among various compute-service providers. Distributed computing enormously increases computational horsepower by partitioning and distributing computational tasks over many machines on the network. Thus, processes that might otherwise take hours or days may be done in minutes or seconds. The performance advantages of distributed computing can be all the more enhanced through the use of supercomputing capabilities, which will also be an option.

Data interoperability tools attempt to "normalize" data structures and formats for integrated analyses. Examples include interoperating with data from disparate GIS or relational databases. Metadata creators will automatically help create metadata elements for generated analytical products. These elements are essential for advertising new products to subsequent analysts and users.

Transaction lineage trackers keep records of analytical processes, so that users may know how results were generated. This is vital for an understanding how any given results were arrived at. It may also be used to keep a global record of the kinds of analyses that users are performing. The use- profile generator keeps records on and generates reports about the uses of the system. The virtual repository of generated products facilitates users' access to and benefiting from the results of others' analyses. Fee processing procedures will allow users to purchase and pay for products and services over the network.

Finally, a simple, intuitive, graphical interface will provide users with access to all of these capabilities and services. Users will be able to access, select, and apply tools of analysis to data right on the network. The interface will allow people easily and simply to marry data and tools on the network. Through simple points-and-clicks, users will be able to find data and identify analytical products desired (through the application of appropriate tools).

Testbed Description

The testbed for the project will be a waterfowl population and habitat management model for the Northern Great Plains. The model has been in use for nearly a decade by the U.S. Fish and Wildlife Service to predict populations of waterfowl and to help determine appropriate habitat management practices. The model will be migrated to the testbed, where it will form the basis for the creation, testing, and implementation of the capabilities listed above.

Potential Resource and User Nodes

The following resource nodes, which will provide data, information, computational tools, analysis tools, or staff, are suggested for this activity:

Fish and Wildlife Service (FWS) site located in FWS Region 3 (Fergus Falls, MN) and in FWS Region 6 (Bismarck, ND). These nodes are the home of HAPET (Habitat and Populations Evaluation Teams), whose job it is to manage data, information, and models pertinent to the study of waterfowl populations.
The NBS Northern Prairie Science Center (NPSC) located in Jamestown, ND. Models and data bases maintained here are frequently required by the HAPET researchers.
National Biological Service (NBS) (Lakewood, CO) will provide project management.
EOSAT (Lanham, MD) will provide access to its index of remote sensing products, and will provide access to selected high resolution multispectral images.
US Geological Service (location to be determined) will provide data and technical expertise.

The two HAPET sites as well as the NPSC, where the actual waterfowl population research and management will take place, will serve as the initial user nodes. Users collocated at the other resource nodes may also participate. In the future, additional ecosystem management user nodes may be added to support "back-end" users.

Data Requirements

The following datasets or databases will support the project:

Wetlands data bases:

National Wetland Inventory (NWI) digital data base for entire area. Located at FWS HAPET offices in Bismarck, North Dakota, and Fergus Falls, Minnesota, and Northern Prairie Science Center in Jamestown, North Dakota. Also accessible through the Internet from NWI. Includes information for wetland zones (e.g., semi-permanent, seasonal, wet meadow) within basins.

Consolidated wetland basins:

Contiguous wetland zones from NWI mapping, consolidated according to basin. Partly done for North Dakota; expected to be completed by December 1994. South Dakota completed. North Dakota information available at Bismarck HAPET office. South Dakota data may be available from South Dakota Cooperative Research Unit and EPA-Corvallis.
Current wetland conditions. Airborne videography indicating the fraction of each wetland basin that was inundated during late spring each year, 1987-present. Based on more than 500 sample units, each four square miles in size, in a stratified random sample from North Dakota, South Dakota, western Minnesota, and eastern Montana. Available at HAPET offices.
Upland landuse information. Sample of more than 500 four-square-mile plots mentioned above. Available at HAPET offices.
Nest success rates for waterfowl. Average nest success of common duck species in Prairie Pothole Region, according to species, habitat, FWS wetland management district, and time period. Available at Northern Prairie Science Center.
Waterfowl counts conducted annually on more than 1000 wetland basins in North Dakota, South Dakota, western Minnesota, and eastern Montana. Based on on-the-ground surveys of wetland basins. Available at Northern Prairie Science Center and HAPET offices.
Breeding bird distributions. Range-wide contour maps of relative abundance of bird species, based on Breeding Bird Survey data. Available at Northern Prairie Science Center.
Carnivore surveys. Information on distribution and relative abundance of coyote, red fox, raccoon, and striped skunk; data for other species are available but are less reliable. Surveys conducted by the Minnesota Department of Natural Resources; spatial smoothing done at Northern Prairie Science Center. Available for Minnesota only.
Contour maps. Available at Northern Prairie Science Center.

Models and Analytic Tools

The following models and analytic tools are currently in use by waterfowl researchers. They will form the basis for the distributed computing and analysis to be done under this project.

Waterfowl population estimation models. Regression models that predict abundance of most common duck species from information on wetland basin class and area. Uses wetland data base, consolidated wetland basins, and current wetland conditions data bases described above. Estimates of model parameter vary regionally and annually and are based on waterfowl counts database mentioned above. Available at HAPET offices and Northern Prairie Science Center.
Waterfowl production model. Individual-based simulation model that predicts production of mallards and four other species of dabbling ducks. Uses waterfowl counts, upland land use information, current wetland conditions, and nest success rates for waterfowl. Available at Northern Prairie Science Center and HAPET offices.
Wetlands bird abundance models. Regression models to predict the presence/absence or abundance of selected species of non-game wetland birds (e.g., coots, rails, marsh wrens) will be developed in the near future. Preliminary versions could be made available for use with the testbed. Uses wetlands database, consolidated wetland basins, and current wetland conditions. Available at Northern Prairie Science Center.

Possible Operational Scenario

A waterfowl researcher and resource manager wish to run a population model to predict the population of a given species of waterfowl over a particular region for the next growing season. The model and the required input datasets are distributed across various resource nodes.
Using a graphical interface, the users may collaboratively generate a query against a directory of models. The server returns a list of relevant waterfowl population models that match the users' criteria. The users select the model of interest and receive a description of the model, its operating assumptions and methodologies, instructions for running it, and a list of input data requirements.
The users specify a region and time of interest and requests are automatically generated for each of the input datasets (through a link to a catalog system). The users are notified if any input dataset is not referenced in the directory or is not available for online access.
Resource nodes respond to the request by extracting, transforming (if required), and placing the data in an accessible online location. Information about the location and condition of staged data is sent back to the application.
The users may optionally select one or more available compute servers and a mode of computation (i.e., standalone vs. distributed) that the model will utilize. The system informs the users about the estimated costs and compute times for the various modes of computation. In the absence of a selection, the system will default to the least-cost method.
Before execution of the model, the system provides to the users a cost estimate based on anticipated resource utilization. Utilization data will be collected from all resource nodes for a more definitive post-run accounting.
The users interactively define an appropriate set of model parameters and initiate a model run. Input datasets (and perhaps the model itself) are read in situ or ported to the selected compute server, and the model is executed. Status information is displayed at the users' workstations.
The results of the analysis are displayed simultaneously at the users' workstations and at the workstations of colleagues, if requested. Several colleagues may collaborate in the analysis of the model run. Input parameters may be adjusted and the model may be executed multiple times.
Each time a model is executed, a transaction record is generated that links together the model, the input parameter set, and the output file. This record serves as the foundation of the "results metadata" and may be used to initiate model runs in the iterative process.
The results of the iterative runs of the model are stored in a temporary workspace. Should the users decide to keep the results, standard metadata are created, and the results, along with their metadata, are shipped to an appropriate metadata server, where others may access and use them. The metadata are periodically indexed and linked to other relevant metadata servers (i.e., the software directory), so that subsequent users, who may not have the technical depth of the process originators, may discover, understand, and use the results themselves.
Users may bring correlative data to bear on the results of previous analyses by querying the network for available correlative data sets. The query is distributed across the network to all relevant directory and/or inventory servers, and browse products may be viewed simultaneously by all users. Then the entire foregoing process may be repeated, using appropriate models or display software, as required. This may done as many times as desired, blending results with new data as software and data permit. For every permanent result, new transaction records and metadata are automatically created and stored in a public "kiosk".
Before finishing their analytical session, users are queried with respect to their use of, and satisfaction with, the system. This feedback will allow system managers to keep current with users' needs, and it will also inform them about the effectiveness and utility of the system they are providing and funding. The system will also keep the user informed, both during the session and at its end, of incurred costs and where and to whom the invoice for services or products will be sent.
Preliminary findings from the analysis are written up and placed into an on-line accessible kiosk where state and local ecosystem resource managers can easily access the information. Links are established to appropriate metadata servers that relate the value-added results back to the original analysis. This last point is very important since the scientific end-user (i.e., the waterfowl researcher) is not necessarily coincident with the decision-making end-user. Thus, the application also must include some mechanism for quickly and easily disseminating the value-added results to the decision- making resource managers.

http://www.nbs.gov/nbii/distributed.htmlLast Updated 5/16/95

http://www.nbs.gov/nbii/distributed.html
Last Updated 5/16/95