Database Driven Scheduling for Batch Systems

Paper: 207
Session: C (talk)
Speaker: Bird, Ian G., SURA/Jefferson Lab, Newport News
Keywords: data bases, data management, parallelization, mass storage



Database Driven Scheduling for Batch Systems

Ian Bird (igb@jlab.org, 757-269-7105)
Rita Chambers (chambers@jlab.org, 757-269-7514)
Mark E. Davis (davis@jlab.org, 757-269-7027)
Andy Kowalski (kowalski@jlab.org, 757-269-6224)
Sandy Philpott (philpott@jlab.org, 757-269-7152)
Dave Rackley (rackley@jlab.org, 757-269-7041)
Roy Whitney (whitney@jlab.org, 757-269-7536)


SURA/Jefferson Lab
12000 Jefferson Avenue
Newport News, VA-23606
USA
Fax: 757-269-7053


Abstract:

By late 1997, the Thomas Jefferson National Accelerator Facility
("Jefferson Lab") will provide a data reconstruction capability for
experimental
data from three concurrently operating Experimental Halls. With an estimate of
over 150 TB per year of new raw data and 2-3 passes through each raw data set
to produce reconstructed data sets, a batch CPU farm providing 300 SPECint95
will be required for off-line data reconstruction to keep pace with the
creation
of new data. The Jefferson Lab Off-line Batch Scheduling System (JOBS) is
currently under development to provide an automated facility to schedule batch-
mode jobs to the central CPU farm and to correlate run sets with raw data
files,
data summary tapes, calibration and other auxiliary files, as well as metadata
related to both experimental runs and phases of the analysis. The INGRES
relational database will be used as a repository of both run and job metadata
and to manage the dynamic status of the batch farm. The system must also
coordinate the retrieval and storage of experimental data files to the central
mass storage library, a StorageTek silo using Redwood tape transports.

This paper will focus on the problems inherent in scheduling batch jobs which
rely on data to be retrieved from near-line storage and how a database-driven
application is used in JOBS to implement a job scheduler. The JOBS scheduler
prepares jobs for submission to batch queues which are configured with the
commercial batch management software, Load Sharing Facility (Platform
Computing). To maximise the utilisation of high performance and costly tape
drives, the scheduling algorithm must balance the availability of on-line
storage with the stream of new, running, and completed jobs. It will schedule
jobs when all required resources are available and ensure tape streaming by
co-locating related data sets on physical tapes and matching logical batch jobs
to physical tape loads. At full production, the CEBAF Large Acceptance
Spectrometer may generate well over 500 2-GB files per day. Consequently, JOBS
must provide the capability to handle reconstruction jobs for multiple runs
with one job submission, and should optimize data retrieval so that the
researcher may maintain selected data sets on-line for extended periods.

The initial release for JOBS is scheduled for Spring, 1997, and should begin
early production runs on a central batch CPU farm consisting of HP, IBM, and
Sun workstations providing 150 SPECint95 of processing power. We will provide
an overview of the high level design for JOBS and the critical decision points
affecting the design. The paper will discuss how the system is being
implemented in JAVA using its remote object manipulation capabilities, and how
an object view of the relational database is maintained with the Java Database
Connectivity API.