WWW Oriented Remote Job Submission, Monitoring and Management over Internet

Paper: 396
Session: D (talk)
Speaker: Alves, Gilvan, LAFEX/CBPF, Rio de Janeiro
Keywords: configuration management, networking, wide-area networking, world-wide collaboration, WWW applications


WWW Oriented Remote Job Submission, Monitoring and
Management over Internet

M. Miranda, G.A. Alves, M. Joffily, A. Santoro, M.H.G. Souza

LAFEX/CBPF R. Dr. Xavier Sigaud, 150 Rio de Janeiro, RJ, Brazil

The growing need for computational resources in big collaborations
like D0 leads to the use of resources located in different sites. We have
developed a system that allows one to submit and monitor a job over
Internet using a WWW Interface. The system assumes there is a centralized
server which receives job requests and send them over Internet to a
production site. The centralized server also functions like a temporary
archiver for the input and output data.

The system is conceived to be used in any kind of job with the
following characteristics:

- CPU intensive
- event oriented, with independent events
- 1 input and 1 output event file (possible very large), besides
calibration and control files

It is also assumed that the productions site has been properly
configured with calibration, control and other data files as well as any
environment parameters. Also, the executable program must be available in
the production site.

Some customization may be required for some well defined modules,
such as the module that deals with event formats. A pilot system has been
developed for D0's Monte Carlo which was run in RISC farm with 30 working
nodes at LAFEX/CBPF in Rio de Janeiro.


The system is composed of two main parts: the submission and
monitoring system, presented in this paper and the core system which
implements the reliable data transfer and job control. The core system is
be presented in the paper "Client/Server OOP Implementation of Remote Job
Parallel Execution with Reliable Data Transfer over the Internet", submitted
to this conference.

The submission and monitoring system is based on a WWW interface, as
well as the Management interface.

The user submits a job by connecting to the WWW server in the
Central Server. Information such as input data file, job parameters,
user identification and priority are sent to the server which generates an
entry in a request file. There is a daemon which constantly examines the
requests when there is a production system available. Policies such as
priority and user privileges can be customized. For such customization, a
routine has to be written. A template with a FIFO police is distributed with
the system.

Once a job is selected for execution, the submission system starts
the core system which in turn starts the job on the designated site. At the
end of the job the user is notified by e-mail.

Every job submitted receives a JobId which is used for monitoring the
job. To monitor the job the users connects to the status page of the WWW
server and informs the job id. The monitoring system will say if the job is
waiting, running or finished. If the job is running, further information can
be provided such as number of events already processed, status of the farm,
and others.

The core system allows reconfiguration of the production site
without having to stop any running job. Using the management interface
the manager of the site can add or remove nodes on the fly.

It is worth to point out that the Central Manager and the Production
site can be any UNIX machine connect through Internet. This means that one
can build it's local System using the available workstations. If the
programs priorities are set low, one can build a production system using
the unused cycles of the workstations on his institution. The nodes of
the system can also be spread over the world, making a "Virtual Farm".
Because it is assumed that the programs are CPU intensive, little load
should be put on the network.