SysManager

A system for monitoring many networked machines from a single interface.

Authors: Jason Carlyle and Phil White


Technical Details of the SysManager System

The SysManager system is composed of three main parts, as described in the Users manual. The justification for a three part design is as follows:

The system obviously requires some sort of user interface that can be started and stopped at any time. This is the first component of the system, the SysManager User Interface.

When the user interface is started, all information about all known machines must appear instantly. The interface can not be expected to gather this information from all the machines itself in a reasonable amount of time. This requires a second part to the system, a central 'database' of all known machines that must stay running at all times (which is the only way to guarantee the information to be ready at any time). This component is known as 'Collectord'.

Since this database is running on one machine, it obviously cannot gather detailed information about each machine. This necessitates the third component, known as 'Sysmanagerd', which runs continuously on each machine to be monitored.

Following are detailed technical descriptions of each of the three components of the SysManager System:

Sysmanagerd Technical Description

Purpose:

Sysmanagerd is responsible for the following functions:


Registration procedure

Upon startup and anytime the Collectord has not contacted Cliented within a specified amount of time, Sysmanagerd goes into this procedure.
  1. Sysmanagerd opens a connection to the machine that was specified as Collectord on the command line at startup.
  2. Once the connection is open, Sysmanagerd sends the NEWCLIENT command.
  3. Sysmanagerd then sends the QUIT command to let Collectord know that it can close connection.
  4. Sysmanagerd then closes the connection and considers itself registered
  5. Since it believes itself to be registered, it moves to the Connection Wait stage.

Connection Wait procedure

The purpose of Connection Wait is, as its name implies, to simply wait on Collectord to contact Sysmanagerd and request information. If contact is not made within a certain time, Connection Wait times out and Sysmanagerd is sent back into the Registration procedure. However, if a connection is made, the timeout counter is reset and Sysmanagerd is sent into Command Mode.

Command Mode procedure

Once in Command Mode, there a four possibilities to exit it.

Collectord Technical Description

Purpose:

Collectord is responsible for the following functions:


Main Data Structures Used:

Machine information for each registered machine is stored in a dynamically allocated list that contains the machines ip address and string of information that was returned by the RETRIEVEINFO command.

Startup Process

On startup, the list of machines is initialized and Collectord is immediately sent into its Mainloop.

Mainloop procedure

Collectord consists of three major pieces: Checking for incoming requests, Gathering machine information, and Checking the Message Queue. The Mainloop consists of the first two of these three pieces plus a wait period so as to not continually poll machines while Checking the Message Queue is done whenever a SIGUSR1 is raised.

Checking for incoming requests

Collectord checks for incoming requests until there is at least one registered machine to gather information from. Once at least one machine is found, control continues down the loop to gather information for the machine. If a request is received by Served while in this state, Collectord is sent into a command mode to handle the request. Once in command mode, the following four commands are valid:

Gathering Machine Information

To allow for scalability, the information gathering phase of Collectord is multithreaded. What this means is that to gather machine information a child is forked and then follows the following procedure:
  1. The child opens a tcp communication channel to the specified Sysmanagerd.
  2. Once the channel is open, the child send the RETRIEVEINFO command to gather the machines information.
  3. This information is packaged up in a message and sent to the parent.
  4. If there a communication problems, a special message is sent to the parent process instead of the informational message.
  5. The child then raises a SIGUSR1 in the parent to let it know that a message is ready.
  6. The child process the sets the polling interval on the Sysmanagerd to which it is connected.
  7. As a final step, the child closes the communication channel, and exits
  8. Checking the Message Queue

    Since Collectord is forked to gather information for each machine, there needs to be some method to communicate gathered information back to the parent process. This was accomplished through the use of message queues. Once the child process gathers information and puts it into a message, it raises a SIGUSR1 to let the parent know to check the message queue. There are two types of messages that the could be sending:

    SysManager Tcl/Tk Interface Technical Description

    Main Data Structures Used:

    In order to store the various pieces of information about several machines, the interface uses an array of lists known as machines. Each element in the array represents all of the data relevant to one machine. Two of these arrays are needed to implement warning messages: One to store the current state of all machines, and one to store the previous state of all machines.

    Startup Process

    When the Interface is launched, the following defaults are set within the interface: The interface is then constructed, and the message SysManager Interface Started is printed in the message display. The command line is then parsed for the server name, and the procedure mainloop is called, followed by the procedure cycle.

    Mainloop procedure

    The mainloop procedure is responsible for keeping the information stored by the interface up-to-date. Here is the logic used by mainloop to control the interface:
    1. Check to see if a machine is currently selected in the machine display. If so:
      1. Clear the information display.
      2. Update each field of the information display with the respective information for the selected machine.
    2. Contact the server:
      1. Open a socket to the server. (Print an error if the connection fails.)
      2. Send the SPEWFORTHALL command.
      3. Receive the information string from the server using non-blocking I/O.
      4. Send the current polling interval to the server using the POLL command.
      5. Close the connection to the server.
      6. Parse the information received by the server into a machines array.
    3. Compare the machines array just created with the previous machines array.
      1. If a machine is in the previous array, but not in the current array, remove it from the display, and print a message indicating lost contact.
      2. If a machine is in the current array, but not in the previous array, add a new icon in the machine display, and print a message indicating a new machine.
      3. Check the differences in all statistics against those specified in the preferences, and print any appropriate warnings regarding individual machines.
    4. Update the polling interval to whichever value is greater:
      • The number of machines
      • The polling interval specified in the preferences
    5. Schedule mainloop to be executed again one polling interval in the future.

    Cycle procedure

    This procedure is responsible for selecting different machines in the machine display automatically. The cycle procedure uses the following logic:
    1. If cycling is turned on and there are machines in the machine display, set the current machine to the next machine in the array (modulo the length of the array).
    2. Update the information display with the new machine's information.
    3. Schedule the cycle procedure to be executed again one cycle interval (as specified by the preferences) in the future.

    Miscellaneous Events/Procedures

    'Show All Warnings' button

    When the Show All Warnings button is activated, the current machines array is examined for conditions that require a warning (as specified in the preferences). Any relevant warnings are then printed, with the exception of rapidly changing disk and swap space (these warnings depend on rates, not instantaneous conditions).

    'Clear Messages' button

    When the Clear Messages button is activated, the message display is cleared, and a message indicating this is then printed.

    Button-1 Click on a machine icon

    This event is bound to a procedure which updates the currently selected machine in the machine display, and displays the appropriate statistics in the information display.

    Button-1 Motion on a machine icon

    This event causes the currently selected icon to be moved the same amount the mouse is moving.

    Button-1 Double Click on a machine icon

    This event causes SysManager to look up the IP number of the machine referred to by the selected icon, and spawn an xterm with the command "telnet x" (with x replaced by the IP number).

    Preferences item in the settings menu

    This event causes a new window to be created with several scale widgets that manipulate various settings. These settings are described in depth in the Preferences Panel subsection of the SysManager User Interface section of this manual. The OK button at the bottom of the window causes the window and the widgets inside it to be destroyed.

    Quit item in the file menu

    This event causes all widgets to be destroyed, and exits wish.