cacct
cacct
is a CLI client that can be used instead of Grafana when operators
cannot or do not wish to maintain a Grafana instance. This CLI client communicates
with both the CEEMS API server and the TSDB server to fetch energy, usage,
performance metrics for a given compute unit, project, and/or user. It has been largely
inspired by SLURM's sacct
tool, and the API
resembles that of sacct
.
cacct
identifies the current username from their Linux UID. Thus, for cacct
to work correctly, the user's UID must be the same on the machine where cacct
is
executed and in the CEEMS API server database.
This tool has been specifically designed for HPC platforms where there is a common
login node that users can access via SSH. The tool must be installed on such login
nodes along with its configuration file. The cacct
configuration file contains the
HTTP client configuration details needed to connect to the CEEMS API and TSDB servers.
Consequently, this configuration file might contain secrets for communicating with these
servers, making it crucial to protect this file on a multi-tenant system like HPC login
nodes. This will be discussed further in the following sections. First, let's examine
the available configuration sections for cacct
:
# cacct configuration skeleton
ceems_api_server: <CEEMS API SERVER CONFIG>
tsdb: <TSDB CONFIG>
cacct
always looks for its configuration file at /etc/ceems/config.yml
or
/etc/ceems/config.yaml
. Therefore, the configuration file must be installed in
one of these locations.
A sample configuration file with only the CEEMS API Server configuration is presented below:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
The above configuration assumes that the target cluster has slurm-0
as its cluster
ID, as configured in the CEEMS API server configuration.
By default, the CEEMS API server expects the username in the X-Grafana-User
header,
so cacct
sets the value for this header with the username making the request.
Finally, the web
section contains the HTTP client configuration for the CEEMS API
server. In this example, the CEEMS API server is reachable at host ceems-api-server
on port 9020
, and basic authentication is configured.
cacct
can pull time series data from the TSDB server for the requested compute units.
This is possible only when the tsdb
section is configured. A sample configuration file
including both CEEMS API server and TSDB server configurations is shown below:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
tsdb:
web:
url: http://tsdb:9090
basic_auth:
username: prometheus
password: anothersupersecretpassword
queries:
# CPU utilization
cpu_usage: uuid:ceems_cpu_usage:ratio_irate{uuid=~"%s"}
# CPU Memory utilization
cpu_mem_usage: uuid:ceems_cpu_memory_usage:ratio{uuid=~"%s"}
# Host power usage in Watts
host_power_usage: uuid:ceems_host_power_watts:pue{uuid=~"%s"}
# Host emissions in g/s
host_emissions: uuid:ceems_host_emissions_g_s:pue{uuid=~"%s"}
# GPU utilization
avg_gpu_usage: uuid:ceems_gpu_usage:ratio{uuid=~"%s"}
# GPU memory utilization
avg_gpu_mem_usage: uuid:ceems_gpu_memory_usage:ratio{uuid=~"%s"}
# GPU power usage in Watts
gpu_power_usage: uuid:ceems_gpu_power_watts:pue{uuid=~"%s"}
# GPU emissions in g/s
gpu_emissions: uuid:ceems_gpu_emissions_g_s:pue{uuid=~"%s"}
# Read IO bytes/s
io_read_bytes: irate(ceems_ebpf_read_bytes_total{uuid=~"%s"}[1m])
# Write IO bytes/s
io_write_bytes: irate(ceems_ebpf_write_bytes_total{uuid=~"%s"}[1m])
Similar to the CEEMS API server configuration, this example assumes the TSDB server is
reachable at tsdb:9090
and basic authentication is configured on the HTTP server. The
tsdb.queries
section is where operators configure the queries to pull time series data
for each metric. If operators used ceems_tool
to generate
recording rules for the TSDB, the queries in the sample configuration above will work
out-of-the-box. The keys in the queries
object can be chosen freely; they are provided
for configuration file maintainability. The placeholder %s
will be replaced by the compute
unit UUIDs at runtime before executing the queries on the TSDB server.
There is no risk of injection here, as the UUID values provided by the end-user are first sanitized and then verified with the CEEMS API server to check if the user is the owner of the compute unit before passing them to the TSDB server.
A complete reference can be found in the Reference section. A valid sample configuration file can be found in the repository.
Securing configuration file​
As evident from the previous section, the cacct
configuration file contains secrets that
should not be accessible to end-users. At the same time, the cacct
executable must be
accessible to end-users so they can fetch their usage statistics. This means cacct
must
be able to read the configuration file at runtime, but the user executing it should not.
This can be achieved using the Sticky bit.
By using the SETUID or SETGID bit on the executable, the binary will have privileges of the user or
group that owns the file. Thus, a SETUID ceems
owned file can read config file owned by ceems
.
Once the config file has been read, cacct
will drop privileges and executes rest of code as the
user who invoked it. This way the privileges are only kept for a minimal time to read config file
and dropped after fetching config. The SETGID sticky bit
can be set on cacct
as follows:
chown ceems:ceems /usr/local/bin/cacct
chmod u+s /usr/local/bin/cacct
# Ensure others can execute cacct
chmod o+x /usr/local/bin/cacct
# Use the same user/group as owner:group for the cacct configuration file
chown ceems:ceems /etc/ceems/config.yml
# Revoke all permissions for others
chmod o-rwx /etc/ceems/config.yml
Now, every time cacct
is invoked, it will have privileges of the ceems
user/group instead to read
/etc/ceems/config.yml
and drop privileges to user who invoked the program later.
When cacct
is installed using the RPM/DEB file provided by the
CEEMS Releases, cacct
is already installed with
the sticky bit set. Operators only need to populate the configuration file at /etc/ceems/config.yml
.