Changelog
0.11.0 / 2025-09-02​
Breaking Changes​
CEEMS Exporter​
- Collector
rapl
is disabled by default now and to enable it add--collector.rapl
to CLI arguments. - Collector
ipmi_dcmi
has been renamed toipmi
as more functionality beyond DCMI has been added to the collector. - Following metric labels have been renamed to be more consistent with Prometheus naming convention:
ceems_ipmi_dcmi_current_watts
->ceems_ipmi_dcmi_power_current_watts
ceems_ipmi_dcmi_min_watts
->ceems_ipmi_dcmi_power_min_watts
ceems_ipmi_dcmi_max_watts
->ceems_ipmi_dcmi_power_max_watts
ceems_ipmi_dcmi_avg_watts
->ceems_ipmi_dcmi_power_avg_watts
ceems_redfish_current_watts
->ceems_redfish_power_current_watts
ceems_redfish_min_watts
->ceems_redfish_power_min_watts
ceems_redfish_max_watts
->ceems_redfish_power_max_watts
ceems_redfish_avg_watts
->ceems_redfish_power_avg_watts
CEEMS tool​
- The relabel configs generated by subcommand
create-relabel-configs
are obsolete as the relabelling of metrics directly handled inside the recording rules. Please regenerate recording rules with new version and remove existing relabel configs on Prometheus server. - Several minor bugs in recording rules have been fixed. Please regenerate the recording rules with new version of
ceems_tool
. - GPU profiling metrics have been renamed to have
prof
in the metric label. For instance,uuid:ceems_gpu_sm_active:ratio
becameuuid:ceems_gpu_prof_sm_active:ratio
. - NVIDIA profiling metrics suffix has been corrected to use
sum
instead ofratio
for NVLink, PCIe traffic metrics. Thus, metrics have been renamed as follows:uuid:ceems_gpu_pcie_tx_bytes:ratio
->uuid:ceems_gpu_prof_pcie_tx_bytes:sum
uuid:ceems_gpu_pcie_rx_bytes:ratio
->uuid:ceems_gpu_prof_pcie_rx_bytes:sum
uuid:ceems_gpu_nvlink_tx_bytes:ratio
->uuid:ceems_gpu_prof_nvlink_tx_bytes:sum
uuid:ceems_gpu_nvlink_rx_bytes:ratio
->uuid:ceems_gpu_prof_nvlink_rx_bytes:sum
List of PRs​
- [FEAT] Add rules for IO and network metrics #406 (@mahendrapaipuri)
- [FEAT] Support runtime XML directory for libvirt collector #404 (@mahendrapaipuri)
- [MAINT] Bump golanglint-ci to 2.4 #399 (@mahendrapaipuri)
- [BREAKING] Updates and fixes to recording rules subcommand of
ceems_tool
#397 (@mahendrapaipuri) - [BREAKING] Support exporting metrics of IPMI sensors #395 (@mahendrapaipuri)
- [MAINT] Bump dependencies #394, #398, #400, #405, #407, #408 (@dependabot)
0.10.2 / 2025-08-07​
- [BUGFIX] Fix bpf code to work with LLVM 20 #393 (@mahendrapaipuri)
- [BUGFIX] Fix k8s resource manager #392 (@mahendrapaipuri)
- [MAINT] Bump dependencies #389, #390, #387 (@dependabot)
0.10.1 / 2025-07-22​
- [BUGFIX] Fix parsing nvidia-smi XML output #388 (@mahendrapaipuri)
- [MAINT] Bump dependencies #387 (@dependabot)
0.10.0 / 2025-07-20​
- [CI] Free up disk space for crossbuild jobs #386 (@mahendrapaipuri)
- [DOCS] Add CONTRIBUTING.md file #385 (@mahendrapaipuri)
- [FEAT] Migrate repo to ceems-dev org #384 (@mahendrapaipuri)
- [FEAT] Filter SLURM cgroups to remove stale ones #382 (@mahendrapaipuri)
- [FEAT] K8s support for CEEMS API server #381 (@mahendrapaipuri)
- [FEAT] Add systemd-less mode for Libvirt collector #377 (@wtripp180901)
- [MAINT] Bump dependencies #375, #376, #378, #383 (@dependabot)
0.9.1 / 2025-07-02​
- [FEAT] Support gzip compression #374 (@mahendrapaipuri)
- [MAINT] Bump dependencies #372, #373 (@dependabot)
0.9.0 / 2025-06-27​
Breaking Changes​
CEEMS LB​
- Undocumented Resource-based LB strategy has been removed. Deployments using this strategy must use Prometheus' remote read feature to achieve the same functionality.
CEEMS Exporter​
- The configuration of Redfish collector must be under the section
redfish_collector
instead ofredfish_web
. More details in docs. - CLI flag
--collector.redfish.web-config
has been deprecated in the favour of--collector.redfish.config.file
. - CLI flag
--collector.k8s.kube-config-file
has been deprecated in the favour of--collector.k8s.kubeconfig.file
. - CLI flag
--collector.k8s.kubelet-socket-file
has been deprecated in the favour of--collector.k8s.kubelet-podresources-socket.file
.
Redfish Proxy​
- The configuration of Redfish proxy must be under
redfish_proxy
instead ofredfish_proxy.web
. More details in docs.
List of PRs​
- [FEAT] Support env vars in config files #369 (@mahendrapaipuri)
- [FEAT] Add k8s admission controller #367 (@mahendrapaipuri)
- [MAINT] refactor: Rename config section names to be consistent across package #364 (@mahendrapaipuri)
- [BREAKING] breaking: Remove resource-based LB strategy #361 (@mahendrapaipuri)
- [FEAT] Native eBPF profiler #360 (@mahendrapaipuri)
- [MAINT] Bump dependencies #359, #362, #365, #366, #368, #371 (@dependabot)
0.8.0 / 2025-05-20​
- [FEAT] Harden redfish proxy app #357 (@mahendrapaipuri)
- [MAINT] Several maintenance changes #354 (@mahendrapaipuri)
- [FEAT] Add k8s collector in the exporter #349 (@mahendrapaipuri)
- [MAINT] Bump dependencies #345, #346, #347, #348, #351, #353, #355, #356, #358 (@dependabot)
0.7.2 / 2025-04-19​
- [FEAT] Make redfish timeout a configurable value #342 (@mahendrapaipuri)
- [DOCS] docs: fix typos and improve consistency #339 (@ncreddine)
- [MAINT] Better usage of bpf LRU hash maps #335 (@mahendrapaipuri)
- [MAINT] Bump dependencies #331, #332, #334, #336, #337, #338, #340, #343, #344 (@dependabot)
0.7.1 / 2025-03-25​
- [MAINT] Minor improvements in power usage collectors #330 (@mahendrapaipuri)
- [DOCS] Update docusaurus.config.ts #329 (@ncreddine)
- [MAINT] Bump dependencies #328, #331 (@dependabot)
0.7.0 / 2025-03-16​
- [FEAT] Add Watttime emission factor #327 (@mahendrapaipuri)
- [FEAT]
cacct
client tool #321 (@mahendrapaipuri) - [FEAT] feat: Add netdev and IB collectors #310 (@mahendrapaipuri)
- [FEAT] Add hwmon collector #309 (@mahendrapaipuri)
- [MAINT] Bump dependencies #320, #322, #323, #324, #325, #326 (@dependabot)
0.6.0 / 2025-02-24​
- [FEAT] Enhancements for CEEMS API server #304 (@mahendrapaipuri)
- [FEAT] Support label filtering in CEEMS LB responses #303 (@mahendrapaipuri)
- [DOCS] Add CLI section in docs #296 (@mahendrapaipuri)
- [DOCS] Deployment guide and minor improvements #294 (@mahendrapaipuri)
- [FEAT] Support SLURM multiple daemons #289 (@mahendrapaipuri)
- [FEAT] Ceems Tooling support #288 (@mahendrapaipuri)
- [MAINT] Bump dependencies #283, #285, #286, #287, #290, #291, #292, #300, #301, #305, #306, #307, #308 (@dependabot)
0.5.3 / 2025-01-24​
- [BUGFIX] Minor corrections in SLURM fetcher and TSDB updater #280 (@mahendrapaipuri)
- [MAINT] Set MIG instance in a separate label, when present #279 (@mahendrapaipuri)
- [MAINT] More configurability on tsdb updater's query batching #277 (@mahendrapaipuri)
- [BUGFIX] Handle running query parameter correctly #271 (@mahendrapaipuri)
- [BUGFIX] TSDB retention period estimation #270 (@mahendrapaipuri)
- [MAINT] Bump dependencies #273, #274, #276, #278 (@dependabot)
0.5.2 / 2025-01-17​
- [BUGFIX] Re-establish session when token invalidates for Redfish collector #268 (@mahendrapaipuri)
- [FEATURE] TSDB estimate batch size dynamically and update OWID data #262 (@mahendrapaipuri)
- [MAINT] Bump dependencies #264, #265, #269 (@dependabot)
0.5.1 / 2025-01-08​
- [FEATURE] Add Cray's pm_counters collector #261 (@mahendrapaipuri)
- [BUGFIX] Use total swap as limit when cgroup sets it as max #260 (@mahendrapaipuri)
- [FEATURE] Configurable Timezone for CEEMS DB #253 (@mahendrapaipuri)
- [FEATURE] Support for Pyroscope servers for CEEMS LB #252 (@mahendrapaipuri)
- [MAINT] Bump dependencies #246, #247, #248, #249, #250, #251, #254, #255, #256 (@dependabot)
0.5.0 / 2024-12-12​
- [BUGFIX] Support IPMI package on 32/64 bit platforms #245 (@mahendrapaipuri)
- [MAINT] Upgrade Go to 1.23.x #244 (@mahendrapaipuri)
- [MAINT] Update dockerfile to include redfish_proxy #243 (@mahendrapaipuri)
- [FEATURE] Add Redfish Collector #240 (@mahendrapaipuri)
- [FEATURE] Pure go IPMI implementation using OpenIPMI interface #238 (@mahendrapaipuri)
- [DOCS] Embed demo Grafana in iframe in documentation welcome page #233 (@mahendrapaipuri)
- [FEATURE] Report usage statistics by taking running units into account #232 (@mahendrapaipuri)
- [FEATURE] Support automatic token rotation for Openstack #227 (@mahendrapaipuri)
- [BUGFIX] Prioritize SLURM_JOB_GPUS env for GPU mapping #221 (@mahendrapaipuri)
- [FEATURE] Migrate to slog logging #211 (@mahendrapaipuri)
- [FEATURE] Implement correct scaling of perf hardware counters #210 (@mahendrapaipuri)
- [MAINT] Bump dependencies #212, #213, #215, #222, #225, #226, #228, #229, #236 (@dependabot), #237, #241 (@dependabot), #242 (@dependabot)
0.5.0-rc.2 / 2024-10-31​
- [BUFGIX] Scale perf counters based on times enabled and ran #209 (@mahendrapaipuri)
0.5.0-rc.1 / 2024-10-29​
- [MAINT] Major refactor to improve performance of exporter #204 (@mahendrapaipuri)
- [MAINT] Bump dependencies #205, #206, #207 (@dependabot)
0.4.1 / 2024-10-25​
- [FEATURE] Use custom header to find target cluster #203 (@mahendrapaipuri)
0.4.0 / 2024-10-23​
- [FEATURE] Add support for HTTP alloy discovery #198 (@mahendrapaipuri)
- [FEATURE] Add openstack resource manager support to API server #196 (@mahendrapaipuri)
- [FEATURE] Add support for MIG and vGPUs in exporter #193 (@mahendrapaipuri)
- [FEATURE] Export power limit from RAPL counters #189 (@mahendrapaipuri)
- [FEATURE] Add libvirt collector #186 (@mahendrapaipuri)
- [FEATURE] Add RDMA collector #182 (@mahendrapaipuri)
- [BUGFIX] Fix cmd execution mode detection #181 (@mahendrapaipuri)
- [BUGFIX] Hide test related CLI flags #180 (@mahendrapaipuri)
- [FEATURE] Add ebpf support for mips,ppc and risc archs #179 (@mahendrapaipuri)
- [MAINT] Bump dependencies #183, #184, #185, #192, #194, #199, #200, #201, #202 (@dependabot)
0.3.1 / 2024-10-03​
- [BUGFIX] Fix cmd execution mode detection #181 (@mahendrapaipuri)
- [BUGFIX] Hide test related CLI flags #180 (@mahendrapaipuri)
- [FEAT] Add ebpf support for mips,ppc and risc archs #179 (@mahendrapaipuri)
0.3.0 / 2024-09-28​
- [CI] Move docs workflow to separate file #178 (@mahendrapaipuri)
- [BUGFIX] Verify TSDB actual retention period #177 (@mahendrapaipuri)
- [FEATURE] Make CEEMS apps capability aware #176 (@mahendrapaipuri)
- [MAINT] Remove unnecessary log lines #167 (@mahendrapaipuri)
- [MAINT] Refactor slurm collector organization #155 (@mahendrapaipuri)
- [MAINT] Graceful exporter shutdown and misc fixes #153 (@mahendrapaipuri)
- [FEATURE] Use consistent CLI flags for exporter #144 (@mahendrapaipuri)
- [FEATURE] Add perf collector that exports perf metrics #137 (@mahendrapaipuri)
- [MAINT] Bump dependencies #138, #139, #140, #141, #142, #143, #145, #146, #147, #148, #149, #150, #151 , #152, #154, #157, #158, #159, #160, #161, #162, #163, #164, #168, #169, #171, #172, #173, #174, #175 (@dependabot)
0.2.1 / 2024-08-17​
- [BUGFIX] Fix setting sysprocattr correctly based on command #136 (@mahendrapaipuri)
0.2.0 / 2024-08-11​
- [FEATURE] Pass context to downstream functions #133 (@mahendrapaipuri)
- [MAINT] Enable more linters #132 (@mahendrapaipuri)
- [MAINT] General maintenance #129 (@mahendrapaipuri)
- [FEATURE] Use native JSON functions in aggregate query #128 (@mahendrapaipuri)
- [FEATURE] Stats API endpoint #127 (@mahendrapaipuri)
- [FEATURE] Cache current usage query result #122 (@mahendrapaipuri)
0.1.1 / 2024-07-24​
- [MAINT] DB query performance improvements #113 (@mahendrapaipuri)
- [BUGFIX] Fix metric aggregation #112 (@mahendrapaipuri)
- [FEATURE] Incremental improvements on API server #111 (@mahendrapaipuri)
- [BUGFIX] Dont cache failed requests for emissions #110 (@mahendrapaipuri)
- [MAINT] Upgrade to Go 1.22.x #109 (@mahendrapaipuri)
- [TEST] Migrate to testify for unit tests #108 (@mahendrapaipuri)
0.1.0 / 2024-07-06​
- [BUGFIX] Build swag using native arch in cross build #107 (@mahendrapaipuri)
- [CI] Avoid building test bins for release workflows #106 (@mahendrapaipuri)
- [BUGFIX] Fix tsdb updater #104 (@mahendrapaipuri)
- [DOCS] Store metrics as map in DB #102 (@mahendrapaipuri)
- [FEATURE] Improve docs on Slurm collector #101 (@mahendrapaipuri)
- [DOCS] Improve docs on Slurm collector #101 (@mahendrapaipuri)
- [CI] Test DEB packages in CI #100 (@mahendrapaipuri)
- [CI] Extract go code for CodeQL analysis #99 (@mahendrapaipuri)
- [FEATURE] Enforce rules on cluster and updater IDs #98 (@mahendrapaipuri)
- [DOCS] Update Docs #97 (@mahendrapaipuri)
- [CI] Add CodeQL workflow #96 (@mahendrapaipuri)
- [FEATURE] Add user and project tables to DB #95 (@mahendrapaipuri)
- [FEATURE] Multicluster support #94 (@mahendrapaipuri)
- [MAINT] General maintenance and enhancements #92 (@mahendrapaipuri)
- [DOCS] Add swagger docs #90 (@mahendrapaipuri)
- [DOCS] Setup docs website #88 (@mahendrapaipuri)
- [DOCS] Publish README to registries #87 (@mahendrapaipuri)
- [FEATURE] Use weighted mean for agg stats #86 (@mahendrapaipuri)
- [CI] Make and publish container images #85 (@mahendrapaipuri)
- [FEATURE] Add demo end points #84 (@mahendrapaipuri)
- [FEATURE] Support DB and API modes for access control #83 (@mahendrapaipuri)
- [FEATURE] Enhancement api server #78 (@mahendrapaipuri)
- [FEATURE] Add
cpu_per_core_count
metric to CPU collector #76 (@mahendrapaipuri) - [FEATURE] Add
last_updated_at
col in usage table #75 (@mahendrapaipuri) - [REFACTOR] Use auth middleware for LB #74 (@mahendrapaipuri)
- [FEATURE] Add recording rules for Prometheus #67 (@mahendrapaipuri)
- [BUGFIX] Ensure non-negative values in agg metrics #66 (@mahendrapaipuri)
0.1.0-rc.6 / 2024-04-04​
- [REFACTOR] Use generic name in metric names #65 (@mahendrapaipuri)
- [FEATURE] Use custom float64 type #62 (@mahendrapaipuri)
- [FEATURE] Configurable TSDB updater queries and DB migrations #64 (@mahendrapaipuri)
- [FEATURE] Use custom float64 type #62 (@mahendrapaipuri)
- [TEST] Add unit tests #61 (@mahendrapaipuri)
- [CI] Fix go coverage badge in README #60 (@mahendrapaipuri)
- [CI] Add coverage badge to README #59 (@mahendrapaipuri)
- [FEATURE] Debian and RPM packaging #58 (@mahendrapaipuri)
- [FEATURE] Add a default resource manager #57 (@mahendrapaipuri)
- [FEATURE] Auto detect IPMI command and add support for capmc #56 (@mahendrapaipuri)
- [FEATURE] chore: Several enhancements for CEEMS LB #54 (@mahendrapaipuri)
- [FEATURE] Incremental metrics aggregation #53 (@mahendrapaipuri)
- [MAINT] Backend Auth for CEEMS LB #52 (@mahendrapaipuri)
0.1.0-rc.5 / 2024-03-02​
- [FEATURE] feat: Support RDMA stats in exporter #45 (@mahendrapaipuri)
- [MAINT] Rename stats pkg to api #44 (@mahendrapaipuri)
- [FEATURE] TSDB Load Balancer #43 (@mahendrapaipuri)
- [FEATURE] DB migrations support #42 (@mahendrapaipuri)
- [MAINT] Refactor DB schema #41 (@mahendrapaipuri)
0.1.0-rc.4 / 2024-02-18​
- [BUGFIX] Misc bugfixes #40 (@mahendrapaipuri)
- [FEATURE] Support different IPMI implementations #39 (@mahendrapaipuri)
- [REFACTOR] Rename pkg to ceems #38 (@mahendrapaipuri)
- [FEATURE] Cache job props for SLURM collector #37 (@mahendrapaipuri)
- [FEATURE] Extend DB schema to add new fields #36 (@mahendrapaipuri)
- [FEATURE] Backup DB at configured interval #35 (@mahendrapaipuri)
0.1.0-rc.3 / 2024-01-22​
- [REFACTOR] refactor: Remove support for job steps #34 (@mahendrapaipuri)
- [FEATURE] Fetch admin users from grafana #33 (@mahendrapaipuri)
- [REFACTOR] Rename pkg #32 (@mahendrapaipuri)
- [FEATURE] Enhancements in collector #31 (@mahendrapaipuri)
- [BUGFIX] Fix tsdb cleanup #30 (@mahendrapaipuri)
- [REFACTOR] Split node metrics into separate collectors #29 (@mahendrapaipuri)
- [FEATURE] Add total procs cputime metric #28 (@mahendrapaipuri)
- [FEATURE] Add support for TSDB vacuuming #27 (@mahendrapaipuri)
- [FEATURE] Use a separate time series for each job for mapping GPU #26 (@mahendrapaipuri)
- [FEATURE] Use query builder #25 (@mahendrapaipuri)
- [FEATURE] Job stats server enhancements #24 (@mahendrapaipuri)
- [REFACTOR] Use cgroups v2 pkg #23 (@mahendrapaipuri)
- [REFACTOR] Rename emissions factory from source to provider #22 (@mahendrapaipuri)
- [FEATURE] Export min and max power readings from ipmi #21 (@mahendrapaipuri)
- [FEATURE] Add hostname label to exporter metrics #20 (@mahendrapaipuri)
- [BUGFIX] Correct env var name for getting gpu index #19 (@mahendrapaipuri)
0.1.0-rc.2 / 2023-12-26​
- [REFACTOR] Refactor jobstats pkg #18 (@mahendrapaipuri)
- [REFACTOR] Use default http client for requests for emissions collector #16 (@mahendrapaipuri)
- [REFACTOR] Refactor emissions pkg #16 (@mahendrapaipuri)
- [BUGFIX] bugfix: Correctly parse SLURM nodelist range string #15 (@mahendrapaipuri)
0.1.0-rc.1 / 2023-12-20​
- [FEATURE] Bug fixes and refactoring #14 (@mahendrapaipuri)
- [FEATURE] Misc improvements #13 (@mahendrapaipuri)
- [FEATURE] Merge job stats DB and server commands #12 (@mahendrapaipuri)
- [FEATURE] Support GPU jobID map from /proc #11 (@mahendrapaipuri)
- [FEATURE] Add Runtime pkg #10 (@mahendrapaipuri)
- [FEATURE] Misc features #9 (@mahendrapaipuri)
- [FEATURE] Add API server to serve job stats #8 (@mahendrapaipuri)
- [FEATURE] Add jobstats pkg #7 (@mahendrapaipuri)
- [FEATURE] Use pkg structure #6 (@mahendrapaipuri)
- [FEATURE] Use UID and GID to job labels #5 (@mahendrapaipuri)
- [FEATURE] Reorganise repo #4 (@mahendrapaipuri)
- [FEATURE] Add unique jobid label for SLURM jobs #3 (@mahendrapaipuri)
- [FEATURE] Add Emission collector #2 (@mahendrapaipuri)
- [FEATURE] CircleCI setup #1 (@mahendrapaipuri)