One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency.
Therefore, a data center deploys monitoring systems which captures the behavior of the executed jobs.
While it is easy to utilize statistics to rank jobs based on the utilization of compute, storage, and network, it is tricky to find patterns in 100.000 of jobs, i.e., is there a class of jobs that aren't performing well.
When support staff investigates a single job, it is relevant to identify related jobs in order to understand the usage of the exhibited behavior better and assess the optimization potential.
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described.
Practically, we apply several of previously developed time series based algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics.
A study is conducted to explore the effectivity of the approach which starts starts from three reference jobs and investigates related jobs.
Firstly, they provide a service to users to enable them the convenient execution of their applications.
Secondly, they aim to improve the efficiency of all workflows -- represented as batch jobs -- in order to allow the data center to serve more workloads.
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as \cite{Grafana} and \cite{XDMod} provide various statistics and time series data for the job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on a medium number of nodes costs human resources and is not a good return of investment.
By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of compute, network, and IO resources.
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
A pattern that can be observed in many jobs bears a potential as the blueprint for optimizing one job may be applied to other jobs as well.
This is particularly true when running one application with similar inputs but also different applications may lead to a similar behavior.
Therefore, it is useful for support staff that investigates a resource hungry job to identify similar jobs that are executed on the supercomputer.
In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance metrics can be applied to jobs with different runtime and number of nodes utilized but differ in the way the define similarity.
We showed that the metrics can be used to cluster jobs, however, it remains unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
The purpose of the methodology is to allow user and support staff to explore all executed jobs on a supercomputer in the order of their similarity to the reference job.
Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, and, finally, the methodology to investigate jobs is described.
On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
The distance metrics should handle jobs of different length and node count.
In \cite{TODOPaper}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
In a nutshell, for each job executed on Mistral, we partition it into 10 minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1) and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
After data is reduced across nodes, we quantize the timelines either using binary or hexadecimal representation which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
\subsection{Algorithms for Computing Similarity}
In this paper, we reuse the algorithms developed in \cite{TODO}: bin\_all, bin\_aggzeros, hex\_native, hex\_lev, and hex\_quant.
They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levensthein-distance.
For jobs with different length, we apply a sliding-windows approach which finds the location for the shorter job in the longer job with the highest similarity.
In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
The fixed interval of 10 minutes also ensure the portability of the approach to other HPC systems.
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with different number of nodes.
The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $m$.
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.}
The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
Next, we analyzed the performance of the algorithm.
Then the quantitative behavior and the correlation between chosen similarity and number of found jobs, and, finally, the quality of the 100 most similar jobs.
\item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
The segmented timeline of the jobs are visualized in \Cref{fig:refJobs}.
This coding is also used for the HEX class of algorithms (BIN algorithms merge all timelines together as described in \jk{TODO}.
The figures show the values of active metrics ($\neq0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview.
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to some activity at the first segment for three other metrics.
The runtime is normalized for 100k jobs, i.e., for bin\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
In the quantitative analysis, we explore for the different algorithms how the similarity of our pool of jobs behaves to our three reference jobs (Job-S, Job-M, and Job-L).
The cumulative distribution of similarity to the reference jobs is shown in \Cref{fig:ecdf}.
For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native.
BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
% This indicates that the algorithms
The support team in a data center may have time to investigate the most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and Rank\,i refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram.
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the HEX\_phases and ks algorithms.
HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while ks identifies 6880 jobs with a similarity of at least 97.5\%.
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev, hex\_native, and ks is including more users (29, 33, and 37, respectively) than the other three algorithms.
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
\paragraph{Node distribution.}
All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across node range -- as long as the average I/O behavior of the jobs are similar.
As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs which is confirmed by investigating the job metadata.
The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further.
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length.
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
The bin and ks algorithms identify one job which name doesn't include “cmor”,
All other algorithm identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the ks algorithm doesn't identify any job with control.
A selection of job timelines is given in \Cref{fig:job-S-hex-lev}; all of these jobs are jobs on control variables.
The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}.
While we cannot visually see much differences between these two jobs compared to the cmor job processing the control variables, the algorithms indicate that jobs processing the control variables must be more similar as they appear much more frequently in the Top\,100 jobs than in all jobs labeled with “cmor”.
Inspecting the Top\,100 for this reference jobs is highlighting the differences between the algorithms.
All algorithms identify a diverse range of job names for this reference job in the Top\,100.
Firstly, the name of the reference job appears 30 times in the whole dataset so this kind job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
Some applications are more prominent in these sets, e.g., for bin\_aggzero, 32\,jobs contain WRF (a model) in the name.
The number of unique names is 19, 38, 49 to 51 for bin\_aggzero, hex\_phases, hex\_native and hex\_lev, respectively.
The jobs that are similar according to the bin algorithms differ from our expectation.
For the bin algorithms, the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
That would increase the likelihood that these jobs are very similar and what the user is looking for.