Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs.
While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100.000 jobs, i.e., is there a class of jobs that aren't performing well.
Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint.
%In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described.
In this paper, we describe a methodology to process efficiently a large set of jobs and find a class with a high temporal I/O similarity to a reference job.
Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics.
A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.
Firstly, they provide a service to users to enable them the convenient execution of their applications.
Secondly, they aim to improve the efficiency of all workflows -- represented as batch jobs -- in order to allow the data center to serve more workloads.
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on 20 nodes may not be a good return of investment.
Knowing details about a problematic or interesting job may be transferred to similar jobs.
Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer.
It is non-trivial to identify jobs with similar behavior from the pool of executed jobs.
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
Job names are defined by users; while a similar name may hint to be a similar workload finding other applications with the same I/O behavior would not be possible.
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series and their I/O behavior.
These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively.
In this paper, we refine these algorithms slightly, also include another algorithm and apply them to rank jobs based on their temporal I/O similarity to a reference job.
In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis.
We also utilize the Kolmogorov-Smirnov-Test to illustrate the benefit and drawbacks of the different methods.
Then, we perform our study by applying the methodology to three reference jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
Related work can be classified into distance measures, analysis of HPC application performance, inter-comparison of jobs in HPC, and I/O-specific tools.
Levenshtein (Edit) distance is a widely used distance metrics indicating the number of edits needed to convert one string to another \cite{navarro2001guided}.
The performance of applications can be analyzed using one of many tracing tools such as Vampir \cite{weber2017visual} that record the behavior of an application explicitly or implicitly by collecting information about the resource usage with a monitoring system.
Monitoring systems that record statistics about hardware usage are widely deployed in data centers to record system utilization by applications.
For Vampir, a popular tool for trace file analysis, in \cite{weber2017visual} the Comparison View is introduced that allows them to manually compare traces of application runs, e.g., to compare optimized with original code.
Vampir generally supports the clustering of process timelines of a single job allowing to focus on relevant code sections and processes when investigating a large number of processes.
Chameleon \cite{bahmani2018chameleon} extends ScalaTrace for recording MPI traces but reduces the overhead by clustering processes and collecting information from one representative of each cluster.
For the clustering, a signature is created for each process that includes the call-graph.
In \cite{halawa2020unsupervised}, 11 performance metrics including CPU and network are utilized for agglomerative clustering of jobs showing the general effectivity of the approach.
Many approaches for clustering applications operate on profiles for compute, network, and I/O \cite{emeras2015evalix,liu2020characterization,bang2020hpc}.
For example, Evalix \cite{emeras2015evalix} monitors system statistics (from proc) in 1-minute intervals but for the analysis, they are converted to a profile removing the time dimension, i.e., compute the average CPU, memory, and I/O over the job runtime.
In \cite{white2018automatic}, a heuristic classifier is developed that analyzes the I/O read/write throughput time series to extract the periodicity of the jobs -- similar to Fourier analysis.
The LASSi tool \cite{AOPIUOTUNS19} periodically monitors Lustre I/O statistics and computes a "risk" factor to identify I/O patterns that stress the file system.
In contrast to existing work, our approach allows a user to identify similar activities based on the temporal I/O behavior recorded by a data center-wide deployed monitoring system.
The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job.
Therefore, we first need to define how a job's data is represented, then describe the algorithms used to compute the similarity, and, the methodology to investigate jobs.
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in ten seconds intervals on all nodes nine I/O metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We are using their data.
In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, dataset is reduced from 1 million jobs to about 580k jobs.
They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance.
For jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
In this paper, we add a similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
%In brief, KS concatenates individual node data and computes similarity be means of Kolmogorov-Smirnov-Test.
Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes (instead of averaging them).
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes.
The similarity function calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $M$: $sim =\frac{\sum_m 1- p_{\text{reject}(m)}}{|M|}$.
\item A user\footnote{This can be support staff or a data center user that was executing the job.} provides a reference job ID and selects a similarity algorithm.
The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
\item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so-called CMORization). The post-processing is I/O intensive.
This coding is also used for the Q algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in~\cite{Eugen20HPS}.
The figures show the values of active metrics ($\neq0$); if few are active, then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
In the following, we assume a reference job is given (we use Job-S, Job-M, and Job-L) and we aim to identify similar jobs.
For each reference job and algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral).
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM.
The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
%The cumulative distribution of similarity to a reference job is shown in %\Cref{fig:ecdf}.
%For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native.
%B-aggz shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
%The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs and rank them; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the histogram in
\Cref{fig:hist} for the actual number of jobs with a given similarity.
As we focus on a feasible number of jobs, we crop it at 100 jobs (total number of jobs is still given).
It turns out that both B algorithms produce nearly identical histograms and we omit one of them.
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at Q-lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the Q-phases and KS algorithms.
Q-phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
Practically, the support team would start with Rank\,1 (most similar job, e.g., the reference job) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
To confirm the hypotheses presented, we analyzed the job metadata comparing job names which validate our quantitative results discussed in the following.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the Q-lev and Q-native algorithms, the other jobs stem from a second user while B algorithms include jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, Q-lev, Q-native, and KS is including more users (29, 33, and 37, respectively) than the other three algorithms.
All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across node range -- as long as the average I/O behavior of the jobs is similar.
As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs, which is confirmed by investigating the job metadata.
The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further.
While all algorithms can compute the similarity between jobs of different length, the B algorithms and Q-native penalize jobs of different length preferring jobs of very similar length.
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}.
From this analysis, we conclude that one representative from B is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects. % and, therefore, should be analyzed individually
All other algorithms identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the KS algorithm doesn't identify any job with control.
While we cannot visually see much differences between these two jobs compared to the the control job, the algorithms indicate that jobs processing the control variables are more similar as they are more frequent in the Top\,100 jobs.
The KS algorithm working on the histograms ranks the jobs correctly on the similarity of their histograms.
However, as it does not deal with the length of the jobs, it may identify jobs of very different length.
In \Cref{fig:job-M-ks}, we see the 3rd ranked job, which profile is indeed quite similar but the time series differs but it is just running for 10min (1 segment) on 10\,nodes.
Remember, for the KS algorithm, we concatenate the metrics of all nodes together instead of averaging it in order to explore if node-specific information helps the similarity.
The B algorithms find a low similarity (best 2nd ranked job is 17\% similar), the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
The Q-phases algorithm finds 85 unique names but as there is only one short I/O phase in the reference job, it finds many (short) jobs with 100\% similarity as seen in \Cref{fig:job-L-hex-phases}.
The KS algorithm is even more inclusive having 1285 jobs with 100\% similarity; the 100 selected ones contain 71 jobs ending with t127, which is a typical model configuration.
As expected, the histograms mimics the profile of the reference job, and thus, the algorithm does what it is expected to do.
Therefore, we applied six different algorithmic strategies developed before and included this time as well a distance metric based on the Kolmogorov-Smirnov-Test.
The quantitative analysis shows that a diverse set of results can be found and that only a tiny subset of the 500k jobs is very similar to each of the three reference jobs.
For the small post-processing job, which is executed many times, all algorithms produce suitable results.
For Job-M, the algorithms exhibit a different behavior.
Job-L is tricky to analyze, because it is compute intense with only a single I/O phase at the beginning.
Generally, the KS algorithm finds jobs with similar histograms which are not necessarily what we subjectively are looking for.
We found that the approach to compute similarity of a reference jobs to all jobs and ranking these was successful to find related jobs that we were interested in.
Another consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.