One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency.
Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs.
While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren't performing well.
Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint.
This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.
Firstly, they provide a service to users to enable them the convenient execution of their applications.
Secondly, they aim to improve the efficiency of all workflows -- represented as batch jobs -- in order to allow the data center to serve more workloads.
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring and analysis tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on 20 nodes may not be a good return of investment.
By ranking jobs based on their utilization, it is easy to find a job that exhibits extensive usage of computing, network, and I/O resources.
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
For instance, a pattern that is observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well.
This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior.
Knowing details about a problematic or interesting job may be transferred to similar jobs.
Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer.
It is non-trivial to identify jobs with similar behavior from the pool of executed jobs.
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same I/O behavior would not be possible.
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series and their I/O behavior.
These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively.
In this paper, we refine these algorithms slightly, include another algorithm, and apply them to rank jobs based on their temporal similarity to a reference job.
We start by introducing related work in \Cref{sec:relwork}.
In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis.
Then, we perform our study by applying the methodology to a reference job, therewith, providing an indicator for the effectiveness of the approach to identify similar jobs.
In \Cref{sec:evaluation}, the reference job is introduced and quantitative analysis of the job pool is made based on job similarity.
In \Cref{sec:timelines}, the 100 most similar jobs are investigated in more detail, and selected timelines are presented.
The paper is concluded in \Cref{sec:summary}.
\section{Related Work}
\label{sec:relwork}
Related work can be classified into distance measures, analysis of HPC application performance, inter-comparison of jobs in HPC, and I/O-specific tools.
%% DISTANCE MEASURES
The ranking of similar jobs performed in this article is related to clustering strategies.
Levenshtein (Edit) distance is a widely used distance metric indicating the number of edits needed to convert one string to another \cite{navarro2001guided}.
The comparison of the time series using various metrics has been extensively investigated.
In \cite{khotanlou2018empirical}, an empirical comparison of distance measures for the clustering of multivariate time series is performed.
14 similarity measures are applied to 23 data sets.
It shows that no similarity measure produces statistically significant better results than another.
However, the Swale scoring model \cite{morse2007efficient} produced the most disjoint clusters.
%In this model, gaps imply a cost.
% Lock-Step Measures and Elastic Measures
% Analysis of HPC application performance
The performance of applications can be analyzed using one of many tracing tools such as Vampir \cite{weber2017visual} that record the behavior of an application explicitly or implicitly by collecting information about the resource usage with a monitoring system.
Monitoring systems that record statistics about hardware usage are widely deployed in data centers to record system utilization by applications.
There are various tools for analyzing the I/O behavior of an application \cite{TFAPIKBBCF19}.
% time series analysis for inter-comparison of processes or jobs in HPC
For Vampir, a popular tool for trace file analysis, in \cite{weber2017visual} the Comparison View is introduced that allows them to manually compare traces of application runs, e.g., to compare optimized with original code.
Vampir generally supports the clustering of process timelines of a single job, allowing to focus on relevant code sections and processes when investigating many processes.
%Chameleon \cite{bahmani2018chameleon} extends ScalaTrace for recording MPI traces but reduces the overhead by clustering processes and collecting information from one representative of each cluster.
%For the clustering, a signature is created for each process that includes the call-graph.
In \cite{halawa2020unsupervised}, 11 performance metrics including CPU and network are utilized for agglomerative clustering of jobs showing the general effectiveness of the approach.
In \cite{rodrigo2018towards}, a characterization of the NERSC workload is performed based on job scheduler information (profiles).
Profiles that include the MPI activities have shown effective to identify the code that is executed \cite{demasi2013identifying}.
Many approaches for clustering applications operate on profiles for compute, network, and I/O \cite{emeras2015evalix,liu2020characterization,bang2020hpc}.
For example, Evalix \cite{emeras2015evalix} monitors system statistics (from proc) in 1-minute intervals but for the analysis, they are converted to a profile removing the time dimension, i.e., compute the average CPU, memory, and I/O over the job runtime.
% I/O-specific tools
PAS2P \cite{mendez2012new} extracts the I/O patterns from application traces and then allows users to manually compare them.
In \cite{white2018automatic}, a heuristic classifier is developed that analyzes the I/O read/write throughput time series to extract the periodicity of the jobs -- similar to Fourier analysis.
The LASSi tool \cite{AOPIUOTUNS19} periodically monitors Lustre I/O statistics and computes a "risk" factor to identify I/O patterns that stress the file system.
In contrast to existing work, our approach allows a user to identify similar activities based on the temporal I/O behavior recorded by a data center-wide deployed monitoring system.
\section{Methodology}
\label{sec:methodology}
The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job.
Therefore, we first need to define how a job's data is represented, then describe the algorithms used to compute the similarity, and, the methodology to investigate jobs.
\subsection{Job Data}
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in ten seconds intervals on all nodes nine I/O metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
The results are 4D data (time, nodes, metrics, file system) per job.
The distance measures should handle jobs of different lengths and node count.
In the open access article \cite{Eugen20HPS}\footnote{\url{https://zenodo.org/record/4478960/files/jhps-incubator-06-temporal-29-jan.pdf}}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
We will be using this representation.
In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes provide sufficient resolution while it reduces noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
The values are chosen to be 0, 1, and 4 because we arithmetically derive metrics: naturally the value of 0 will indicate that no I/O issue appears; we weight critical I/O to be 4x as important as high I/O.
This strategy ensures that the same approach can be applied to other HPC systems regardless of the actual distribution of these statistics on that data center.
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, the dataset is reduced from 1 million jobs to about 580k jobs.
\subsection{Algorithms for Computing Similarity}
We reuse the B and Q algorithms developed in~\cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases.
They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance.
B-all determines similarity between binary codings by means of Levenshtein distance.
B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero.
Q-lev determines similarity between quantized codings by using Levenshtein distance.
Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1}- m_{job2}|}{16}$.
There are various options for how a longer job is embedded in a shorter job, for example, a larger input file may stretch the length of the I/O and compute phases; another option can be that more (model) time is simulated. In this article, we consider these different behavioral patterns and attempt to identify situations where the I/O pattern of a long job is contained in a shorter job. Therefore, for jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation.
The Q-phases algorithm extracts I/O phases from our 10-minute segments and computes the similarity between the most similar I/O phases of both jobs.
\subsection{Methodology}
Our strategy for localizing similar jobs works as follows:
\begin{itemize}
\item A user\footnote{This can be support staff or a data center user that was executing the job.} provides a reference job ID and selects a similarity algorithm.
\item The system iterates over all jobs of the job pool computing the similarity to the reference job using the specified algorithm.
\item It sorts the jobs based on the similarity to the reference job.
\item It visualizes the cumulative job similarity allowing the user to understand how job similarity is distributed.
\item The user starts the inspection by looking at the most similar jobs first.
\end{itemize}
The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
For the inspection of the jobs, a user may explore the job metadata, searching for similarities, and explore the time series of a job's I/O metrics.
For this study, we chose the reference job called Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12
The segmented timelines of the job are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes on which the job ran.
This coding is also used for the Q algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in~\cite{Eugen20HPS}.
The figures show the values of active metrics ($\neq0$); if few are active, then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
We can also see an interesting result of our categorized coding, the write bytes are bigger than 0 while write calls are 0\footnote{The reason is that a few write calls transfer many bytes; less than our 90\%-quantile, therefore, write calls will be set to 0.}.
In the following, we assume the reference job (Job-M) is given and we aim to identify similar jobs.
For the reference job and each algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral).
During this process, the runtime of the algorithm is recorded.
Then we inspect the correlation between the similarity and number of found jobs.
Finally, the quantitative behavior of the 100 most similar jobs is investigated.
\subsection{Performance}
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM.
A boxplot for the runtimes is shown in \Cref{fig:performance}.
The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
Generally, the B algorithms are fastest, while the Q algorithms often take 4-5x as long.
In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs.
The support team in a data center may have time to investigate the most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs and rank them; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the histogram in
\Cref{fig:hist} for the actual number of jobs with a given similarity.
As we focus on a feasible number of jobs, we crop it at 100 jobs (total number of jobs is still given).
It turns out that both B algorithms produce nearly identical histograms, and we omit one of them.
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
Practically, the support team would start with Rank\,1 (most similar job, e.g., the reference job) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). B-aggz is nearly identical to B-all and therefore omitted.}%
\label{fig:hist}
\end{figure}
\subsubsection{Inclusivity and Specificity}
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs:
We explore the distribution of users (and groups), runtime, and node count across jobs.
The algorithms should include different users, node counts, and across runtime.
To confirm the hypotheses presented, we analyzed the job metadata comparing job names which validate our quantitative results discussed in the following.
\paragraph{User distribution.}
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs.
Jobs from 13 users are included; about 25\% of jobs stem from the same user; Q-lev, and Q-native include more users (29, 33, and 37, respectively) than the other three algorithms.
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
\paragraph{Node distribution.}
\Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100 -- the red line marks the reference job.
All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across the node range as long as the average I/O behavior of the jobs is similar.
The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
While all algorithms can compute the similarity between jobs of different length, the B algorithms and Q-native penalize jobs of different length preferring jobs of very similar length.
To verify the suitability of the similarity metrics, for each algorithm, we carefully investigated the timelines of each of the jobs in the Top\,100.
We subjectively found that the approach works very well and identifies suitable similar jobs.
To demonstrate this, we include a selection of job timelines and selected interesting job profiles.
These can be visually and subjectively compared to our reference jobs shown in \Cref{fig:refJobs}.
For space reasons, the included images will be scaled down making it difficult to read the text.
However, we believe that they are still well suited for a visual inspection and comparison.
\subsection{Job-M}
Inspecting the Top\,100 for this reference job is highlighting the differences between the algorithms.
All algorithms identify a diverse range of job names for this reference job in the Top\,100.
Firstly, the same name of the reference job appears 30 times in the whole dataset.
Additional 932 jobs have a slightly modified name.
So this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
All algorithms identify only the reference job but none of the other jobs with the identical name but 1 (KS), 2 (B-* and Q-native) to 3 (Q-lev and Q-phases) jobs with slightly modified names.
Some applications are more prominent in these sets, e.g., for B-aggzero, 32~jobs contain WRF (a model) in the name.
The number of unique names is 19, 38, 49, and 51 for B-aggzero, Q-phases, Q-native and Q-lev, respectively.
When inspecting their timelines, the jobs that are similar according to the B algorithms (see \Cref{fig:job-M-bin-aggzero}) subjectively appear to us to be different.
The reason lies in the definition of the B-* similarity which aggregate all I/O statistics into one timeline.
The other algorithms like Q-lev (\Cref{fig:job-M-hex-lev}) and Q-native (\Cref{fig:job-M-hex-native}) seem to work as intended:
While jobs exhibit short bursts of other active metrics even for low similarity, we can eyeball a relevant similarity particularly for Rank\,2 and Rank\,3 which have the high similarity of 90+\%. For Rank\,15 to Rank\,100, with around 70\% similarity, a partial match of the metrics is still given.
% \caption{Job-M with Q-phases, selection of similar jobs}
% \label{fig:job-M-hex-phases}
% \end{figure}
\section{Conclusion}%
\label{sec:summary}
We conducted a study to identify similar jobs based on timelines of nine I/O statistics.
The quantitative analysis shows that a diverse set of results can be found and that only a tiny subset of the 500k jobs is very similar to each of the three reference jobs.
For the small post-processing job, which is executed many times, all algorithms produce suitable results.
For Job-M, the algorithms exhibit a different behavior.
Job-L is tricky to analyze, because it is compute-intense with only a single I/O phase at the beginning.
We found that the approach to compute similarity of reference jobs to all jobs and ranking these was successful to find related jobs that we were interested in.
The Q-lev and Q-native work best according to our subjective qualitative analysis.
Typically, a related job stems from the same user/group and may have a related job name, but the approach was able to find other jobs as well.
The pre-processing of the algorithms and distance metrics differ leading to a different definition of similarity.
The data center support/user must define how to define similarity to select the algorithm that suits best.
Another consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
That would increase the likelihood that these jobs are very similar and what the user is looking for.
Our next step is to foster a discussion in the community to identify and define suitable similarity metrics for the different analysis purposes.
\subsection*{Acknowledgment}%% Remove this section if not needed