This commit is contained in:
Julian M. Kunkel 2020-10-22 17:28:02 +01:00
parent 4ab98b256c
commit 240398942e
1 changed files with 87 additions and 76 deletions

View File

@ -118,7 +118,16 @@ By ranking jobs based on the statistics, it isn't difficult to find a job that e
However, would it be beneficial to investigate this workload in detail and potentially optimize it? However, would it be beneficial to investigate this workload in detail and potentially optimize it?
A pattern that can be observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well. A pattern that can be observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well.
This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior. This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior.
Therefore, it is useful for support staff that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer. Knowing details about a problematic or interesting job may be transferred to similar jobs.
Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer.
It is non-trivial to identify jobs with similar behavior from the pool of executed jobs.
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same IO behavior is would not be possible.
\jk{Hoffe das erklärt es}
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.}
\eb{Vorteil fuer den Nutzer ist nicht ganz klar. Warum sollte ein Nutzer nach ähnlichen Jobs suchen?}
In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior. In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance metrics can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity. The distance metrics can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
@ -139,11 +148,15 @@ Finally, we conclude our paper in \Cref{sec:summary}.
\section{Related Work} \section{Related Work}
\label{sec:relwork} \label{sec:relwork}
Clustering of jobs based on their names
Vampir clustering of timelines of a single job.
\section{Methodology} \section{Methodology}
\label{sec:methodology} \label{sec:methodology}
The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job. The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job.
Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, and, finally, the methodology to investigate jobs is described. Therefore, we first need to define how a job's data is represented, then describe the algorithms used to compute the similarity, and, finally, the methodology to investigate jobs is described.
\subsection{Job Data} \subsection{Job Data}
On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager. On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
@ -155,13 +168,13 @@ After data is reduced across nodes, we quantize the timelines either using binar
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs. By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
\subsection{Algorithms for Computing Similarity} \subsection{Algorithms for Computing Similarity}
In this paper, we reuse the algorithms developed in \cite{TODO}: bin\_all, bin\_aggzeros, hex\_native, hex\_lev, and hex\_quant. We reuse the algorithms developed in \cite{TODO}: BIN\_all, BIN\_aggzeros, HEX\_native, HEX\_lev, and HEX\_quant.
They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levenshtein-distance. They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levenshtein-distance.
For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity.
The hex\_quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. The HEX\_quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs.
In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
\paragraph{Kolmogorov-Smirnov (kv) algorithm} \paragraph{Kolmogorov-Smirnov (kv) algorithm}
In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
% Summary % Summary
For the analysis, we perform two preparation steps. For the analysis, we perform two preparation steps.
Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes. Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes.
@ -170,7 +183,7 @@ This reduces the four-dimensional dataset to two dimensions (time, metrics).
% Aggregation % Aggregation
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system. The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
The fixed interval of 10 minutes also ensures the portability of the approach to other HPC systems. The fixed interval of 10 minutes also ensures the portability of the approach to other HPC systems.
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with a different number of nodes. Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes.
We apply no aggregation function to the metric dimension. We apply no aggregation function to the metric dimension.
% Filtering % Filtering
@ -189,32 +202,23 @@ The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of r
\subsection{Methodology} \subsection{Methodology}
\eb{Der Absatz liest sich nicht fluessig}
Our strategy for localizing similar jobs works as follows: Our strategy for localizing similar jobs works as follows:
The user/support staff provides a reference job ID and selects a similarity definition. \begin{itemize}
The system iterates over all jobs and computes the distance to the reference job using the algorithm. \item A user\footnote{This can be support staff or a data center user that was executing the job.} provides a reference job ID and selects a similarity algorithm.
Next, sort the jobs based on the distance to the reference job. \item The system iterates over all jobs of the job pool computing the distance to the reference job using the specified algorithm.
Visualize the cumulative job distance. \item It sorts the jobs based on the distance to the reference job.
Start the inspection of the jobs looking at the most similar jobs first. \item It visualizes the cumulative job distance allowing the user to understand how job similarity is distributed.
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.} \item The user start the inspection by looking at the most similar jobs first.
\end{itemize}
\eb{Vorteil fuer den Nutzer ist nicht ganz klar. Warum sollte ein Nutzer nach ähnlichen Jobs suchen?}
The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity. The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%. For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
For the inspection of the jobs, a user may explore the job metadata, searching for similarities, and explore the time series of a job's IO metrics. For the inspection of the jobs, a user may explore the job metadata, searching for similarities, and explore the time series of a job's IO metrics.
\section{Evaluation} \section{Reference Jobs}
\label{sec:evaluation} \label{sec:refjobs}
For each reference job and algorithm, we created CSV files with the computed similarity for all other jobs. For this study, we chose several reference jobs with different compute and IO characteristics:
Next, we analyzed the performance of the algorithm.
Then the quantitative behavior and the correlation between chosen similarity and number of found jobs, and, finally, the quality of the 100 most similar jobs.
\subsection{Reference Jobs}
In the following, we assume a job is given and we aim to identify similar jobs.
We chose several reference jobs with different compute and IO characteristics:
\begin{itemize} \begin{itemize}
\item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so-called CMORization). The post-processing is IO intensive. \item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so-called CMORization). The post-processing is IO intensive.
\item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12 \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12
@ -223,21 +227,21 @@ We chose several reference jobs with different compute and IO characteristics:
Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant. Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
\end{itemize} \end{itemize}
The segmented timelines of the jobs are visualized in \Cref{fig:refJobs}. The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes.
This coding is also used for the HEX class of algorithms (BIN algorithms merge all timelines together as described in \jk{TODO}. This coding is also used for the HEX class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{TODO}.
The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview. The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6. For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
In \Cref{fig:refJobsHist}, the histograms of all job metrics are shown. In \Cref{fig:refJobsHist}, the histograms of all job metrics are shown.
A histogram contains the activities of each node and timestep without being averaged across the nodes. A histogram contains the activities of each node and timestep without being averaged across the nodes.
This data is used to compare jobs using Kolmogorov-Smirnov-Test. This data is used to compare jobs using Kolmogorov-Smirnov-Test.
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate. The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to some activity at the first segment for three other metrics. In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the first segment as primarily Rank\,0 is doing IO.
\begin{figure} \begin{figure}
\begin{subfigure}{0.8\textwidth} \begin{subfigure}{0.8\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{job-ks-0timeseries4296426} \includegraphics[width=\textwidth]{job-timeseries4296426}
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S} \caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S}
\end{subfigure} \end{subfigure}
\centering \centering
@ -260,7 +264,7 @@ Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to so
\begin{subfigure}{0.8\textwidth} \begin{subfigure}{0.8\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{job-ks-2timeseries7488914-30} \includegraphics[width=\textwidth]{job-timeseries7488914-30}
\caption{Job-L (first 30 segments of 400; remaining segments are similar)} \caption{Job-L (first 30 segments of 400; remaining segments are similar)}
\label{fig:job-L} \label{fig:job-L}
\end{subfigure} \end{subfigure}
@ -303,24 +307,32 @@ Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to so
\section{Evaluation}
\label{sec:evaluation}
In the following, we assume a reference job is given (we use Job-S, Job-M, and Job-L) and we aim to identify similar jobs.
For each reference job and algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral).
During this process the runtime of the algorithm is recorded.
Then we inspect the correlation between the similarity and number of found jobs.
Finally, the quantitative behavior of the 100 most similar jobs is investigated.
\subsection{Performance} \subsection{Performance}
\jk{Describe System at DKRZ from old paper} \jk{Eugen: pls describe node where the performance is measured on.}
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ. To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ.
A boxplot for the runtimes is shown in \Cref{fig:performance}. A boxplot for the runtimes is shown in \Cref{fig:performance}.
The runtime is normalized for 100k jobs, i.e., for bin\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process. The runtime is normalized for 100k jobs, i.e., for BIN\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long. Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long.
Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L. Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L.
The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window. The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
Note that the current algorithms are sequential and executed on just one core. Note that the current algorithms are sequential and executed on just one core.
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized. For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
We believe this will then allow a near-online analysis of a job. We believe this will then allow a near-online analysis of a job.
\jk{To analyze KS jobs}
\jk{To update the figure to use KS and (maybe to aggregate job profiles)? Problem old files are gone}
\eb{old files recovered + KS numbers are in Git-Repo}
\begin{figure} \begin{figure}
\centering \centering
\begin{subfigure}{0.31\textwidth} \begin{subfigure}{0.31\textwidth}
@ -346,24 +358,24 @@ We believe this will then allow a near-online analysis of a job.
\subsection{Quantitative Analysis} \subsection{Quantitative Analysis}
In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our three reference jobs (Job-S, Job-M, and Job-L). In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs.
The cumulative distribution of similarity to the reference jobs is shown in \Cref{fig:ecdf}. The cumulative distribution of similarity to a reference job is shown in \Cref{fig:ecdf}.
For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native. For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native.
BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%. BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest. The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
% This indicates that the algorithms % This indicates that the algorithms
The support team in a data center may have time to investigate the most similar jobs. The support team in a data center may have time to investigate the most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and Rank\,i refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram. Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram.
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown. In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
As we focus on a feasible number of jobs, the diagram should be read from the right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given). As we focus on a feasible number of jobs, the diagram should be read from the right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them. It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
In the figures, we can see again a different behavior of the algorithms depending on the reference job. In the figures, we can see again a different behavior of the algorithms depending on the reference job.
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady. Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at HEX\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the HEX\_phases and ks algorithms. For Job-L, we find barely similar jobs, except when using the HEX\_phases and KS algorithms.
HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while ks identifies 6880 jobs with a similarity of at least 97.5\%. HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed. Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
\begin{figure} \begin{figure}
@ -435,10 +447,9 @@ To confirm the hypotheses presented, we analyzed the job metadata comparing job
\paragraph{User distribution.} \paragraph{User distribution.}
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total). For Job-S, we can see that about 70-80\% of jobs stem from one user, for the HEX\_lev and HEX\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev, hex\_native, and ks is including more users (29, 33, and 37, respectively) than the other three algorithms. For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, HEX\_lev, HEX\_native, and KS is including more users (29, 33, and 37, respectively) than the other three algorithms.
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases cover 35 users. For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but HEX\_phases cover 35 users.
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups. We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users. Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
@ -447,13 +458,13 @@ All algorithms reduce over the node dimensions, therefore, we naturally expect a
\Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100 -- the red line marks the reference job. \Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100 -- the red line marks the reference job.
For Job-M and Job-L, we can observe that indeed the range of similar nodes is between 1 and 128. For Job-M and Job-L, we can observe that indeed the range of similar nodes is between 1 and 128.
For Job-S, all 100 top-ranked jobs use one node. For Job-S, all 100 top-ranked jobs use one node.
As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs which is confirmed by investigating the job metadata. As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs, which is confirmed by investigating the job metadata.
The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further. The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further.
\paragraph{Runtime distribution.} \paragraph{Runtime distribution.}
The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}. The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length. While all algorithms can compute the similarity between jobs of different length, the bin algorithms and HEX\_native penalize jobs of different length preferring jobs of very similar length.
For Job-M and Job-L, hex\_phases and ks are able to identify much shorter or longer jobs. For Job-M and Job-L, HEX\_phases and KS are able to identify much shorter or longer jobs.
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself. For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
\begin{figure} \begin{figure}
@ -522,11 +533,11 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h
\subsubsection{Algorithmic differences} \subsubsection{Algorithmic differences}
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}. To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}.
Bin\_all and bin\_aggzeros overlap with at least 99 ranks for all three jobs. Bin\_all and BIN\_aggzeros overlap with at least 99 ranks for all three jobs.
While there is some reordering, both algorithms lead to a comparable set. While there is some reordering, both algorithms lead to a comparable set.
All algorithms have a significant overlap for Job-S. All algorithms have a significant overlap for Job-S.
For Job\-M, however, they lead to a different ranking, and Top\,100, particularly ks determines a different set. For Job\-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set.
Generally, hex\_lev and Hex\_native are generating more similar results than other algorithms. Generally, HEX\_lev and Hex\_native are generating more similar results than other algorithms.
From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually. From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
@ -568,8 +579,8 @@ It is executed for different simulations and variables across timesteps.
The job name of Job-S suggests that is applied to the control variable. The job name of Job-S suggests that is applied to the control variable.
In the metadata, we found 22,580 jobs with “cmor” in the name of which 367 jobs mention “control”. In the metadata, we found 22,580 jobs with “cmor” in the name of which 367 jobs mention “control”.
The bin and ks algorithms identify one job which name doesn't include “cmor”, The bin and KS algorithms identify one job which name doesn't include “cmor”,
All other algorithms identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the ks algorithm doesn't identify any job with control. All other algorithms identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the KS algorithm doesn't identify any job with control.
A selection of job timelines is given in \Cref{fig:job-S-hex-lev}; all of these jobs are jobs on control variables. A selection of job timelines is given in \Cref{fig:job-S-hex-lev}; all of these jobs are jobs on control variables.
The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}. The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}.
While we cannot visually see much differences between these two jobs compared to the cmor job processing the control variables, the algorithms indicate that jobs processing the control variables must be more similar as they appear much more frequently in the Top\,100 jobs than in all jobs labeled with “cmor”. While we cannot visually see much differences between these two jobs compared to the cmor job processing the control variables, the algorithms indicate that jobs processing the control variables must be more similar as they appear much more frequently in the Top\,100 jobs than in all jobs labeled with “cmor”.
@ -579,18 +590,18 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
\begin{table} \begin{table}
\centering \centering
\begin{tabular}{r|r|r|r|r|r} \begin{tabular}{r|r|r|r|r|r}
bin\_aggzeros & bin\_all & hex\_lev & hex\_native & hex\_phases & ks\\ \hline BIN\_aggzeros & BIN\_all & HEX\_lev & HEX\_native & HEX\_phases & KS\\ \hline
38 & 38 & 33 & 26 & 33 & 0 38 & 38 & 33 & 26 & 33 & 0
\end{tabular} \end{tabular}
%\begin{tabular}{r|r} %\begin{tabular}{r|r}
% Algorithm & Jobs \\ \hline % Algorithm & Jobs \\ \hline
% bin\_aggzeros & 38 \\ % BIN\_aggzeros & 38 \\
% bin\_all & 38 \\ % BIN\_all & 38 \\
% hex\_lev & 33 \\ % HEX\_lev & 33 \\
% hex\_native & 26 \\ % HEX\_native & 26 \\
% hex\_phases & 33 \\ % HEX\_phases & 33 \\
% ks & 0 % KS & 0
%\end{tabular} %\end{tabular}
\caption{Job-S: number of jobs with “control” in their name in the Top-100} \caption{Job-S: number of jobs with “control” in their name in the Top-100}
\label{tbl:control-jobs} \label{tbl:control-jobs}
@ -609,7 +620,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
\caption{Non-control job: Rank\,4, SIM=81\%} \caption{Non-control job: Rank\,4, SIM=81\%}
\end{subfigure} \end{subfigure}
\caption{Job-S: jobs with different job names when using bin\_aggzeros} \caption{Job-S: jobs with different job names when using BIN\_aggzeros}
\label{fig:job-S-bin-agg} \label{fig:job-S-bin-agg}
\end{figure} \end{figure}
@ -674,7 +685,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
% \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--99timeseries4687419} % \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--99timeseries4687419}
% \caption{Rank\,100, SIM=} % \caption{Rank\,100, SIM=}
% \end{subfigure} % \end{subfigure}
% \caption{Job-S with bin\_aggzero, selection of similar jobs} % \caption{Job-S with BIN\_aggzero, selection of similar jobs}
% \label{fig:job-S-bin-aggzeros} % \label{fig:job-S-bin-aggzeros}
% \end{figure} % \end{figure}
@ -685,8 +696,8 @@ Inspecting the Top\,100 for this reference job is highlighting the differences b
All algorithms identify a diverse range of job names for this reference job in the Top\,100. All algorithms identify a diverse range of job names for this reference job in the Top\,100.
Firstly, the name of the reference job appears 30 times in the whole dataset so this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names. Firstly, the name of the reference job appears 30 times in the whole dataset so this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
Some applications are more prominent in these sets, e.g., for bin\_aggzero, 32\,jobs contain WRF (a model) in the name. Some applications are more prominent in these sets, e.g., for BIN\_aggzero, 32\,jobs contain WRF (a model) in the name.
The number of unique names is 19, 38, 49 to 51 for bin\_aggzero, hex\_phases, hex\_native and hex\_lev, respectively. The number of unique names is 19, 38, 49 to 51 for BIN\_aggzero, HEX\_phases, HEX\_native and HEX\_lev, respectively.
The jobs that are similar according to the bin algorithms differ from our expectations. The jobs that are similar according to the bin algorithms differ from our expectations.
@ -736,7 +747,7 @@ The jobs that are similar according to the bin algorithms differ from our expect
\caption{Rank\,100, SIM=70\%} \caption{Rank\,100, SIM=70\%}
\end{subfigure} \end{subfigure}
\caption{Job-M with hex\_lev, selection of similar jobs} \caption{Job-M with HEX\_lev, selection of similar jobs}
\label{fig:job-M-hex-lev} \label{fig:job-M-hex-lev}
\end{figure} \end{figure}
@ -763,7 +774,7 @@ The jobs that are similar according to the bin algorithms differ from our expect
\caption{Rank 100, SIM=88\%} \caption{Rank 100, SIM=88\%}
\end{subfigure} \end{subfigure}
\caption{Job-M with hex\_native, selection of similar jobs} \caption{Job-M with HEX\_native, selection of similar jobs}
\label{fig:job-M-hex-native} \label{fig:job-M-hex-native}
\end{figure} \end{figure}
@ -789,15 +800,15 @@ The jobs that are similar according to the bin algorithms differ from our expect
\caption{Rank 100, SIM=24\%} \caption{Rank 100, SIM=24\%}
\end{subfigure} \end{subfigure}
\caption{Job-M with hex\_phases, selection of similar jobs} \caption{Job-M with HEX\_phases, selection of similar jobs}
\label{fig:job-M-hex-phases} \label{fig:job-M-hex-phases}
\end{figure} \end{figure}
\subsection{Job-L} \subsection{Job-L}
For the bin algorithms, the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively. For the bin algorithms, the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
The hex algorithms identify a more diverse set of applications (18 unique names and no xmessy job), and the hex\_phases algorithm has 85 unique names. The hex algorithms identify a more diverse set of applications (18 unique names and no xmessy job), and the HEX\_phases algorithm has 85 unique names.
The ks algorithm finds 71 jobs ending with t127, which is a typical model configuration. The KS algorithm finds 71 jobs ending with t127, which is a typical model configuration.
\begin{figure} \begin{figure}
\begin{subfigure}{0.3\textwidth} \begin{subfigure}{0.3\textwidth}
@ -820,7 +831,7 @@ The ks algorithm finds 71 jobs ending with t127, which is a typical model config
\caption{Rank 100, SIM=11\%} \caption{Rank 100, SIM=11\%}
\end{subfigure} \end{subfigure}
\caption{Job-L with bin\_aggzero, selection of similar jobs} \caption{Job-L with BIN\_aggzero, selection of similar jobs}
\label{fig:job-L-bin-aggzero} \label{fig:job-L-bin-aggzero}
\end{figure} \end{figure}
@ -846,7 +857,7 @@ The ks algorithm finds 71 jobs ending with t127, which is a typical model config
\caption{Rank 100, SIM=17\%} \caption{Rank 100, SIM=17\%}
\end{subfigure} \end{subfigure}
\caption{Job-L with hex\_lev, selection of similar jobs} \caption{Job-L with HEX\_lev, selection of similar jobs}
\label{fig:job-L-hex-lev} \label{fig:job-L-hex-lev}
\end{figure} \end{figure}
@ -872,7 +883,7 @@ The ks algorithm finds 71 jobs ending with t127, which is a typical model config
\caption{Rank 100, SIM=17\%} \caption{Rank 100, SIM=17\%}
\end{subfigure} \end{subfigure}
\caption{Job-L with hex\_native, selection of similar jobs} \caption{Job-L with HEX\_native, selection of similar jobs}
\label{fig:job-L-hex-native} \label{fig:job-L-hex-native}
\end{figure} \end{figure}
@ -897,7 +908,7 @@ The ks algorithm finds 71 jobs ending with t127, which is a typical model config
\caption{Rank 100, SIM=100\%} \caption{Rank 100, SIM=100\%}
\end{subfigure} \end{subfigure}
\caption{Job-L with hex\_phases, selection of similar jobs} \caption{Job-L with HEX\_phases, selection of similar jobs}
\label{fig:job-L-hex-phases} \label{fig:job-L-hex-phases}
\end{figure} \end{figure}
@ -910,7 +921,7 @@ The ks algorithm finds 71 jobs ending with t127, which is a typical model config
One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms. One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
That would increase the likelihood that these jobs are very similar and what the user is looking for. That would increase the likelihood that these jobs are very similar and what the user is looking for.
The ks algorithm finds jobs with similar histograms which are not necessarily what we are looking for. The KS algorithm finds jobs with similar histograms which are not necessarily what we are looking for.
%\printbibliography %\printbibliography
\end{document} \end{document}