diff --git a/lncs-from-jhps/assets/fig/progress_5024292-out-boxplot.png b/lncs-from-jhps/assets/fig/progress_5024292-out-boxplot.png index 8a9716e..bf4fd8e 100644 Binary files a/lncs-from-jhps/assets/fig/progress_5024292-out-boxplot.png and b/lncs-from-jhps/assets/fig/progress_5024292-out-boxplot.png differ diff --git a/lncs-from-jhps/assets/job_similarities_5024292-out/hist-sim.png b/lncs-from-jhps/assets/job_similarities_5024292-out/hist-sim.png index bf26ff3..08ad619 100644 Binary files a/lncs-from-jhps/assets/job_similarities_5024292-out/hist-sim.png and b/lncs-from-jhps/assets/job_similarities_5024292-out/hist-sim.png differ diff --git a/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-elapsed.png b/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-elapsed.png index 4acaa06..1796425 100644 Binary files a/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-elapsed.png and b/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-elapsed.png differ diff --git a/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-nodes.png b/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-nodes.png index 4dd83e2..9c266b3 100644 Binary files a/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-nodes.png and b/lncs-from-jhps/assets/job_similarities_5024292-out/jobs-nodes.png differ diff --git a/lncs-from-jhps/assets/job_similarities_5024292-out/user-ids.png b/lncs-from-jhps/assets/job_similarities_5024292-out/user-ids.png index 8ffcd52..059a672 100644 Binary files a/lncs-from-jhps/assets/job_similarities_5024292-out/user-ids.png and b/lncs-from-jhps/assets/job_similarities_5024292-out/user-ids.png differ diff --git a/lncs-from-jhps/main.tex b/lncs-from-jhps/main.tex index 6131211..ff88c71 100644 --- a/lncs-from-jhps/main.tex +++ b/lncs-from-jhps/main.tex @@ -35,7 +35,7 @@ \begin{document} -\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis} +\title{Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis} \author{Julian Kunkel\inst{1} \and Eugen Betke\inst{2}} \institute{\vspace*{-1cm}} @@ -59,12 +59,12 @@ This allows staff to understand the usage of the exhibited behavior better and t \medskip -In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we describe a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. -Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. -A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. -The data stem from DKRZ's supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and reveal interesting patterns in the data. -It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose. -%203 days. +In this article, our goal is to identify jobs similar to an arbitrary reference job. +In particular, we sketch a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. +Practically, we apply several previously developed time series algorithms. +A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. +The data stem from DKRZ's supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. +Our analysis shows that the strategy and algorithms bear potential to identify similar jobs but more testing is necessary. \end{abstract} @@ -99,9 +99,8 @@ In this paper, we refine these algorithms slightly, include another algorithm, a We start by introducing related work in \Cref{sec:relwork}. In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis. -We also utilize the Kolmogorov-Smirnov-Test to illustrate the benefits and drawbacks of the different methods. -Then, we perform our study by applying the methodology to three reference jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs. -In \Cref{sec:evaluation}, the reference jobs are introduced and quantitative analysis of the job pool is made based on job similarity. +Then, we perform our study by applying the methodology to a reference job, therewith, providing an indicator for the effectiveness of the approach to identify similar jobs. +In \Cref{sec:evaluation}, the reference job is introduced and quantitative analysis of the job pool is made based on job similarity. In \Cref{sec:timelines}, the 100 most similar jobs are investigated in more detail, and selected timelines are presented. The paper is concluded in \Cref{sec:summary}. @@ -131,10 +130,9 @@ There are various tools for analyzing the I/O behavior of an application \cite{T For Vampir, a popular tool for trace file analysis, in \cite{weber2017visual} the Comparison View is introduced that allows them to manually compare traces of application runs, e.g., to compare optimized with original code. Vampir generally supports the clustering of process timelines of a single job, allowing to focus on relevant code sections and processes when investigating many processes. -Chameleon \cite{bahmani2018chameleon} extends ScalaTrace for recording MPI traces but reduces the overhead by clustering processes and collecting information from one representative of each cluster. -For the clustering, a signature is created for each process that includes the call-graph. +%Chameleon \cite{bahmani2018chameleon} extends ScalaTrace for recording MPI traces but reduces the overhead by clustering processes and collecting information from one representative of each cluster. +%For the clustering, a signature is created for each process that includes the call-graph. In \cite{halawa2020unsupervised}, 11 performance metrics including CPU and network are utilized for agglomerative clustering of jobs showing the general effectiveness of the approach. - In \cite{rodrigo2018towards}, a characterization of the NERSC workload is performed based on job scheduler information (profiles). Profiles that include the MPI activities have shown effective to identify the code that is executed \cite{demasi2013identifying}. Many approaches for clustering applications operate on profiles for compute, network, and I/O \cite{emeras2015evalix,liu2020characterization,bang2020hpc}. @@ -165,37 +163,6 @@ This strategy ensures that the same approach can be applied to other HPC systems After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis. By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, the dataset is reduced from 1 million jobs to about 580k jobs. -\paragraph{Binary coding} -Binary coding is a data reduction technique that represents monitoring data as a sequence of numbers, where each number stands for a specific I/O behavior. -It is applied on the hypercube, produced by the previous pre-processing steps (segmentation and categorization). -This hypercube has four dimensions (nodes, metrics, file systems and time) and is filled with categories. -The data is reduced in two steps. -First, reduction of two dimensions (nodes and file systems) and aggregation by the sum() function creates a new two-dimensional data structure (metrics and time) filled with integer values. -Second, reduction of a dimension (metrics) and mapping of integer values to a unique number (behavior mapping) creates a one dimensional data structure (time) filled with I/O behavior. -Behavior mapping is a function that maps 9 metrics to 9-bit numbers where each bit represents activity of a metric. -If the value of a metric is LowIO then the bit is set to 0 (compute intense state), else (for values HighIO and CriticalIO) it is set to 1 (IO intense state). -The following example shows a 15 segment long binary coding: - -\begin{lstlisting} -jobA (after coding): [1:5:0:0:0:0:0:0:96:96:96:96:96:96:96], 'length':15 -\end{lstlisting} - -\paragraph{Hexadecimal coding} -Hexadecimal coding is a data reduction technique, that can be applied to the hypercube to obtain further data reduction. -After the first two pre-processing steps (segmentation and categorization), a four dimensional hypercube (nodes, metrics, file systems and time) stores performance categories. -Hexadecimal coding operates in two steps. -First, it aggregates two dimensions (nodes and file systems) by the mean() function. -What remains are two dimensions (metrics and time) and the mean values between 0 an 4. -Second, the mean values are quantized to 16 levels -- 0 = [0,0.25), 1 = [0.25,0.5), $ \ldots $ , F = [3.75, 4]. -The following example shows a five segment long hexadecimal coding: - -\begin{lstlisting} -jobB: 'length': 6, 'coding': - 'metric_read' : [0:2:2:2:9], - 'metric_write' : [0:0:0:0:0], - ..., - 'metric_md_other': [0:0:0:F:F] -\end{lstlisting} \subsection{Algorithms for Computing Similarity} We reuse the B and Q algorithms developed in~\cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases. @@ -207,30 +174,6 @@ Q-native uses a performance-aware similarity function, i.e., the distance betwee There are various options for how a longer job is embedded in a shorter job, for example, a larger input file may stretch the length of the I/O and compute phases; another option can be that more (model) time is simulated. In this article, we consider these different behavioral patterns and attempt to identify situations where the I/O pattern of a long job is contained in a shorter job. Therefore, for jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity. Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation. The Q-phases algorithm extracts I/O phases from our 10-minute segments and computes the similarity between the most similar I/O phases of both jobs. -In this paper, we add a similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following. - -\paragraph{Kolmogorov-Smirnov (KS) algorithm.} -% Summary -For the analysis, we perform two preparation steps. -Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes (instead of averaging them). -This reduces the four-dimensional dataset to two dimensions (time, metrics). -% Aggregation -The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently of how many file systems are present on an HPC system. -Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes. -%No aggregation is performed on the metric dimension. - -% Filtering -%Zero-jobs are jobs with no sign of significant I/O load and are of little interest in the analysis. -%Their sum across all dimensions and time series is equal to zero. -%Furthermore, we filter those jobs whose time series have less than 8 values. - -% Similarity -For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''. -The similarity function calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $M$: $sim = \frac{\sum_m 1 - p_{\text{reject}(m)}}{|M|}$. - -%\begin{equation}\label{eq:ks_similarity} - - \subsection{Methodology} @@ -247,139 +190,48 @@ For the latter, it is interesting to investigate clusters of similar jobs, e.g., For the inspection of the jobs, a user may explore the job metadata, searching for similarities, and explore the time series of a job's I/O metrics. -\section{Reference Jobs}% +\section{Reference Job}% \label{sec:refjobs} -For this study, we chose several reference jobs with different compute and I/O characteristics: -\begin{itemize} - \item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so-called CMORization). The post-processing is I/O intensive. - \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12 - \item Job-L: a 66-hour 20-node job. - The initialization data is read at the beginning. - Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as I/O relevant. -\end{itemize} +For this study, we chose the reference job called Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12 -The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes on which the job ran. +The segmented timelines of the job are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes on which the job ran. This coding is also used for the Q algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in~\cite{Eugen20HPS}. The figures show the values of active metrics ($\neq 0$); if few are active, then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview. -For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6. +For example, we can see that several metrics increase in Segment\,12. We can also see an interesting result of our categorized coding, the write bytes are bigger than 0 while write calls are 0\footnote{The reason is that a few write calls transfer many bytes; less than our 90\%-quantile, therefore, write calls will be set to 0.}. -\Cref{fig:refJobsHist} summarizes hexadecimal codings of Job-S and Job-M to histograms: -A specific histogram contains the metric of each node at every timestep -- without being averaged across the nodes. -Essentially, this data is used to compare jobs using Kolmogorov-Smirnov-Test. -The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate. -In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the first segment as primarily Rank\,0 is doing I/O. \begin{figure} -\begin{subfigure}{0.8\textwidth} -\centering -\includegraphics[width=\textwidth]{job-timeseries4296426} -\caption{Job-S (runtime=15,551\,s, segments=25)}\label{fig:job-S} -\end{subfigure} -\centering - - -\begin{subfigure}{0.8\textwidth} -\centering \includegraphics[width=\textwidth]{job-timeseries5024292} -\caption{Job-M (runtime=28,828\,s, segments=48)}\label{fig:job-M} -\end{subfigure} -\centering - - -\caption{Reference jobs: segmented timelines of mean I/O activity}% +\caption{Segmented timelines of Job-M (runtime=28,828\,s, segments=48)}\label{fig:job-M} \label{fig:refJobs} \end{figure} -\begin{figure}\ContinuedFloat% - -\begin{subfigure}{0.8\textwidth} -\centering -\includegraphics[width=\textwidth]{job-timeseries7488914-30} -\caption{Job-L (first 30 segments of 400; remaining segments are zero)}% -\label{fig:job-L} -\end{subfigure} -\centering -\caption{Reference jobs: segmented timelines of mean I/O activity} -\end{figure} - - -\begin{figure} -\begin{subfigure}{0.49\textwidth} % TODO war 0.8 -\centering -\includegraphics[width=\textwidth]{job-ks-0hist4296426} -\caption{Job-S}\label{fig:job-S-hist} -\end{subfigure} -\centering -\begin{subfigure}{0.49\textwidth} -\centering -\includegraphics[width=\textwidth]{job-ks-1hist5024292} -\caption{Job-M}\label{fig:job-M-hist} -\end{subfigure} -\centering - - -\caption{Reference jobs: histogram of I/O activities}% -\label{fig:refJobsHist} -\end{figure} - -%\begin{figure}\ContinuedFloat -%\begin{subfigure}{0.8\textwidth} -%\centering -%\includegraphics[width=\textwidth]{job-ks-2hist7488914} -%\caption{Job-L} -%\label{fig:job-L} -%\end{subfigure} -%\centering -%\caption{Reference jobs: histogram of I/O activities} -%\end{figure} - - - \section{Evaluation}% \label{sec:evaluation} -In the following, we assume a reference job is given (we use Job-S, Job-M, and Job-L) and we aim to identify similar jobs. -For each reference job and algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral). +In the following, we assume the reference job (Job-M) is given and we aim to identify similar jobs. +For the reference job and each algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral). During this process, the runtime of the algorithm is recorded. Then we inspect the correlation between the similarity and number of found jobs. Finally, the quantitative behavior of the 100 most similar jobs is investigated. - \subsection{Performance} To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM. A boxplot for the runtimes is shown in \Cref{fig:performance}. The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process. Generally, the B algorithms are fastest, while the Q algorithms often take 4-5x as long. -Q\_phases is slow for Job-S and Job-M while it is fast for Job-L. -The reason is that just one phase is extracted for Job-L. -The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window. -The KS algorithm is faster than the others by 10x, but it operates on the statistics of the time series. +Q\_phases and Levenshtein based algorithm are significantly slower. Note that the current algorithms are sequential and executed on just one core. They could easily be parallelized which would then allow for an online analysis. \begin{figure} \centering - \begin{subfigure}{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{progress_4296426-out-boxplot} - \caption{Job-S (segments=25)}\label{fig:perf-job-S} - \end{subfigure} - \begin{subfigure}{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{progress_5024292-out-boxplot} - \caption{Job-M (segments=48)}\label{fig:perf-job-M} - \end{subfigure} - \begin{subfigure}{0.48\textwidth} - \centering - \includegraphics[width=\textwidth]{progress_7488914-out-boxplot} - \caption{Job-L (segments=400)}\label{fig:perf-job-L} - \end{subfigure} - - \caption{Runtime of the algorithms to compute the similarity to reference jobs}% + \includegraphics[width=0.5\textwidth]{progress_5024292-out-boxplot} + \caption{Runtime of the algorithms to compute the similarity to our reference job}% \label{fig:performance} \end{figure} @@ -387,72 +239,21 @@ They could easily be parallelized which would then allow for an online analysis. \subsection{Quantitative Analysis} In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs. -% TODO full paper -%The cumulative distribution of similarity to a reference job is shown in %\Cref{fig:ecdf}. -%For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native. -%B-aggz shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%. -%The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest. -% This indicates that the algorithms The support team in a data center may have time to investigate the most similar jobs. Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs and rank them; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the histogram in \Cref{fig:hist} for the actual number of jobs with a given similarity. As we focus on a feasible number of jobs, we crop it at 100 jobs (total number of jobs is still given). It turns out that both B algorithms produce nearly identical histograms, and we omit one of them. In the figures, we can see again a different behavior of the algorithms depending on the reference job. -Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at Q-lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady. -For Job-L, we find barely similar jobs, except when using the Q-phases and KS algorithms. -Q-phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%. +We can see a cluster with jobs of higher similarity (for B-all and Q-native at a similarity of 75\%). +Generally, the growth in the relevant section is more steady. Practically, the support team would start with Rank\,1 (most similar job, e.g., the reference job) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed. -% TODO full paper? -% \begin{figure} -% -% \begin{subfigure}{0.8\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf} -% \caption{Job-S} \label{fig:ecdf-job-S} -% \end{subfigure} -% \centering -% -% \begin{subfigure}{0.8\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf} -% \caption{Job-M} \label{fig:ecdf-job-M} -% \end{subfigure} -% \centering -% -% \begin{subfigure}{0.8\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf} -% \caption{Job-L} \label{fig:ecdf-job-L} -% \end{subfigure} -% \centering -% \caption{Quantitative job similarity -- empirical cumulative density function} -% \label{fig:ecdf} -% \end{figure} - - \begin{figure} \centering -\begin{subfigure}{0.67\textwidth} -\centering -\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_4296426-out/hist-sim} -\caption{Job-S}\label{fig:hist-job-S} -\end{subfigure} +\includegraphics[width=0.9\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_5024292-out/hist-sim} -\begin{subfigure}{0.67\textwidth} -\centering -\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_5024292-out/hist-sim} -\caption{Job-M}\label{fig:hist-job-M} -\end{subfigure} - -\begin{subfigure}{0.67\textwidth} -\centering -\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_7488914-out/hist-sim} -\caption{Job-L}\label{fig:hist-job-L} -\end{subfigure} -\centering \caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). B-aggz is nearly identical to B-all and therefore omitted.}% \label{fig:hist} \end{figure} @@ -461,10 +262,7 @@ Practically, the support team would start with Rank\,1 (most similar job, e.g., When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps). Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload. -Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings%\footnote{% -%As they can contain confidential data, it is difficult to anonymize them without perturbing the meaning. -%Therefore, they are not published in our data repository. -%} +Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings. To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs: We explore the distribution of users (and groups), runtime, and node count across jobs. @@ -475,120 +273,42 @@ To confirm the hypotheses presented, we analyzed the job metadata comparing job \paragraph{User distribution.} To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs. -For Job-S, we can see that about 70-80\% of jobs stem from one user, for the Q-lev and Q-native algorithms, the other jobs stem from a second user while B algorithms include jobs from additional users (5 in total). -For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, Q-lev, Q-native, and KS include more users (29, 33, and 37, respectively) than the other three algorithms. -For Job-L, the two Q algorithms include (12 and 13) a bit more diverse user community than the B algorithms (9) but Q-phases cover 35 users. +Jobs from 13 users are included; about 25\% of jobs stem from the same user; Q-lev, and Q-native include more users (29, 33, and 37, respectively) than the other three algorithms. We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups. Thus, a user is likely from the same group and the number of groups is similar to the number of unique users. \paragraph{Node distribution.} \Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100 -- the red line marks the reference job. All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across the node range as long as the average I/O behavior of the jobs is similar. -For Job-M and Job-L, we can observe that indeed the range of similar nodes is between 1 and 128. -For Job-S, all 100 top-ranked jobs use one node. -As post-processing jobs use typically one node and the number of post-processing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs, which is confirmed by investigating the job metadata. -The boxplots have different shapes which is an indication that the different algorithms identify a different set of jobs -- we will analyze this later further. +We can observe that the range of nodes for similar jobs is between 1 and 128. \paragraph{Runtime distribution.} The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}. While all algorithms can compute the similarity between jobs of different length, the B algorithms and Q-native penalize jobs of different length preferring jobs of very similar length. -For Job-M and Job-L, Q-phases and KS are able to identify much shorter or longer jobs. -For Job-L, for Q-phases and KS, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself. -Also, as there are only few jobs of similar lengths to Job-L and the B-* algorithms penalize job-length differences, the Top\,100 similar jobs have a significant difference in job length. +Q-phases is able to identify much shorter or longer jobs. \begin{figure}[bt] -\begin{subfigure}{0.48\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/user-ids} -\caption{Job-S}\label{fig:users-job-S} -\end{subfigure} -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/user-ids} -\caption{Job-M}\label{fig:users-job-M} -\end{subfigure} -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/user-ids} -\caption{Job-L}\label{fig:users-job-L} -\end{subfigure} - +\includegraphics[width=0.6\textwidth]{job_similarities_5024292-out/user-ids} \caption{User information for all 100 top-ranked jobs. Each color represents a specific user for the given data.} \label{fig:userids} \end{figure} \begin{figure} -%\begin{subfigure}{0.31\textwidth} -%\centering -%\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-nodes} -%\caption{Job-S} \label{fig:nodes-job-S} -%\end{subfigure} -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-nodes} -\caption{Job-M (ref. job runs on 128 nodes)}\label{fig:nodes-job-M} -\end{subfigure} -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-nodes} -\caption{Job-L (reference job runs on 20 nodes)}\label{fig:nodes-job-L} -\end{subfigure} -\centering -\caption{Distribution of node counts for Top 100 (for Job-S always nodes=1)}% -\label{fig:nodes-job} -\end{figure} -\begin{figure} -\begin{subfigure}{0.7\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-elapsed} -\caption{Job-S ($job=15,551s$)}\label{fig:runtime-job-S} +\begin{subfigure}{0.47\textwidth} + \includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-nodes} + \caption{Node counts ($job=128 nodes$)}% + \label{fig:nodes-job} \end{subfigure} -\begin{subfigure}{0.7\textwidth} -\centering +\begin{subfigure}{0.47\textwidth} \includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-elapsed} -\caption{Job-M ($job=28,828s$)}\label{fig:runtime-job-M} -\end{subfigure} -\begin{subfigure}{0.7\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-elapsed} -\caption{Job-L ($job=240ks$)}\label{fig:runtime-job-L} -\end{subfigure} -\centering -\caption{Distribution of runtime for all 100 top-ranked jobs}% +\caption{Runtime ($job=28,828s$)}% \label{fig:runtime-job} -\end{figure} - -\subsubsection{Algorithmic differences} -To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}. -B-all and B-aggz overlap with at least 99 ranks for all three jobs. -While there is some reordering, both algorithms lead to a comparable set. -All algorithms have a significant overlap for Job-S. -For Job-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set. -Generally, Q-lev and Q-native are generating more similar results than other algorithms. -From this analysis, we conclude that one representative from B is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects. % and, therefore, should be analyzed individually - - -\begin{figure}[t] -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/intersection-heatmap} -\caption{Job-S}\label{fig:heatmap-job-S} -\end{subfigure} -\begin{subfigure}{0.48\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/intersection-heatmap} -\caption{Job-M}\label{fig:heatmap-job-M} %,trim={2.5cm 0 0 0},clip -\end{subfigure} -\begin{subfigure}{0.52\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/intersection-heatmap} -\caption{Job-L}\label{fig:heatmap-job-L} \end{subfigure} -\centering -\caption{Intersection of the 100 top-ranked jobs for different algorithms}% -\label{fig:heatmap-job} +\caption{Distribution for all 100 top-ranked jobs} \end{figure} %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% %%%%%%%%%%% @@ -602,114 +322,6 @@ These can be visually and subjectively compared to our reference jobs shown in \ For space reasons, the included images will be scaled down making it difficult to read the text. However, we believe that they are still well suited for a visual inspection and comparison. -\subsection{Job-S} - -This job represents post-processing (CMORization) which is a typical step. -It is executed for different simulations and variables across timesteps. -The job name suggests that it is applied to the control variable. -In the metadata, we found 22,580 jobs with “cmor” in the name of which 367 jobs mention “control”. - -The B and KS algorithms identify one job whose name doesn't include “cmor”. -All other algorithms identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the KS algorithm doesn't identify any job with control. -A selection of job timelines on control variables is given in \Cref{fig:job-S-hex-lev}. -The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}. -While we cannot visually see many differences between these two jobs compared to the control job, the algorithms indicate that jobs processing the control variables are more similar as they are more frequent in the Top\,100 jobs. -For Job-S, we found that all algorithms work well and, therefore, omit further timelines. - -\begin{table}[bt] -\centering -\begin{tabular}{r|r|r|r|r|r} - B-aggz & B-all & Q-lev & Q-native & Q-phases & KS\\ \hline - 38 & 38 & 33 & 26 & 33 & 0 -\end{tabular} - - \caption{Job-S: number of jobs with “control” in their name in the Top-100}% - \label{tbl:control-jobs} -\end{table} - -\begin{figure}[bt] -\centering -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--76timeseries4235560} -\caption{Non-cmor job: Rank\,76, SIM=69\%} -\end{subfigure} -\qquad -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.8077--4timeseries4483904} -\caption{Non-control job: Rank\,4, SIM=81\%} -\end{subfigure} - -\caption{Job-S: jobs with different job names when using B-aggz}% -\label{fig:job-S-bin-agg} -\end{figure} - - -\begin{figure}[bt] -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.9615--1timeseries4296288} -\caption{Rank 2, SIM=96\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.9012--15timeseries4296277} -\caption{Rank 15, SIM=90\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.7901--99timeseries4297842} -\caption{Rank\,100, SIM=79\%} -\end{subfigure} - -\caption{Job-S with Q-Lev, selection of similar jobs}% -\label{fig:job-S-hex-lev} -\end{figure} - -% \begin{figure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.9808--1timeseries4296288} -% \caption{Rank 2, SIM=} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.9375--15timeseries4564296} -% \caption{Rank 15, SIM=} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.8915--99timeseries4296785} -% \caption{Rank\,100, SIM=} -% \end{subfigure} -% \caption{Job-S with Hex-Native, selection of similar jobs} -% \label{fig:job-S-hex-native} -% \end{figure} -% -% \ContinuedFloat - -% -% \begin{figure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.8462--1timeseries4296280} -% \caption{Rank 2, SIM=} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.7778--14timeseries4555405} -% \caption{Rank 15, SIM=} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--99timeseries4687419} -% \caption{Rank\,100, SIM=} -% \end{subfigure} -% \caption{Job-S with B-aggzero, selection of similar jobs} -% \label{fig:job-S-bin-aggzeros} -% \end{figure} - \subsection{Job-M} @@ -726,28 +338,6 @@ When inspecting their timelines, the jobs that are similar according to the B al The reason lies in the definition of the B-* similarity which aggregate all I/O statistics into one timeline. The other algorithms like Q-lev (\Cref{fig:job-M-hex-lev}) and Q-native (\Cref{fig:job-M-hex-native}) seem to work as intended: While jobs exhibit short bursts of other active metrics even for low similarity, we can eyeball a relevant similarity particularly for Rank\,2 and Rank\,3 which have the high similarity of 90+\%. For Rank\,15 to Rank\,100, with around 70\% similarity, a partial match of the metrics is still given. -The KS algorithm working on the histograms ranks the jobs correctly on the similarity of their histograms. -However, as it does not deal with the length of the jobs, it may identify jobs of very different length. -In \Cref{fig:job-M-ks}, we see the 3rd ranked job whose profile is indeed quite similar but the time series differs but it is just running for 10min (1 segment) on 10\,nodes. -Remember, for the KS algorithm, we concatenate the metrics of all nodes together instead of averaging it in order to explore if node-specific information helps the similarity. - -\begin{figure}[bt] -\begin{subfigure}{0.5\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/ks-0.7863--ks-2hist7827264} -\caption{Histogram} -\end{subfigure} -\qquad -\begin{subfigure}{0.36\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/ks-0.7863--ks-2timeseries7827264} -\caption{Concatenated time series} -\end{subfigure} - -\caption{Job-M with KS, for Rank\,3, SIM=78\%}% -\label{fig:job-M-ks} -\end{figure} - @@ -853,133 +443,14 @@ Remember, for the KS algorithm, we concatenate the metrics of all nodes together % \label{fig:job-M-hex-phases} % \end{figure} -\subsection{Job-L} - -The B algorithms find a low similarity (the best 2nd ranked job is 17\% similar), the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively. -In \Cref{fig:job-L-bin-aggzero}, it can be seen that the found jobs have little in common with the reference job. - -The Q-lev and Q-native algorithms identify a more diverse set of applications (18 unique names and no xmessy job). -Q-native \Cref{fig:job-L-hex-native} finds long jobs with only little activity which, therefore, is similar to our reference job. -The Q-phases algorithm finds 85 unique names but as there is only one short I/O phase in the reference job, it finds many (short) jobs with 100\% similarity as seen in \Cref{fig:job-L-hex-phases}. -The KS algorithm is even more inclusive having 1285 jobs with 100\% similarity; the 100 selected ones contain 71 jobs ending with t127, which is a typical model configuration. -As expected, the histograms mimic the profile of the reference job, and thus, the algorithm does what it is expected to do. - - -\begin{figure}[bt] -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1671--1timeseries7869050} -\caption{Rank 2, SIM=17\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1671--2timeseries7990497} -\caption{Rank 3, SIM=17\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1521--14timeseries8363584} -\caption{Rank 15, SIM=15\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1097--97timeseries4262983} -\caption{Rank 100, SIM=11\%} -\end{subfigure} - -\caption{Job-L with B-aggzero, selection of similar jobs}% -\label{fig:job-L-bin-aggzero} -\end{figure} - -% -% \begin{figure}[bt] -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.9386--1timeseries7266845} -% \caption{Rank 2, SIM=94\%} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.9375--2timeseries7214657} -% \caption{Rank 3, SIM=94\%} -% \end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.7251--14timeseries4341304} -% \caption{Rank 15, SIM=73\%} -% \end{subfigure} -% % \begin{subfigure}{0.3\textwidth} -% % \centering -% % \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.1657--99timeseries8036223} -% % \caption{Rank 100, SIM=17\%} -% % \end{subfigure} -% -% \caption{Job-L with Q-lev, selection of similar jobs} -% \label{fig:job-L-hex-lev} -% \end{figure} - - -\begin{figure}[bt] -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.9390--1timeseries7266845} -\caption{Rank 2, SIM=94\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.9333--2timeseries7214657} -\caption{Rank 3, SIM=93\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.8708--14timeseries4936553} -\caption{Rank 15, SIM=87\%} -\end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \centering -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.1695--99timeseries7942052} -% \caption{Rank 100, SIM=17\%} -% \end{subfigure} - -\caption{Job-L with Q-native, selection of similar jobs}% -\label{fig:job-L-hex-native} -\end{figure} - -\begin{figure}[bt] -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--14timeseries4577917} -\caption{Rank 2, SIM=100\%} -\end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--1timeseries4405671} -\caption{Rank 3, SIM=100\%} -\end{subfigure} -% \begin{subfigure}{0.3\textwidth} -% \includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--2timeseries4621422} -% \caption{Rank 15, SIM=100\%} -% \end{subfigure} -\begin{subfigure}{0.47\textwidth} -\centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--99timeseries4232293} -\caption{Rank 100, SIM=100\%} -\end{subfigure} - -\caption{Job-L with Q-phases, selection of similar jobs}% -\label{fig:job-L-hex-phases} -\end{figure} - - - - \section{Conclusion}% \label{sec:summary} We conducted a study to identify similar jobs based on timelines of nine I/O statistics. -Therefore, we applied six different algorithmic strategies developed before and included this time as well a distance metric based on the Kolmogorov-Smirnov-Test. The quantitative analysis shows that a diverse set of results can be found and that only a tiny subset of the 500k jobs is very similar to each of the three reference jobs. For the small post-processing job, which is executed many times, all algorithms produce suitable results. For Job-M, the algorithms exhibit a different behavior. Job-L is tricky to analyze, because it is compute-intense with only a single I/O phase at the beginning. -Generally, the KS algorithm finds jobs with similar histograms which are not necessarily what we subjectively are looking for. We found that the approach to compute similarity of reference jobs to all jobs and ranking these was successful to find related jobs that we were interested in. The Q-lev and Q-native work best according to our subjective qualitative analysis. Typically, a related job stems from the same user/group and may have a related job name, but the approach was able to find other jobs as well.