Paper reduktion (2)
This commit is contained in:
parent
7b0a4b693a
commit
94e13ed4f3
146
paper/main.tex
146
paper/main.tex
|
@ -72,7 +72,7 @@
|
||||||
\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}
|
\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}
|
||||||
%\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}
|
%\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}
|
||||||
\author{}
|
\author{}
|
||||||
\institute{}
|
\institute{\vspace*{-1cm}}
|
||||||
|
|
||||||
%\institute{
|
%\institute{
|
||||||
%University of Reading--%
|
%University of Reading--%
|
||||||
|
@ -96,8 +96,8 @@ This allows staff to understand the usage of the exhibited behavior better and t
|
||||||
\medskip
|
\medskip
|
||||||
|
|
||||||
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described.
|
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described.
|
||||||
Practically, we apply several previously developed time series algorithms and also utilize Kolmogorov-Smirnov-Test to compare the distribution of the statistics.
|
Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics.
|
||||||
A study is conducted to explore the effectiveness of the approach which starts from three reference jobs and investigates related jobs.
|
A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs.
|
||||||
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
|
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
|
||||||
It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.
|
It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.
|
||||||
%203 days.
|
%203 days.
|
||||||
|
@ -116,33 +116,30 @@ Rarely, users will liaise with staff and request a performance analysis and opti
|
||||||
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
|
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
|
||||||
Monitoring tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
|
Monitoring tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
|
||||||
|
|
||||||
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on a medium number of nodes costs human resources and is not a good return of investment.
|
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on 20 nodes may not be a good return of investment.
|
||||||
By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of computing, network, and IO resources.
|
By ranking jobs based on their utilization, it isn't difficult to find a job that exhibits extensive usage of computing, network, and IO resources.
|
||||||
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
|
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
|
||||||
A pattern that can be observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well.
|
However, a pattern that is observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well.
|
||||||
This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior.
|
This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior.
|
||||||
Knowing details about a problematic or interesting job may be transferred to similar jobs.
|
Knowing details about a problematic or interesting job may be transferred to similar jobs.
|
||||||
Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer.
|
Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer.
|
||||||
|
|
||||||
It is non-trivial to identify jobs with similar behavior from the pool of executed jobs.
|
It is non-trivial to identify jobs with similar behavior from the pool of executed jobs.
|
||||||
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
|
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
|
||||||
Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same IO behavior is would not be possible.
|
Job names are defined by users; while a similar name may hint to be a similar workload finding other applications with the same IO behavior would not be possible.
|
||||||
|
|
||||||
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
|
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
|
||||||
The distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
|
These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
|
||||||
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
|
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively.
|
||||||
In this article, we refine these distance measures slightly and apply them to rank jobs based on their similarity to a reference job.
|
In this paper, we refine these algorithms slightly, also include another algorithm and apply them to rank jobs based on their similarity to a reference job.
|
||||||
Therefore, we perform a study on three reference jobs with a different character.
|
|
||||||
We also utilize Kolmogorov-Smirnov-Test to illustrate the benefit and drawbacks of the different methods.
|
|
||||||
|
|
||||||
This paper is structured as follows.
|
|
||||||
We start by introducing related work in \Cref{sec:relwork}.
|
We start by introducing related work in \Cref{sec:relwork}.
|
||||||
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
|
In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis.
|
||||||
In \Cref{sec:methodology} we describe briefly the data reduction and the algorithms for similarity analysis.
|
We also utilize the Kolmogorov-Smirnov-Test to illustrate the benefit and drawbacks of the different methods.
|
||||||
Then, we perform our study by applying the methodology to three jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
|
Then, we perform our study by applying the methodology to three reference jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
|
||||||
In \Cref{sec:evaluation}, the reference jobs are introduced and quantitative analysis of the job pool is made based on job similarity.
|
In \Cref{sec:evaluation}, the reference jobs are introduced and quantitative analysis of the job pool is made based on job similarity.
|
||||||
In \Cref{sec:timelines}, the 100 most similar jobs are investigated in more detail, and selected timelines are presented.
|
In \Cref{sec:timelines}, the 100 most similar jobs are investigated in more detail, and selected timelines are presented.
|
||||||
Finally, we conclude our paper in \Cref{sec:summary}.
|
The paper is concluded in \Cref{sec:summary}.
|
||||||
|
|
||||||
\section{Related Work}
|
\section{Related Work}
|
||||||
\label{sec:relwork}
|
\label{sec:relwork}
|
||||||
|
@ -151,13 +148,14 @@ Related work can be classified into distance measures, analysis of HPC applicati
|
||||||
|
|
||||||
%% DISTANCE MEASURES
|
%% DISTANCE MEASURES
|
||||||
The ranking of similar jobs performed in this article is related to clustering strategies.
|
The ranking of similar jobs performed in this article is related to clustering strategies.
|
||||||
|
Levenshtein (Edit) distance is a widely used distance metrics indicating the number of edits needed to convert one string to another \cite{navarro2001guided}.
|
||||||
The comparison of the time series using various metrics has been extensively investigated.
|
The comparison of the time series using various metrics has been extensively investigated.
|
||||||
In \cite{khotanlou2018empirical}, an empirical comparison of distance measures for the clustering of multivariate time series is performed.
|
In \cite{khotanlou2018empirical}, an empirical comparison of distance measures for the clustering of multivariate time series is performed.
|
||||||
14 similarity measures are applied to 23 data sets.
|
14 similarity measures are applied to 23 data sets.
|
||||||
It shows that no similarity measure produces statistically significant better results than another.
|
It shows that no similarity measure produces statistically significant better results than another.
|
||||||
However, the Swale scoring model \cite{morse2007efficient} produced the most disjoint clusters.
|
However, the Swale scoring model \cite{morse2007efficient} produced the most disjoint clusters.
|
||||||
In this model, gaps imply a cost.
|
%In this model, gaps imply a cost.
|
||||||
Levenshtein distance is often referred to as Edit Distance (ED) \cite{navarro2001guided}.
|
|
||||||
% Lock-Step Measures and Elastic Measures
|
% Lock-Step Measures and Elastic Measures
|
||||||
|
|
||||||
% Analysis of HPC application performance
|
% Analysis of HPC application performance
|
||||||
|
@ -182,14 +180,14 @@ For example, Evalix \cite{emeras2015evalix} monitors system statistics (from pro
|
||||||
PAS2P \cite{mendez2012new} extracts the IO patterns from application traces and then allows users to manually compare them.
|
PAS2P \cite{mendez2012new} extracts the IO patterns from application traces and then allows users to manually compare them.
|
||||||
In \cite{white2018automatic}, a heuristic classifier is developed that analyzes the I/O read/write throughput time series to extract the periodicity of the jobs -- similar to Fourier analysis.
|
In \cite{white2018automatic}, a heuristic classifier is developed that analyzes the I/O read/write throughput time series to extract the periodicity of the jobs -- similar to Fourier analysis.
|
||||||
The LASSi tool \cite{AOPIUOTUNS19} periodically monitors Lustre I/O statistics and computes a "risk" factor to identify IO patterns that stress the file system.
|
The LASSi tool \cite{AOPIUOTUNS19} periodically monitors Lustre I/O statistics and computes a "risk" factor to identify IO patterns that stress the file system.
|
||||||
In contrast to existing work, our approach allows a user to identify similar activities based on the temporal I/O behavior recorded with a data center-wide deployed monitoring system.
|
In contrast to existing work, our approach allows a user to identify similar activities based on the temporal I/O behavior recorded by a data center-wide deployed monitoring system.
|
||||||
|
|
||||||
|
|
||||||
\section{Methodology}
|
\section{Methodology}
|
||||||
\label{sec:methodology}
|
\label{sec:methodology}
|
||||||
|
|
||||||
The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job.
|
The purpose of the methodology is to allow users and support staff to explore all executed jobs on a supercomputer in order of their similarity to the reference job.
|
||||||
Therefore, we first need to define how a job's data is represented, then describe the algorithms used to compute the similarity, and, finally, the methodology to investigate jobs is described.
|
Therefore, we first need to define how a job's data is represented, then describe the algorithms used to compute the similarity, and, the methodology to investigate jobs.
|
||||||
|
|
||||||
\subsection{Job Data}
|
\subsection{Job Data}
|
||||||
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
|
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
|
||||||
|
@ -199,14 +197,14 @@ In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-pro
|
||||||
In a nutshell, for each job executed on Mistral, they partitioned it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
|
In a nutshell, for each job executed on Mistral, they partitioned it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
|
||||||
The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems.
|
The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems.
|
||||||
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (IO activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
|
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (IO activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
|
||||||
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
|
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, dataset is reduced from 1 million jobs to about 580k jobs.
|
||||||
|
|
||||||
\subsection{Algorithms for Computing Similarity}
|
\subsection{Algorithms for Computing Similarity}
|
||||||
We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases.
|
We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases.
|
||||||
They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance.
|
They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance.
|
||||||
B-all determines similarity between binary codings by means of Levenshtein distance.
|
B-all determines similarity between binary codings by means of Levenshtein distance.
|
||||||
B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero.
|
B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero.
|
||||||
Q-lev determines similarity between quantized codings by using Levensthein distance.
|
Q-lev determines similarity between quantized codings by using Levenshtein distance.
|
||||||
Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$.
|
Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$.
|
||||||
For jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
|
For jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
|
||||||
Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation.
|
Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation.
|
||||||
|
@ -219,7 +217,6 @@ In this paper, we add a similarity definition based on Kolmogorov-Smirnov-Test t
|
||||||
For the analysis, we perform two preparation steps.
|
For the analysis, we perform two preparation steps.
|
||||||
Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes (instead of averaging) them.
|
Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes (instead of averaging) them.
|
||||||
This reduces the four-dimensional dataset to two dimensions (time, metrics).
|
This reduces the four-dimensional dataset to two dimensions (time, metrics).
|
||||||
|
|
||||||
% Aggregation
|
% Aggregation
|
||||||
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
|
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
|
||||||
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes.
|
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes.
|
||||||
|
@ -232,7 +229,7 @@ No aggregation is performed on the metric dimension.
|
||||||
|
|
||||||
% Similarity
|
% Similarity
|
||||||
For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''.
|
For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''.
|
||||||
The similarity function $sim = \frac{\sum_m 1 - p_{\text{reject}(m)}}{|M|}, \text{with } m \in \text{metrics}$, calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $m$.
|
The similarity function calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $M$: $sim = \frac{\sum_m 1 - p_{\text{reject}(m)}}{|M|}$.
|
||||||
|
|
||||||
%\begin{equation}\label{eq:ks_similarity}
|
%\begin{equation}\label{eq:ks_similarity}
|
||||||
|
|
||||||
|
@ -269,10 +266,9 @@ The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- reme
|
||||||
This coding is also used for the Q class of algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in \cite{Eugen20HPS}.
|
This coding is also used for the Q class of algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in \cite{Eugen20HPS}.
|
||||||
The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
|
The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
|
||||||
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
|
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
|
||||||
|
|
||||||
In \Cref{fig:refJobsHist}, the histograms of the job metrics are shown in Q coding (16 steps).
|
In \Cref{fig:refJobsHist}, the histograms of the job metrics are shown in Q coding (16 steps).
|
||||||
The histogram contains the activities of each node at every timestep -- without being averaged across the nodes.
|
The histogram contains the activities of each node at every timestep -- without being averaged across the nodes.
|
||||||
This data is used to compare jobs using Kolmogorov-Smirnov-Test.
|
This data is used to compare jobs using Kolmogorov-Smirnov.
|
||||||
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
|
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
|
||||||
In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the first segment as primarily Rank\,0 is doing IO.
|
In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the first segment as primarily Rank\,0 is doing IO.
|
||||||
|
|
||||||
|
@ -312,15 +308,13 @@ In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the f
|
||||||
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\begin{subfigure}{0.8\textwidth}
|
\begin{subfigure}{0.49\textwidth} % TODO war 0.8
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{job-ks-0hist4296426}
|
\includegraphics[width=\textwidth]{job-ks-0hist4296426}
|
||||||
\caption{Job-S} \label{fig:job-S-hist}
|
\caption{Job-S} \label{fig:job-S-hist}
|
||||||
\end{subfigure}
|
\end{subfigure}
|
||||||
\centering
|
\centering
|
||||||
|
\begin{subfigure}{0.49\textwidth}
|
||||||
|
|
||||||
\begin{subfigure}{0.8\textwidth}
|
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{job-ks-1hist5024292}
|
\includegraphics[width=\textwidth]{job-ks-1hist5024292}
|
||||||
\caption{Job-M} \label{fig:job-M-hist}
|
\caption{Job-M} \label{fig:job-M-hist}
|
||||||
|
@ -350,7 +344,7 @@ In \Cref{fig:job-L}, the mean value is mostly rounded down to 0 except for the f
|
||||||
|
|
||||||
In the following, we assume a reference job is given (we use Job-S, Job-M, and Job-L) and we aim to identify similar jobs.
|
In the following, we assume a reference job is given (we use Job-S, Job-M, and Job-L) and we aim to identify similar jobs.
|
||||||
For each reference job and algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral).
|
For each reference job and algorithm, we created CSV files with the computed similarity to all other jobs from our job pool (worth 203 days of production of Mistral).
|
||||||
During this process the runtime of the algorithm is recorded.
|
During this process, the runtime of the algorithm is recorded.
|
||||||
Then we inspect the correlation between the similarity and number of found jobs.
|
Then we inspect the correlation between the similarity and number of found jobs.
|
||||||
Finally, the quantitative behavior of the 100 most similar jobs is investigated.
|
Finally, the quantitative behavior of the 100 most similar jobs is investigated.
|
||||||
|
|
||||||
|
@ -366,8 +360,7 @@ Q\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is
|
||||||
The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
|
The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
|
||||||
The KS algorithm is faster than the others by 10x, but it operates on the statistics of the time series.
|
The KS algorithm is faster than the others by 10x, but it operates on the statistics of the time series.
|
||||||
Note that the current algorithms are sequential and executed on just one core.
|
Note that the current algorithms are sequential and executed on just one core.
|
||||||
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
|
They could easily be parallelized which would then allow for an online analysis.
|
||||||
We believe this will then allow an online analysis.
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
|
@ -395,49 +388,49 @@ We believe this will then allow an online analysis.
|
||||||
\subsection{Quantitative Analysis}
|
\subsection{Quantitative Analysis}
|
||||||
|
|
||||||
In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs.
|
In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs.
|
||||||
The cumulative distribution of similarity to a reference job is shown in \Cref{fig:ecdf}.
|
% TODO full paper
|
||||||
For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native.
|
%The cumulative distribution of similarity to a reference job is shown in %\Cref{fig:ecdf}.
|
||||||
B-aggz shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
|
%For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native.
|
||||||
The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
|
%B-aggz shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
|
||||||
|
%The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
|
||||||
% This indicates that the algorithms
|
% This indicates that the algorithms
|
||||||
|
|
||||||
The support team in a data center may have time to investigate the most similar jobs.
|
The support team in a data center may have time to investigate the most similar jobs.
|
||||||
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram.
|
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs and rank them; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the histogram in
|
||||||
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
|
\Cref{fig:hist} for the actual number of jobs with a given similarity.
|
||||||
As we focus on a feasible number of jobs, the diagram should be read from the right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
|
As we focus on a feasible number of jobs, we crop it at 100 jobs (total number of jobs is still given).
|
||||||
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
|
It turns out that both B algorithms produce nearly identical histograms and we omit one of them.
|
||||||
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
|
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
|
||||||
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at Q-lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at Q-lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
||||||
For Job-L, we find barely similar jobs, except when using the Q-phases and KS algorithms.
|
For Job-L, we find barely similar jobs, except when using the Q-phases and KS algorithms.
|
||||||
Q-phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
|
Q-phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
|
||||||
|
Practically, the support team would start with Rank\,1 (most similar job, e.g., the reference job) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
|
||||||
|
|
||||||
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
|
% TODO full paper?
|
||||||
|
% \begin{figure}
|
||||||
\begin{figure}
|
%
|
||||||
|
% \begin{subfigure}{0.8\textwidth}
|
||||||
\begin{subfigure}{0.8\textwidth}
|
% \centering
|
||||||
\centering
|
% \includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf}
|
||||||
\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf}
|
% \caption{Job-S} \label{fig:ecdf-job-S}
|
||||||
\caption{Job-S} \label{fig:ecdf-job-S}
|
% \end{subfigure}
|
||||||
\end{subfigure}
|
% \centering
|
||||||
\centering
|
%
|
||||||
|
% \begin{subfigure}{0.8\textwidth}
|
||||||
\begin{subfigure}{0.8\textwidth}
|
% \centering
|
||||||
\centering
|
% \includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf}
|
||||||
\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf}
|
% \caption{Job-M} \label{fig:ecdf-job-M}
|
||||||
\caption{Job-M} \label{fig:ecdf-job-M}
|
% \end{subfigure}
|
||||||
\end{subfigure}
|
% \centering
|
||||||
\centering
|
%
|
||||||
|
% \begin{subfigure}{0.8\textwidth}
|
||||||
\begin{subfigure}{0.8\textwidth}
|
% \centering
|
||||||
\centering
|
% \includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf}
|
||||||
\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf}
|
% \caption{Job-L} \label{fig:ecdf-job-L}
|
||||||
\caption{Job-L} \label{fig:ecdf-job-L}
|
% \end{subfigure}
|
||||||
\end{subfigure}
|
% \centering
|
||||||
\centering
|
% \caption{Quantitative job similarity -- empirical cumulative density function}
|
||||||
\caption{Quantitative job similarity -- empirical cumulative density function}
|
% \label{fig:ecdf}
|
||||||
\label{fig:ecdf}
|
% \end{figure}
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
|
@ -469,10 +462,11 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
|
||||||
|
|
||||||
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
|
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
|
||||||
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
|
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
|
||||||
Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings\footnote{%
|
Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings%\footnote{%
|
||||||
As they can contain confidential data, it is difficult to anonymize them without perturbing the meaning.
|
%As they can contain confidential data, it is difficult to anonymize them without perturbing the meaning.
|
||||||
Therefore, they are not published in our data repository.
|
%Therefore, they are not published in our data repository.
|
||||||
}
|
%}
|
||||||
|
% TODO ANONY
|
||||||
|
|
||||||
To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs:
|
To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs:
|
||||||
We explore the distribution of users (and groups), runtime, and node count across jobs.
|
We explore the distribution of users (and groups), runtime, and node count across jobs.
|
||||||
|
@ -574,10 +568,10 @@ While there is some reordering, both algorithms lead to a comparable set.
|
||||||
All algorithms have a significant overlap for Job-S.
|
All algorithms have a significant overlap for Job-S.
|
||||||
For Job-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set.
|
For Job-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set.
|
||||||
Generally, Q-lev and Q\_native are generating more similar results than other algorithms.
|
Generally, Q-lev and Q\_native are generating more similar results than other algorithms.
|
||||||
From this analysis, we conclude that one representative from binarization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
|
From this analysis, we conclude that one representative from B is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects. % and, therefore, should be analyzed individually
|
||||||
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}[t]
|
||||||
\begin{subfigure}{0.31\textwidth}
|
\begin{subfigure}{0.31\textwidth}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{job_similarities_4296426-out/intersection-heatmap}
|
\includegraphics[width=\textwidth]{job_similarities_4296426-out/intersection-heatmap}
|
||||||
|
|
Loading…
Reference in New Issue