This commit is contained in:
Eugen Betke 2020-10-20 13:00:03 +02:00
parent e9e9143250
commit 754c8b89ab
1 changed files with 19 additions and 17 deletions

View File

@ -112,7 +112,7 @@ Rarely, users will liaise with staff and request a performance analysis and opti
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as \cite{Grafana} and \cite{XDMod} provide various statistics and time series data for the job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, analyzing a job that is executed once on a medium number of nodes is only costing human resources and not a good return of investment.
The support staff should focus on workloads for which optimization is beneficial, for instance, \ebrep{analysis of}{analyzing} a job that is executed once on a medium number of nodes \ebrep{costs}{is only costing} human resources and \ebadd{is} not a good return of investment.
By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of compute, network, and IO resources.
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
A pattern that can be observed in many jobs bears a potential as the blueprint for optimizing one job may be applied to other jobs as well.
@ -120,16 +120,16 @@ This is particularly true when running one application with similar inputs but a
Therefore, it is useful for support staff that investigates a resource hungry job to identify similar jobs that are executed on the supercomputer.
In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance metrics can be applied to jobs of with different runtime and number of nodes utilized but differ in the way the define similarity.
We showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
The distance metrics can be applied to jobs \ebdel{of }with different runtime and number of nodes utilized but differ in the way the define similarity.
We showed that the metrics can be used to cluster jobs, however, it remain\ebrep{s}{ed} unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In this article, we refined these distance metrics slightly and apply them to rank jobs based on their similarity to a reference job.
Therefore, we perform a study on three reference jobs with different character.
We also utilize Kolmogorov-Smirnov to illustrate the benefit and drawbacks of the different methods.
We also utilize Kolmogorov-Smirnov\ebadd{-Test} to illustrate the benefit and drawbacks of the different methods.
This paper is structured as follows.
We start by introducing related work in \Cref{sec:relwork}.
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
In \Cref{sec:methodology} we describe briefly describe the data reduction and the machine learning approaches.
In \Cref{sec:methodology} we describe briefly the data reduction and the machine learning approaches.
In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
Finally, we conclude our paper in \Cref{sec:summary}.
@ -140,14 +140,14 @@ Finally, we conclude our paper in \Cref{sec:summary}.
\label{sec:methodology}
The purpose of the methodology is to allow user and support staff to explore all executed jobs on a supercomputer in the order of their similarity to the reference job.
Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, finally the methodology to investigate jobs is described.
Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, \ebadd{and} finally\ebadd{,} the methodology to investigate jobs is described.
\subsection{Job Data}
On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
The results in 4D data (time, nodes, metrics, file system) per job.
The results \ebrep{are}{in} 4D data (time, nodes, metrics, file system) per job.
The distance metrics should handle jobs of different length and node count.
In \cite{TODOPaper}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
In a nutshell, for each job executed on Mistral, we partition it into 10 minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1) and CriticalIO (4) for values below 99-percentile, up to 99\.9-percentile, and above, respectively.
In a nutshell, for each job executed on Mistral, we partition it into 10 minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1) and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
After data is reduced across nodes, we quantize the timelines either using binary or hexadecimal representation which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
@ -155,10 +155,10 @@ By pre-filtering jobs with no I/O activity -- their sum across all dimensions an
In this paper, we reuse the algorithms developed in \cite{TODO}: bin\_all, bin\_aggzeros, hex\_native, hex\_lev, and hex\_quant.
They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levensthein-distance.
For jobs with different length, we apply a sliding-windows approach which finds the location for the shorter job in the longer job with the highest similarity.
The hex\_quant algorithm extracts phases and matches
The hex\_quant algorithm extracts I/O phases and \ebrep{computes similarity between the most similar I/O phases of both jobs}{matches}.
\paragraph{Kolmogorov-Smirnov (kv) algorithm}
In this paper, we add a Kolmogorov-Smirnov algorithm that compares the probability distribution of the observed values which we describe in the following.
In this paper, we add a \ebrep{new similarity definition based on Kolmogorov-Smirnov-Test}{Kolmogorov-Smirnov algorithm} that compares the probability distribution of the observed values which we describe in the following.
% Summary
For the analysis, we perform two preparation steps.
Dimension reduction by computing mean across the two file systems and by concatenating the time series data of the individual nodes.
@ -166,8 +166,8 @@ This reduces the four dimensional dataset to two dimensions (time, metrics).
% Aggregation
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
The fixed interval of 10 min also ensure the portability of the approach to other HPC systems.
The concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with different number of nodes.
The fixed interval of 10 \ebrep{minutes}{min} also ensure the portability of the approach to other HPC systems.
\ebrep{Unlike the previous similarity definitions, the}{The} concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with different number of nodes.
We apply no aggregation function to the metric dimension.
% Filtering
@ -187,20 +187,21 @@ The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of r
\subsection{Methodology}
\eb{Der Absatz liest sich nicht fluessig}
Our strategy for localizing similar jobs works as follows:
The user/support staff provides a reference job ID and algorithm to use for the similarity.
The system iterate over all jobs and computes the distance to the reference job using the algorithm.
The user/support staff provides a reference job ID and \ebrep{selects a similarity definition}{algorithm to use for the similarity}.
The system iterates over all jobs and computes the distance to the reference job using the algorithm.
Next, sort the jobs based on the distance to the reference job.
Visualize the cumulative job distance.
Start the inspection of the jobs looking at the most similar jobs first.
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.}
\eb{Vorteil fuer den Nutzer ist nicht ganz klar. Warum sollte ein Nutzer nach ähnlichen Jobs suchen?}
The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
For the inspection of the jobs, a user may explore the job metadata, searching for similarities and explore the time series of a job's IO metrics.
\section{Evaluation}
\label{sec:evaluation}
@ -227,7 +228,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
In \Cref{fig:refJobsHist}, the histograms of all job metrics are shown.
A histogram contains the activities of each node and timestep without being averaged across the nodes.
This data is used to compare jobs using Kolmogorov-Smirnov.
This data is used to compare jobs using Kolmogorov-Smirnov\ebadd{-Test}.
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to some activity at the first segment for three other metrics.
@ -317,6 +318,7 @@ We believe this will then allow a near-online analysis of a job.
\jk{To update the figure to use KS and (maybe to aggregate job profiles)? Problem old files are gone}
\eb{old files recovered + KS numbers are in Git-Repo}
\begin{figure}
\centering
\begin{subfigure}{0.31\textwidth}