Anonymization ^-^

This commit is contained in:
Julian M. Kunkel 2020-12-04 16:10:54 +00:00
parent 7be00c5a3b
commit fed2f1aa47
1 changed files with 16 additions and 15 deletions

View File

@ -70,16 +70,17 @@
\crefname{codecount}{Code}{Codes}
\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}
\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}
%\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}
\institute{
University of Reading--%
\email{j.m.kunkel@reading.ac.uk}%
\and
DKRZ --
\email{betke@dkrz.de}%
}
%\institute{
%University of Reading--%
%\email{j.m.kunkel@reading.ac.uk}%
%\and
%DKRZ --
%\email{betke@dkrz.de}%
%}
\begin{document}
\maketitle
@ -126,10 +127,10 @@ It is non-trivial to identify jobs with similar behavior from the pool of execut
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same IO behavior is would not be possible.
In our previous paper \cite{Eugen20HPS}, we developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
We showed that the metrics can be used to cluster jobs, however, it remains unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In this article, we refined these distance measures slightly and apply them to rank jobs based on their similarity to a reference job.
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In this article, we refine these distance measures slightly and apply them to rank jobs based on their similarity to a reference job.
Therefore, we perform a study on three reference jobs with a different character.
We also utilize Kolmogorov-Smirnov-Test to illustrate the benefit and drawbacks of the different methods.
@ -193,8 +194,8 @@ Therefore, we first need to define how a job's data is represented, then describ
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
The results are 4D data (time, nodes, metrics, file system) per job.
The distance measures should handle jobs of different lengths and node count.
In \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
In a nutshell, for each job executed on Mistral, we partition it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We are using their data.
In a nutshell, for each job executed on Mistral, they partitioned it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems.
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (IO activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
@ -206,7 +207,7 @@ B-all determines similarity between binary codings by means of Levenshtein dista
B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero.
Q-lev determines similarity between quantized codings by using Levensthein distance.
Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$.
For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity.
For jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation.
The Q-phases algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs.
In this paper, we add a similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
@ -221,7 +222,7 @@ This reduces the four-dimensional dataset to two dimensions (time, metrics).
% Aggregation
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes.
We apply no aggregation function to the metric dimension.
No aggregation is performed on the metric dimension.
% Filtering
%Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis.