Weiter ^-^

This commit is contained in:
Julian M. Kunkel 2020-11-19 12:41:58 +00:00
parent 966715a082
commit 554944e4a0
2 changed files with 64 additions and 43 deletions

View File

@ -186,12 +186,29 @@
}
@article{betke20,
title={The Importance of Temporal Behavior when Classifying Job IO Patterns Using Machine Learning Techniques},
author={Betke, Eugen and Kunkel, Julian}
@inproceedings{betke20,
author = {Eugen Betke and Julian Kunkel},
title = {{The Importance of Temporal Behavior when Classifying Job IO Patterns Using Machine Learning Techniques}},
year = {2020},
month = {06},
booktitle = {{High Performance Computing: ISC High Performance 2020 International Workshops, Revised Selected Papers}},
editor = {Heike Jagode and Hartwig Anzt and Guido Juckeland and Hatem Ltaief},
publisher = {Springer},
series = {Lecture Notes in Computer Science},
number = {12151},
pages = {191-205},
conference = {ISC HPC},
location = {Frankfurt, Germany},
isbn = {978-3-030-59851-8},
issn = {1611-3349},
doi = {https://doi.org/10.1007/978-3-030-59851-8_12},
abstract = {Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. It is state of the practice to investigate job similarity by looking into job profiles that summarize the dynamics of job execution into one dimension of statistics and neglect the temporal behavior. In this work, we utilize machine learning techniques to cluster and classify parallel jobs based on the similarity in their temporal IO behavior to highlight the importance of temporal behavior when comparing jobs. Our contribution is the qualitative and quantitative evaluation of different IO characterizations and similarity measurements that work toward the development of a suitable clustering algorithm. We explore IO characteristics from monitoring data of one million parallel jobs and cluster them into groups of similar jobs. Therefore, the time series of various IO statistics is converted into features using different similarity metrics that customize the classification. We discuss conventional ML techniques that are applied to job profiles and contrast this with the analysis of time series data where we apply the Levenshtein distance as a distance metrics. While the employed Levenshtein algorithms arent yet optimal, the results suggest that temporal behavior is key to identify related pattern.},
}
@article{Eugen20HPS,
title={TODO JHPS version},
author={Betke, Eugen and Kunkel, Julian}
title={{Classifying Temporal Characteristics of Job I/O}},
author={Betke, Eugen and Kunkel, Julian},
journal={Journal of High Performance Storage},
issue={1},
date={2020}
}

View File

@ -69,7 +69,7 @@
\usepackage{cleveref}
\crefname{codecount}{Code}{Codes}
\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analyzing}
\title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis}
\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}
@ -95,7 +95,7 @@ This allows staff to understand the usage of the exhibited behavior better and t
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described.
Practically, we apply several previously developed time series algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics.
A study is conducted to explore the effectivity of the approach which starts from three reference jobs and investigates related jobs.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation %203 days.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. %203 days.
%Problem with the definition of similarity.
Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
\end{abstract}
@ -125,10 +125,6 @@ It is non-trivial to identify jobs with similar behavior from the pool of execut
Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes).
Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same IO behavior is would not be possible.
\jk{Hoffe das erklärt es}
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.}
\eb{Vorteil fuer den Nutzer ist nicht ganz klar. Warum sollte ein Nutzer nach ähnlichen Jobs suchen?}
In our previous paper \cite{Eugen20HPS}, we developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
We showed that the metrics can be used to cluster jobs, however, it remains unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
@ -355,19 +351,16 @@ Finally, the quantitative behavior of the 100 most similar jobs is investigated.
\subsection{Performance}
\jk{Eugen: pls describe node where the performance is measured on.}
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ.
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM.
A boxplot for the runtimes is shown in \Cref{fig:performance}.
The runtime is normalized for 100k jobs, i.e., for BIN\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long.
Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L.
The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
The KS algorithm is faster than the others by 10x but it operates on the statistics of the time series.
Note that the current algorithms are sequential and executed on just one core.
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
We believe this will then allow a near-online analysis of a job.
We believe this will then allow an online analysis.
\begin{figure}
\centering
@ -443,19 +436,19 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
\begin{figure}
\centering
\begin{subfigure}{0.75\textwidth}
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_4296426-out/hist-sim}
\caption{Job-S} \label{fig:hist-job-S}
\end{subfigure}
\begin{subfigure}{0.75\textwidth}
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_5024292-out/hist-sim}
\caption{Job-M} \label{fig:hist-job-M}
\end{subfigure}
\begin{subfigure}{0.75\textwidth}
\begin{subfigure}{0.7\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_7488914-out/hist-sim}
\caption{Job-L} \label{fig:hist-job-L}
@ -603,10 +596,12 @@ From this analysis, we conclude that one representative from binary quantization
\section{Assessing Timelines for Similar Jobs}
\label{sec:timelines}
To verify the suitability of the similarity metrics, for each algorithm, we investigated the timelines of all Top\,100 jobs.
To verify the suitability of the similarity metrics, for each algorithm, we carefully investigated the timelines of each of the jobs in the Top\,100.
We subjectively found that the approach works very well and identifies suitable similar jobs.
To demonstrate this, we include a selection of job timelines -- typically Rank\,2, Rank\,15, and Rank\,100, and selected interesting job profiles.
To demonstrate this, we include a selection of job timelines and selected interesting job profiles.
These can be visually and subjectively compared to our reference jobs shown in \Cref{fig:refJobs}.
For space reasons, the included images will be scaled down making it difficult to read the text.
However, we believe that they are still well suited for a visual inspection and comparison.
\subsection{Job-S}
@ -623,7 +618,7 @@ While we cannot visually see much differences between these two jobs compared to
For Job-S, we found that all algorithms work well and, therefore, omit further timelines.
\begin{table}
\begin{table}[bt]
\centering
\begin{tabular}{r|r|r|r|r|r}
BIN\_aggzeros & BIN\_all & HEX\_lev & HEX\_native & HEX\_phases & KS\\ \hline
@ -643,13 +638,14 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
\label{tbl:control-jobs}
\end{table}
\begin{figure}
\begin{figure}[bt]
\centering
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--76timeseries4235560}
\caption{Non-cmor job: Rank\,76, SIM=69\%}
\end{subfigure}
\qquad
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.8077--4timeseries4483904}
@ -661,7 +657,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
\end{figure}
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.9615--1timeseries4296288}
@ -729,17 +725,18 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
\subsection{Job-M}
Inspecting the Top\,100 for this reference job is highlighting the differences between the algorithms.
All algorithms identify a diverse range of job names for this reference job in the Top\,100.
Firstly, the name of the reference job appears 30 times in the whole dataset so this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
Some applications are more prominent in these sets, e.g., for BIN\_aggzero, 32\,jobs contain WRF (a model) in the name.
The number of unique names is 19, 38, 49 to 51 for BIN\_aggzero, HEX\_phases, HEX\_native and HEX\_lev, respectively.
Firstly, the name of the reference job appears 30 times in the whole dataset.
So this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
Some applications are more prominent in these sets, e.g., for BIN\_aggzero, 32~jobs contain WRF (a model) in the name.
The number of unique names is 19, 38, 49, and 51 for BIN\_aggzero, HEX\_phases, HEX\_native and HEX\_lev, respectively.
The jobs that are similar according to the bin algorithms differ from our expectations.
The jobs that are similar according to the bin algorithms (see \Cref{fig:job-M-bin-aggzero}) differ from our expectations.
The other algorithms like HEX\_lev (\Cref{fig:job-M-hex-lev}) and HEX\_native (\Cref{fig:job-M-hex-native}) seem to work as intended:
While jobs exhibit short bursts of other active metrics even for low similarity we can eyeball a relevant similarity.
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/bin_aggzeros-0.7755--1timeseries8010306}
@ -762,9 +759,10 @@ The jobs that are similar according to the bin algorithms differ from our expect
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\vspace*{-2cm}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.9546--1timeseries7826634}
\caption{Rank\,2, SIM=95\%}
\end{subfigure}
@ -777,6 +775,8 @@ The jobs that are similar according to the bin algorithms differ from our expect
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.7392--15timeseries7651420}
\caption{Rank\,15, SIM=74\%}
\end{subfigure}
\vspace*{-1.7cm}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.7007--99timeseries8201967}
@ -789,18 +789,14 @@ The jobs that are similar according to the bin algorithms differ from our expect
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\vspace*{-1.6cm}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9878--1timeseries5240733}
\caption{Rank 2, SIM=99\%}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9651--2timeseries7826634}
\caption{Rank 3, SIM=97\%}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9084--14timeseries8037817}
\caption{Rank 15, SIM=91\%}
\end{subfigure}
@ -810,12 +806,20 @@ The jobs that are similar according to the bin algorithms differ from our expect
\caption{Rank 100, SIM=88\%}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\vspace*{-1.5cm}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9651--2timeseries7826634}
\caption{Rank 3, SIM=97\%}
\end{subfigure}
\caption{Job-M with HEX\_native, selection of similar jobs}
\label{fig:job-M-hex-native}
\end{figure}
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_phases-0.8831--1timeseries7826634}
@ -846,7 +850,7 @@ For the bin algorithms, the inspection of job names (14 unique names) leads to t
The hex algorithms identify a more diverse set of applications (18 unique names and no xmessy job), and the HEX\_phases algorithm has 85 unique names.
The KS algorithm finds 71 jobs ending with t127, which is a typical model configuration.
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1671--1timeseries7869050}
@ -872,7 +876,7 @@ The KS algorithm finds 71 jobs ending with t127, which is a typical model config
\end{figure}
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.9386--1timeseries7266845}
@ -898,7 +902,7 @@ The KS algorithm finds 71 jobs ending with t127, which is a typical model config
\end{figure}
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.9390--1timeseries7266845}
@ -923,7 +927,7 @@ The KS algorithm finds 71 jobs ending with t127, which is a typical model config
\label{fig:job-L-hex-native}
\end{figure}
\begin{figure}
\begin{figure}[bt]
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--14timeseries4577917}