This commit is contained in:
Julian M. Kunkel 2020-10-04 17:11:35 +01:00
parent d2d5970a4c
commit d78e511676
1 changed files with 38 additions and 20 deletions

View File

@ -84,35 +84,53 @@ DKRZ --
\maketitle
\begin{abstract}
One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency.
Therefore, a data center deploys monitoring systems which captures the behavior of the executed jobs.
While it is easy to utilize statistics to rank jobs based on the utilization of compute, storage, and network, it is tricky to find patterns in 100.000 of jobs, i.e., is there a class of jobs that aren't performing well.
When support staff investigates a single job, it is relevant to identify related jobs in order to understand the usage of the exhibited behavior better and assess the optimization potential.
Supercomputers execute 1000's of jobs every day.
Support staff at a data center have two goals.
Firstly, they provide a service to users to enable them the execution of their applications.
Secondly, they aim to improve the efficiency of the workflows in order to allow the data center to serve more workloads.
In order to optimize an application, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, the data center must deploy monitoring systems and staff must pro-actively identify candidates for optimization.
While it is easy to utilize statistics to rank applications based on the utilization of compute, storage, and network, it is tricky to find patterns in 100.000 of jobs, i.e., is there a class of jobs that aren't performing well.
When support staff investigates a single job, the question might be are there other jobs like this?
In this paper, a methodology and algorithms to identify similar jobs based on their temporal IO behavior is described.
A study is conducted investigating similar jobs starting from three reference jobs bearing in mind if this is effective to find similar jobs.
The data stems from DKRZ's supercomputer Mistral and included more than 500.000 jobs that have been executed during several months.
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described.
Practically, we apply several of previously developed time series based algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics.
A study is conducted to explore the effectivity of the approach which starts starts from three reference jobs and investigates related jobs.
The data stems from DKRZ's supercomputer Mistral and include more than 500.000 jobs that have been executed during several months.
%Problem with definition of similarity.
Our analysis shows that this strategy is effective to identify similar jobs and revealed some interesting patterns on the data.
Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed some interesting patterns on the data.
\end{abstract}
\section{Introduction}
%This paper is structured as follows.
%We start with the related work in \Cref{sec:relwork}.
Supercomputers execute 1000's of jobs every day.
Support staff at a data center have two goals.
Firstly, they provide a service to users to enable them the convenient execution of their applications.
Secondly, they aim to improve the efficiency of all workflows -- represented as batch jobs -- in order to allow the data center to serve more workloads.
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as \cite{Grafana} and \cite{XDMod} provide various statistics and time series data for the job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, analyzing a job that is executed once on a medium number of nodes is only costing human resources and not a good return of investment.
By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of compute, network, and IO resources.
However, would it be beneficial to investigate this workload in detail and potentially optimize it?
A pattern that can be observed in many jobs bears a potential as the blueprint for optimizing one job may be applied to other jobs as well.
This is particularly true when running one application with similar inputs but also different applications may lead to a similar behavior.
Therefore, it is useful for support staff that investigates a resource hungry job to identify similar jobs that are executed on the supercomputer.
In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance metrics can be applied to jobs of with different runtime and number of nodes utilized but differ in the way the define similarity.
We showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In this article, we refined these distance metrics slightly and apply them to rank jobs based on their similarity to a reference job.
Therefore, we perform a study on three reference jobs with different character.
We also utilize Kolmogorov-Smirnov to illustrate the benefit and drawbacks of the differents methods.
This paper is structured as follows.
We start by introducing related work in \Cref{sec:relwork}.
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.
%Finally, we finalize our paper with a summary in \Cref{sec:summary}.
In \Cref{sec:methodology} we describe briefly describe the data reduction and the machine learning approaches.
In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectivness of the approach to identify similar jobs.
Finally, we conclude our paper in \Cref{sec:summary}.
\section{Related Work}
\label{sec:relwork}