master
Julian M. Kunkel 2020-11-20 10:08:16 +00:00
parent fd867d55f0
commit c6dd942e2c
1 changed files with 6 additions and 7 deletions

View File

@ -87,17 +87,17 @@ DKRZ --
One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency.
Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs.
While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100.000 jobs, i.e., is there a class of jobs that aren't performing well.
Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs.
Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint.
This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.
\medskip
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described.
In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described.
Practically, we apply several previously developed time series algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics.
A study is conducted to explore the effectivity of the approach which starts from three reference jobs and investigates related jobs.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. %203 days.
%Problem with the definition of similarity.
Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
A study is conducted to explore the effectiveness of the approach which starts from three reference jobs and investigates related jobs.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose.
%203 days.
\end{abstract}
@ -179,7 +179,6 @@ For example, Evalix \cite{emeras2015evalix} monitors system statistics (from pro
PAS2P \cite{mendez2012new} extracts the IO patterns from application traces and then allows users to manually compare them.
In \cite{white2018automatic}, a heuristic classifier is developed that analyzes the I/O read/write throughput time series to extract the periodicity of the jobs -- similar to Fourier analysis.
The LASSi tool \cite{AOPIUOTUNS19} periodically monitors Lustre I/O statistics and computes a "risk" factor to identify IO patterns that stress the file system.
In contrast to existing work, our approach allows a user to identify similar activities based on the temporal I/O behavior recorded with a data center-wide deployed monitoring system.