This commit is contained in:
Julian M. Kunkel 2020-12-08 12:09:02 +00:00
parent 1cf57a036c
commit db3e4a8eb0
1 changed files with 4 additions and 7 deletions

View File

@ -97,8 +97,7 @@ This allows staff to understand the usage of the exhibited behavior better and t
\medskip \medskip
%In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described. In this paper, we describe a methodology to identify jobs related to a reference job based on their temporal I/O similarity.
In this paper, we describe a methodology to process efficiently a large set of jobs and find a class with a high temporal I/O similarity to a reference job.
Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics. Practically, we apply several previously developed time series algorithms and also utilize the Kolmogorov-Smirnov-Test to compare the distribution of the metrics.
A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs. A study is conducted to explore the effectiveness of the approach by investigating related jobs for three reference jobs.
The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data. The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data.
@ -117,8 +116,7 @@ Secondly, they aim to improve the efficiency of all workflows -- represented as
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed. In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly. Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization. Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution. Monitoring and analysis tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
\eb{Grafana ist ein reines Visualisierungswerkzeug}
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on 20 nodes may not be a good return of investment. The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on 20 nodes may not be a good return of investment.
By ranking jobs based on their utilization, it isn't difficult to find a job that exhibits extensive usage of computing, network, and I/O resources. By ranking jobs based on their utilization, it isn't difficult to find a job that exhibits extensive usage of computing, network, and I/O resources.
@ -135,7 +133,7 @@ Job names are defined by users; while a similar name may hint to be a similar wo
In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series and their I/O behavior. In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series and their I/O behavior.
These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity. These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively. They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively.
In this paper, we refine these algorithms slightly, also include another algorithm and apply them to rank jobs based on their temporal I/O similarity to a reference job. In this paper, we refine these algorithms slightly, include another algorithm, and apply them to rank jobs based on their temporal similarity to a reference job.
We start by introducing related work in \Cref{sec:relwork}. We start by introducing related work in \Cref{sec:relwork}.
In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis. In \Cref{sec:methodology}, we describe briefly the data reduction and the algorithms for similarity analysis.
@ -199,8 +197,7 @@ The results are 4D data (time, nodes, metrics, file system) per job.
The distance measures should handle jobs of different lengths and node count. The distance measures should handle jobs of different lengths and node count.
In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We are using their data. In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We are using their data.
In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively. In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems. This strategy ensures that the same approach can be applied to other HPC systems.
\eb{Portability muss noch verdeutlicht werden}
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis. After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, dataset is reduced from 1 million jobs to about 580k jobs. By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, dataset is reduced from 1 million jobs to about 580k jobs.