diff --git a/paper/main.tex b/paper/main.tex index 9152f01..361cc9b 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -92,7 +92,8 @@ When support staff investigates a single job, it is relevant to identify related In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described. Practically, we apply several of previously developed time series based algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics. A study is conducted to explore the effectivity of the approach which starts starts from three reference jobs and investigates related jobs. -The data stems from DKRZ's supercomputer Mistral and include more than 500.000 jobs that have been executed during several months. +The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed during several months of operation. +\jk{Wie lange war das?} %Problem with definition of similarity. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed some interesting patterns on the data. @@ -123,13 +124,13 @@ The distance metrics can be applied to jobs of with different runtime and number We showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively. In this article, we refined these distance metrics slightly and apply them to rank jobs based on their similarity to a reference job. Therefore, we perform a study on three reference jobs with different character. -We also utilize Kolmogorov-Smirnov to illustrate the benefit and drawbacks of the differents methods. +We also utilize Kolmogorov-Smirnov to illustrate the benefit and drawbacks of the different methods. This paper is structured as follows. We start by introducing related work in \Cref{sec:relwork}. %Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors. In \Cref{sec:methodology} we describe briefly describe the data reduction and the machine learning approaches. -In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectivness of the approach to identify similar jobs. +In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs. Finally, we conclude our paper in \Cref{sec:summary}. \section{Related Work} @@ -137,60 +138,68 @@ Finally, we conclude our paper in \Cref{sec:summary}. \section{Methodology} \label{sec:methodology} -\ebadd{ + +The purpose of the methodology is to allow user and support staff to explore all executed jobs on a supercomputer in the order of their similarity to the reference job. +Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, finally the methodology to investigate jobs is described. + +\subsection{Job Data} +On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager. +The results in 4D data (time, nodes, metrics, file system) per job. +The distance metrics should handle jobs of different length and node count. +In \cite{TODOPaper}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. +In a nutshell, for each job executed on Mistral, we partition it into 10 minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1) and CriticalIO (4) for values below 99-percentile, up to 99\.9-percentile, and above, respectively. +After data is reduced across nodes, we quantize the timelines either using binary or hexadecimal representation which is then ready for similarity analysis. +By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs. + +\subsection{Algorithms for Computing Similarity} +In this paper, we reuse the algorithms developed in \cite{TODO}: bin\_all, bin\_aggzeros, hex\_native, hex\_lev, and hex\_quant. +They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levensthein-distance. +For jobs with different length, we apply a sliding-windows approach which finds the location for the shorter job in the longer job with the highest similarity. +The hex\_quant algorithm extracts phases and matches + +\paragraph{Kolmogorov-Smirnov (kv) algorithm} +In this paper, we add a Kolmogorov-Smirnov algorithm that compares the probability distribution of the observed values which we describe in the following. % Summary -For the analysis of the Kolmogorov-Smirnov-based similarity we perform two preparation steps. -Dimension reduction by mean and concatenation functions allow us to reduce the four dimensional dataset to two dimensions. -Pre-filtering omits irrelevant jobs in term of performance and reduces the dataset any further. +For the analysis, we perform two preparation steps. +Dimension reduction by computing mean across the two file systems and by concatenating the time series data of the individual nodes. +This reduces the four dimensional dataset to two dimensions (time, metrics). % Aggregation The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system. -A fixed interval also ensure the portability of the approach to other HPC systems. -The concatenation of time series on the node dimension preserves I/O information of all nodes. +The fixed interval of 10 min also ensure the portability of the approach to other HPC systems. +The concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with different number of nodes. We apply no aggregation function to the metric dimension. % Filtering -Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis. -Their sum across all dimensions and time series is equal to zero. -Furthermore, we filter those jobs whose time series have less than 8 values. +%Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis. +%Their sum across all dimensions and time series is equal to zero. +%Furthermore, we filter those jobs whose time series have less than 8 values. +% Oben beschrieben % Similarity For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''. -The similarity function \Cref{eq:ks_similarity} calculates the inverse of reject probability $p_{\text{reject}}$. -} +The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $m$. + \begin{equation}\label{eq:ks_similarity} - similarity = 1 - p_{\text{reject}} + similarity = \frac{\sum_m 1 - p_{\text{reject}(m)}}{m} \end{equation} -Given: the reference job ID. -Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set. +\subsection{Methodology} -Adapt the algorithms: -\begin{itemize} - \item iterate for all jobs - \begin{itemize} - \item compute distance to reference job - \end{itemize} - \item sort the jobs based on the distance to ref job - \item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance -\end{itemize} +Our strategy for localizing similar jobs works as follows: +The user/support staff provides a reference job ID and algorithm to use for the similarity. +The system iterate over all jobs and computes the distance to the reference job using the algorithm. +Next, sort the jobs based on the distance to the reference job. +Visualize the cumulative job distance. +Start the inspection of the jobs looking at the most similar jobs first. -A user might be interested to explore say closest 10 or 50 jobs. +The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity. +For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%. -Algorithms: -Profile algorithm: job-profiles (job-duration, job-metrics, combine both) -$\rightarrow$ just compute geom-mean distance between profile +For the inspection of the jobs, a user may explore the job metadata, searching for similarities and explore the time series of a job's IO metrics. -Check time series algorithms: - -\begin{itemize} - \item bin - \item hex\_native - \item hex\_lev - \item hex\_quant -\end{itemize} \section{Evaluation} \label{sec:evaluation}