diff --git a/paper/main.tex b/paper/main.tex index 30cd238..0bbe14a 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -195,23 +195,22 @@ The results are 4D data (time, nodes, metrics, file system) per job. The distance measures should handle jobs of different lengths and node count. In \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. In a nutshell, for each job executed on Mistral, we partition it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively. -After data is reduced across nodes, we quantize the timelines either using binary or quantum hexadecimal representation which is then ready for similarity analysis. +The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems. +After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (IO activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis. By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs. \subsection{Algorithms for Computing Similarity} We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases. -They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance measure is mostly the Euclidean distance or the Levenshtein-distance. -For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. - +They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance. B-all determines similarity between binary codings by means of Levenshtein distance. B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero. Q-lev determines similarity between quantized codings by using Levensthein distance. -Q-native uses instead of Levenshtein distance a performance-aware similarity function. +Q-native uses a performance-aware similarity function, i.e., distance for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$. +For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation. -KS concatenates individual node data (instead of averaging) and computes similarity be means of Kolmogorov-Smirnov-Test. - The Q-phases algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following. +In brief, KS concatenates individual node data (instead of averaging) and computes similarity be means of Kolmogorov-Smirnov-Test. \paragraph{Kolmogorov-Smirnov (KS) algorithm} % Summary @@ -221,7 +220,6 @@ This reduces the four-dimensional dataset to two dimensions (time, metrics). % Aggregation The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system. -The fixed interval of 10 minutes also ensures the portability of the approach to other HPC systems. Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes. We apply no aggregation function to the metric dimension.