diff --git a/paper/main.tex b/paper/main.tex index e8c65e1..30cd238 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -94,7 +94,7 @@ This allows staff to understand the usage of the exhibited behavior better and t \medskip In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal I/O behavior is described. -Practically, we apply several previously developed time series algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics. +Practically, we apply several previously developed time series algorithms and also utilize Kolmogorov-Smirnov-Test to compare the distribution of the statistics. A study is conducted to explore the effectiveness of the approach which starts from three reference jobs and investigates related jobs. The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed interesting patterns in the data. It also shows the need for the community to jointly define the semantics of similarity depending on the analysis purpose. @@ -202,12 +202,18 @@ By pre-filtering jobs with no I/O activity -- their sum across all dimensions an We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases. They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance measure is mostly the Euclidean distance or the Levenshtein-distance. For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. -\todo{evtl. describe each algorithm with one sentence?} + +B-all determines similarity between binary codings by means of Levenshtein distance. +B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero. +Q-lev determines similarity between quantized codings by using Levensthein distance. +Q-native uses instead of Levenshtein distance a performance-aware similarity function. +Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation. +KS concatenates individual node data (instead of averaging) and computes similarity be means of Kolmogorov-Smirnov-Test. + The Q-phases algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following. - -\paragraph{Kolmogorov-Smirnov (kv) algorithm} +\paragraph{Kolmogorov-Smirnov (KS) algorithm} % Summary For the analysis, we perform two preparation steps. Dimension reduction by computing means across the two file systems and by concatenating the time series data of the individual nodes. @@ -229,7 +235,7 @@ For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $m$. \begin{equation}\label{eq:ks_similarity} - similarity = \frac{\sum_m 1 - p_{\text{reject}(m)}}{m} + similarity = \frac{\sum_m 1 - p_{\text{reject}(m)}}{|M|}, \text{with } m \in M, \text{where } M \text{ is set of metrics} \end{equation} @@ -261,7 +267,7 @@ For this study, we chose several reference jobs with different compute and IO ch \end{itemize} The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes. -This coding is also used for the Q class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{Eugen20HPS}. +This coding is also used for the Q class of algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in \cite{Eugen20HPS}. The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview. For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6. @@ -359,7 +365,7 @@ The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to Generally, the B algorithms are fastest, while the Q algorithms take often 4-5x as long. Q\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L. The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window. -The KS algorithm is faster than the others by 10x but it operates on the statistics of the time series. +The KS algorithm is faster than the others by 10x, but it operates on the statistics of the time series. Note that the current algorithms are sequential and executed on just one core. For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized. We believe this will then allow an online analysis. @@ -478,7 +484,7 @@ To confirm the hypotheses presented, we analyzed the job metadata comparing job \paragraph{User distribution.} To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs. -For Job-S, we can see that about 70-80\% of jobs stem from one user, for the Q-lev and Q-native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total). +For Job-S, we can see that about 70-80\% of jobs stem from one user, for the Q-lev and Q-native algorithms, the other jobs stem from a second user while B algorithms include jobs from additional users (5 in total). For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, Q-lev, Q-native, and KS is including more users (29, 33, and 37, respectively) than the other three algorithms. For Job-L, the two Q algorithms include with (12 and 13) a bit more diverse user community than the B algorithms (9) but Q-phases cover 35 users. We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups. @@ -567,9 +573,9 @@ To verify that the different algorithms behave differently, the intersection for Bin\_all and B-aggz overlap with at least 99 ranks for all three jobs. While there is some reordering, both algorithms lead to a comparable set. All algorithms have a significant overlap for Job-S. -For Job\-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set. +For Job-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set. Generally, Q-lev and Q\_native are generating more similar results than other algorithms. -From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually. +From this analysis, we conclude that one representative from binarization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually. \begin{figure}