Rename algorithms
This commit is contained in:
parent
06a7e2441f
commit
922969dcad
|
@ -200,10 +200,10 @@ After data is reduced across nodes, we quantize the timelines either using binar
|
|||
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
|
||||
|
||||
\subsection{Algorithms for Computing Similarity}
|
||||
We reuse the algorithms developed in \cite{Eugen20HPS}: BIN\_all, BIN\_aggzeros, HEX\_native, HEX\_lev, and HEX\_quant.
|
||||
We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggzeros, Q-native, Q-lev, and Q-quant.
|
||||
They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance measure is mostly the Euclidean distance or the Levenshtein-distance.
|
||||
For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity.
|
||||
The HEX\_quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs.
|
||||
The Q-quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs.
|
||||
In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following.
|
||||
|
||||
\paragraph{Kolmogorov-Smirnov (kv) algorithm}
|
||||
|
@ -260,11 +260,11 @@ For this study, we chose several reference jobs with different compute and IO ch
|
|||
\end{itemize}
|
||||
|
||||
The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes.
|
||||
This coding is also used for the HEX class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{Eugen20HPS}.
|
||||
This coding is also used for the Q class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{Eugen20HPS}.
|
||||
The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
|
||||
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
|
||||
|
||||
In \Cref{fig:refJobsHist}, the histograms of the job metrics are shown in HEX coding (16 steps).
|
||||
In \Cref{fig:refJobsHist}, the histograms of the job metrics are shown in Q coding (16 steps).
|
||||
The histogram contains the activities of each node at every timestep -- without being averaged across the nodes.
|
||||
This data is used to compare jobs using Kolmogorov-Smirnov-Test.
|
||||
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
|
||||
|
@ -354,7 +354,7 @@ Finally, the quantitative behavior of the 100 most similar jobs is investigated.
|
|||
|
||||
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM.
|
||||
A boxplot for the runtimes is shown in \Cref{fig:performance}.
|
||||
The runtime is normalized for 100k jobs, i.e., for BIN\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
|
||||
The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
|
||||
Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long.
|
||||
Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L.
|
||||
The Levenshtein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
|
||||
|
@ -390,9 +390,9 @@ We believe this will then allow an online analysis.
|
|||
|
||||
In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs.
|
||||
The cumulative distribution of similarity to a reference job is shown in \Cref{fig:ecdf}.
|
||||
For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native.
|
||||
BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
|
||||
The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
|
||||
For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native.
|
||||
B-aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
|
||||
The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
|
||||
% This indicates that the algorithms
|
||||
|
||||
The support team in a data center may have time to investigate the most similar jobs.
|
||||
|
@ -401,9 +401,9 @@ In \Cref{fig:hist}, the histograms with the actual number of jobs for a given si
|
|||
As we focus on a feasible number of jobs, the diagram should be read from the right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
|
||||
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
|
||||
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
|
||||
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at HEX\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
||||
For Job-L, we find barely similar jobs, except when using the HEX\_phases and KS algorithms.
|
||||
HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
|
||||
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at Q-lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
||||
For Job-L, we find barely similar jobs, except when using the Q-phases and KS algorithms.
|
||||
Q-phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while KS identifies 6880 jobs with a similarity of at least 97.5\%.
|
||||
|
||||
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster of jobs with close similarity is analyzed.
|
||||
|
||||
|
@ -455,7 +455,7 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
|
|||
\caption{Job-L} \label{fig:hist-job-L}
|
||||
\end{subfigure}
|
||||
\centering
|
||||
\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). BIN\_aggzeros is nearly identical to BIN\_all.}
|
||||
\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). B-aggzeros is nearly identical to B-all.}
|
||||
\label{fig:hist}
|
||||
\end{figure}
|
||||
|
||||
|
@ -477,9 +477,9 @@ To confirm the hypotheses presented, we analyzed the job metadata comparing job
|
|||
\paragraph{User distribution.}
|
||||
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
|
||||
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs.
|
||||
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the HEX\_lev and HEX\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
|
||||
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, HEX\_lev, HEX\_native, and KS is including more users (29, 33, and 37, respectively) than the other three algorithms.
|
||||
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but HEX\_phases cover 35 users.
|
||||
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the Q-lev and Q-native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
|
||||
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user; here, Q-lev, Q-native, and KS is including more users (29, 33, and 37, respectively) than the other three algorithms.
|
||||
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but Q-phases cover 35 users.
|
||||
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
|
||||
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
|
||||
|
||||
|
@ -493,8 +493,8 @@ The boxplots have different shapes which is an indication, that the different al
|
|||
|
||||
\paragraph{Runtime distribution.}
|
||||
The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
|
||||
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and HEX\_native penalize jobs of different length preferring jobs of very similar length.
|
||||
For Job-M and Job-L, HEX\_phases and KS are able to identify much shorter or longer jobs.
|
||||
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and Q-native penalize jobs of different length preferring jobs of very similar length.
|
||||
For Job-M and Job-L, Q-phases and KS are able to identify much shorter or longer jobs.
|
||||
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
|
||||
|
||||
\begin{figure}
|
||||
|
@ -536,7 +536,7 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h
|
|||
\caption{Job-L (reference job runs on 20 nodes)} \label{fig:nodes-job-L}
|
||||
\end{subfigure}
|
||||
\centering
|
||||
\caption{Distribution of node counts (for Job-S nodes=1 in all cases))}
|
||||
\caption{Distribution of node counts for Top 100 (for Job-S always nodes=1)}
|
||||
\label{fig:nodes-job}
|
||||
\end{figure}
|
||||
|
||||
|
@ -563,11 +563,11 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h
|
|||
|
||||
\subsubsection{Algorithmic differences}
|
||||
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}.
|
||||
Bin\_all and BIN\_aggzeros overlap with at least 99 ranks for all three jobs.
|
||||
Bin\_all and B-aggzeros overlap with at least 99 ranks for all three jobs.
|
||||
While there is some reordering, both algorithms lead to a comparable set.
|
||||
All algorithms have a significant overlap for Job-S.
|
||||
For Job\-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set.
|
||||
Generally, HEX\_lev and Hex\_native are generating more similar results than other algorithms.
|
||||
Generally, Q-lev and Hex\_native are generating more similar results than other algorithms.
|
||||
From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
|
||||
|
||||
|
||||
|
@ -622,17 +622,17 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
|
|||
\begin{table}[bt]
|
||||
\centering
|
||||
\begin{tabular}{r|r|r|r|r|r}
|
||||
BIN\_aggzeros & BIN\_all & HEX\_lev & HEX\_native & HEX\_phases & KS\\ \hline
|
||||
B-aggzeros & B-all & Q-lev & Q-native & Q-phases & KS\\ \hline
|
||||
38 & 38 & 33 & 26 & 33 & 0
|
||||
\end{tabular}
|
||||
|
||||
%\begin{tabular}{r|r}
|
||||
% Algorithm & Jobs \\ \hline
|
||||
% BIN\_aggzeros & 38 \\
|
||||
% BIN\_all & 38 \\
|
||||
% HEX\_lev & 33 \\
|
||||
% HEX\_native & 26 \\
|
||||
% HEX\_phases & 33 \\
|
||||
% B-aggzeros & 38 \\
|
||||
% B-all & 38 \\
|
||||
% Q-lev & 33 \\
|
||||
% Q-native & 26 \\
|
||||
% Q-phases & 33 \\
|
||||
% KS & 0
|
||||
%\end{tabular}
|
||||
\caption{Job-S: number of jobs with “control” in their name in the Top-100}
|
||||
|
@ -653,7 +653,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
|
|||
\caption{Non-control job: Rank\,4, SIM=81\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-S: jobs with different job names when using BIN\_aggzeros}
|
||||
\caption{Job-S: jobs with different job names when using B-aggzeros}
|
||||
\label{fig:job-S-bin-agg}
|
||||
\end{figure}
|
||||
|
||||
|
@ -675,7 +675,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
|
|||
\caption{Rank\,100, SIM=79\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-S with Hex-Lev, selection of similar jobs}
|
||||
\caption{Job-S with Q-Lev, selection of similar jobs}
|
||||
\label{fig:job-S-hex-lev}
|
||||
\end{figure}
|
||||
|
||||
|
@ -718,7 +718,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t
|
|||
% \includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--99timeseries4687419}
|
||||
% \caption{Rank\,100, SIM=}
|
||||
% \end{subfigure}
|
||||
% \caption{Job-S with BIN\_aggzero, selection of similar jobs}
|
||||
% \caption{Job-S with B-aggzero, selection of similar jobs}
|
||||
% \label{fig:job-S-bin-aggzeros}
|
||||
% \end{figure}
|
||||
|
||||
|
@ -729,11 +729,11 @@ Inspecting the Top\,100 for this reference job is highlighting the differences b
|
|||
All algorithms identify a diverse range of job names for this reference job in the Top\,100.
|
||||
Firstly, the name of the reference job appears 30 times in the whole dataset.
|
||||
So this job type isn't necessarily executed frequently and, therefore, our Top\,100 is expected to contain other names.
|
||||
Some applications are more prominent in these sets, e.g., for BIN\_aggzero, 32~jobs contain WRF (a model) in the name.
|
||||
The number of unique names is 19, 38, 49, and 51 for BIN\_aggzero, HEX\_phases, HEX\_native and HEX\_lev, respectively.
|
||||
Some applications are more prominent in these sets, e.g., for B-aggzero, 32~jobs contain WRF (a model) in the name.
|
||||
The number of unique names is 19, 38, 49, and 51 for B-aggzero, Q-phases, Q-native and Q-lev, respectively.
|
||||
|
||||
The jobs that are similar according to the bin algorithms (see \Cref{fig:job-M-bin-aggzero}) differ from our expectations.
|
||||
The other algorithms like HEX\_lev (\Cref{fig:job-M-hex-lev}) and HEX\_native (\Cref{fig:job-M-hex-native}) seem to work as intended:
|
||||
The other algorithms like Q-lev (\Cref{fig:job-M-hex-lev}) and Q-native (\Cref{fig:job-M-hex-native}) seem to work as intended:
|
||||
While jobs exhibit short bursts of other active metrics even for low similarity we can eyeball a relevant similarity.
|
||||
The KS algorithm working on the histograms ranks the jobs correctly on the similarity of their histograms.
|
||||
However, as it does not deal with the length of the jobs, it may identify jobs of very different length.
|
||||
|
@ -807,7 +807,7 @@ Remember, for the KS algorithm, we concatenate the metrics of all nodes together
|
|||
\caption{Rank\,100, SIM=70\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-M with HEX\_lev, selection of similar jobs}
|
||||
\caption{Job-M with Q-lev, selection of similar jobs}
|
||||
\label{fig:job-M-hex-lev}
|
||||
\end{figure}
|
||||
|
||||
|
@ -838,7 +838,7 @@ Remember, for the KS algorithm, we concatenate the metrics of all nodes together
|
|||
\caption{Rank 3, SIM=97\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-M with HEX\_native, selection of similar jobs}
|
||||
\caption{Job-M with Q-native, selection of similar jobs}
|
||||
\label{fig:job-M-hex-native}
|
||||
\end{figure}
|
||||
|
||||
|
@ -864,7 +864,7 @@ Remember, for the KS algorithm, we concatenate the metrics of all nodes together
|
|||
% \caption{Rank 100, SIM=24\%}
|
||||
% \end{subfigure}
|
||||
%
|
||||
% \caption{Job-M with HEX\_phases, selection of similar jobs}
|
||||
% \caption{Job-M with Q-phases, selection of similar jobs}
|
||||
% \label{fig:job-M-hex-phases}
|
||||
% \end{figure}
|
||||
|
||||
|
@ -873,9 +873,9 @@ Remember, for the KS algorithm, we concatenate the metrics of all nodes together
|
|||
The bin algorithms find a low similarity (best 2nd ranked job is 17\% similar), the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
|
||||
In \Cref{fig:job-L-bin-aggzero}, it can be seen that the found jobs have little in common with the reference job.
|
||||
|
||||
The HEX\_lev and HEX\_native algorithms identify a more diverse set of applications (18 unique names and no xmessy job).
|
||||
HEX\_native \Cref{fig:job-L-hex-native} finds long jobs where the only few activity as our reference job.
|
||||
The HEX\_phases algorithm finds 85 unique names but as there is only one short IO phase in the reference job, it finds many (short) jobs with 100\% similarity as seen in \Cref{fig:job-L-hex-phases}.
|
||||
The Q-lev and Q-native algorithms identify a more diverse set of applications (18 unique names and no xmessy job).
|
||||
Q-native \Cref{fig:job-L-hex-native} finds long jobs where the only few activity as our reference job.
|
||||
The Q-phases algorithm finds 85 unique names but as there is only one short IO phase in the reference job, it finds many (short) jobs with 100\% similarity as seen in \Cref{fig:job-L-hex-phases}.
|
||||
The KS algorithm is even more inclusive having 1285 jobs with 100\% similarity; the 100 selected ones contain 71 jobs ending with t127, which is a typical model configuration.
|
||||
As expected, the histograms mimics the profile of the reference job, and thus, the algorithm does what it is expected to do.
|
||||
|
||||
|
@ -901,7 +901,7 @@ As expected, the histograms mimics the profile of the reference job, and thus, t
|
|||
\caption{Rank 100, SIM=11\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-L with BIN\_aggzero, selection of similar jobs}
|
||||
\caption{Job-L with B-aggzero, selection of similar jobs}
|
||||
\label{fig:job-L-bin-aggzero}
|
||||
\end{figure}
|
||||
|
||||
|
@ -927,7 +927,7 @@ As expected, the histograms mimics the profile of the reference job, and thus, t
|
|||
% % \caption{Rank 100, SIM=17\%}
|
||||
% % \end{subfigure}
|
||||
%
|
||||
% \caption{Job-L with HEX\_lev, selection of similar jobs}
|
||||
% \caption{Job-L with Q-lev, selection of similar jobs}
|
||||
% \label{fig:job-L-hex-lev}
|
||||
% \end{figure}
|
||||
|
||||
|
@ -953,7 +953,7 @@ As expected, the histograms mimics the profile of the reference job, and thus, t
|
|||
% \caption{Rank 100, SIM=17\%}
|
||||
% \end{subfigure}
|
||||
|
||||
\caption{Job-L with HEX\_native, selection of similar jobs}
|
||||
\caption{Job-L with Q-native, selection of similar jobs}
|
||||
\label{fig:job-L-hex-native}
|
||||
\end{figure}
|
||||
|
||||
|
@ -978,7 +978,7 @@ As expected, the histograms mimics the profile of the reference job, and thus, t
|
|||
\caption{Rank 100, SIM=100\%}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Job-L with HEX\_phases, selection of similar jobs}
|
||||
\caption{Job-L with Q-phases, selection of similar jobs}
|
||||
\label{fig:job-L-hex-phases}
|
||||
\end{figure}
|
||||
|
||||
|
@ -997,7 +997,7 @@ Job-L is tricky to analyze, because it is compute intense with only a single I/O
|
|||
Generally, the KS algorithm finds jobs with similar histograms which are not necessarily what we subjectively are looking for.
|
||||
|
||||
We found that the approach to compute similarity of a reference jobs to all jobs and ranking these based on their similarity was successful to find related jobs that we were interested in.
|
||||
The HEX\_lev and HEX\_native work best according to our subjective qualitative analysis.
|
||||
The Q-lev and Q-native work best according to our subjective qualitative analysis.
|
||||
Typically, a related job stems from the same user/group and may have a related job name but the approach was inclusive.
|
||||
However, all algorithms perform their task as intended.
|
||||
The pre-processing of the algorithms and distance metrics differ leading to a different definition of similarity.
|
||||
|
|
Loading…
Reference in New Issue