Description

This commit is contained in:
Julian M. Kunkel 2020-08-26 15:09:14 +01:00
parent c2274b6c79
commit 4778135ccd
2 changed files with 43 additions and 17 deletions

View File

@ -162,7 +162,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries4296426}
\caption{Job-S} \label{fig:job-S}
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S}
\end{subfigure}
\centering
@ -170,7 +170,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries5024292}
\caption{Job-M} \label{fig:job-M}
\caption{Job-M (runtime=28,828\,s, segments=48)} \label{fig:job-M}
\end{subfigure}
\centering
@ -213,17 +213,17 @@ We believe this will then allow a near-online analysis of a job.
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{progress_4296426-out-boxplot}
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:perf-job-S}
\caption{Job-S (segments=25)} \label{fig:perf-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{progress_5024292-out-boxplot}
\caption{Job-M (runtime=28,828\,s, segments=48)} \label{fig:perf-job-M}
\caption{Job-M (segments=48)} \label{fig:perf-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{progress_7488914-out-boxplot}
\caption{Job-L} \label{fig:perf-job-L}
\caption{Job-L (segments=400)} \label{fig:perf-job-L}
\end{subfigure}
\caption{Runtime of the algorithms to compute the similarity to reference jobs}
@ -241,13 +241,14 @@ The different algorithms lead to different curves for our reference jobs, e.g.,
% This indicates that the algorithms
The support team in a data center may have time to investigate the most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs (the Top\,100).
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and Rank\,i refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram.
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
This algorithm finds 393 jobs that have a similarity of 100\%, thus they are indistinguishable to the algorithm.
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
@ -305,16 +306,38 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
\subsubsection{Inclusivity and Specificity}
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more.
Up to about 2x users than groups.
To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs:
We explore the distribution of users (and groups), runtime, and node count across jobs.
The algorithms should include different users, node counts, and across runtime.
To confirm hypotheses presented, we analyzed the job metadata comparing job names which validates our quantitative results discussed in the following.
\paragraph{User distribution.}
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
\paragraph{Node distribution.}
All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across node range -- as long as the average I/O behavior of the jobs are similar.
\Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100.
For Job-M and Job-L, we can observe that indeed the range of similar nodes is between 1 and 128.
For Job-S, all 100 most similar jobs use one node.
As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs which is confirmed by investigating the job metadata.
The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further.
\paragraph{Runtime distribution.}
The runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length leading to a narrow profile.
For Job-M and Job-L, hex\_phases is able to identify much shorter or longer jobs.
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
\begin{figure}
\begin{subfigure}{0.31\textwidth}
\centering
@ -362,32 +385,31 @@ For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse us
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-elapsed}
\caption{Job-S ($job=10^{4.19}$)} \label{fig:runtime-job-S}
\caption{Job-S ($job=10^{4.19}s$)} \label{fig:runtime-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-elapsed}
\caption{Job-M ($job=10^{4.46}$)} \label{fig:runtime-job-M}
\caption{Job-M ($job=10^{4.46}s$)} \label{fig:runtime-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-elapsed}
\caption{Job-L ($job=10^{5.3}$)} \label{fig:runtime-job-L}
\caption{Job-L ($job=10^{5.3}s$)} \label{fig:runtime-job-L}
\end{subfigure}
\centering
\caption{Distribution of runtime for all 100 top ranked jobs}
\label{fig:runtime-job}
\end{figure}
To see how different the algorithms behave, the intersection of two algorithms is computed for the 100 jobs with the highest similarity and visualized in \Cref{fig:heatmap-job}.
As expected, we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
\subsubsection{Algorithmic differences}
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
As expected we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
While there is some reordering, both algorithms lead to a comparable order.
The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L.
For Job\-M, however, they lead to a different ranking and Top\,100.
From the analysis, we conclude that one representative from binary quantization is sufficient while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be considered together.
One consideration is to identify jobs that meet a rank threshold for all different algorithms.
\jk{TODO}
\begin{figure}
\begin{subfigure}{0.31\textwidth}
@ -707,5 +729,8 @@ Bin aggzero liefert Mist zurück.
\section{Conclusion}
\label{sec:summary}
One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
That would increase the likelihood that these jobs are very similar and what the user is looking for.
%\printbibliography
\end{document}

View File

@ -4,6 +4,7 @@ library(ggplot2)
library(dplyr)
require(scales)
# Turn to TRUE to print indivdiual job images
plotjobs = FALSE
# Color scheme
@ -123,10 +124,10 @@ for (alg_name in levels(data$alg_name)){
res.jobs = rbind(res.jobs, cbind(alg_name, metadata[metadata$jobid %in% result[, alg_name],]))
}
ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm")
ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm") + xlab("Job node count")
ggsave("jobs-nodes.png", width=6, height=4)
ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) + ylab("Runtime in s") + xlab("Algorithm") + theme(legend.position = "none")
ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + ylab("Job runtime in s") + xlab("Algorithm") + theme(legend.position = "none") # scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
ggsave("jobs-elapsed.png", width=6, height=4)