Description
This commit is contained in:
parent
c2274b6c79
commit
4778135ccd
|
@ -162,7 +162,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
|
|||
\begin{subfigure}{0.8\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job-timeseries4296426}
|
||||
\caption{Job-S} \label{fig:job-S}
|
||||
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S}
|
||||
\end{subfigure}
|
||||
\centering
|
||||
|
||||
|
@ -170,7 +170,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
|
|||
\begin{subfigure}{0.8\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job-timeseries5024292}
|
||||
\caption{Job-M} \label{fig:job-M}
|
||||
\caption{Job-M (runtime=28,828\,s, segments=48)} \label{fig:job-M}
|
||||
\end{subfigure}
|
||||
\centering
|
||||
|
||||
|
@ -213,17 +213,17 @@ We believe this will then allow a near-online analysis of a job.
|
|||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{progress_4296426-out-boxplot}
|
||||
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:perf-job-S}
|
||||
\caption{Job-S (segments=25)} \label{fig:perf-job-S}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{progress_5024292-out-boxplot}
|
||||
\caption{Job-M (runtime=28,828\,s, segments=48)} \label{fig:perf-job-M}
|
||||
\caption{Job-M (segments=48)} \label{fig:perf-job-M}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{progress_7488914-out-boxplot}
|
||||
\caption{Job-L} \label{fig:perf-job-L}
|
||||
\caption{Job-L (segments=400)} \label{fig:perf-job-L}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Runtime of the algorithms to compute the similarity to reference jobs}
|
||||
|
@ -241,13 +241,14 @@ The different algorithms lead to different curves for our reference jobs, e.g.,
|
|||
% This indicates that the algorithms
|
||||
|
||||
The support team in a data center may have time to investigate the most similar jobs.
|
||||
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs (the Top\,100).
|
||||
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs; we refer to them as the Top\,100 jobs, and Rank\,i refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the following histogram.
|
||||
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
|
||||
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
|
||||
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
|
||||
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
|
||||
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
||||
For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
|
||||
This algorithm finds 393 jobs that have a similarity of 100\%, thus they are indistinguishable to the algorithm.
|
||||
|
||||
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
|
||||
|
||||
|
@ -305,16 +306,38 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
|
|||
|
||||
\subsubsection{Inclusivity and Specificity}
|
||||
|
||||
When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps).
|
||||
Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload.
|
||||
|
||||
User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more.
|
||||
Up to about 2x users than groups.
|
||||
To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs:
|
||||
We explore the distribution of users (and groups), runtime, and node count across jobs.
|
||||
The algorithms should include different users, node counts, and across runtime.
|
||||
To confirm hypotheses presented, we analyzed the job metadata comparing job names which validates our quantitative results discussed in the following.
|
||||
|
||||
\paragraph{User distribution.}
|
||||
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
|
||||
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
|
||||
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
|
||||
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
|
||||
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.
|
||||
|
||||
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
|
||||
Thus, a user is likely from the same group and the number of groups is similar to the number of unique users.
|
||||
|
||||
\paragraph{Node distribution.}
|
||||
All algorithms reduce over the node dimensions, therefore, we naturally expect a big inclusion across node range -- as long as the average I/O behavior of the jobs are similar.
|
||||
\Cref{fig:nodes-job} shows a boxplot for the node counts in the Top\,100.
|
||||
For Job-M and Job-L, we can observe that indeed the range of similar nodes is between 1 and 128.
|
||||
For Job-S, all 100 most similar jobs use one node.
|
||||
As post-processing jobs use typically one node and the number of postprocessing jobs is a high proportion, it appears natural that all Top\,100 are from this class of jobs which is confirmed by investigating the job metadata.
|
||||
The boxplots have different shapes which is an indication, that the different algorithms identify a different set of jobs -- we will analyze this later further.
|
||||
|
||||
\paragraph{Runtime distribution.}
|
||||
The runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
|
||||
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length leading to a narrow profile.
|
||||
For Job-M and Job-L, hex\_phases is able to identify much shorter or longer jobs.
|
||||
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
|
||||
|
||||
\begin{figure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
|
@ -362,32 +385,31 @@ For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse us
|
|||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-elapsed}
|
||||
\caption{Job-S ($job=10^{4.19}$)} \label{fig:runtime-job-S}
|
||||
\caption{Job-S ($job=10^{4.19}s$)} \label{fig:runtime-job-S}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-elapsed}
|
||||
\caption{Job-M ($job=10^{4.46}$)} \label{fig:runtime-job-M}
|
||||
\caption{Job-M ($job=10^{4.46}s$)} \label{fig:runtime-job-M}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-elapsed}
|
||||
\caption{Job-L ($job=10^{5.3}$)} \label{fig:runtime-job-L}
|
||||
\caption{Job-L ($job=10^{5.3}s$)} \label{fig:runtime-job-L}
|
||||
\end{subfigure}
|
||||
\centering
|
||||
\caption{Distribution of runtime for all 100 top ranked jobs}
|
||||
\label{fig:runtime-job}
|
||||
\end{figure}
|
||||
|
||||
To see how different the algorithms behave, the intersection of two algorithms is computed for the 100 jobs with the highest similarity and visualized in \Cref{fig:heatmap-job}.
|
||||
As expected, we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
|
||||
\subsubsection{Algorithmic differences}
|
||||
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
|
||||
As expected we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
|
||||
While there is some reordering, both algorithms lead to a comparable order.
|
||||
The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L.
|
||||
For Job\-M, however, they lead to a different ranking and Top\,100.
|
||||
From the analysis, we conclude that one representative from binary quantization is sufficient while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be considered together.
|
||||
|
||||
One consideration is to identify jobs that meet a rank threshold for all different algorithms.
|
||||
\jk{TODO}
|
||||
|
||||
\begin{figure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
|
@ -707,5 +729,8 @@ Bin aggzero liefert Mist zurück.
|
|||
\section{Conclusion}
|
||||
\label{sec:summary}
|
||||
|
||||
One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
|
||||
That would increase the likelihood that these jobs are very similar and what the user is looking for.
|
||||
|
||||
%\printbibliography
|
||||
\end{document}
|
||||
|
|
|
@ -4,6 +4,7 @@ library(ggplot2)
|
|||
library(dplyr)
|
||||
require(scales)
|
||||
|
||||
# Turn to TRUE to print indivdiual job images
|
||||
plotjobs = FALSE
|
||||
|
||||
# Color scheme
|
||||
|
@ -123,10 +124,10 @@ for (alg_name in levels(data$alg_name)){
|
|||
res.jobs = rbind(res.jobs, cbind(alg_name, metadata[metadata$jobid %in% result[, alg_name],]))
|
||||
}
|
||||
|
||||
ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm")
|
||||
ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm") + xlab("Job node count")
|
||||
ggsave("jobs-nodes.png", width=6, height=4)
|
||||
|
||||
ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) + ylab("Runtime in s") + xlab("Algorithm") + theme(legend.position = "none")
|
||||
ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + ylab("Job runtime in s") + xlab("Algorithm") + theme(legend.position = "none") # scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))
|
||||
ggsave("jobs-elapsed.png", width=6, height=4)
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue