diff --git a/paper/main.tex b/paper/main.tex index 69cea91..d819344 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -69,7 +69,7 @@ \usepackage{cleveref} \crefname{codecount}{Code}{Codes} -\title{Using Machine Learning to Identify Similar Jobs Based on their IO Behavior} +\title{A Workflow for Identifying Jobs with Similar I/O Behavior by Analyzing the Timeseries} \author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}} @@ -82,9 +82,6 @@ DKRZ -- } \begin{document} \maketitle -\eb{Der Titel ist nicht zugtreffend. Hier ist kein Clustering im Spiel, also auch keine Machine Learning. Es wird ja nichts gelernt. Es ist eher so: -``A Workflow for Identification of Similar Job by Means of Timeseries-based I/O Similarity Functions'' -} \begin{abstract} @@ -318,12 +315,10 @@ Practically, the support team would start with Rank\,1 (most similar job, presum When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps). Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload. -\ebadd{% -In the most cases, the support staff would identify the re-execution of jobs simply by job names. -The job names are often user defined generic strings and can contain confidential data. -It is quite difficult to anonymize them and keep the meaning unchanged. -Therefore, they are not available for this analysis. -}% +Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings\footnote{% +As they can contain confidential data, it is difficult to anonymize them without perturbing the meaning. +Therefore, they are not published in our data repository. +} To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs: We explore the distribution of users (and groups), runtime, and node count across jobs. @@ -351,8 +346,7 @@ The boxplots have different shapes which is an indication, that the different al \paragraph{Runtime distribution.} The \added{job} runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}. -While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length leading to a narrow profile. -\eb{``Narrow profiles'' sieht man irgendwie nicht in den Bildern. (Oder ich hab's nicht verstanden. Verstande habe ich, dass ``Narrow profile`` eine Jobmenge ist mit einer ahnlichen Laufzeit.)} +While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length. For Job-M and Job-L, hex\_phases is able to identify much shorter or longer jobs. For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself. @@ -374,7 +368,7 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h \end{subfigure} -\caption{User information for all 100 top ranked jobs} +\caption{User information for all 100 top-ranked jobs} \label{fig:userids} \end{figure} @@ -395,32 +389,30 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h \caption{Job-L (reference job runs on 20 nodes)} \label{fig:nodes-job-L} \end{subfigure} \centering -\caption{Distribution of node counts (for Job-S nodes=1 in all cases)} +\caption{Distribution of node counts (for Job-S nodes=1 in all cases))} \label{fig:nodes-job} \end{figure} -\eb{In \Cref{fig:nodes-job} koennte man noch evtl. die Anzahl der Knoten der untersuchten Jobs einblenden.} \begin{figure} \begin{subfigure}{0.31\textwidth} \centering \includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-elapsed} -\caption{Job-S ($job=10^{4.19}s$)} \label{fig:runtime-job-S} +\caption{Job-S ($job=15,551s$)} \label{fig:runtime-job-S} \end{subfigure} \begin{subfigure}{0.31\textwidth} \centering \includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-elapsed} -\caption{Job-M ($job=10^{4.46}s$)} \label{fig:runtime-job-M} +\caption{Job-M ($job=28,828s$)} \label{fig:runtime-job-M} \end{subfigure} \begin{subfigure}{0.31\textwidth} \centering \includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-elapsed} -\caption{Job-L ($job=10^{5.3}s$)} \label{fig:runtime-job-L} +\caption{Job-L ($job=240ks$)} \label{fig:runtime-job-L} \end{subfigure} \centering -\caption{Distribution of runtime for all 100 top ranked jobs} +\caption{Distribution of runtime for all 100 top-ranked jobs} \label{fig:runtime-job} \end{figure} -\eb{In \Cref{fig:runtime-job} koennte man evtl. noch die Laufzeit der untersuchten Jobs einblenden.} \subsubsection{Algorithmic differences} To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}. @@ -428,8 +420,9 @@ As expected we can observe that bin\_all and bin\_aggzeros is very similar for a While there is some reordering, both algorithms lead to a comparable order. The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L. For Job\-M, however, they lead to a different ranking and Top\,100. -From the analysis, we conclude that one representative from binary quantization is sufficient while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be considered together. +From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually. \eb{Ist das eine generelle Aussage: ``one representative from binary quantization is sufficient``? Wenn ja, dann ist sie sehr wage. Koennte Zufall sein.} +\jk{Habe das bissl umgeschrieben. Sicher ja. Ist halt sehr ähnlich.} @@ -451,10 +444,9 @@ From the analysis, we conclude that one representative from binary quantization \end{subfigure} \centering -\caption{Intersection of the 100 top ranked jobs for different algorithms} +\caption{Intersection of the 100 top-ranked jobs for different algorithms} \label{fig:heatmap-job} \end{figure} -\eb{In \Cref{fig:heatmap-job} muss die Farbpalette gefixt werden. Auf blau sieht man gar nichts.} \section{Assessing Timelines for Similar Jobs} diff --git a/scripts/plot.R b/scripts/plot.R index 6cbd2b0..8f8ecb1 100755 --- a/scripts/plot.R +++ b/scripts/plot.R @@ -3,17 +3,20 @@ library(ggplot2) library(dplyr) require(scales) +library(stringi) +library(stringr) # Turn to TRUE to print indivdiual job images plotjobs = FALSE # Color scheme -plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000066") +plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000099") # Parse job from command line args = commandArgs(trailingOnly = TRUE) file = "job_similarities_5024292.csv" # for manual execution file = args[1] +jobID = str_extract(file, regex("[0-9]+")) data = read.csv(file) # Columns are: jobid alg_id alg_name similarity @@ -124,10 +127,15 @@ for (alg_name in levels(data$alg_name)){ res.jobs = rbind(res.jobs, cbind(alg_name, metadata[metadata$jobid %in% result[, alg_name],])) } -ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm") + xlab("Job node count") +# Plot histogram of nodes per algorithm +jobRef = metadata[metadata$jobid == jobID,]$total_nodes +ggplot(res.jobs, aes(alg_name, total_nodes, fill=alg_name)) + geom_boxplot() + scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x))) + theme(legend.position = "none") + xlab("Algorithm") + ylab("Job node count") + geom_hline(yintercept= jobRef, linetype="dashed", color = "red", size=0.5) ggsave("jobs-nodes.png", width=6, height=4) -ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + ylab("Job runtime in s") + xlab("Algorithm") + theme(legend.position = "none") # scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +# Plot histogram of elapsed time per algorithm +jobRef = metadata[metadata$jobid == jobID,]$elapsed +ggplot(res.jobs, aes(alg_name, elapsed, fill=alg_name)) + geom_boxplot() + ylab("Job runtime in s") + xlab("Algorithm") + theme(legend.position = "none") + ylim(0, max(res.jobs$elapsed)) + geom_hline(yintercept= jobRef, linetype="dashed", color = "red", size=0.5) +# scale_y_continuous(trans = log2_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) ggsave("jobs-elapsed.png", width=6, height=4)