diff --git a/fig/job-timeseries4296426.pdf b/fig/job-timeseries4296426.pdf index 1fa9c71..a43a026 100644 Binary files a/fig/job-timeseries4296426.pdf and b/fig/job-timeseries4296426.pdf differ diff --git a/paper/main.tex b/paper/main.tex index e8aac59..a5d0feb 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -137,17 +137,71 @@ Check time series algorithms: \section{Evaluation} \label{sec:evaluation} +For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs. +Next, we analyzed the performance of the algorithm. +Then the quantitative behavior and the correlation between chosen similarity and number of found jobs, and, finally, the quality of the 100 most similar jobs. + +\subsection{Reference Jobs} + In the following, we assume a job is given and we aim to identify similar jobs. -We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}: +We chose several reference jobs with different compute and IO characteristics: \begin{itemize} - \item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive. + \item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive. \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up. %CHE.ws12 \item Job-L: a 66-hour 20-node job. The initialization data is read at the beginning. Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant. \end{itemize} -For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs. +The segmented timeline of the jobs are visualized in \Cref{fig:refJobs}. +This coding is also used for the HEX class of algorithms (BIN algorithms merge all timelines together as described in \jk{TODO}. +The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview. +For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6. + +\begin{figure} +\begin{subfigure}{0.8\textwidth} +\centering +\includegraphics[width=\textwidth]{job-timeseries4296426} +\caption{Job-S} \label{fig:job-S} +\end{subfigure} +\centering + + +\begin{subfigure}{0.8\textwidth} +\centering +\includegraphics[width=\textwidth]{job-timeseries5024292} +\caption{Job-M} \label{fig:job-M} +\end{subfigure} +\centering + + +\caption{Reference jobs: segmented timelines of mean IO activity} +\label{fig:refJobs} +\end{figure} + + +\begin{figure}\ContinuedFloat + +\begin{subfigure}{0.8\textwidth} +\centering +\includegraphics[width=\textwidth]{job-timeseries7488914-30} +\caption{Job-L (first 30 segments of 400; remaining segments are similar)} +\label{fig:job-L} +\end{subfigure} +\centering +\caption{Reference jobs: segmented timelines of mean IO activity} +\end{figure} + + + +\subsection{Performance} + +\jk{Describe System at DKRZ from old paper} + +The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}. + +\jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.} + \begin{figure} \centering @@ -168,93 +222,73 @@ For each reference job and algorithm, we created a CSV files with the computed s \end{figure} -Create histograms + cumulative job distribution for all algorithms. -Insert job profiles for closest 10 jobs. - -Potentially, analyze how the rankings of different similarities look like. - - -\begin{figure} -\begin{subfigure}{0.8\textwidth} -\centering -\includegraphics[width=\textwidth]{job-timeseries4296426} -\caption{Job-S} \label{fig:job-S} -\end{subfigure} -\centering - -\caption{Reference jobs: timeline of mean IO activity} -\label{fig:refJobs} -\end{figure} - - -\begin{figure}\ContinuedFloat - -\begin{subfigure}{0.8\textwidth} -\centering -\includegraphics[width=\textwidth]{job-timeseries5024292} -\caption{Job-M} \label{fig:job-M} -\end{subfigure} -\centering - -\begin{subfigure}{0.8\textwidth} -\centering -\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf} -\caption{Job-L (first 30 segments of 400; remaining segments are similar)} -\label{fig:job-L} -\end{subfigure} -\centering -\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0} -\end{figure} +\subsection{Quantitative Analysis} +In the quantitative analysis, we explore for the different algorithms how the similarity of our pool of jobs behaves to our three reference jobs (Job-S, Job-M, and Job-L). +The cumulative distribution of similarity to the reference jobs is shown in \Cref{fig:ecdf}. +For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native. +BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%. +The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest. +% This indicates that the algorithms +The support team in a data center may have time to investigate the most similar jobs. +Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs. +In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown. +As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given). +It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them. +In the figures, we can see again a different behavior of the algorithms depending on the reference job. +Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady. +For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm. \begin{figure} \begin{subfigure}{0.8\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png} +\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf} \caption{Job-S} \label{fig:ecdf-job-S} \end{subfigure} \centering \begin{subfigure}{0.8\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png} +\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf} \caption{Job-M} \label{fig:ecdf-job-M} \end{subfigure} \centering \begin{subfigure}{0.8\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png} +\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf} \caption{Job-L} \label{fig:ecdf-job-L} \end{subfigure} \centering -\caption{Empirical cumulative density function} +\caption{Quantitative job similarity -- empirical cumulative density function} \label{fig:ecdf} \end{figure} \begin{figure} - -\begin{subfigure}{0.5\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim} + +\begin{subfigure}{0.75\textwidth} +\centering +\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_4296426-out/hist-sim} \caption{Job-S} \label{fig:hist-job-S} \end{subfigure} -\begin{subfigure}{0.5\textwidth} + +\begin{subfigure}{0.75\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim} +\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_5024292-out/hist-sim} \caption{Job-M} \label{fig:hist-job-M} \end{subfigure} -\begin{subfigure}{0.5\textwidth} +\begin{subfigure}{0.75\textwidth} \centering -\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim} +\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_7488914-out/hist-sim} \caption{Job-L} \label{fig:hist-job-L} \end{subfigure} \centering -\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)} +\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). BIN\_aggzeros is nearly identical to BIN\_all.} \label{fig:hist} \end{figure} @@ -415,7 +449,7 @@ One consideration is to identify jobs that meet a rank threshold for all differe % \ContinuedFloat Hex phases very similar to hex native. -Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860.png| +Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860| Bin aggzeros works quite well here too. The jobs are a bit more diverse. @@ -602,7 +636,7 @@ Bin aggzero liefert Mist zurück. \end{subfigure} \caption{Job-L with hex\_lev, selection of similar jobs} -\label{fig:job-L-hex-phases} +\label{fig:job-L-hex-lev} \end{figure} diff --git a/scripts/analyse-all.sh b/scripts/analyse-all.sh index 1ff9a7e..d7b968c 100755 --- a/scripts/analyse-all.sh +++ b/scripts/analyse-all.sh @@ -4,6 +4,8 @@ echo "This script performs the complete analysis steps" +CLEAN=0 # Set to 0 to make some update + function prepare(){ pushd datasets ./decompress.sh @@ -21,6 +23,9 @@ for I in job_similarities_*.csv ; do ./scripts/plot.R $I > description.txt OUT=${I%%.csv}-out mkdir $OUT - rm $OUT/* - mv *.png *.pdf description.txt $OUT + if [[ $CLEAN != "0" ]] ; then + rm $OUT/* + mv description.txt $OUT + fi + mv *.png *.pdf $OUT done diff --git a/scripts/plot-single-job.py b/scripts/plot-single-job.py index 8849d7c..e9f6392 100755 --- a/scripts/plot-single-job.py +++ b/scripts/plot-single-job.py @@ -10,7 +10,7 @@ import matplotlib.cm as cm jobs = sys.argv[1].split(",") prefix = sys.argv[2].split(",") -fileformat = ".png" +fileformat = ".pdf" print("Plotting the job: " + str(sys.argv[1])) print("Plotting with prefix: " + str(sys.argv[2])) @@ -78,12 +78,16 @@ def plot(prefix, header, row): colors = [] style = [] for name, group in groups: - metrics[name] = [x[2] for x in group.values] - labels.append(name) style.append(linestyleMap[name] + markerMap[name]) colors.append(colorMap[name]) + if name == "md_file_delete": + name = "file_delete" + if name == "md_file_create": + name = "file_create" + metrics[name] = [x[2] for x in group.values] + labels.append(name) - fsize = (8, 1 + 1.5 * len(labels)) + fsize = (8, 1 + 1.1 * len(labels)) fsizeFixed = (8, 2) pyplot.close('all') @@ -97,7 +101,7 @@ def plot(prefix, header, row): ax[i].set_ylabel(l) pyplot.xlabel("Segment number") - pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight') + pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight', dpi=150) # Plot first 30 segments if len(timeseries) <= 50: @@ -113,7 +117,7 @@ def plot(prefix, header, row): ax[i].set_ylabel(l) pyplot.xlabel("Segment number") - pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight') + pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight', dpi=150) ### end plotting function diff --git a/scripts/plot.R b/scripts/plot.R index 642c61b..c8ff172 100755 --- a/scripts/plot.R +++ b/scripts/plot.R @@ -4,7 +4,7 @@ library(ggplot2) library(dplyr) require(scales) -plotjobs = TRUE +plotjobs = FALSE # Color scheme plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000066") @@ -22,19 +22,20 @@ cat("Job count:") cat(nrow(data)) # empirical cumulative density function (ECDF) -ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") -ggsave("ecdf.png", width=8, height=3) +data$sim = data$similarity*100 +ggplot(data, aes(sim, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("Similarity in %") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + scale_x_log10() +ggsave("ecdf.png", width=8, height=2.5) -ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0) -ggsave("ecdf-0.5.png", width=8, height=3) +# histogram for the jobs +ggplot(data, aes(sim), group=alg_name) + geom_histogram(color="black", binwidth=2.5) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + xlab("Similarity in %") + scale_y_continuous(limits=c(0, 100), oob=squish) + scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=2.5, geom="text", adj=1.0, angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+95)) +ggsave("hist-sim.png", width=6, height=4.5) + +#ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0) +#ggsave("ecdf-0.5.png", width=8, height=3) e = data %>% filter(similarity >= 0.5) print(summary(e)) -# histogram for the jobs -ggplot(data, aes(similarity), group=alg_name) + geom_histogram(color="black", binwidth=0.025) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + scale_y_continuous(limits=c(0, 100), oob=squish) + scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=0.025, geom="text", angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+20)) -ggsave("hist-sim.png") - # load job information, i.e., the time series per job jobData = read.csv("job-io-datasets/datasets/job_codings.csv") metadata = read.csv("job-io-datasets/datasets/job_metadata.csv")