Nai

2020-08-21 19:12:33 +01:00 · 2020-08-21 19:12:33 +01:00 · 7dc7328e34
commit 7dc7328e34
parent 60116fc7a6
5 changed files with 115 additions and 71 deletions
--- a/fig/job-timeseries4296426.pdf
+++ b/fig/job-timeseries4296426.pdf
--- a/paper/main.tex
+++ b/paper/main.tex
@ -137,17 +137,71 @@ Check time series algorithms:
 \section{Evaluation}
 \label{sec:evaluation}

+For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
+Next, we analyzed the performance of the algorithm.
+Then the quantitative behavior and the correlation between chosen similarity and number of found jobs, and, finally, the quality of the 100 most similar jobs.
+
+\subsection{Reference Jobs}
+
 In the following, we assume a job is given and we aim to identify similar jobs.
-We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:
+We chose several reference jobs with different compute and IO characteristics:
 \begin{itemize}
-	\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
+	\item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
  \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up.   %CHE.ws12
 	\item Job-L: a 66-hour 20-node job.
  The initialization data is read at the beginning.
  Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
 \end{itemize}

-For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
+The segmented timeline of the jobs are visualized in \Cref{fig:refJobs}.
+This coding is also used for the HEX class of algorithms (BIN algorithms merge all timelines together as described in \jk{TODO}.
+The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview.
+For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
+
+\begin{figure}
+\begin{subfigure}{0.8\textwidth}
+\centering
+\includegraphics[width=\textwidth]{job-timeseries4296426}
+\caption{Job-S} \label{fig:job-S}
+\end{subfigure}
+\centering
+
+
+\begin{subfigure}{0.8\textwidth}
+\centering
+\includegraphics[width=\textwidth]{job-timeseries5024292}
+\caption{Job-M} \label{fig:job-M}
+\end{subfigure}
+\centering
+
+
+\caption{Reference jobs: segmented timelines of mean IO activity}
+\label{fig:refJobs}
+\end{figure}
+
+
+\begin{figure}\ContinuedFloat
+
+\begin{subfigure}{0.8\textwidth}
+\centering
+\includegraphics[width=\textwidth]{job-timeseries7488914-30}
+\caption{Job-L (first 30 segments of 400; remaining segments are similar)}
+\label{fig:job-L}
+\end{subfigure}
+\centering
+\caption{Reference jobs: segmented timelines of mean IO activity}
+\end{figure}
+
+
+
+\subsection{Performance}
+
+\jk{Describe System at DKRZ from old paper}
+
+The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}.
+
+\jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.}
+

 \begin{figure}
 \centering
@ -168,93 +222,73 @@ For each reference job and algorithm, we created a CSV files with the computed s
 \end{figure}


-Create histograms + cumulative job distribution for all algorithms.
-Insert job profiles for closest 10 jobs.
-
-Potentially, analyze how the rankings of different similarities look like.
-
-
-\begin{figure}
-\begin{subfigure}{0.8\textwidth}
-\centering
-\includegraphics[width=\textwidth]{job-timeseries4296426}
-\caption{Job-S} \label{fig:job-S}
-\end{subfigure}
-\centering
-
-\caption{Reference jobs: timeline of mean IO activity}
-\label{fig:refJobs}
-\end{figure}
-
-
-\begin{figure}\ContinuedFloat
-
-\begin{subfigure}{0.8\textwidth}
-\centering
-\includegraphics[width=\textwidth]{job-timeseries5024292}
-\caption{Job-M} \label{fig:job-M}
-\end{subfigure}
-\centering
-
-\begin{subfigure}{0.8\textwidth}
-\centering
-\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}
-\caption{Job-L (first 30 segments of 400; remaining segments are similar)}
-\label{fig:job-L}
-\end{subfigure}
-\centering
-\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
-\end{figure}
+\subsection{Quantitative Analysis}

+In the quantitative analysis, we explore for the different algorithms how the similarity of our pool of jobs behaves to our three reference jobs (Job-S, Job-M, and Job-L).
+The cumulative distribution of similarity to the reference jobs is shown in \Cref{fig:ecdf}.
+For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native.
+BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
+The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
+% This indicates that the algorithms

+The support team in a data center may have time to investigate the most similar jobs.
+Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs.
+In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
+As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given).
+It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
+In the figures, we can see again a different behavior of the algorithms depending on the reference job.
+Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady.
+For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.

 \begin{figure}

 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf}
 \caption{Job-S} \label{fig:ecdf-job-S}
 \end{subfigure}
 \centering

 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf}
 \caption{Job-M} \label{fig:ecdf-job-M}
 \end{subfigure}
 \centering

 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf}
 \caption{Job-L} \label{fig:ecdf-job-L}
 \end{subfigure}
 \centering
-\caption{Empirical cumulative density function}
+\caption{Quantitative job similarity -- empirical cumulative density function}
 \label{fig:ecdf}
 \end{figure}


 \begin{figure}
-
-\begin{subfigure}{0.5\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim}
+
+\begin{subfigure}{0.75\textwidth}
+\centering
+\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_4296426-out/hist-sim}
 \caption{Job-S} \label{fig:hist-job-S}
 \end{subfigure}
-\begin{subfigure}{0.5\textwidth}
+
+\begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_5024292-out/hist-sim}
 \caption{Job-M} \label{fig:hist-job-M}
 \end{subfigure}

-\begin{subfigure}{0.5\textwidth}
+\begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_7488914-out/hist-sim}
 \caption{Job-L} \label{fig:hist-job-L}
 \end{subfigure}
 \centering
-\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)}
+\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). BIN\_aggzeros is nearly identical to BIN\_all.}
 \label{fig:hist}
 \end{figure}

@ -415,7 +449,7 @@ One consideration is to identify jobs that meet a rank threshold for all differe
 % \ContinuedFloat

 Hex phases very similar to hex native.
-Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860.png|
+Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860|


 Bin aggzeros works quite well here too. The jobs are a bit more diverse.
@ -602,7 +636,7 @@ Bin aggzero liefert Mist zurück.
 \end{subfigure}

 \caption{Job-L with hex\_lev, selection of similar jobs}
-\label{fig:job-L-hex-phases}
+\label{fig:job-L-hex-lev}
 \end{figure}


--- a/scripts/analyse-all.sh
+++ b/scripts/analyse-all.sh
@ -4,6 +4,8 @@

 echo "This script performs the complete analysis steps"

+CLEAN=0 # Set to 0 to make some update
+
 function prepare(){
  pushd datasets
  ./decompress.sh
@ -21,6 +23,9 @@ for I in job_similarities_*.csv ; do
  ./scripts/plot.R $I > description.txt
  OUT=${I%%.csv}-out
  mkdir $OUT
-  rm $OUT/*
-  mv *.png *.pdf description.txt $OUT
+  if [[ $CLEAN != "0" ]] ; then
+    rm $OUT/*
+    mv description.txt $OUT
+  fi
+  mv *.png *.pdf $OUT
 done
--- a/scripts/plot-single-job.py
+++ b/scripts/plot-single-job.py
@ -10,7 +10,7 @@ import matplotlib.cm as cm
 jobs = sys.argv[1].split(",")
 prefix = sys.argv[2].split(",")

-fileformat = ".png"
+fileformat = ".pdf"

 print("Plotting the job: " + str(sys.argv[1]))
 print("Plotting with prefix: " + str(sys.argv[2]))
@ -78,12 +78,16 @@ def plot(prefix, header, row):
  colors = []
  style = []
  for name, group in groups:
-    metrics[name] = [x[2] for x in group.values]
-    labels.append(name)
    style.append(linestyleMap[name] + markerMap[name])
    colors.append(colorMap[name])
+    if name == "md_file_delete":
+      name = "file_delete"
+    if name == "md_file_create":
+      name = "file_create"
+    metrics[name] = [x[2] for x in group.values]
+    labels.append(name)

-  fsize = (8, 1 + 1.5 * len(labels))
+  fsize = (8, 1 + 1.1 * len(labels))
  fsizeFixed = (8, 2)

  pyplot.close('all')
@ -97,7 +101,7 @@ def plot(prefix, header, row):
      ax[i].set_ylabel(l)

  pyplot.xlabel("Segment number")
-  pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight')
+  pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight', dpi=150)

  # Plot first 30 segments
  if len(timeseries) <= 50:
@ -113,7 +117,7 @@ def plot(prefix, header, row):
      ax[i].set_ylabel(l)

  pyplot.xlabel("Segment number")
-  pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight')
+  pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight', dpi=150)

 ### end plotting function

--- a/scripts/plot.R
+++ b/scripts/plot.R
@ -4,7 +4,7 @@ library(ggplot2)
 library(dplyr)
 require(scales)

-plotjobs = TRUE
+plotjobs = FALSE

 # Color scheme
 plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000066")
@ -22,19 +22,20 @@ cat("Job count:")
 cat(nrow(data))

 # empirical cumulative density function (ECDF)
-ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2")
-ggsave("ecdf.png", width=8, height=3)
+data$sim = data$similarity*100
+ggplot(data, aes(sim, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("Similarity in %") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + scale_x_log10()
+ggsave("ecdf.png", width=8, height=2.5)

-ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4))  + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
-ggsave("ecdf-0.5.png", width=8, height=3)
+# histogram for the jobs
+ggplot(data, aes(sim), group=alg_name) + geom_histogram(color="black", binwidth=2.5) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + xlab("Similarity in %") + scale_y_continuous(limits=c(0, 100), oob=squish)  +   scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=2.5, geom="text", adj=1.0, angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+95))
+ggsave("hist-sim.png", width=6, height=4.5)
+
+#ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4))  + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
+#ggsave("ecdf-0.5.png", width=8, height=3)

 e = data %>% filter(similarity >= 0.5)
 print(summary(e))

-# histogram for the jobs
-ggplot(data, aes(similarity), group=alg_name) + geom_histogram(color="black", binwidth=0.025) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + scale_y_continuous(limits=c(0, 100), oob=squish)  +   scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=0.025, geom="text", angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+20))
-ggsave("hist-sim.png")
-
 # load job information, i.e., the time series per job
 jobData = read.csv("job-io-datasets/datasets/job_codings.csv")
 metadata = read.csv("job-io-datasets/datasets/job_metadata.csv")