Nai

2020-08-21 19:12:33 +01:00 · 2020-08-21 19:12:33 +01:00 · 7dc7328e34
commit 7dc7328e34
parent 60116fc7a6
5 changed files with 115 additions and 71 deletions
--- a/fig/job-timeseries4296426.pdf
+++ b/fig/job-timeseries4296426.pdf
--- a/paper/main.tex
+++ b/paper/main.tex
@ -137,17 +137,71 @@ Check time series algorithms:
 \section{Evaluation}
 \label{sec:evaluation}
 For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
 Next, we analyzed the performance of the algorithm.
 Then the quantitative behavior and the correlation between chosen similarity and number of found jobs, and, finally, the quality of the 100 most similar jobs.
 \subsection{Reference Jobs}
 In the following, we assume a job is given and we aim to identify similar jobs.
-We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:
+We chose several reference jobs with different compute and IO characteristics:
 \begin{itemize}
-	\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
+	\item Job-S: performs post-processing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
  \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up.   %CHE.ws12
 	\item Job-L: a 66-hour 20-node job.
  The initialization data is read at the beginning.
  Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
 \end{itemize}
-For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
+The segmented timeline of the jobs are visualized in \Cref{fig:refJobs}.
 This coding is also used for the HEX class of algorithms (BIN algorithms merge all timelines together as described in \jk{TODO}.
 The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview.
 For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
 \begin{figure}
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries4296426}
 \caption{Job-S} \label{fig:job-S}
 \end{subfigure}
 \centering
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries5024292}
 \caption{Job-M} \label{fig:job-M}
 \end{subfigure}
 \centering
 \caption{Reference jobs: segmented timelines of mean IO activity}
 \label{fig:refJobs}
 \end{figure}
 \begin{figure}\ContinuedFloat
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries7488914-30}
 \caption{Job-L (first 30 segments of 400; remaining segments are similar)}
 \label{fig:job-L}
 \end{subfigure}
 \centering
 \caption{Reference jobs: segmented timelines of mean IO activity}
 \end{figure}
 \subsection{Performance}
 \jk{Describe System at DKRZ from old paper}
 The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}.
 \jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.}
 \begin{figure}
 \centering
@ -168,93 +222,73 @@ For each reference job and algorithm, we created a CSV files with the computed s
 \end{figure}
-Create histograms + cumulative job distribution for all algorithms.
+\subsection{Quantitative Analysis}
 Insert job profiles for closest 10 jobs.
 Potentially, analyze how the rankings of different similarities look like.
 \begin{figure}
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries4296426}
 \caption{Job-S} \label{fig:job-S}
 \end{subfigure}
 \centering
 \caption{Reference jobs: timeline of mean IO activity}
 \label{fig:refJobs}
 \end{figure}
 \begin{figure}\ContinuedFloat
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries5024292}
 \caption{Job-M} \label{fig:job-M}
 \end{subfigure}
 \centering
 \begin{subfigure}{0.8\textwidth}
 \centering
 \includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}
 \caption{Job-L (first 30 segments of 400; remaining segments are similar)}
 \label{fig:job-L}
 \end{subfigure}
 \centering
 \caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
 \end{figure}
 In the quantitative analysis, we explore for the different algorithms how the similarity of our pool of jobs behaves to our three reference jobs (Job-S, Job-M, and Job-L).
 The cumulative distribution of similarity to the reference jobs is shown in \Cref{fig:ecdf}.
 For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for HEX\_native.
 BIN\_aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%.
 The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, HEX\_phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest.
 % This indicates that the algorithms
 The support team in a data center may have time to investigate the most similar jobs.
 Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs.
 In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
 As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given).
 It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
 In the figures, we can see again a different behavior of the algorithms depending on the reference job.
 Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady.
 For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
 \begin{figure}
 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf}
 \caption{Job-S} \label{fig:ecdf-job-S}
 \end{subfigure}
 \centering
 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf}
 \caption{Job-M} \label{fig:ecdf-job-M}
 \end{subfigure}
 \centering
 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png}
+\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf}
 \caption{Job-L} \label{fig:ecdf-job-L}
 \end{subfigure}
 \centering
-\caption{Empirical cumulative density function}
+\caption{Quantitative job similarity -- empirical cumulative density function}
 \label{fig:ecdf}
 \end{figure}
 \begin{figure}
 \begin{subfigure}{0.5\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim}
+
 \begin{subfigure}{0.75\textwidth}
 \centering
 \includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_4296426-out/hist-sim}
 \caption{Job-S} \label{fig:hist-job-S}
 \end{subfigure}
-\begin{subfigure}{0.5\textwidth}
+
 \begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_5024292-out/hist-sim}
 \caption{Job-M} \label{fig:hist-job-M}
 \end{subfigure}
-\begin{subfigure}{0.5\textwidth}
+\begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_7488914-out/hist-sim}
 \caption{Job-L} \label{fig:hist-job-L}
 \end{subfigure}
 \centering
-\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)}
+\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). BIN\_aggzeros is nearly identical to BIN\_all.}
 \label{fig:hist}
 \end{figure}
@ -415,7 +449,7 @@ One consideration is to identify jobs that meet a rank threshold for all differe
 % \ContinuedFloat
 Hex phases very similar to hex native.
-Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860.png|
+Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860|
 Bin aggzeros works quite well here too. The jobs are a bit more diverse.
@ -602,7 +636,7 @@ Bin aggzero liefert Mist zurück.
 \end{subfigure}
 \caption{Job-L with hex\_lev, selection of similar jobs}
-\label{fig:job-L-hex-phases}
+\label{fig:job-L-hex-lev}
 \end{figure}
--- a/scripts/analyse-all.sh
+++ b/scripts/analyse-all.sh
@ -4,6 +4,8 @@
 echo "This script performs the complete analysis steps"
 CLEAN=0 # Set to 0 to make some update
 function prepare(){
  pushd datasets
  ./decompress.sh
@ -21,6 +23,9 @@ for I in job_similarities_*.csv ; do
  ./scripts/plot.R $I > description.txt
  OUT=${I%%.csv}-out
  mkdir $OUT
-  rm $OUT/*
+  if [[ $CLEAN != "0" ]] ; then
-  mv *.png *.pdf description.txt $OUT
+    rm $OUT/*
    mv description.txt $OUT
  fi
  mv *.png *.pdf $OUT
 done
--- a/scripts/plot-single-job.py
+++ b/scripts/plot-single-job.py
@ -10,7 +10,7 @@ import matplotlib.cm as cm
 jobs = sys.argv[1].split(",")
 prefix = sys.argv[2].split(",")
-fileformat = ".png"
+fileformat = ".pdf"
 print("Plotting the job: " + str(sys.argv[1]))
 print("Plotting with prefix: " + str(sys.argv[2]))
@ -78,12 +78,16 @@ def plot(prefix, header, row):
  colors = []
  style = []
  for name, group in groups:
    metrics[name] = [x[2] for x in group.values]
    labels.append(name)
    style.append(linestyleMap[name] + markerMap[name])
    colors.append(colorMap[name])
    if name == "md_file_delete":
      name = "file_delete"
    if name == "md_file_create":
      name = "file_create"
    metrics[name] = [x[2] for x in group.values]
    labels.append(name)
-  fsize = (8, 1 + 1.5 * len(labels))
+  fsize = (8, 1 + 1.1 * len(labels))
  fsizeFixed = (8, 2)
  pyplot.close('all')
@ -97,7 +101,7 @@ def plot(prefix, header, row):
      ax[i].set_ylabel(l)
  pyplot.xlabel("Segment number")
-  pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight')
+  pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight', dpi=150)
  # Plot first 30 segments
  if len(timeseries) <= 50:
@ -113,7 +117,7 @@ def plot(prefix, header, row):
      ax[i].set_ylabel(l)
  pyplot.xlabel("Segment number")
-  pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight')
+  pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight', dpi=150)
 ### end plotting function
--- a/scripts/plot.R
+++ b/scripts/plot.R
@ -4,7 +4,7 @@ library(ggplot2)
 library(dplyr)
 require(scales)
-plotjobs = TRUE
+plotjobs = FALSE
 # Color scheme
 plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000066")
@ -22,19 +22,20 @@ cat("Job count:")
 cat(nrow(data))
 # empirical cumulative density function (ECDF)
-ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2")
+data$sim = data$similarity*100
-ggsave("ecdf.png", width=8, height=3)
+ggplot(data, aes(sim, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("Similarity in %") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + scale_x_log10()
 ggsave("ecdf.png", width=8, height=2.5)
-ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4))  + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
+# histogram for the jobs
-ggsave("ecdf-0.5.png", width=8, height=3)
+ggplot(data, aes(sim), group=alg_name) + geom_histogram(color="black", binwidth=2.5) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + xlab("Similarity in %") + scale_y_continuous(limits=c(0, 100), oob=squish)  +   scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=2.5, geom="text", adj=1.0, angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+95))
 ggsave("hist-sim.png", width=6, height=4.5)
 #ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4))  + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
 #ggsave("ecdf-0.5.png", width=8, height=3)
 e = data %>% filter(similarity >= 0.5)
 print(summary(e))
 # histogram for the jobs
 ggplot(data, aes(similarity), group=alg_name) + geom_histogram(color="black", binwidth=0.025) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + scale_y_continuous(limits=c(0, 100), oob=squish)  +   scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=0.025, geom="text", angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+20))
 ggsave("hist-sim.png")
 # load job information, i.e., the time series per job
 jobData = read.csv("job-io-datasets/datasets/job_codings.csv")
 metadata = read.csv("job-io-datasets/datasets/job_metadata.csv")