diff --git a/datasets/job_assessment.csv.tar.xz b/datasets/job_assessment.csv.tar.xz deleted file mode 100644 index 8922e04..0000000 Binary files a/datasets/job_assessment.csv.tar.xz and /dev/null differ diff --git a/datasets/job_codings_v3.csv.tar.xz b/datasets/job_codings_v3.csv.tar.xz deleted file mode 100644 index 570ea29..0000000 Binary files a/datasets/job_codings_v3.csv.tar.xz and /dev/null differ diff --git a/datasets/job_codings_v3_confidential.csv.tar.xz b/datasets/job_codings_v3_confidential.csv.tar.xz deleted file mode 100644 index ef97290..0000000 Binary files a/datasets/job_codings_v3_confidential.csv.tar.xz and /dev/null differ diff --git a/datasets/job_codings_v4.csv.tar.xz b/datasets/job_codings_v4.csv.tar.xz deleted file mode 100644 index fdf82e6..0000000 Binary files a/datasets/job_codings_v4.csv.tar.xz and /dev/null differ diff --git a/datasets/job_codings_v4_confidential.csv.tar.xz b/datasets/job_codings_v4_confidential.csv.tar.xz deleted file mode 100644 index 30a4b67..0000000 Binary files a/datasets/job_codings_v4_confidential.csv.tar.xz and /dev/null differ diff --git a/datasets/job_metadata.csv.tar.xz b/datasets/job_metadata.csv.tar.xz deleted file mode 100644 index 176772b..0000000 Binary files a/datasets/job_metadata.csv.tar.xz and /dev/null differ diff --git a/datasets/job_metadata_confidential.csv.tar.xz b/datasets/job_metadata_confidential.csv.tar.xz deleted file mode 100644 index 268266f..0000000 Binary files a/datasets/job_metadata_confidential.csv.tar.xz and /dev/null differ diff --git a/fig/progress_4296426-out-boxplot.png b/fig/progress_4296426-out-boxplot.png index 8ad3765..debe9fc 100644 Binary files a/fig/progress_4296426-out-boxplot.png and b/fig/progress_4296426-out-boxplot.png differ diff --git a/fig/progress_4296426-out-cummulative.png b/fig/progress_4296426-out-cummulative.png index d4f4d42..7ddf8b2 100644 Binary files a/fig/progress_4296426-out-cummulative.png and b/fig/progress_4296426-out-cummulative.png differ diff --git a/fig/progress_5024292-out-boxplot.png b/fig/progress_5024292-out-boxplot.png index 3e3a9d3..8a9716e 100644 Binary files a/fig/progress_5024292-out-boxplot.png and b/fig/progress_5024292-out-boxplot.png differ diff --git a/fig/progress_5024292-out-cummulative.png b/fig/progress_5024292-out-cummulative.png index 52fe3d1..1af1fa2 100644 Binary files a/fig/progress_5024292-out-cummulative.png and b/fig/progress_5024292-out-cummulative.png differ diff --git a/fig/progress_7488914-out-boxplot.png b/fig/progress_7488914-out-boxplot.png index d565916..fdc2272 100644 Binary files a/fig/progress_7488914-out-boxplot.png and b/fig/progress_7488914-out-boxplot.png differ diff --git a/fig/progress_7488914-out-cummulative.png b/fig/progress_7488914-out-cummulative.png index e259836..5480b22 100644 Binary files a/fig/progress_7488914-out-cummulative.png and b/fig/progress_7488914-out-cummulative.png differ diff --git a/paper/main.tex b/paper/main.tex index 87f872b..acf32e2 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -85,7 +85,6 @@ DKRZ -- \begin{abstract} -\todo{Rename algorithm according to JHPS paper, evtl. describe each algorithm with one sentence?} One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100.000 jobs, i.e., is there a class of jobs that aren't performing well. @@ -200,12 +199,14 @@ After data is reduced across nodes, we quantize the timelines either using binar By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs. \subsection{Algorithms for Computing Similarity} -We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggzeros, Q-native, Q-lev, and Q-quant. +We reuse the algorithms developed in \cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases. They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance measure is mostly the Euclidean distance or the Levenshtein-distance. For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. -The Q-quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. +\todo{evtl. describe each algorithm with one sentence?} +The Q-phases algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. In this paper, we add a new similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following. + \paragraph{Kolmogorov-Smirnov (kv) algorithm} % Summary For the analysis, we perform two preparation steps. @@ -391,7 +392,7 @@ We believe this will then allow an online analysis. In the quantitative analysis, we explore the different algorithms how the similarity of our pool of jobs behaves to our reference jobs. The cumulative distribution of similarity to a reference job is shown in \Cref{fig:ecdf}. For example, in \Cref{fig:ecdf-job-S}, we see that about 70\% have a similarity of less than 10\% to Job-S for Q-native. -B-aggzeros shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%. +B-aggz shows some steep increases, e.g., more than 75\% of jobs have the same low similarity below 2\%. The different algorithms lead to different curves for our reference jobs, e.g., for Job-S, Q-phases bundles more jobs with low similarity compared to the other jobs; in Job-L, it is the slowest. % This indicates that the algorithms @@ -455,7 +456,7 @@ Practically, the support team would start with Rank\,1 (most similar job, presum \caption{Job-L} \label{fig:hist-job-L} \end{subfigure} \centering -\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). B-aggzeros is nearly identical to B-all.} +\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts). B-aggz is nearly identical to B-all.} \label{fig:hist} \end{figure} @@ -563,7 +564,7 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h \subsubsection{Algorithmic differences} To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combinations of algorithms and visualized in \Cref{fig:heatmap-job}. -Bin\_all and B-aggzeros overlap with at least 99 ranks for all three jobs. +Bin\_all and B-aggz overlap with at least 99 ranks for all three jobs. While there is some reordering, both algorithms lead to a comparable set. All algorithms have a significant overlap for Job-S. For Job\-M, however, they lead to a different ranking, and Top\,100, particularly KS determines a different set. @@ -622,13 +623,13 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t \begin{table}[bt] \centering \begin{tabular}{r|r|r|r|r|r} - B-aggzeros & B-all & Q-lev & Q-native & Q-phases & KS\\ \hline + B-aggz & B-all & Q-lev & Q-native & Q-phases & KS\\ \hline 38 & 38 & 33 & 26 & 33 & 0 \end{tabular} %\begin{tabular}{r|r} % Algorithm & Jobs \\ \hline -% B-aggzeros & 38 \\ +% B-aggz & 38 \\ % B-all & 38 \\ % Q-lev & 33 \\ % Q-native & 26 \\ @@ -653,7 +654,7 @@ For Job-S, we found that all algorithms work well and, therefore, omit further t \caption{Non-control job: Rank\,4, SIM=81\%} \end{subfigure} -\caption{Job-S: jobs with different job names when using B-aggzeros} +\caption{Job-S: jobs with different job names when using B-aggz} \label{fig:job-S-bin-agg} \end{figure} diff --git a/scripts/plot-performance.R b/scripts/plot-performance.R index 6186095..ed94285 100755 --- a/scripts/plot-performance.R +++ b/scripts/plot-performance.R @@ -11,9 +11,9 @@ prefix = args[2] # Plot the performance numbers of the analysis data = read.csv(file) -levels(data$alg_name)[levels(data$alg_name) == "bin_aggzeros"] = "bin_aggz" -levels(data$alg_name)[levels(data$alg_name) == "hex_native"] = "hex_nat" -levels(data$alg_name)[levels(data$alg_name) == "hex_phases"] = "hex_phas" +levels(data$alg_name)[levels(data$alg_name) == "B-aggzeros"] = "B-aggz" +levels(data$alg_name)[levels(data$alg_name) == "Q-native"] = "Q-nat" +levels(data$alg_name)[levels(data$alg_name) == "Q-phases"] = "Q-phas" e = data %>% filter(jobs_done >= (jobs_total - 9998)) e$time_per_100k = e$elapsed / (e$jobs_done / 100000) diff --git a/scripts/plot.R b/scripts/plot.R index f9658c2..4d527b2 100755 --- a/scripts/plot.R +++ b/scripts/plot.R @@ -11,7 +11,7 @@ library(stringi) library(stringr) # Turn to TRUE to print indivdiual job images -plotjobs = TRUE +plotjobs = FALSE # Color scheme plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000099")