diff --git a/fig/progress_4296426-out-boxplot.png b/fig/progress_4296426-out-boxplot.png new file mode 100644 index 0000000..e4df5dc Binary files /dev/null and b/fig/progress_4296426-out-boxplot.png differ diff --git a/fig/progress_4296426-out-cummulative.png b/fig/progress_4296426-out-cummulative.png new file mode 100644 index 0000000..06e39d8 Binary files /dev/null and b/fig/progress_4296426-out-cummulative.png differ diff --git a/fig/progress_5024292-out-boxplot.png b/fig/progress_5024292-out-boxplot.png new file mode 100644 index 0000000..852a8e2 Binary files /dev/null and b/fig/progress_5024292-out-boxplot.png differ diff --git a/fig/progress_5024292-out-cummulative.png b/fig/progress_5024292-out-cummulative.png new file mode 100644 index 0000000..6359ae2 Binary files /dev/null and b/fig/progress_5024292-out-cummulative.png differ diff --git a/fig/progress_7488914-out-boxplot.png b/fig/progress_7488914-out-boxplot.png new file mode 100644 index 0000000..3481a06 Binary files /dev/null and b/fig/progress_7488914-out-boxplot.png differ diff --git a/fig/progress_7488914-out-cummulative.png b/fig/progress_7488914-out-cummulative.png new file mode 100644 index 0000000..16af8f8 Binary files /dev/null and b/fig/progress_7488914-out-cummulative.png differ diff --git a/paper/main.tex b/paper/main.tex index a5d0feb..41498b3 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -198,26 +198,35 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se \jk{Describe System at DKRZ from old paper} -The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}. - -\jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.} - +To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ. +A boxplot for the runtimes is shown in \Cref{fig:performance}. +The runtime is normalized for 100k seconds, i.e., for bin\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process. +Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long. +Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L. +The Levensthein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window. +Note that the current algorithms are sequential and executed on just one core. +For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized. +We believe this will then allow a near-online analysis of a job. \begin{figure} \centering - \begin{subfigure}{0.8\textwidth} + \begin{subfigure}{0.31\textwidth} \centering - \includegraphics[width=\textwidth]{runtime-overview} - \caption{Overview to process all jobs} \label{fig:runtime-overview} + \includegraphics[width=\textwidth]{progress_4296426-out-boxplot} + \caption{Job-S} \label{fig:perf-job-S} + \end{subfigure} + \begin{subfigure}{0.31\textwidth} + \centering + \includegraphics[width=\textwidth]{progress_5024292-out-boxplot} + \caption{Job-M} \label{fig:perf-job-M} + \end{subfigure} + \begin{subfigure}{0.31\textwidth} + \centering + \includegraphics[width=\textwidth]{progress_7488914-out-boxplot} + \caption{Job-L} \label{fig:perf-job-L} \end{subfigure} - \begin{subfigure}{0.8\textwidth} - \centering - \includegraphics[width=\textwidth]{runtime-cummulative} - \caption{Cumulative} \label{fig:runtime-cummulative} - \end{subfigure} - - \caption{Performance of the algorithms} + \caption{Runtime overview for all algorithms and jobs} \label{fig:performance} \end{figure} @@ -232,14 +241,16 @@ The different algorithms lead to different curves for our reference jobs, e.g., % This indicates that the algorithms The support team in a data center may have time to investigate the most similar jobs. -Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs. +Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs (the Top\,100). In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown. -As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given). +As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given). It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them. In the figures, we can see again a different behavior of the algorithms depending on the reference job. -Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady. +Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady. For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm. +Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed. + \begin{figure} \begin{subfigure}{0.8\textwidth} @@ -292,13 +303,13 @@ For Job-L, we find barely similar jobs, except when using the HEX\_phases algori \label{fig:hist} \end{figure} -\subsection{Quantitative Analysis of Selected Jobs} +\subsubsection{Inclusivity and Specificity} User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more. Up to about 2x users than groups. -To understand how the Top\,100 jobs are distributed across users, the data is grouped by userid and counted. +To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs. For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total). For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms. @@ -388,15 +399,16 @@ One consideration is to identify jobs that meet a rank threshold for all differe \begin{subfigure}{0.31\textwidth} \centering \includegraphics[width=\textwidth]{job_similarities_5024292-out/intersection-heatmap} -\caption{Job-M} \label{fig:heatmap-job-M} +\caption{Job-M} \label{fig:heatmap-job-M} %,trim={2.5cm 0 0 0},clip \end{subfigure} \begin{subfigure}{0.31\textwidth} \centering \includegraphics[width=\textwidth]{job_similarities_7488914-out/intersection-heatmap} \caption{Job-L} \label{fig:heatmap-job-L} \end{subfigure} + \centering -\caption{Intersection of the top 100 jobs for the different algorithms} +\caption{Intersection of the 100 top ranked jobs for different algorithms} \label{fig:heatmap-job} \end{figure} diff --git a/scripts/analyse-all.sh b/scripts/analyse-all.sh index d7b968c..26767dd 100755 --- a/scripts/analyse-all.sh +++ b/scripts/analyse-all.sh @@ -29,3 +29,10 @@ for I in job_similarities_*.csv ; do fi mv *.png *.pdf $OUT done + +# analyze peformance data + +for I in datasets/progress_*.csv ; do + OUT=fig/$(basename ${I%%.csv}-out) + ./scripts/plot-performance.R $I $OUT +done diff --git a/scripts/plot-performance-clustering.R b/scripts/plot-performance-clustering.R new file mode 100755 index 0000000..9315e33 --- /dev/null +++ b/scripts/plot-performance-clustering.R @@ -0,0 +1,21 @@ +#!/usr/bin/env Rscript +library(ggplot2) +library(dplyr) +require(scales) + +# Plot the performance numbers of the clustering + +data = read.csv("datasets/clustering_progress.csv") + +e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99)) +e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %") + +# Development when adding more jobs +ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom") +ggsave("fig/runtime-cummulative.png", width=6, height=4.5) + +# Bar chart for the maximum +e = data %>% filter(jobs_done >= (jobs_total - 9998)) +e$percent = as.factor(round(e$min_sim*100,0)) +ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20)) +ggsave("fig/runtime-overview.png", width=7, height=2) diff --git a/scripts/plot-performance.R b/scripts/plot-performance.R index 9315e33..86d2678 100755 --- a/scripts/plot-performance.R +++ b/scripts/plot-performance.R @@ -3,19 +3,20 @@ library(ggplot2) library(dplyr) require(scales) -# Plot the performance numbers of the clustering +args = commandArgs(trailingOnly = TRUE) +file = "datasets/progress_4296426.csv" # for manual execution +file = args[1] +prefix = args[2] -data = read.csv("datasets/clustering_progress.csv") -e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99)) -e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %") +# Plot the performance numbers of the analysis +data = read.csv(file) + +e = data %>% filter(jobs_done >= (jobs_total - 9998)) +e$time_per_100k = e$elapsed / (e$jobs_done / 100000) +ggplot(e, aes(alg_name, time_per_100k, fill=alg_name)) + geom_boxplot() + theme(legend.position=c(0.2, 0.7)) + xlab("Algorithm") + ylab("Runtime in s per 100k jobs") + stat_summary(aes(label=round(..y..,0)), position = position_nudge(x = 0, y = 250), fun=mean, geom="text", size=4) +ggsave(paste(prefix, "-boxplot.png", sep=""), width=5, height=4) # Development when adding more jobs -ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom") -ggsave("fig/runtime-cummulative.png", width=6, height=4.5) - -# Bar chart for the maximum -e = data %>% filter(jobs_done >= (jobs_total - 9998)) -e$percent = as.factor(round(e$min_sim*100,0)) -ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20)) -ggsave("fig/runtime-overview.png", width=7, height=2) +ggplot(data, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + ylab("Cummulative runtime in s") + xlab("Jobs processed") + theme(legend.position = "bottom") #+ scale_x_log10() + scale_y_log10() +ggsave(paste(prefix, "-cummulative.png", sep=""), width=6, height=4.5) diff --git a/scripts/plot.R b/scripts/plot.R index c8ff172..e66316c 100755 --- a/scripts/plot.R +++ b/scripts/plot.R @@ -113,8 +113,8 @@ for (l1 in levels(data$alg_name)){ print(res.intersect) # Plot heatmap about intersection -ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("") -ggsave("intersection-heatmap.png", width=6, height=5) +ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("") + theme(legend.position = "bottom", legend.title = element_blank()) +ggsave("intersection-heatmap.png", width=4.5, height=4.5) # Collect the metadata of all jobs in a new table res.jobs = tibble()