Added figures.

This commit is contained in:
Julian M. Kunkel 2020-08-25 18:00:28 +01:00
parent 153c6a440e
commit cc929d7db1
11 changed files with 76 additions and 35 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

View File

@ -198,26 +198,35 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
\jk{Describe System at DKRZ from old paper}
The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}.
\jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.}
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ.
A boxplot for the runtimes is shown in \Cref{fig:performance}.
The runtime is normalized for 100k seconds, i.e., for bin\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long.
Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L.
The Levensthein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
Note that the current algorithms are sequential and executed on just one core.
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
We believe this will then allow a near-online analysis of a job.
\begin{figure}
\centering
\begin{subfigure}{0.8\textwidth}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{runtime-overview}
\caption{Overview to process all jobs} \label{fig:runtime-overview}
\includegraphics[width=\textwidth]{progress_4296426-out-boxplot}
\caption{Job-S} \label{fig:perf-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{progress_5024292-out-boxplot}
\caption{Job-M} \label{fig:perf-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{progress_7488914-out-boxplot}
\caption{Job-L} \label{fig:perf-job-L}
\end{subfigure}
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{runtime-cummulative}
\caption{Cumulative} \label{fig:runtime-cummulative}
\end{subfigure}
\caption{Performance of the algorithms}
\caption{Runtime overview for all algorithms and jobs}
\label{fig:performance}
\end{figure}
@ -232,14 +241,16 @@ The different algorithms lead to different curves for our reference jobs, e.g.,
% This indicates that the algorithms
The support team in a data center may have time to investigate the most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs.
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs (the Top\,100).
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given).
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady.
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
\begin{figure}
\begin{subfigure}{0.8\textwidth}
@ -292,13 +303,13 @@ For Job-L, we find barely similar jobs, except when using the HEX\_phases algori
\label{fig:hist}
\end{figure}
\subsection{Quantitative Analysis of Selected Jobs}
\subsubsection{Inclusivity and Specificity}
User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more.
Up to about 2x users than groups.
To understand how the Top\,100 jobs are distributed across users, the data is grouped by userid and counted.
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
@ -388,15 +399,16 @@ One consideration is to identify jobs that meet a rank threshold for all differe
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/intersection-heatmap}
\caption{Job-M} \label{fig:heatmap-job-M}
\caption{Job-M} \label{fig:heatmap-job-M} %,trim={2.5cm 0 0 0},clip
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/intersection-heatmap}
\caption{Job-L} \label{fig:heatmap-job-L}
\end{subfigure}
\centering
\caption{Intersection of the top 100 jobs for the different algorithms}
\caption{Intersection of the 100 top ranked jobs for different algorithms}
\label{fig:heatmap-job}
\end{figure}

View File

@ -29,3 +29,10 @@ for I in job_similarities_*.csv ; do
fi
mv *.png *.pdf $OUT
done
# analyze peformance data
for I in datasets/progress_*.csv ; do
OUT=fig/$(basename ${I%%.csv}-out)
./scripts/plot-performance.R $I $OUT
done

View File

@ -0,0 +1,21 @@
#!/usr/bin/env Rscript
library(ggplot2)
library(dplyr)
require(scales)
# Plot the performance numbers of the clustering
data = read.csv("datasets/clustering_progress.csv")
e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99))
e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %")
# Development when adding more jobs
ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom")
ggsave("fig/runtime-cummulative.png", width=6, height=4.5)
# Bar chart for the maximum
e = data %>% filter(jobs_done >= (jobs_total - 9998))
e$percent = as.factor(round(e$min_sim*100,0))
ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20))
ggsave("fig/runtime-overview.png", width=7, height=2)

View File

@ -3,19 +3,20 @@ library(ggplot2)
library(dplyr)
require(scales)
# Plot the performance numbers of the clustering
args = commandArgs(trailingOnly = TRUE)
file = "datasets/progress_4296426.csv" # for manual execution
file = args[1]
prefix = args[2]
data = read.csv("datasets/clustering_progress.csv")
e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99))
e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %")
# Plot the performance numbers of the analysis
data = read.csv(file)
e = data %>% filter(jobs_done >= (jobs_total - 9998))
e$time_per_100k = e$elapsed / (e$jobs_done / 100000)
ggplot(e, aes(alg_name, time_per_100k, fill=alg_name)) + geom_boxplot() + theme(legend.position=c(0.2, 0.7)) + xlab("Algorithm") + ylab("Runtime in s per 100k jobs") + stat_summary(aes(label=round(..y..,0)), position = position_nudge(x = 0, y = 250), fun=mean, geom="text", size=4)
ggsave(paste(prefix, "-boxplot.png", sep=""), width=5, height=4)
# Development when adding more jobs
ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom")
ggsave("fig/runtime-cummulative.png", width=6, height=4.5)
# Bar chart for the maximum
e = data %>% filter(jobs_done >= (jobs_total - 9998))
e$percent = as.factor(round(e$min_sim*100,0))
ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20))
ggsave("fig/runtime-overview.png", width=7, height=2)
ggplot(data, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + ylab("Cummulative runtime in s") + xlab("Jobs processed") + theme(legend.position = "bottom") #+ scale_x_log10() + scale_y_log10()
ggsave(paste(prefix, "-cummulative.png", sep=""), width=6, height=4.5)

View File

@ -113,8 +113,8 @@ for (l1 in levels(data$alg_name)){
print(res.intersect)
# Plot heatmap about intersection
ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("")
ggsave("intersection-heatmap.png", width=6, height=5)
ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("") + theme(legend.position = "bottom", legend.title = element_blank())
ggsave("intersection-heatmap.png", width=4.5, height=4.5)
# Collect the metadata of all jobs in a new table
res.jobs = tibble()