Added figures.
This commit is contained in:
parent
153c6a440e
commit
cc929d7db1
Binary file not shown.
After Width: | Height: | Size: 72 KiB |
Binary file not shown.
After Width: | Height: | Size: 150 KiB |
Binary file not shown.
After Width: | Height: | Size: 76 KiB |
Binary file not shown.
After Width: | Height: | Size: 189 KiB |
Binary file not shown.
After Width: | Height: | Size: 76 KiB |
Binary file not shown.
After Width: | Height: | Size: 127 KiB |
|
@ -198,26 +198,35 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
|
|||
|
||||
\jk{Describe System at DKRZ from old paper}
|
||||
|
||||
The runtime for computing the similarity of relevant IO jobs (580,000 and 440,000 for BIN and HEX algorithms, respectively) is shown in \Cref{fig:performance}.
|
||||
|
||||
\jk{TO FIX, This is for clustering algorithm, not for computing SIM, which is what we do here.}
|
||||
|
||||
To measure the performance for computing the similarity to the reference jobs, the algorithms are executed 10 times on a compute node at DKRZ.
|
||||
A boxplot for the runtimes is shown in \Cref{fig:performance}.
|
||||
The runtime is normalized for 100k seconds, i.e., for bin\_all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process.
|
||||
Generally, the bin algorithms are fastest, while the hex algorithms take often 4-5x as long.
|
||||
Hex\_phases is slow for Job-S and Job-M while it is fast for Job-L, the reason is that just one phase is extracted for Job-L.
|
||||
The Levensthein based algorithms take longer for longer jobs -- proportional to the job length as it applies a sliding window.
|
||||
Note that the current algorithms are sequential and executed on just one core.
|
||||
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
|
||||
We believe this will then allow a near-online analysis of a job.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\begin{subfigure}{0.8\textwidth}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{runtime-overview}
|
||||
\caption{Overview to process all jobs} \label{fig:runtime-overview}
|
||||
\includegraphics[width=\textwidth]{progress_4296426-out-boxplot}
|
||||
\caption{Job-S} \label{fig:perf-job-S}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{progress_5024292-out-boxplot}
|
||||
\caption{Job-M} \label{fig:perf-job-M}
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{progress_7488914-out-boxplot}
|
||||
\caption{Job-L} \label{fig:perf-job-L}
|
||||
\end{subfigure}
|
||||
|
||||
\begin{subfigure}{0.8\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{runtime-cummulative}
|
||||
\caption{Cumulative} \label{fig:runtime-cummulative}
|
||||
\end{subfigure}
|
||||
|
||||
\caption{Performance of the algorithms}
|
||||
\caption{Runtime overview for all algorithms and jobs}
|
||||
\label{fig:performance}
|
||||
\end{figure}
|
||||
|
||||
|
@ -232,14 +241,16 @@ The different algorithms lead to different curves for our reference jobs, e.g.,
|
|||
% This indicates that the algorithms
|
||||
|
||||
The support team in a data center may have time to investigate the most similar jobs.
|
||||
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs.
|
||||
Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar ranked jobs (the Top\,100).
|
||||
In \Cref{fig:hist}, the histograms with the actual number of jobs for a given similarity are shown.
|
||||
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left and for a bin we show at most 100 jobs (total number is still given).
|
||||
As we focus on a feasible number of jobs, the diagram should be read from right (100\% similarity) to left; and for a bin we show at most 100 jobs (total number is still given).
|
||||
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
|
||||
In the figures, we can see again a different behavior of the algorithms depending on the reference job.
|
||||
Especially for Job-S, we can see clusters with jobs of higher similarity while for Job-M, the growth in the relevant section is more steady.
|
||||
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
|
||||
For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
|
||||
|
||||
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
|
||||
|
||||
\begin{figure}
|
||||
|
||||
\begin{subfigure}{0.8\textwidth}
|
||||
|
@ -292,13 +303,13 @@ For Job-L, we find barely similar jobs, except when using the HEX\_phases algori
|
|||
\label{fig:hist}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Quantitative Analysis of Selected Jobs}
|
||||
\subsubsection{Inclusivity and Specificity}
|
||||
|
||||
|
||||
User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more.
|
||||
Up to about 2x users than groups.
|
||||
|
||||
To understand how the Top\,100 jobs are distributed across users, the data is grouped by userid and counted.
|
||||
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
|
||||
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
|
||||
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
|
||||
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
|
||||
|
@ -388,15 +399,16 @@ One consideration is to identify jobs that meet a rank threshold for all differe
|
|||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job_similarities_5024292-out/intersection-heatmap}
|
||||
\caption{Job-M} \label{fig:heatmap-job-M}
|
||||
\caption{Job-M} \label{fig:heatmap-job-M} %,trim={2.5cm 0 0 0},clip
|
||||
\end{subfigure}
|
||||
\begin{subfigure}{0.31\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{job_similarities_7488914-out/intersection-heatmap}
|
||||
\caption{Job-L} \label{fig:heatmap-job-L}
|
||||
\end{subfigure}
|
||||
|
||||
\centering
|
||||
\caption{Intersection of the top 100 jobs for the different algorithms}
|
||||
\caption{Intersection of the 100 top ranked jobs for different algorithms}
|
||||
\label{fig:heatmap-job}
|
||||
\end{figure}
|
||||
|
||||
|
|
|
@ -29,3 +29,10 @@ for I in job_similarities_*.csv ; do
|
|||
fi
|
||||
mv *.png *.pdf $OUT
|
||||
done
|
||||
|
||||
# analyze peformance data
|
||||
|
||||
for I in datasets/progress_*.csv ; do
|
||||
OUT=fig/$(basename ${I%%.csv}-out)
|
||||
./scripts/plot-performance.R $I $OUT
|
||||
done
|
||||
|
|
|
@ -0,0 +1,21 @@
|
|||
#!/usr/bin/env Rscript
|
||||
library(ggplot2)
|
||||
library(dplyr)
|
||||
require(scales)
|
||||
|
||||
# Plot the performance numbers of the clustering
|
||||
|
||||
data = read.csv("datasets/clustering_progress.csv")
|
||||
|
||||
e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99))
|
||||
e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %")
|
||||
|
||||
# Development when adding more jobs
|
||||
ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom")
|
||||
ggsave("fig/runtime-cummulative.png", width=6, height=4.5)
|
||||
|
||||
# Bar chart for the maximum
|
||||
e = data %>% filter(jobs_done >= (jobs_total - 9998))
|
||||
e$percent = as.factor(round(e$min_sim*100,0))
|
||||
ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20))
|
||||
ggsave("fig/runtime-overview.png", width=7, height=2)
|
|
@ -3,19 +3,20 @@ library(ggplot2)
|
|||
library(dplyr)
|
||||
require(scales)
|
||||
|
||||
# Plot the performance numbers of the clustering
|
||||
args = commandArgs(trailingOnly = TRUE)
|
||||
file = "datasets/progress_4296426.csv" # for manual execution
|
||||
file = args[1]
|
||||
prefix = args[2]
|
||||
|
||||
data = read.csv("datasets/clustering_progress.csv")
|
||||
|
||||
e = data %>% filter(min_sim %in% c(0.1, 0.5, 0.99))
|
||||
e$percent = paste("SIM =", as.factor(round(e$min_sim*100,0)), " %")
|
||||
# Plot the performance numbers of the analysis
|
||||
data = read.csv(file)
|
||||
|
||||
e = data %>% filter(jobs_done >= (jobs_total - 9998))
|
||||
e$time_per_100k = e$elapsed / (e$jobs_done / 100000)
|
||||
ggplot(e, aes(alg_name, time_per_100k, fill=alg_name)) + geom_boxplot() + theme(legend.position=c(0.2, 0.7)) + xlab("Algorithm") + ylab("Runtime in s per 100k jobs") + stat_summary(aes(label=round(..y..,0)), position = position_nudge(x = 0, y = 250), fun=mean, geom="text", size=4)
|
||||
ggsave(paste(prefix, "-boxplot.png", sep=""), width=5, height=4)
|
||||
|
||||
# Development when adding more jobs
|
||||
ggplot(e, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + facet_grid(percent ~ .) + ylab("Cummulative runtime in s") + xlab("Jobs processed") + scale_y_log10() + theme(legend.position = "bottom")
|
||||
ggsave("fig/runtime-cummulative.png", width=6, height=4.5)
|
||||
|
||||
# Bar chart for the maximum
|
||||
e = data %>% filter(jobs_done >= (jobs_total - 9998))
|
||||
e$percent = as.factor(round(e$min_sim*100,0))
|
||||
ggplot(e, aes(y=elapsed, x=percent, fill=alg_name)) + geom_bar(stat="identity") + facet_grid(. ~ alg_name, switch = 'y') + scale_y_log10() + theme(legend.position = "none") + ylab("Runtime in s") + xlab("Minimum similarity in %") + geom_text(aes(label = round(elapsed,0), angle = 90, y=0*(elapsed)+20))
|
||||
ggsave("fig/runtime-overview.png", width=7, height=2)
|
||||
ggplot(data, aes(x=jobs_done, y=elapsed, color=alg_name)) + geom_point() + ylab("Cummulative runtime in s") + xlab("Jobs processed") + theme(legend.position = "bottom") #+ scale_x_log10() + scale_y_log10()
|
||||
ggsave(paste(prefix, "-cummulative.png", sep=""), width=6, height=4.5)
|
||||
|
|
|
@ -113,8 +113,8 @@ for (l1 in levels(data$alg_name)){
|
|||
print(res.intersect)
|
||||
|
||||
# Plot heatmap about intersection
|
||||
ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("")
|
||||
ggsave("intersection-heatmap.png", width=6, height=5)
|
||||
ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("") + theme(legend.position = "bottom", legend.title = element_blank())
|
||||
ggsave("intersection-heatmap.png", width=4.5, height=4.5)
|
||||
|
||||
# Collect the metadata of all jobs in a new table
|
||||
res.jobs = tibble()
|
||||
|
|
Loading…
Reference in New Issue