master
Julian M. Kunkel 2020-10-04 15:11:46 +01:00
parent e6e45b6a75
commit d2d5970a4c
3 changed files with 40 additions and 41 deletions

View File

@ -85,20 +85,27 @@ DKRZ --
\begin{abstract}
Support staff.
Problem, a particular job found that isn't performing well.
Now how can we find similar jobs?
Supercomputers execute 1000's of jobs every day.
Support staff at a data center have two goals.
Firstly, they provide a service to users to enable them the execution of their applications.
Secondly, they aim to improve the efficiency of the workflows in order to allow the data center to serve more workloads.
Problem with definition of similarity.
In order to optimize an application, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, the data center must deploy monitoring systems and staff must pro-actively identify candidates for optimization.
In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are illustrated.
Similar to a study.
While it is easy to utilize statistics to rank applications based on the utilization of compute, storage, and network, it is tricky to find patterns in 100.000 of jobs, i.e., is there a class of jobs that aren't performing well.
When support staff investigates a single job, the question might be are there other jobs like this?
Research questions: is this effective to find similar jobs?
In this paper, a methodology and algorithms to identify similar jobs based on their temporal IO behavior is described.
A study is conducted investigating similar jobs starting from three reference jobs bearing in mind if this is effective to find similar jobs.
The data stems from DKRZ's supercomputer Mistral and included more than 500.000 jobs that have been executed during several months.
The contribution of this paper...
%Problem with definition of similarity.
Our analysis shows that this strategy is effective to identify similar jobs and revealed some interesting patterns on the data.
\end{abstract}
\section{Introduction}
%This paper is structured as follows.
@ -359,19 +366,19 @@ Practically, the support team would start with Rank\,1 (most similar job, presum
\begin{subfigure}{0.75\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_4296426-out/hist-sim}
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_4296426-out/hist-sim}
\caption{Job-S} \label{fig:hist-job-S}
\end{subfigure}
\begin{subfigure}{0.75\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_5024292-out/hist-sim}
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_5024292-out/hist-sim}
\caption{Job-M} \label{fig:hist-job-M}
\end{subfigure}
\begin{subfigure}{0.75\textwidth}
\centering
\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_7488914-out/hist-sim}
\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_7488914-out/hist-sim}
\caption{Job-L} \label{fig:hist-job-L}
\end{subfigure}
\centering
@ -484,7 +491,7 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h
\subsubsection{Algorithmic differences}
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
Bin\_all and bin\_aggzeros overlap with at least 99 ranks for all three jobs (we exclude bin\_aggzero therefore from the figure).
Bin\_all and bin\_aggzeros overlap with at least 99 ranks for all three jobs.
While there is some reordering, both algorithms lead to a comparable set.
All algorithms have significant overlap for Job-S.
For Job\-M, however, they lead to a different ranking and Top\,100, particularly ks determines a different set.
@ -530,24 +537,30 @@ It is executed for different simulations and variables across timesteps.
The job name of Job-S suggests that is applied to the control variable.
In the metadata, we found 22,580 jobs with “cmor” in the name of which 367 jobs mention “control”.
The bin algorithms identify one job which name doesn't include “cmor”,
All other algorithm identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}).
The bin and ks algorithms identify one job which name doesn't include “cmor”,
All other algorithm identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the ks algorithm doesn't identify any job with control.
A selection of job timelines is given in \Cref{fig:job-S-hex-lev}; all of these jobs are jobs on control variables.
The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}.
While we cannot visually see much differences between these two jobs compared to the cmor job processing the control variables, the algorithms indicate that jobs processing the control variables must be more similar as they appear much more frequently in the Top\,100 jobs than in all jobs labeled with “cmor”.
For Job-S, we found that all algorithms work similarly well and, therefore, omit further timelines.
For Job-S, we found that all algorithms work well and, therefore, omit further timelines.
\begin{table}
\centering
\begin{tabular}{r|r}
Algorithm & Jobs \\ \hline
bin\_aggzeros & 38 \\
bin\_all & 38 \\
hex\_lev & 33 \\
hex\_native & 26 \\
hex\_phases & 33
\begin{tabular}{r|r|r|r|r|r}
bin\_aggzeros & bin\_all & hex\_lev & hex\_native & hex\_phases & ks\\ \hline
38 & 38 & 33 & 26 & 33 & 0
\end{tabular}
%\begin{tabular}{r|r}
% Algorithm & Jobs \\ \hline
% bin\_aggzeros & 38 \\
% bin\_all & 38 \\
% hex\_lev & 33 \\
% hex\_native & 26 \\
% hex\_phases & 33 \\
% ks & 0
%\end{tabular}
\caption{Job-S: number of jobs with “control” in their name in the Top-100}
\label{tbl:control-jobs}
\end{table}
@ -752,7 +765,8 @@ The jobs that are similar according to the bin algorithms differ from our expect
\subsection{Job-L}
For the bin algorithms, the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
The hex algorithms identify a more diverse set of applications (18 unique names), with no xmessy job, and the hex\_phases algorithm has 85 unique names.
The hex algorithms identify a more diverse set of applications (18 unique names and no xmessy job), and the hex\_phases algorithm has 85 unique names.
The ks algorithm finds 71 jobs ending with t127, which is a typical model configuration.
\begin{figure}
\begin{subfigure}{0.3\textwidth}
@ -865,5 +879,7 @@ The hex algorithms identify a more diverse set of applications (18 unique names)
One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
That would increase the likelihood that these jobs are very similar and what the user is looking for.
The ks algorithm finds jobs with similar histograms which is not necessarily what we are looking for.
%\printbibliography
\end{document}

View File

@ -8,20 +8,6 @@ CLEAN=0 # Set to 0 to make some update
./scripts/plot-job-timelines-ks.py 4296426,5024292,7488914 fig/job,fig/job,fig/job
function prepare(){
pushd datasets
./decompress.sh
popd
for I in datasets/*.csv ; do
if [ ! -e $(basename $I) ]; then
echo "Creating symlink $(basename $I)"
ln -s $I
fi
done
}
prepare
for I in datasets/job_similarities_*.csv ; do
rm *.png *.pdf

View File

@ -124,9 +124,6 @@ tbl.intersect$intersect = 0
for (l1 in levels(data$alg_name)){
for (l2 in levels(data$alg_name)){
if(l1 == "bin_aggzeros" || l2 == "bin_aggzeros"){
next;
}
res = length(intersect(result[,l1], result[,l2]))
res.intersect[l1,l2] = res
tbl.intersect[tbl.intersect$first == l1 & tbl.intersect$second == l2, ]$intersect = res
@ -137,7 +134,7 @@ print(res.intersect)
# Plot heatmap about intersection
ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("") + theme(legend.position = "bottom", legend.title = element_blank())
ggsave("intersection-heatmap.png", width=4.5, height=4.5)
ggsave("intersection-heatmap.png", width=5, height=5)
# Collect the metadata of all jobs in a new table
res.jobs = tibble()