This commit is contained in:
Julian M. Kunkel 2020-10-01 17:10:27 +01:00
parent ee1ab64914
commit e6e45b6a75
18 changed files with 70 additions and 8745 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -191,10 +191,16 @@ This coding is also used for the HEX class of algorithms (BIN algorithms merge a
The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview. The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview.
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6. For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
In \Cref{fig:refJobsHist}, the histograms of all job metrics are shown.
A histogram contains the activities of each node and timestep without being averaged across the nodes.
This data is used to compare jobs using Kolmogorov-Smirnov.
The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to some activity at the first segment for three other metrics.
\begin{figure} \begin{figure}
\begin{subfigure}{0.8\textwidth} \begin{subfigure}{0.8\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{job-timeseries4296426} \includegraphics[width=\textwidth]{job-ks-0timeseries4296426}
\caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S} \caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S}
\end{subfigure} \end{subfigure}
\centering \centering
@ -217,7 +223,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
\begin{subfigure}{0.8\textwidth} \begin{subfigure}{0.8\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{job-timeseries7488914-30} \includegraphics[width=\textwidth]{job-ks-2timeseries7488914-30}
\caption{Job-L (first 30 segments of 400; remaining segments are similar)} \caption{Job-L (first 30 segments of 400; remaining segments are similar)}
\label{fig:job-L} \label{fig:job-L}
\end{subfigure} \end{subfigure}
@ -226,6 +232,40 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
\end{figure} \end{figure}
\begin{figure}
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-ks-0hist4296426}
\caption{Job-S} \label{fig:job-S-hist}
\end{subfigure}
\centering
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-ks-1hist5024292}
\caption{Job-M} \label{fig:job-M-hist}
\end{subfigure}
\centering
\caption{Reference jobs: histogram of IO activities}
\label{fig:refJobsHist}
\end{figure}
%\begin{figure}\ContinuedFloat
%\begin{subfigure}{0.8\textwidth}
%\centering
%\includegraphics[width=\textwidth]{job-ks-2hist7488914}
%\caption{Job-L}
%\label{fig:job-L}
%\end{subfigure}
%\centering
%\caption{Reference jobs: histogram of IO activities}
%\end{figure}
\subsection{Performance} \subsection{Performance}
@ -241,6 +281,8 @@ Note that the current algorithms are sequential and executed on just one core.
For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized. For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
We believe this will then allow a near-online analysis of a job. We believe this will then allow a near-online analysis of a job.
\jk{To update the figure to use KS and (maybe to aggregate job profiles)? Problem old files are gone}
\begin{figure} \begin{figure}
\centering \centering
\begin{subfigure}{0.31\textwidth} \begin{subfigure}{0.31\textwidth}
@ -280,8 +322,8 @@ As we focus on a feasible number of jobs, the diagram should be read from right
It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them. It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
In the figures, we can see again a different behavior of the algorithms depending on the reference job. In the figures, we can see again a different behavior of the algorithms depending on the reference job.
Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady. Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm. For Job-L, we find barely similar jobs, except when using the HEX\_phases and ks algorithms.
This algorithm finds 393 jobs that have a similarity of 100\%, thus they are indistinguishable to the algorithm. HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while ks identifies 6880 jobs with a similarity of at least 97.5\%.
Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed. Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.
@ -356,7 +398,7 @@ To confirm hypotheses presented, we analyzed the job metadata comparing job name
To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total). For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms. For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev, hex\_native, and ks is including more users (29, 33, and 37, respectively) than the other three algorithms.
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users. For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.
We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups. We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
@ -373,7 +415,7 @@ The boxplots have different shapes which is an indication, that the different al
\paragraph{Runtime distribution.} \paragraph{Runtime distribution.}
The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}. The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length. While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length.
For Job-M and Job-L, hex\_phases is able to identify much shorter or longer jobs. For Job-M and Job-L, hex\_phases and ks are able to identify much shorter or longer jobs.
For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself. For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.
\begin{figure} \begin{figure}
@ -442,14 +484,12 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h
\subsubsection{Algorithmic differences} \subsubsection{Algorithmic differences}
To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}. To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
As expected we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs. Bin\_all and bin\_aggzeros overlap with at least 99 ranks for all three jobs (we exclude bin\_aggzero therefore from the figure).
While there is some reordering, both algorithms lead to a comparable order. While there is some reordering, both algorithms lead to a comparable set.
The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L. All algorithms have significant overlap for Job-S.
For Job\-M, however, they lead to a different ranking and Top\,100. For Job\-M, however, they lead to a different ranking and Top\,100, particularly ks determines a different set.
Generally, hex\_lev and Hex\_native are generating more similar results than other algorithms.
From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually. From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
\eb{Ist das eine generelle Aussage: ``one representative from binary quantization is sufficient``? Wenn ja, dann ist sie sehr wage. Koennte Zufall sein.}
\jk{Habe das bissl umgeschrieben. Sicher ja. Ist halt sehr ähnlich.}
\begin{figure} \begin{figure}

View File

@ -6,6 +6,8 @@ echo "This script performs the complete analysis steps"
CLEAN=0 # Set to 0 to make some update CLEAN=0 # Set to 0 to make some update
./scripts/plot-job-timelines-ks.py 4296426,5024292,7488914 fig/job,fig/job,fig/job
function prepare(){ function prepare(){
pushd datasets pushd datasets
./decompress.sh ./decompress.sh
@ -21,12 +23,13 @@ function prepare(){
prepare prepare
for I in job_similarities_*.csv ; do for I in datasets/job_similarities_*.csv ; do
rm *.png *.pdf rm *.png *.pdf
echo "processing $I" echo "processing $I"
set -x set -x
./scripts/plot.R $I > description.txt 2>&1 ./scripts/plot.R $I > description.txt 2>&1
set +x set +x
I=${I##datasets/}
OUT=${I%%.csv}-out OUT=${I%%.csv}-out
mkdir $OUT mkdir $OUT
if [[ $CLEAN != "0" ]] ; then if [[ $CLEAN != "0" ]] ; then

View File

@ -12,7 +12,7 @@ import matplotlib.cm as cm
jobs = sys.argv[1].split(",") jobs = sys.argv[1].split(",")
prefix = sys.argv[2].split(",") prefix = sys.argv[2].split(",")
fileformat = ".png" fileformat = ".pdf"
print("Plotting the job: " + str(sys.argv[1])) print("Plotting the job: " + str(sys.argv[1]))
print("Plotting with prefix: " + str(sys.argv[2])) print("Plotting with prefix: " + str(sys.argv[2]))

View File

@ -10,7 +10,7 @@ import matplotlib.cm as cm
jobs = sys.argv[1].split(",") jobs = sys.argv[1].split(",")
prefix = sys.argv[2].split(",") prefix = sys.argv[2].split(",")
fileformat = ".png" fileformat = ".pdf"
print("Plotting the job: " + str(sys.argv[1])) print("Plotting the job: " + str(sys.argv[1]))
print("Plotting with prefix: " + str(sys.argv[2])) print("Plotting with prefix: " + str(sys.argv[2]))

View File

@ -2,7 +2,7 @@
# Parse job from command line # Parse job from command line
args = commandArgs(trailingOnly = TRUE) args = commandArgs(trailingOnly = TRUE)
file = args[1] filename = args[1]
library(ggplot2) library(ggplot2)
library(dplyr) library(dplyr)
@ -16,13 +16,14 @@ plotjobs = TRUE
# Color scheme # Color scheme
plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000099") plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000099")
if (! exists("file")){ if (! exists("filename")){
file = "job_similarities_5024292.csv" # for manual execution filename = "./datasets/job_similarities_5024292.csv"
filename = "./datasets/job_similarities_7488914.csv" # for manual execution
} }
print(file) print(filename)
jobID = str_extract(file, regex("[0-9]+")) jobID = str_extract(filename, regex("[0-9]+"))
data = read.csv(file) data = read.csv(filename)
# Columns are: jobid alg_id alg_name similarity # Columns are: jobid alg_id alg_name similarity
#data$alg_id = as.factor(data$alg_id) # EB: falsche Spalte? #data$alg_id = as.factor(data$alg_id) # EB: falsche Spalte?
@ -42,6 +43,7 @@ ggsave("hist-sim.png", width=6, height=5)
#ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0) #ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
#ggsave("ecdf-0.5.png", width=8, height=3) #ggsave("ecdf-0.5.png", width=8, height=3)
print("Similarity > 0.5")
e = data %>% filter(similarity >= 0.5) e = data %>% filter(similarity >= 0.5)
print(summary(e)) print(summary(e))
@ -122,6 +124,9 @@ tbl.intersect$intersect = 0
for (l1 in levels(data$alg_name)){ for (l1 in levels(data$alg_name)){
for (l2 in levels(data$alg_name)){ for (l2 in levels(data$alg_name)){
if(l1 == "bin_aggzeros" || l2 == "bin_aggzeros"){
next;
}
res = length(intersect(result[,l1], result[,l2])) res = length(intersect(result[,l1], result[,l2]))
res.intersect[l1,l2] = res res.intersect[l1,l2] = res
tbl.intersect[tbl.intersect$first == l1 & tbl.intersect$second == l2, ]$intersect = res tbl.intersect[tbl.intersect$first == l1 & tbl.intersect$second == l2, ]$intersect = res