Merge branch 'master' of http://git.hps.vi4io.org/eugen.betke/mistral-io-datasets into master

2020-10-08 15:36:43 +02:00 · 2020-10-08 15:36:43 +02:00 · 74275fcfa6
commit 74275fcfa6
parent 875f89e39e 0cb4fc28bd
19 changed files with 364 additions and 8819 deletions
--- a/datasets/clustering_progress.csv
+++ b/datasets/clustering_progress.csv
--- a/datasets/job_similarities_4296426.csv.tar.xz
+++ b/datasets/job_similarities_4296426.csv.tar.xz
--- a/datasets/job_similarities_5024292.csv.tar.xz
+++ b/datasets/job_similarities_5024292.csv.tar.xz
--- a/datasets/job_similarities_7488914.csv.tar.xz
+++ b/datasets/job_similarities_7488914.csv.tar.xz
--- a/datasets/ks_progress_4296426.csv.tar.xz
+++ b/datasets/ks_progress_4296426.csv.tar.xz
--- a/datasets/ks_progress_5024292.csv.tar.xz
+++ b/datasets/ks_progress_5024292.csv.tar.xz
--- a/datasets/ks_progress_7488914.csv.tar.xz
+++ b/datasets/ks_progress_7488914.csv.tar.xz
--- a/datasets/ks_similarities_4296426.csv.tar.xz
+++ b/datasets/ks_similarities_4296426.csv.tar.xz
--- a/datasets/ks_similarities_5024292.csv.tar.xz
+++ b/datasets/ks_similarities_5024292.csv.tar.xz
--- a/datasets/ks_similarities_7488914.csv.tar.xz
+++ b/datasets/ks_similarities_7488914.csv.tar.xz
--- a/datasets/progress_4296426.csv
+++ b/datasets/progress_4296426.csv
--- a/datasets/progress_5024292.csv
+++ b/datasets/progress_5024292.csv
--- a/datasets/progress_7488914.csv
+++ b/datasets/progress_7488914.csv
--- a/paper/main.tex
+++ b/paper/main.tex
@ -84,28 +84,54 @@ DKRZ --
 \maketitle

 \begin{abstract}
+One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency.
+Therefore, a data center deploys monitoring systems which captures the behavior of the executed jobs.
+While it is easy to utilize statistics to rank jobs based on the utilization of compute, storage, and network, it is tricky to find patterns in 100.000 of jobs, i.e., is there a class of jobs that aren't performing well.
+When support staff investigates a single job, it is relevant to identify related jobs in order to understand the usage of the exhibited behavior better and assess the optimization potential.

-Support staff.
-Problem, a particular job found that isn't performing well.
-Now how can we find similar jobs?
+In this paper, a methodology to rank the similarity of all jobs to a reference job based on their temporal IO behavior is described.
+Practically, we apply several of previously developed time series based algorithms and also utilize Kolmogorov-Smirnov to compare the distribution of the statistics.
+A study is conducted to explore the effectivity of the approach which starts starts from three reference jobs and investigates related jobs.
+The data stems from DKRZ's supercomputer Mistral and includes more than 500.000 jobs that have been executed during several months of operation.
+\jk{Wie lange war das?}

-Problem with definition of similarity.
-
-In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are  illustrated.
-Similar to a study.
-
-Research questions: is this effective to find similar jobs?
-
-The contribution of this paper...
+%Problem with definition of similarity.
+Our analysis shows that the strategy and algorithms are effective to identify similar jobs and revealed some interesting patterns on the data.
 \end{abstract}

+
 \section{Introduction}

-%This paper is structured as follows.
-%We start with the related work in \Cref{sec:relwork}.
+Supercomputers execute 1000's of jobs every day.
+Support staff at a data center have two goals.
+Firstly, they provide a service to users to enable them the convenient execution of their applications.
+Secondly, they aim to improve the efficiency of all workflows -- represented as batch jobs -- in order to allow the data center to serve more workloads.
+
+In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
+Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
+Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
+Monitoring tools such as \cite{Grafana} and \cite{XDMod} provide various statistics and time series data for the job execution.
+
+The support staff should focus on workloads for which optimization is beneficial, for instance, analyzing a job that is executed once on a medium number of nodes is only costing human resources and not a good return of investment.
+By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of compute, network, and IO resources.
+However, would it be beneficial to investigate this workload in detail and potentially optimize it?
+A pattern that can be observed in many jobs bears a potential as the blueprint for optimizing one job may be applied to other jobs as well.
+This is particularly true when running one application with similar inputs but also different applications may lead to a similar behavior.
+Therefore, it is useful for support staff that investigates a resource hungry job to identify similar jobs that are executed on the supercomputer.
+
+In our previous paper \cite{XXX}, we developed several distance metrics and algorithms for the clustering of jobs based on the time series of their IO behavior.
+The distance metrics can be applied to jobs of with different runtime and number of nodes utilized but differ in the way the define similarity.
+We showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
+In this article, we refined these distance metrics slightly and apply them to rank jobs based on their similarity to a reference job.
+Therefore, we perform a study on three reference jobs with different character.
+We also utilize Kolmogorov-Smirnov to illustrate the benefit and drawbacks of the different methods.
+
+This paper is structured as follows.
+We start by introducing related work in \Cref{sec:relwork}.
 %Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
-%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.
-%Finally, we finalize our paper with a summary in \Cref{sec:summary}.
+In \Cref{sec:methodology} we describe briefly describe the data reduction and the machine learning approaches.
+In \Cref{sec:evaluation}, we perform a study by applying the methodology on three jobs with different behavior, therewith, assessing the effectiveness of the approach to identify similar jobs.
+Finally, we conclude our paper in \Cref{sec:summary}.

 \section{Related Work}
 \label{sec:relwork}
@ -113,33 +139,67 @@ The contribution of this paper...
 \section{Methodology}
 \label{sec:methodology}

-Given: the reference job ID.
-Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.
+The purpose of the methodology is to allow user and support staff to explore all executed jobs on a supercomputer in the order of their similarity to the reference job.
+Therefore, we first need to define the job data, then describe the algorithms used to compute the similarity, finally the methodology to investigate jobs is described.

-Adapt the algorithms:
-\begin{itemize}
-	\item iterate for all jobs
-		\begin{itemize}
-			\item compute distance to reference job
-		\end{itemize}
-	\item sort the jobs based on the distance to ref job
-	\item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance
-\end{itemize}
+\subsection{Job Data}
+On the Mistral supercomputer at DKRZ, the monitoring system gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
+The results in 4D data (time, nodes, metrics, file system) per job.
+The distance metrics should handle jobs of different length and node count.
+In \cite{TODOPaper}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
+In a nutshell, for each job executed on Mistral, we partition it into 10 minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1) and CriticalIO (4) for values below 99-percentile, up to 99\.9-percentile, and above, respectively.
+After data is reduced across nodes, we quantize the timelines either using  binary or hexadecimal representation which is then ready for similarity analysis.
+By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.

-A user might be interested to explore say closest 10 or 50 jobs.
+\subsection{Algorithms for Computing Similarity}
+In this paper, we reuse the algorithms developed in \cite{TODO}: bin\_all, bin\_aggzeros, hex\_native, hex\_lev, and hex\_quant.
+They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance metrics is mostly the Euclidean distance or the Levensthein-distance.
+For jobs with different length, we apply a sliding-windows approach which  finds the location for the shorter job in the longer job with the highest similarity.
+The hex\_quant algorithm extracts phases and matches

-Algorithms:
-Profile algorithm: job-profiles (job-duration, job-metrics, combine both)
-$\rightarrow$ just compute geom-mean distance between profile
+\paragraph{Kolmogorov-Smirnov (kv) algorithm}
+In this paper, we add a Kolmogorov-Smirnov algorithm that compares the probability distribution of the observed values which we describe in the following.
+% Summary
+For the analysis, we perform two preparation steps.
+Dimension reduction by computing mean across the two file systems and by concatenating the time series data of the individual nodes.
+This reduces the four dimensional dataset to two dimensions (time, metrics).

-Check time series algorithms:
+% Aggregation
+The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
+The fixed interval of 10 min also ensure the portability of the approach to other HPC systems.
+The concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it allows comparison of jobs with different number of nodes.
+We apply no aggregation function to the metric dimension.
+
+% Filtering
+%Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis.
+%Their sum across all dimensions and time series is equal to zero.
+%Furthermore, we filter those jobs whose time series have less than 8 values.
+% Oben beschrieben
+
+% Similarity
+For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''.
+The similarity function \Cref{eq:ks_similarity} calculates the mean inverse of reject probability $p_{\text{reject}}$ computed with the ks-test across all metrics $m$.
+
+\begin{equation}\label{eq:ks_similarity}
+	similarity = \frac{\sum_m 1 - p_{\text{reject}(m)}}{m}
+\end{equation}
+
+
+
+\subsection{Methodology}
+
+Our strategy for localizing similar jobs works as follows:
+The user/support staff provides a reference job ID and algorithm to use for the similarity.
+The system iterate over all jobs and computes the distance to the reference job using the algorithm.
+Next, sort the jobs based on the distance to the reference job.
+Visualize the cumulative job distance.
+Start the inspection of the jobs looking at the most similar jobs first.
+
+The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity.
+For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%.
+
+For the inspection of the jobs, a user may explore the job metadata, searching for similarities and explore the time series of a job's IO metrics.

-\begin{itemize}
-	\item bin
-	\item hex\_native
-  \item hex\_lev
-	\item hex\_quant
-\end{itemize}

 \section{Evaluation}
 \label{sec:evaluation}
@ -165,10 +225,16 @@ This coding is also used for the HEX class of algorithms (BIN algorithms merge a
 The figures show the values of active metrics ($\neq 0$) only; if few are active then they are shown in one timeline, otherwise they are rendered individually to provide a better overview.
 For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.

+In \Cref{fig:refJobsHist}, the histograms of all job metrics are shown.
+A histogram contains the activities of each node and timestep without being averaged across the nodes.
+This data is used to compare jobs using Kolmogorov-Smirnov.
+The metrics at Job-L are not shown as they have only a handful of instances where the value is not 0, except for write\_bytes: the first process is writing out at a low rate.
+Interestingly, the aggregated pattern of Job-L in \Cref{fig:job-L} sums up to some activity at the first segment for three other metrics.
+
 \begin{figure}
 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job-timeseries4296426}
+\includegraphics[width=\textwidth]{job-ks-0timeseries4296426}
 \caption{Job-S (runtime=15,551\,s, segments=25)} \label{fig:job-S}
 \end{subfigure}
 \centering
@ -191,7 +257,7 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se

 \begin{subfigure}{0.8\textwidth}
 \centering
-\includegraphics[width=\textwidth]{job-timeseries7488914-30}
+\includegraphics[width=\textwidth]{job-ks-2timeseries7488914-30}
 \caption{Job-L (first 30 segments of 400; remaining segments are similar)}
 \label{fig:job-L}
 \end{subfigure}
@ -200,6 +266,40 @@ For example, we can see in \Cref{fig:job-S}, that several metrics increase in Se
 \end{figure}


+\begin{figure}
+\begin{subfigure}{0.8\textwidth}
+\centering
+\includegraphics[width=\textwidth]{job-ks-0hist4296426}
+\caption{Job-S} \label{fig:job-S-hist}
+\end{subfigure}
+\centering
+
+
+\begin{subfigure}{0.8\textwidth}
+\centering
+\includegraphics[width=\textwidth]{job-ks-1hist5024292}
+\caption{Job-M} \label{fig:job-M-hist}
+\end{subfigure}
+\centering
+
+
+\caption{Reference jobs: histogram of IO activities}
+\label{fig:refJobsHist}
+\end{figure}
+
+%\begin{figure}\ContinuedFloat
+%\begin{subfigure}{0.8\textwidth}
+%\centering
+%\includegraphics[width=\textwidth]{job-ks-2hist7488914}
+%\caption{Job-L}
+%\label{fig:job-L}
+%\end{subfigure}
+%\centering
+%\caption{Reference jobs: histogram of IO activities}
+%\end{figure}
+
+
+

 \subsection{Performance}

@ -215,6 +315,8 @@ Note that the current algorithms are sequential and executed on just one core.
 For computing the similarity to one (or a small set of reference jobs), they could easily be parallelized.
 We believe this will then allow a near-online analysis of a job.

+
+\jk{To update the figure to use KS and (maybe to aggregate job profiles)? Problem old files are gone}
 \begin{figure}
 \centering
  \begin{subfigure}{0.31\textwidth}
@ -254,8 +356,8 @@ As we focus on a feasible number of jobs, the diagram should be read from right
 It turns out that both BIN algorithms produce nearly identical histograms and we omit one of them.
 In the figures, we can see again a different behavior of the algorithms depending on the reference job.
 Especially for Job-S, we can see clusters with jobs of higher similarity (e.g., at hex\_lev at SIM=75\%) while for Job-M, the growth in the relevant section is more steady.
-For Job-L, we find barely similar jobs, except when using the HEX\_phases algorithm.
-This algorithm finds 393 jobs that have a similarity of 100\%, thus they are indistinguishable to the algorithm.
+For Job-L, we find barely similar jobs, except when using the HEX\_phases and ks algorithms.
+HEX\_phases find 393 jobs that have a similarity of 100\%, thus they are indistinguishable, while ks identifies 6880 jobs with a similarity of at least 97.5\%.

 Practically, the support team would start with Rank\,1 (most similar job, presumably, the reference job itself) and walk down until the jobs look different, or until a cluster is analyzed.

@ -291,19 +393,19 @@ Practically, the support team would start with Rank\,1 (most similar job, presum

 \begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_4296426-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_4296426-out/hist-sim}
 \caption{Job-S} \label{fig:hist-job-S}
 \end{subfigure}

 \begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_5024292-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_5024292-out/hist-sim}
 \caption{Job-M} \label{fig:hist-job-M}
 \end{subfigure}

 \begin{subfigure}{0.75\textwidth}
 \centering
-\includegraphics[width=\textwidth,trim={0 0 0 2.2cm},clip]{job_similarities_7488914-out/hist-sim}
+\includegraphics[width=\textwidth,trim={0 0 0 2.0cm},clip]{job_similarities_7488914-out/hist-sim}
 \caption{Job-L} \label{fig:hist-job-L}
 \end{subfigure}
 \centering
@ -330,7 +432,7 @@ To confirm hypotheses presented, we analyzed the job metadata comparing job name
 To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted.
 \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
 For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
-For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
+For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev, hex\_native, and ks is including more users (29, 33, and 37, respectively) than the other three algorithms.
 For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.

 We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups.
@ -347,7 +449,7 @@ The boxplots have different shapes which is an indication, that the different al
 \paragraph{Runtime distribution.}
 The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}.
 While all algorithms can compute the similarity between jobs of different length, the bin algorithms and hex\_native penalize jobs of different length preferring jobs of very similar length.
-For Job-M and Job-L, hex\_phases is able to identify much shorter or longer jobs.
+For Job-M and Job-L, hex\_phases and ks are able to identify much shorter or longer jobs.
 For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:hist-job-L}, 393 jobs have a similarity of 100\%) which is the reason why the job runtime isn't shown in the figure itself.

 \begin{figure}
@ -416,14 +518,12 @@ For Job-L, the job itself isn't included in the chosen Top\,100 (see \Cref{fig:h

 \subsubsection{Algorithmic differences}
 To verify that the different algorithms behave differently, the intersection for the Top\,100 is computed for all combination of algorithms and visualized in \Cref{fig:heatmap-job}.
-As expected we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
-While there is some reordering, both algorithms lead to a comparable order.
-The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L.
-For Job\-M, however, they lead to a different ranking and Top\,100.
+Bin\_all and bin\_aggzeros overlap with at least 99 ranks for all three jobs.
+While there is some reordering, both algorithms lead to a comparable set.
+All algorithms have significant overlap for Job-S.
+For Job\-M, however, they lead to a different ranking and Top\,100, particularly ks determines a different set.
+Generally, hex\_lev and Hex\_native are generating more similar results than other algorithms.
 From this analysis, we conclude that one representative from binary quantization is sufficient as it generates very similar results while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be analyzed individually.
-\eb{Ist das eine generelle Aussage: ``one representative from binary quantization is sufficient``? Wenn ja, dann ist sie sehr wage. Koennte Zufall sein.}
-\jk{Habe das bissl umgeschrieben. Sicher ja. Ist halt sehr ähnlich.}
-


 \begin{figure}
@ -464,24 +564,30 @@ It is executed for different simulations and variables across timesteps.
 The job name of Job-S  suggests that is applied to the control variable.
 In the metadata, we found 22,580 jobs with “cmor” in the name of which 367 jobs mention “control”.

-The bin algorithms identify one job which name doesn't include “cmor”,
-All other algorithm identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}).
+The bin and ks algorithms identify one job which name doesn't include “cmor”,
+All other algorithm identify only “cmor” jobs and 26-38 of these jobs are applied to “control” (see \Cref{tbl:control-jobs}) -- only the ks algorithm doesn't identify any job with control.
 A selection of job timelines is given in \Cref{fig:job-S-hex-lev}; all of these jobs are jobs on control variables.
 The single non-cmor job and a high-ranked non-control cmor job is shown in \Cref{fig:job-S-bin-agg}.
 While we cannot visually see much differences between these two jobs compared to the cmor job processing the control variables, the algorithms indicate that jobs processing the control variables must be more similar as they appear much more frequently in the Top\,100 jobs than in all jobs labeled with “cmor”.

-For Job-S, we found that all algorithms work similarly well and, therefore, omit further timelines.
+For Job-S, we found that all algorithms work well and, therefore, omit further timelines.

 \begin{table}
 \centering
-\begin{tabular}{r|r}
-  Algorithm & Jobs \\ \hline
-    bin\_aggzeros & 38 \\
-    bin\_all & 38 \\
-    hex\_lev & 33 \\
-    hex\_native & 26 \\
-    hex\_phases & 33
+\begin{tabular}{r|r|r|r|r|r}
+  bin\_aggzeros & bin\_all & hex\_lev & hex\_native & hex\_phases & ks\\ \hline
+  38 &   38 &   33 &   26 &   33 &       0
 \end{tabular}
+
+%\begin{tabular}{r|r}
+%  Algorithm & Jobs \\ \hline
+%    bin\_aggzeros & 38 \\
+%    bin\_all & 38 \\
+%    hex\_lev & 33 \\
+%    hex\_native & 26 \\
+%    hex\_phases & 33 \\
+%    ks & 0
+%\end{tabular}
  \caption{Job-S: number of jobs with “control” in their name in the Top-100}
  \label{tbl:control-jobs}
 \end{table}
@ -686,7 +792,8 @@ The jobs that are similar according to the bin algorithms differ from our expect
 \subsection{Job-L}

 For the bin algorithms, the inspection of job names (14 unique names) leads to two prominent applications: bash and xmessy with 45 and 48 instances, respectively.
-The hex algorithms identify a more diverse set of applications (18 unique names), with no xmessy job, and the hex\_phases algorithm has 85 unique names.
+The hex algorithms identify a more diverse set of applications (18 unique names and no xmessy job), and the hex\_phases algorithm has 85 unique names.
+The ks algorithm finds 71 jobs ending with t127, which is a typical model configuration.

 \begin{figure}
 \begin{subfigure}{0.3\textwidth}
@ -799,5 +906,7 @@ The hex algorithms identify a more diverse set of applications (18 unique names)
 One consideration could be to identify jobs that are found by all algorithms, i.e., jobs that meet a certain (rank) threshold for different algorithms.
 That would increase the likelihood that these jobs are very similar and what the user is looking for.

+The ks algorithm finds jobs with similar histograms which is not necessarily what we are looking for.
+
 %\printbibliography
 \end{document}
--- a/scripts/analyse-all.sh
+++ b/scripts/analyse-all.sh
@ -6,31 +6,22 @@ echo "This script performs the complete analysis steps"

 CLEAN=0 # Set to 0 to make some update

-function prepare(){
-  pushd datasets
-  ./decompress.sh
-  popd
+./scripts/plot-job-timelines-ks.py 4296426,5024292,7488914 fig/job,fig/job,fig/job

-  for I in datasets/*.csv ; do
-		if [ ! -e $(basename $I) ]; then
-			echo "Creating symlink $(basename $I)"
-			ln -s $I
-		fi
-  done
-}

-prepare
-
-for I in job_similarities_*.csv ; do
+for I in datasets/job_similarities_*.csv ; do
  rm *.png *.pdf
-  ./scripts/plot.R $I > description.txt
+  echo "processing $I"
+	set -x
+	./scripts/plot.R $I > description.txt 2>&1
+	set +x
+  I=${I##datasets/}
  OUT=${I%%.csv}-out
  mkdir $OUT
  if [[ $CLEAN != "0" ]] ; then
    rm $OUT/*
-    mv description.txt $OUT
  fi
-  mv *.png *.pdf jobs-*.txt $OUT
+  mv description.txt *.png *.pdf jobs-*.txt $OUT
 done

 # analyze peformance data
--- a/scripts/create-paper-vis.sh
+++ b/scripts/create-paper-vis.sh
@ -4,7 +4,8 @@

 mkdir fig
 for job in 5024292 4296426 7488914 ; do
-./scripts/plot-single-job.py $job "fig/job-"
+#./scripts/plot-single-job.py $job "fig/job-"
+./scripts/plot-single-ks-jobs.py $job "fig/job-"
 done

 # Remove whitespace around jobs
--- a/scripts/plot-job-timelines-ks.py
+++ b/scripts/plot-job-timelines-ks.py
@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+
+import csv
+import sys
+import pandas as pd
+from pandas import DataFrame
+from pandas import Grouper
+import seaborn as sns
+from matplotlib import pyplot
+import matplotlib.cm as cm
+
+jobs = sys.argv[1].split(",")
+prefix = sys.argv[2].split(",")
+
+fileformat = ".pdf"
+
+print("Plotting the job: " + str(sys.argv[1]))
+print("Plotting with prefix: " + str(sys.argv[2]))
+
+
+# Color map
+colorMap = { "md_file_create": cm.tab10(0),
+"md_file_delete": cm.tab10(1),
+"md_mod": cm.tab10(2),
+"md_other": cm.tab10(3),
+"md_read": cm.tab10(4),
+"read_bytes": cm.tab10(5),
+"read_calls": cm.tab10(6),
+"write_bytes": cm.tab10(7),
+"write_calls": cm.tab10(8)
+}
+
+markerMap = { "md_file_create": "^",
+"md_file_delete": "v",
+"md_other": ".",
+"md_mod": "<",
+"md_read": ">",
+"read_bytes": "h",
+"read_calls": "H",
+"write_bytes": "D",
+"write_calls": "d"
+}
+
+linestyleMap = { "md_file_create": ":",
+"md_file_delete": ":",
+"md_mod": ":",
+"md_other": ":",
+"md_read": ":",
+"read_bytes": "--",
+"read_calls": "--",
+"write_bytes": "-.",
+"write_calls": "-."
+}
+
+# Plot the timeseries
+def plot(prefix, header, row):
+  x = { h : d for (h, d) in zip(header, row)}
+  jobid = x["jobid"]
+  del x["jobid"]
+  result = []
+  for k in x:
+    timeseries = x[k].split(":")
+    timeseries = [ float(x) for x in timeseries]
+    if sum(timeseries) == 0:
+      continue
+    timeseries = [ [k, x, s] for (s,x) in zip(timeseries, range(0, len(timeseries))) ]
+    result.extend(timeseries)
+
+  if len(result) == 0:
+    print("Empty job! Cannot plot!")
+    return
+
+  data = DataFrame(result, columns=["metrics", "segment", "value"])
+  groups = data.groupby(["metrics"])
+  metrics = DataFrame()
+  labels = []
+  colors = []
+  style = []
+  for name, group in groups:
+    style.append(linestyleMap[name] + markerMap[name])
+    colors.append(colorMap[name])
+    if name == "md_file_delete":
+      name = "file_delete"
+    if name == "md_file_create":
+      name = "file_create"
+    try:
+      metrics[name] = pd.Series([x[2] for x in group.values])
+    except:
+      print("Error processing %s with" % jobid)
+      print(group.values)
+      return
+
+    labels.append(name)
+
+  fsize = (8, 1 + 1.1 * len(labels))
+  fsizeFixed = (8, 2)
+  fsizeHist = (8, 6.5)
+
+  pyplot.close('all')
+
+  if len(labels) < 4 :
+    ax = metrics.plot(legend=True, sharex=True, grid = True,  sharey=True, markersize=10, figsize=fsizeFixed, color=colors, style=style)
+    ax.set_ylabel("Value")
+  else:
+    ax = metrics.plot(subplots=True, legend=False, sharex=True, grid = True,  sharey=True, markersize=10, figsize=fsize, color=colors, style=style)
+    for (i, l) in zip(range(0, len(labels)), labels):
+      ax[i].set_ylabel(l)
+
+  pyplot.xlabel("Segment number")
+  pyplot.savefig(prefix + "timeseries" + jobid + fileformat, bbox_inches='tight', dpi=150)
+
+  # Create a facetted grid
+  #g = sns.FacetGrid(tips, col="time", margin_titles=True)
+  #bins = np.linspace(0, 60, 13)
+  #g.map(plt.hist, "total_bill", color="steelblue", bins=bins)
+  ax = metrics.hist(grid = True, sharey=True, figsize=fsizeHist, bins=15, range=(0, 15))
+  pyplot.xlim(0, 15)
+  pyplot.savefig(prefix + "hist" + jobid + fileformat, bbox_inches='tight', dpi=150)
+
+
+  # Plot first 30 segments
+  if len(timeseries) <= 50:
+    return
+
+  if len(labels) < 4 :
+    ax = metrics.plot(legend=True, xlim=(0,30), sharex=True, grid = True,  sharey=True, markersize=10, figsize=fsizeFixed, color=colors, style=style)
+    ax.set_ylabel("Value")
+  else:
+    ax = metrics.plot(subplots=True, xlim=(0,30), legend=False, sharex=True, grid = True,  sharey=True, markersize=10, figsize=fsize, color=colors, style=style)
+    for (i, l) in zip(range(0, len(labels)), labels):
+      ax[i].set_ylabel(l)
+
+  pyplot.xlabel("Segment number")
+  pyplot.savefig(prefix + "timeseries" + jobid + "-30" + fileformat, bbox_inches='tight', dpi=150)
+
+### end plotting function
+
+
+
+#with open('job-io-datasets/datasets/job_codings.csv') as csv_file: # EB: old codings
+with open('./datasets/job_codings_v4.csv') as csv_file: # EB: v3 codings moved to this repo
+    csv_reader = csv.reader(csv_file, delimiter=',')
+    line_count = 0
+    for row in csv_reader:
+      if line_count == 0:
+        header = row
+        line_count += 1
+        continue
+      job = row[0].strip()
+      if not job in jobs:
+        continue
+      else:
+        index = jobs.index(job)
+        plot(prefix[index] + "-ks-" + str(index), header, row)
--- a/scripts/plot-job-timelines.py
+++ b/scripts/plot-job-timelines.py
@ -10,7 +10,7 @@ import matplotlib.cm as cm
 jobs = sys.argv[1].split(",")
 prefix = sys.argv[2].split(",")

-fileformat = ".png"
+fileformat = ".pdf"

 print("Plotting the job: " + str(sys.argv[1]))
 print("Plotting with prefix: " + str(sys.argv[2]))
--- a/scripts/plot.R
+++ b/scripts/plot.R
@ -1,5 +1,9 @@
 #!/usr/bin/env Rscript

+# Parse job from command line
+args = commandArgs(trailingOnly = TRUE)
+filename = args[1]
+
 library(ggplot2)
 library(dplyr)
 require(scales)
@ -7,18 +11,19 @@ library(stringi)
 library(stringr)

 # Turn to TRUE to print indivdiual job images
-plotjobs = FALSE
+plotjobs = TRUE

 # Color scheme
 plotcolors <- c("#CC0000", "#FFA500", "#FFFF00", "#008000", "#9999ff", "#000099")

-# Parse job from command line
-args = commandArgs(trailingOnly = TRUE)
-file = "job_similarities_5024292.csv" # for manual execution
-file = args[1]
-jobID = str_extract(file, regex("[0-9]+"))
+if (! exists("filename")){
+  filename = "./datasets/job_similarities_5024292.csv"
+  filename = "./datasets/job_similarities_7488914.csv" # for manual execution
+}
+print(filename)
+jobID = str_extract(filename, regex("[0-9]+"))

-data = read.csv(file)
+data = read.csv(filename)
 # Columns are: jobid alg_id alg_name similarity

 #data$alg_id = as.factor(data$alg_id) # EB: falsche Spalte?
@ -28,16 +33,17 @@ cat(nrow(data))

 # empirical cumulative density function (ECDF)
 data$sim = data$similarity*100
-ggplot(data, aes(sim, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("Similarity in %") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4)) + scale_color_brewer(palette = "Set2") + scale_x_log10()
+ggplot(data, aes(sim, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("Similarity in %") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.5),  legend.title = element_blank()) + scale_color_brewer(palette = "Set2") # + scale_x_log10() +
 ggsave("ecdf.png", width=8, height=2.5)

 # histogram for the jobs
 ggplot(data, aes(sim), group=alg_name) + geom_histogram(color="black", binwidth=2.5) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + xlab("Similarity in %") + scale_y_continuous(limits=c(0, 100), oob=squish)  +   scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none") + stat_bin(binwidth=2.5, geom="text", adj=1.0, angle = 90, colour="black", size=3, aes(label=..count.., y=0*(..count..)+95))
-ggsave("hist-sim.png", width=6, height=4.5)
+ggsave("hist-sim.png", width=6, height=5)

 #ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position=c(0.9, 0.4))  + scale_color_brewer(palette = "Set2") + xlim(0.5, 1.0)
 #ggsave("ecdf-0.5.png", width=8, height=3)

+print("Similarity > 0.5")
 e = data %>% filter(similarity >= 0.5)
 print(summary(e))

@ -47,13 +53,20 @@ metadata = read.csv("./datasets/job_metadata.csv") # EB: is ebenfalls im Repo
 metadata$user_id = as.factor(metadata$user_id)
 metadata$group_id = as.factor(metadata$group_id)

-plotJobs = function(jobs){
+plotJobs = function(algorithm, jobs){
    # print the job timelines
    r = e[ordered, ]

    if (plotjobs) {
+      if(algorithm == "ks"){
+        script = "./scripts/plot-job-timelines-ks.py"
+      }else{
+        script = "./scripts/plot-job-timelines.py"
+      }
      prefix = do.call("sprintf", list("%s-%.4f-", level, r$similarity))
-      system(sprintf("./scripts/plot-single-job.py %s %s", paste(r$jobid, collapse=","), paste(prefix, collapse=",")))
+      call = sprintf("%s %s %s", script, paste(r$jobid, collapse=","), paste(prefix, collapse=","))
+      print(call)
+      system(call)
    }

    system(sprintf("./scripts/extract-conf-data.sh %s > jobs-%s.txt", paste(r$jobid, collapse=" "), level))
@ -88,7 +101,7 @@ for (level in levels(data$alg_name)){
    userprofile$userrank = 1:nrow(userprofile)
    result.userid = rbind(result.userid, cbind(level, userprofile))

-    plotJobs(jobs)
+    plotJobs(level, jobs)
 }

 colnames(result.userid) = c("alg_name", "user_id", "count", "userrank")
@ -121,7 +134,7 @@ print(res.intersect)

 # Plot heatmap about intersection
 ggplot(tbl.intersect, aes(first, second, fill=intersect)) + geom_tile() + geom_text(aes(label = round(intersect, 1))) + scale_fill_gradientn(colours = rev(plotcolors)) + xlab("") + ylab("")  + theme(legend.position = "bottom", legend.title = element_blank())
-ggsave("intersection-heatmap.png", width=4.5, height=4.5)
+ggsave("intersection-heatmap.png", width=5, height=5)

 # Collect the metadata of all jobs in a new table
 res.jobs = tibble()