mistral-io-datasets/paper/main.tex

\let\accentvec\vec
\documentclass[]{llncs}

\usepackage{todonotes}
\newcommand{\eb}[1]{\todo[inline]{(EB): #1}}
\newcommand{\jk}[1]{\todo[inline]{JK: #1}}

\usepackage{silence}
\WarningFilter{biblatex}{Using}
\WarningFilter{latex}{Float too large}
\WarningFilter{caption}{Unsupported}
\WarningFilter{caption}{Unknown document}

\let\spvec\vec
\let\vec\accentvec
\usepackage{amsmath}
\let\vec\spvec

\usepackage{array}
\usepackage{xcolor}
\usepackage{color}
\usepackage{colortbl}
\usepackage{subcaption}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{lstautogobble}
\usepackage[listings,skins,breakable,raster,most]{tcolorbox}
\usepackage{caption}


\lstset{
	numberbychapter=false,
	belowskip=-10pt,
	aboveskip=-10pt,
}

\lstdefinestyle{lstcodebox} {
	basicstyle=\scriptsize\ttfamily,
	autogobble=true,
	tabsize=2,
	captionpos=b,
	float,
}

\usepackage{graphicx}
\graphicspath{
	{./pictures/},
  {../fig/},
  {../}
}

\usepackage[backend=bibtex, style=numeric]{biblatex}
\addbibresource{bibliography.bib}


\usepackage{enumitem}
\setitemize{noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}

\definecolor{darkgreen}{rgb}{0,0.5,0}
\definecolor{darkyellow}{rgb}{0.7,0.7,0}


\usepackage{cleveref}
\crefname{codecount}{Code}{Codes}

\title{Using Machine Learning to Identify Similar Jobs Based on their IO Behavior}
\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}

\institute{
University of Reading--%
\email{j.m.kunkel@reading.ac.uk}%
\and
DKRZ --
\email{betke@dkrz.de}%
}
\begin{document}
\maketitle

\begin{abstract}

Support staff.
Problem, a particular job found that isn't performing well.
Now how can we find similar jobs?

Problem with definition of similarity.

In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are  illustrated.
Similar to a study.

Research questions: is this effective to find similar jobs?

The contribution of this paper...
\end{abstract}

\section{Introduction}

%This paper is structured as follows.
%We start with the related work in \Cref{sec:relwork}.
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.
%Finally, we finalize our paper with a summary in \Cref{sec:summary}.

\section{Related Work}
\label{sec:relwork}

\section{Methodology}
\label{sec:methodology}

Given: the reference job ID.
Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.

Adapt the algorithms:
\begin{itemize}
	\item iterate for all jobs
		\begin{itemize}
			\item compute distance to reference job
		\end{itemize}
	\item sort the jobs based on the distance to ref job
	\item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance
\end{itemize}

A user might be interested to explore say closest 10 or 50 jobs.

Algorithms:
Profile algorithm: job-profiles (job-duration, job-metrics, combine both)
$\rightarrow$ just compute geom-mean distance between profile

Check time series algorithms:

\begin{itemize}
	\item bin
	\item hex\_native
  \item hex\_lev
	\item hex\_quant
\end{itemize}

\section{Evaluation}
\label{sec:evaluation}

In the following, we assume a job is given and we aim to identify similar jobs.
We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:
\begin{itemize}
	\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
  \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up.   %CHE.ws12
	\item Job-L: a 66-hour 20-node job.
  The initialization data is read at the beginning.
  Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
\end{itemize}

For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.

\begin{figure}
\centering
  \begin{subfigure}{0.8\textwidth}
  \centering
  \includegraphics[width=\textwidth]{runtime-overview}
  \caption{Overview to process all jobs} \label{fig:runtime-overview}
  \end{subfigure}

  \begin{subfigure}{0.8\textwidth}
  \centering
  \includegraphics[width=\textwidth]{runtime-cummulative}
  \caption{Cumulative} \label{fig:runtime-cummulative}
  \end{subfigure}

  \caption{Performance of the algorithms}
  \label{fig:performance}
\end{figure}


Create histograms + cumulative job distribution for all algorithms.
Insert job profiles for closest 10 jobs.

Potentially, analyze how the rankings of different similarities look like.


\begin{figure}
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries4296426}
\caption{Job-S} \label{fig:job-S}
\end{subfigure}
\centering

\caption{Reference jobs: timeline of mean IO activity}
\label{fig:refJobs}
\end{figure}


\begin{figure}\ContinuedFloat

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries5024292}
\caption{Job-M} \label{fig:job-M}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}
\caption{Job-L (first 30 segments of 400; remaining segments are similar)}
\label{fig:job-L}
\end{subfigure}
\centering
\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
\end{figure}


\begin{figure}

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png}
\caption{Job-S} \label{fig:ecdf-job-S}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png}
\caption{Job-M} \label{fig:ecdf-job-M}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png}
\caption{Job-L} \label{fig:ecdf-job-L}
\end{subfigure}
\centering
\caption{Empirical cumulative density function}
\label{fig:ecdf}
\end{figure}


\begin{figure}

\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim}
\caption{Job-S} \label{fig:hist-job-S}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim}
\caption{Job-M} \label{fig:hist-job-M}
\end{subfigure}

\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim}
\caption{Job-L} \label{fig:hist-job-L}
\end{subfigure}
\centering
\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)}
\label{fig:hist}
\end{figure}

\subsection{Quantitative Analysis of Selected Jobs}


User count and group id is the same, meaning that a user is likely from the same group and the number of groups is identical to the number of users (unique), for Job-L user id and group count differ a bit, for Job-M a bit more.
Up to about 2x users than groups.

To understand how the Top\,100 jobs are distributed across users, the data is grouped by userid and counted.
\Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the top most user in the stack has the smallest number of jobs.
For Job-S, we can see that about 70-80\% of jobs stem from one user, for the hex\_lev and hex\_native algorithms, the other jobs stem from a second user while bin includes jobs from additional users (5 in total).
For Job-M, jobs from more users are included (13); about 25\% of jobs stem from the same user, here, hex\_lev and hex\_native is including more users (30 and 33, respectively) than the other three algorithms.
For Job-L, the two hex algorithms include with (12 and 13) a bit more diverse user community than the bin algorithms (9) but hex\_phases covers 35 users.

\begin{figure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/user-ids}
\caption{Job-S} \label{fig:users-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/user-ids}
\caption{Job-M} \label{fig:users-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/user-ids}
\caption{Job-L} \label{fig:users-job-L}
\end{subfigure}


\caption{User information for each jobs}
\label{fig:userids}
\end{figure}

\begin{figure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-nodes}
\caption{Job-S} \label{fig:nodes-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-nodes}
\caption{Job-M} \label{fig:nodes-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-nodes}
\caption{Job-L} \label{fig:nodes-job-L}
\end{subfigure}
\centering
\caption{Distribution of node counts}
\label{fig:nodes-job}
\end{figure}


\begin{figure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/jobs-elapsed}
\caption{Job-S} \label{fig:runtime-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/jobs-elapsed}
\caption{Job-M} \label{fig:runtime-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/jobs-elapsed}
\caption{Job-L} \label{fig:runtime-job-L}
\end{subfigure}
\centering
\caption{Distribution of elapsed runtime}
\label{fig:runtime-job}
\end{figure}

To see how different the algorithms behave, the intersection of two algorithms is computed for the 100 jobs with the highest similarity and visualized in \Cref{fig:heatmap-job}.
As expected, we can observe that bin\_all and bin\_aggzeros is very similar for all three jobs.
While there is some reordering, both algorithms lead to a comparable order.
The hex\_lev and hex\_native algorithms are also exhibiting some overlap particularly for Job-S and Job-L.
For Job\-M, however, they lead to a different ranking and Top\,100.
From the analysis, we conclude that one representative from binary quantization is sufficient while the other algorithms identify mostly disjoint behavioral aspects and, therefore, should be considered together.

One consideration is to identify jobs that meet a rank threshold for all different algorithms.
\jk{TODO}

\begin{figure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/intersection-heatmap}
\caption{Job-S} \label{fig:heatmap-job-S}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/intersection-heatmap}
\caption{Job-M} \label{fig:heatmap-job-M}
\end{subfigure}
\begin{subfigure}{0.31\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/intersection-heatmap}
\caption{Job-L} \label{fig:heatmap-job-L}
\end{subfigure}
\centering
\caption{Intersection of the top 100 jobs for the different algorithms}
\label{fig:heatmap-job}
\end{figure}

\section{Assessing Timelines for Similar Jobs}

\subsection{Job-S}

\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.9615--1timeseries4296288}
\caption{Rank 2, SIM=0.9615}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.9012--15timeseries4296277}
\caption{Rank 15, SIM=0.9017}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_lev-0.7901--99timeseries4297842}
\caption{Rank\,100, SIM=0.790}
\end{subfigure}

\caption{Job-S with Hex-Lev, selection of similar jobs}
\label{fig:job-S-hex-lev}
\end{figure}

\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.9808--1timeseries4296288}
\caption{Rank 2, SIM=}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.9375--15timeseries4564296}
\caption{Rank 15, SIM=}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hex_native-0.8915--99timeseries4296785}
\caption{Rank\,100, SIM=}
\end{subfigure}

\caption{Job-S with Hex-Native, selection of similar jobs}
\label{fig:job-S-hex-native}
\end{figure}

% \ContinuedFloat

Hex phases very similar to hex native.
Komischer JOB zu inspizieren: \verb|job_similarities_4296426-out/hex_phases-0.7429--93timeseries4237860.png|


Bin aggzeros works quite well here too. The jobs are a bit more diverse.


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.8462--1timeseries4296280}
\caption{Rank 2, SIM=}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.7778--14timeseries4555405}
\caption{Rank 15, SIM=}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/bin_aggzeros-0.6923--99timeseries4687419}
\caption{Rank\,100, SIM=}
\end{subfigure}

\caption{Job-S with bin\_aggzero, selection of similar jobs}
\label{fig:job-S-bin-aggzeros}
\end{figure}


\subsection{Job-M}

Bin aggzero liefert Mist zurück.


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/bin_aggzeros-0.7755--1timeseries8010306}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/bin_aggzeros-0.7347--14timeseries4498983}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/bin_aggzeros-0.5102--99timeseries5120077}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-M with Bin-Aggzero, selection of similar jobs}
\label{fig:job-M-bin-aggzero}
\end{figure}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.9546--1timeseries7826634}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.9365--2timeseries5240733}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.7392--15timeseries7651420}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_lev-0.7007--99timeseries8201967}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-M with hex\_lev, selection of similar jobs}
\label{fig:job-M-hex-lev}
\end{figure}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9878--1timeseries5240733}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9651--2timeseries7826634}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.9084--14timeseries8037817}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_native-0.8838--99timeseries7571967}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-M with hex\_native, selection of similar jobs}
\label{fig:job-M-hex-native}
\end{figure}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_phases-0.8831--1timeseries7826634}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_phases-0.7963--2timeseries5240733}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_phases-0.4583--14timeseries4244400}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hex_phases-0.2397--99timeseries7644009}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-M with hex\_phases, selection of similar jobs}
\label{fig:job-M-hex-phases}
\end{figure}

\subsection{Job-L}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1671--1timeseries7869050}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1671--2timeseries7990497}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1521--14timeseries8363584}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/bin_aggzeros-0.1097--97timeseries4262983}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-L with bin\_aggzero, selection of similar jobs}
\label{fig:job-L-bin-aggzero}
\end{figure}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.9386--1timeseries7266845}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.9375--2timeseries7214657}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.7251--14timeseries4341304}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_lev-0.1657--99timeseries8036223}
\caption{$SIM=$ (30s)}
\end{subfigure}

\caption{Job-L with hex\_lev, selection of similar jobs}
\label{fig:job-L-hex-phases}
\end{figure}


\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.9390--1timeseries7266845}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.9333--2timeseries7214657}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.8708--14timeseries4936553}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_native-0.1695--99timeseries7942052}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-L with hex\_native, selection of similar jobs}
\label{fig:job-L-hex-native}
\end{figure}

\begin{figure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--14timeseries4577917}
\caption{Rank 2, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--1timeseries4405671}
\caption{Rank 3, $SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--2timeseries4621422}
\caption{$SIM=$}
\end{subfigure}
\begin{subfigure}{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hex_phases-1.0000--99timeseries4232293}
\caption{$SIM=$ }
\end{subfigure}

\caption{Job-L with hex\_phases, selection of similar jobs}
\label{fig:job-L-hex-phases}
\end{figure}


\section{Conclusion}
\label{sec:summary}

%\printbibliography
\end{document}