mistral-io-datasets/paper/main.tex

\let\accentvec\vec
\documentclass[]{llncs}

\usepackage{todonotes}
\newcommand{\eb}[1]{\todo[inline]{(EB): #1}}
\newcommand{\jk}[1]{\todo[inline]{JK: #1}}

\usepackage{silence}
\WarningFilter{biblatex}{Using}
\WarningFilter{latex}{Float too large}
\WarningFilter{caption}{Unsupported}
\WarningFilter{caption}{Unknown document}

\let\spvec\vec
\let\vec\accentvec
\usepackage{amsmath}
\let\vec\spvec

\usepackage{array}
\usepackage{xcolor}
\usepackage{color}
\usepackage{colortbl}
\usepackage{subcaption}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{lstautogobble}
\usepackage[listings,skins,breakable,raster,most]{tcolorbox}
\usepackage{caption}


\lstset{
	numberbychapter=false,
	belowskip=-10pt,
	aboveskip=-10pt,
}

\lstdefinestyle{lstcodebox} {
	basicstyle=\scriptsize\ttfamily,
	autogobble=true,
	tabsize=2,
	captionpos=b,
	float,
}

\usepackage{graphicx}
\graphicspath{
	{./pictures/},
  {../fig/},
  {../}
}

\usepackage[backend=bibtex, style=numeric]{biblatex}
\addbibresource{bibliography.bib}


\usepackage{enumitem}
\setitemize{noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}

\definecolor{darkgreen}{rgb}{0,0.5,0}
\definecolor{darkyellow}{rgb}{0.7,0.7,0}


\usepackage{cleveref}
\crefname{codecount}{Code}{Codes}

\title{Using Machine Learning to Identify Similar Jobs Based on their IO Behavior}
\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}

\institute{
University of Reading--%
\email{j.m.kunkel@reading.ac.uk}%
\and
DKRZ --
\email{betke@dkrz.de}%
}
\begin{document}
\maketitle

\begin{abstract}

Support staff.
Problem, a particular job found that isn't performing well.
Now how can we find similar jobs?

Problem with definition of similarity.

In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are  illustrated.
Similar to a study.

Research questions: is this effective to find similar jobs?

The contribution of this paper...
\end{abstract}

\section{Introduction}

%This paper is structured as follows.
%We start with the related work in \Cref{sec:relwork}.
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.
%Finally, we finalize our paper with a summary in \Cref{sec:summary}.

\section{Related Work}
\label{sec:relwork}

\section{Methodology}
\label{sec:methodology}

Given: the reference job ID.
Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.

Adapt the algorithms:
\begin{itemize}
	\item iterate for all jobs
		\begin{itemize}
			\item compute distance to reference job
		\end{itemize}
	\item sort the jobs based on the distance to ref job
	\item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance
\end{itemize}

A user might be interested to explore say closest 10 or 50 jobs.

Algorithms:
Profile algorithm: job-profiles (job-duration, job-metrics, combine both)
$\rightarrow$ just compute geom-mean distance between profile

Check time series algorithms:

\begin{itemize}
	\item bin
	\item hex\_native
  \item hex\_lev
	\item hex\_quant
\end{itemize}

\section{Evaluation}
\label{sec:evaluation}

In the following, we assume a job is given and we aim to identify similar jobs.
We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:
\begin{itemize}
	\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
  \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up.   %CHE.ws12
	\item Job-L: a 66-hour 20-node job.
  The initialization data is read at the beginning.
  Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
\end{itemize}

For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.


Sollte man was zur Laufzeit der Algorithmen sagen? Denke Daten zu haben wäre sinnvoll.

Create histograms + cumulative job distribution for all algorithms.
Insert job profiles for closest 10 jobs.

Potentially, analyze how the rankings of different similarities look like.


\begin{figure}
\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries4296426}
\caption{Job-S} \label{fig:job-S}
\end{subfigure}
\centering

\caption{Reference jobs: timeline of mean IO activity}
\label{fig:refJobs}
\end{figure}


\begin{figure}\ContinuedFloat

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries5024292}
\caption{Job-M} \label{fig:job-M}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}
\caption{Job-L (first 30 segments of 400; remaining segments are similar)}
\label{fig:job-L}
\end{subfigure}
\centering
\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
\end{figure}


\begin{figure}

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png}
\caption{Job-S} \label{fig:ecdf-job-S}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png}
\caption{Job-M} \label{fig:ecdf-job-M}
\end{subfigure}
\centering

\begin{subfigure}{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png}
\caption{Job-L} \label{fig:ecdf-job-L}
\end{subfigure}
\centering
\caption{Empirical cumulative density function}
\label{fig:ecdf}
\end{figure}


\begin{figure}

\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim}
\caption{Job-S} \label{fig:hist-job-S}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim}
\caption{Job-M} \label{fig:hist-job-M}
\end{subfigure}

\begin{subfigure}{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim}
\caption{Job-L} \label{fig:hist-job-L}
\end{subfigure}
\centering
\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)}
\label{fig:ecdf}
\end{figure}


\section{Summary and Conclusion}
\label{sec:summary}

%\printbibliography
\end{document}
Paper moved 2020-08-18 12:58:39 +00:00			`\let\accentvec\vec`
			`\documentclass[]{llncs}`

			`\usepackage{todonotes}`
			`\newcommand{\eb}[1]{\todo[inline]{(EB): #1}}`
			`\newcommand{\jk}[1]{\todo[inline]{JK: #1}}`

			`\usepackage{silence}`
			`\WarningFilter{biblatex}{Using}`
			`\WarningFilter{latex}{Float too large}`
			`\WarningFilter{caption}{Unsupported}`
			`\WarningFilter{caption}{Unknown document}`

			`\let\spvec\vec`
			`\let\vec\accentvec`
			`\usepackage{amsmath}`
			`\let\vec\spvec`

			`\usepackage{array}`
			`\usepackage{xcolor}`
			`\usepackage{color}`
			`\usepackage{colortbl}`
			`\usepackage{subcaption}`
			`\usepackage{hyperref}`
			`\usepackage{listings}`
			`\usepackage{lstautogobble}`
			`\usepackage[listings,skins,breakable,raster,most]{tcolorbox}`
			`\usepackage{caption}`


			`\lstset{`
			`numberbychapter=false,`
			`belowskip=-10pt,`
			`aboveskip=-10pt,`
			`}`

			`\lstdefinestyle{lstcodebox} {`
			`basicstyle=\scriptsize\ttfamily,`
			`autogobble=true,`
			`tabsize=2,`
			`captionpos=b,`
			`float,`
			`}`

			`\usepackage{graphicx}`
			`\graphicspath{`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`{./pictures/},`
Nai 2020-08-20 11:23:32 +00:00			`{../fig/},`
			`{../}`
Paper moved 2020-08-18 12:58:39 +00:00			`}`

			`\usepackage[backend=bibtex, style=numeric]{biblatex}`
			`\addbibresource{bibliography.bib}`


			`\usepackage{enumitem}`
			`\setitemize{noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}`

			`\definecolor{darkgreen}{rgb}{0,0.5,0}`
			`\definecolor{darkyellow}{rgb}{0.7,0.7,0}`


			`\usepackage{cleveref}`
			`\crefname{codecount}{Code}{Codes}`

			`\title{Using Machine Learning to Identify Similar Jobs Based on their IO Behavior}`
			`\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}}`

			`\institute{`
			`University of Reading--%`
			`\email{j.m.kunkel@reading.ac.uk}%`
			`\and`
			`DKRZ --`
			`\email{betke@dkrz.de}%`
			`}`
			`\begin{document}`
			`\maketitle`

			`\begin{abstract}`

			`Support staff.`
			`Problem, a particular job found that isn't performing well.`
			`Now how can we find similar jobs?`

			`Problem with definition of similarity.`

			`In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are illustrated.`
			`Similar to a study.`

			`Research questions: is this effective to find similar jobs?`

			`The contribution of this paper...`
			`\end{abstract}`

			`\section{Introduction}`

			`%This paper is structured as follows.`
			`%We start with the related work in \Cref{sec:relwork}.`
			`%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.`
			`%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.`
			`%Finally, we finalize our paper with a summary in \Cref{sec:summary}.`

			`\section{Related Work}`
			`\label{sec:relwork}`

			`\section{Methodology}`
			`\label{sec:methodology}`

			`Given: the reference job ID.`
			`Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.`

			`Adapt the algorithms:`
			`\begin{itemize}`
			`\item iterate for all jobs`
			`\begin{itemize}`
			`\item compute distance to reference job`
			`\end{itemize}`
			`\item sort the jobs based on the distance to ref job`
			`\item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance`
			`\end{itemize}`

			`A user might be interested to explore say closest 10 or 50 jobs.`

			`Algorithms:`
			`Profile algorithm: job-profiles (job-duration, job-metrics, combine both)`
			`$\rightarrow$ just compute geom-mean distance between profile`

			`Check time series algorithms:`

			`\begin{itemize}`
			`\item bin`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`\item hex\_native`
			`\item hex\_lev`
			`\item hex\_quant`
Paper moved 2020-08-18 12:58:39 +00:00			`\end{itemize}`

			`\section{Evaluation}`
			`\label{sec:evaluation}`

Optimization 2020-08-20 10:48:27 +00:00			`In the following, we assume a job is given and we aim to identify similar jobs.`
			`We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:`
Paper moved 2020-08-18 12:58:39 +00:00			`\begin{itemize}`
Optimization 2020-08-20 10:48:27 +00:00			`\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.`
			`\item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up. %CHE.ws12`
			`\item Job-L: a 66-hour 20-node job.`
			`The initialization data is read at the beginning.`
			`Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.`
Paper moved 2020-08-18 12:58:39 +00:00			`\end{itemize}`

Optimization 2020-08-20 10:48:27 +00:00			`For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.`


			`Sollte man was zur Laufzeit der Algorithmen sagen? Denke Daten zu haben wäre sinnvoll.`
Paper moved 2020-08-18 12:58:39 +00:00
			`Create histograms + cumulative job distribution for all algorithms.`
			`Insert job profiles for closest 10 jobs.`

			`Potentially, analyze how the rankings of different similarities look like.`

Fix Color map for job vis. 2020-08-19 18:01:48 +00:00
			`\begin{figure}`
			`\begin{subfigure}{0.8\textwidth}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`\includegraphics[width=\textwidth]{job-timeseries4296426}`
			`\caption{Job-S} \label{fig:job-S}`
			`\end{subfigure}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00
			`\caption{Reference jobs: timeline of mean IO activity}`
			`\label{fig:refJobs}`
			`\end{figure}`


			`\begin{figure}\ContinuedFloat`

			`\begin{subfigure}{0.8\textwidth}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`\includegraphics[width=\textwidth]{job-timeseries5024292}`
			`\caption{Job-M} \label{fig:job-M}`
			`\end{subfigure}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00
			`\begin{subfigure}{0.8\textwidth}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}`
			`\caption{Job-L (first 30 segments of 400; remaining segments are similar)}`
			`\label{fig:job-L}`
			`\end{subfigure}`
Nai 2020-08-20 11:23:32 +00:00			`\centering`
Fix Color map for job vis. 2020-08-19 18:01:48 +00:00			`\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}`
			`\end{figure}`


Nai 2020-08-20 11:23:32 +00:00
			`\begin{figure}`

			`\begin{subfigure}{0.8\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_4296426-out/ecdf.png}`
			`\caption{Job-S} \label{fig:ecdf-job-S}`
			`\end{subfigure}`
			`\centering`

			`\begin{subfigure}{0.8\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_5024292-out/ecdf.png}`
			`\caption{Job-M} \label{fig:ecdf-job-M}`
			`\end{subfigure}`
			`\centering`

			`\begin{subfigure}{0.8\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_7488914-out/ecdf.png}`
			`\caption{Job-L} \label{fig:ecdf-job-L}`
			`\end{subfigure}`
			`\centering`
			`\caption{Empirical cumulative density function}`
			`\label{fig:ecdf}`
			`\end{figure}`


			`\begin{figure}`

			`\begin{subfigure}{0.5\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_4296426-out/hist-sim}`
			`\caption{Job-S} \label{fig:hist-job-S}`
			`\end{subfigure}`
			`\begin{subfigure}{0.5\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_5024292-out/hist-sim}`
			`\caption{Job-M} \label{fig:hist-job-M}`
			`\end{subfigure}`

			`\begin{subfigure}{0.5\textwidth}`
			`\centering`
			`\includegraphics[width=\textwidth]{job_similarities_7488914-out/hist-sim}`
			`\caption{Job-L} \label{fig:hist-job-L}`
			`\end{subfigure}`
			`\centering`
			`\caption{Histogram for the number of jobs (bin width: 2.5\%, numbers are the actual job counts)}`
			`\label{fig:ecdf}`
			`\end{figure}`



Paper moved 2020-08-18 12:58:39 +00:00			`\section{Summary and Conclusion}`
			`\label{sec:summary}`

			`%\printbibliography`
			`\end{document}`