2020-08-18 12:58:39 +00:00
\let \accentvec \vec
\documentclass [] { llncs}
\usepackage { todonotes}
\newcommand { \eb } [1]{ \todo [inline] { (EB): #1} }
\newcommand { \jk } [1]{ \todo [inline] { JK: #1} }
\usepackage { silence}
\WarningFilter { biblatex} { Using}
\WarningFilter { latex} { Float too large}
\WarningFilter { caption} { Unsupported}
\WarningFilter { caption} { Unknown document}
\let \spvec \vec
\let \vec \accentvec
\usepackage { amsmath}
\let \vec \spvec
\usepackage { array}
\usepackage { xcolor}
\usepackage { color}
\usepackage { colortbl}
\usepackage { subcaption}
\usepackage { hyperref}
\usepackage { listings}
\usepackage { lstautogobble}
\usepackage [listings,skins,breakable,raster,most] { tcolorbox}
\usepackage { caption}
\lstset {
numberbychapter=false,
belowskip=-10pt,
aboveskip=-10pt,
}
\lstdefinestyle { lstcodebox} {
basicstyle=\scriptsize \ttfamily ,
autogobble=true,
tabsize=2,
captionpos=b,
float,
}
\usepackage { graphicx}
\graphicspath {
2020-08-19 18:01:48 +00:00
{ ./pictures/} ,
{ ../fig/}
2020-08-18 12:58:39 +00:00
}
\usepackage [backend=bibtex, style=numeric] { biblatex}
\addbibresource { bibliography.bib}
\usepackage { enumitem}
\setitemize { noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}
\definecolor { darkgreen} { rgb} { 0,0.5,0}
\definecolor { darkyellow} { rgb} { 0.7,0.7,0}
\usepackage { cleveref}
\crefname { codecount} { Code} { Codes}
\title { Using Machine Learning to Identify Similar Jobs Based on their IO Behavior}
\author { Julian Kunkel\inst { 2} \and Eugen Betke\inst { 1} }
\institute {
University of Reading--%
\email { j.m.kunkel@reading.ac.uk} %
\and
DKRZ --
\email { betke@dkrz.de} %
}
\begin { document}
\maketitle
\begin { abstract}
Support staff.
Problem, a particular job found that isn't performing well.
Now how can we find similar jobs?
Problem with definition of similarity.
In this paper, a methodology and algorithms to identify similar jobs based on profiles and time series are illustrated.
Similar to a study.
Research questions: is this effective to find similar jobs?
The contribution of this paper...
\end { abstract}
\section { Introduction}
%This paper is structured as follows.
%We start with the related work in \Cref{sec:relwork}.
%Then, in TODO we introduce the DKRZ monitoring systems and explain how I/O metrics are captured by the collectors.
%In \Cref{sec:methodology} we describe the data reduction and the machine learning approaches and do an experiment in \Cref{sec:data,sec:evaluation}.
%Finally, we finalize our paper with a summary in \Cref{sec:summary}.
\section { Related Work}
\label { sec:relwork}
\section { Methodology}
\label { sec:methodology}
Given: the reference job ID.
Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.
Adapt the algorithms:
\begin { itemize}
\item iterate for all jobs
\begin { itemize}
\item compute distance to reference job
\end { itemize}
\item sort the jobs based on the distance to ref job
\item create cumulative job distribution based on distance for visualization, allow users to output jobs with a given distance
\end { itemize}
A user might be interested to explore say closest 10 or 50 jobs.
Algorithms:
Profile algorithm: job-profiles (job-duration, job-metrics, combine both)
$ \rightarrow $ just compute geom-mean distance between profile
Check time series algorithms:
\begin { itemize}
\item bin
2020-08-19 18:01:48 +00:00
\item hex\_ native
\item hex\_ lev
\item hex\_ quant
2020-08-18 12:58:39 +00:00
\end { itemize}
\section { Evaluation}
\label { sec:evaluation}
2020-08-20 10:48:27 +00:00
In the following, we assume a job is given and we aim to identify similar jobs.
We chose several reference jobs with different compute and IO characteristics visualized in \Cref { fig:refJobs} :
2020-08-18 12:58:39 +00:00
\begin { itemize}
2020-08-20 10:48:27 +00:00
\item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
\item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up. %CHE.ws12
\item Job-L: a 66-hour 20-node job.
The initialization data is read at the beginning.
Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
2020-08-18 12:58:39 +00:00
\end { itemize}
2020-08-20 10:48:27 +00:00
For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
Sollte man was zur Laufzeit der Algorithmen sagen? Denke Daten zu haben wäre sinnvoll.
2020-08-18 12:58:39 +00:00
Create histograms + cumulative job distribution for all algorithms.
Insert job profiles for closest 10 jobs.
Potentially, analyze how the rankings of different similarities look like.
2020-08-19 18:01:48 +00:00
\begin { figure}
\begin { subfigure} { 0.8\textwidth }
\includegraphics [width=\textwidth] { job-timeseries4296426}
\caption { Job-S} \label { fig:job-S}
\end { subfigure}
\caption { Reference jobs: timeline of mean IO activity}
\label { fig:refJobs}
\end { figure}
\begin { figure} \ContinuedFloat
\begin { subfigure} { 0.8\textwidth }
\includegraphics [width=\textwidth] { job-timeseries5024292}
\caption { Job-M} \label { fig:job-M}
\end { subfigure}
\begin { subfigure} { 0.8\textwidth }
\includegraphics [width=\textwidth] { job-timeseries7488914-30.pdf}
\caption { Job-L (first 30 segments of 400; remaining segments are similar)}
\label { fig:job-L}
\end { subfigure}
\caption { Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
\end { figure}
2020-08-18 12:58:39 +00:00
\section { Summary and Conclusion}
\label { sec:summary}
%\printbibliography
\end { document}