@ -118,7 +118,6 @@ Related work can be classified into distance measures, analysis of HPC applicati

%% DISTANCE MEASURES

The ranking of similar jobs performed in this article is related to clustering strategies.

Levenshtein (Edit) distance is a widely used distance metric indicating the number of edits needed to convert one string to another \cite{navarro2001guided}.

\eb{Was heisst ``Edit''}

The comparison of the time series using various metrics has been extensively investigated.

In \cite{khotanlou2018empirical}, an empirical comparison of distance measures for the clustering of multivariate time series is performed.

14 similarity measures are applied to 23 data sets.

@ -162,11 +161,9 @@ Therefore, we first need to define how a job's data is represented, then describ

On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in ten seconds intervals on all nodes nine I/O metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.

The results are 4D data (time, nodes, metrics, file system) per job.

The distance measures should handle jobs of different lengths and node count.

In the open-access article \cite{Eugen20HPS}\footnote{\scriptsize\url{https://zenodo.org/record/4478960/files/jhps-incubator-06-temporal-29-jan.pdf}}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.

\eb{Doppelte Referenz (in der Fussleiste und im Literaturverzeichnis) sieht aus wie eine mathematische Gleichung.}

In the open-access article \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.

We will be using this representation.

In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes reduces noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.

\eb{Noise ist nicht ganz korrekt. Das Problem ist eher die Datenmenge, weil sie nicht leicht zu verarbeiten ist.}

In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes reduces compute time and noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.

The values are chosen to be 0, 1, and 4 because we arithmetically derive metrics: naturally, the value of 0 will indicate that no I/O issue appears; we weight critical I/O to be 4x as important as high I/O.

This strategy ensures that the same approach can be applied to other HPC systems regardless of the actual distribution of these statistics on that data center.

After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.

@ -182,7 +179,7 @@ Q-lev determines the similarity between quantized codings by using Levenshtein d

Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{\text{job1}}- m_{\text{job2}}|}{16}$.

%There are various options for how a longer job is embedded in a shorter job, for example, a larger input file may stretch the length of the I/O and compute phases; another option can be that more (model) time is simulated.

One of our basic considerations is that a short job may run longer, e.g, when restarted with a larger input file (it can stretch the length of the I/O and compute phases) or when run with more simulating steps.

\eb{Der Satz oben wurde umgeschrieben. Checken ob er passt.}

There are more alternatives how a longer job is related to a shorter job but we do not consider them for now.

In this article, we consider these different behavioral patterns and attempt to identify situations where the I/O pattern of a long job is contained in a shorter job.

Therefore, for jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.

Q-phases extracts phase information and performs a phase-aware and performance-aware similarity computation.

@ -435,8 +432,6 @@ This was the first exploration of this methodology.

In the future, we will expand the study by comparing more jobs in order to identify the suitability of the methodology.

\eb{Darf man eigentlich ein Bild mitten im Literaturverzeichnis plazieren? Falls nicht, dann koennte man mit einem FloatBarrier eine Grenze setzen (siehe Code). Allerdings werdes es dann 13 Seiten.}