Nai
This commit is contained in:
parent
29d2b2501a
commit
d6b5a3154b
|
@ -118,7 +118,6 @@ Related work can be classified into distance measures, analysis of HPC applicati
|
|||
%% DISTANCE MEASURES
|
||||
The ranking of similar jobs performed in this article is related to clustering strategies.
|
||||
Levenshtein (Edit) distance is a widely used distance metric indicating the number of edits needed to convert one string to another \cite{navarro2001guided}.
|
||||
\eb{Was heisst ``Edit''}
|
||||
The comparison of the time series using various metrics has been extensively investigated.
|
||||
In \cite{khotanlou2018empirical}, an empirical comparison of distance measures for the clustering of multivariate time series is performed.
|
||||
14 similarity measures are applied to 23 data sets.
|
||||
|
@ -162,11 +161,9 @@ Therefore, we first need to define how a job's data is represented, then describ
|
|||
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in ten seconds intervals on all nodes nine I/O metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
|
||||
The results are 4D data (time, nodes, metrics, file system) per job.
|
||||
The distance measures should handle jobs of different lengths and node count.
|
||||
In the open-access article \cite{Eugen20HPS}\footnote{\scriptsize \url{https://zenodo.org/record/4478960/files/jhps-incubator-06-temporal-29-jan.pdf}}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
|
||||
\eb{Doppelte Referenz (in der Fussleiste und im Literaturverzeichnis) sieht aus wie eine mathematische Gleichung.}
|
||||
In the open-access article \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
|
||||
We will be using this representation.
|
||||
In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes reduces noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
|
||||
\eb{Noise ist nicht ganz korrekt. Das Problem ist eher die Datenmenge, weil sie nicht leicht zu verarbeiten ist.}
|
||||
In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes reduces compute time and noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
|
||||
The values are chosen to be 0, 1, and 4 because we arithmetically derive metrics: naturally, the value of 0 will indicate that no I/O issue appears; we weight critical I/O to be 4x as important as high I/O.
|
||||
This strategy ensures that the same approach can be applied to other HPC systems regardless of the actual distribution of these statistics on that data center.
|
||||
After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis.
|
||||
|
@ -182,7 +179,7 @@ Q-lev determines the similarity between quantized codings by using Levenshtein d
|
|||
Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{\text{job1}} - m_{\text{job2}}|}{16}$.
|
||||
%There are various options for how a longer job is embedded in a shorter job, for example, a larger input file may stretch the length of the I/O and compute phases; another option can be that more (model) time is simulated.
|
||||
One of our basic considerations is that a short job may run longer, e.g, when restarted with a larger input file (it can stretch the length of the I/O and compute phases) or when run with more simulating steps.
|
||||
\eb{Der Satz oben wurde umgeschrieben. Checken ob er passt.}
|
||||
There are more alternatives how a longer job is related to a shorter job but we do not consider them for now.
|
||||
In this article, we consider these different behavioral patterns and attempt to identify situations where the I/O pattern of a long job is contained in a shorter job.
|
||||
Therefore, for jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity.
|
||||
Q-phases extracts phase information and performs a phase-aware and performance-aware similarity computation.
|
||||
|
@ -435,8 +432,6 @@ This was the first exploration of this methodology.
|
|||
In the future, we will expand the study by comparing more jobs in order to identify the suitability of the methodology.
|
||||
|
||||
|
||||
\eb{Darf man eigentlich ein Bild mitten im Literaturverzeichnis plazieren? Falls nicht, dann koennte man mit einem FloatBarrier eine Grenze setzen (siehe Code). Allerdings werdes es dann 13 Seiten.}
|
||||
|
||||
%\FloatBarrier
|
||||
\printbibliography%
|
||||
|
||||
|
|
Loading…
Reference in New Issue