diff --git a/lncs-from-jhps/main.tex b/lncs-from-jhps/main.tex index 1fb37f2..51e20c4 100644 --- a/lncs-from-jhps/main.tex +++ b/lncs-from-jhps/main.tex @@ -64,7 +64,7 @@ In particular, we sketch a methodology that utilizes temporal I/O similarity to Practically, we apply several previously developed time series algorithms. A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. The data stem from DKRZ's supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. -Our analysis shows that the strategy and algorithms bear potential to identify similar jobs but more testing is necessary. +Our analysis shows that the strategy and algorithms bear the potential to identify similar jobs, but more testing is necessary. \end{abstract} @@ -84,7 +84,7 @@ The support staff should focus on workloads for which optimization is beneficial By ranking jobs based on their utilization, it is easy to find a job that exhibits extensive usage of computing, network, and I/O resources. However, would it be beneficial to investigate this workload in detail and potentially optimize it? For instance, a pattern that is observed in many jobs bears potential as the blueprint for optimizing one job may be applied to other jobs as well. -This is particularly true when running one application with similar inputs but also different applications may lead to similar behavior. +This is particularly true when running one application with similar inputs, but also different applications may lead to similar behavior. Knowing details about a problematic or interesting job may be transferred to similar jobs. Therefore, it is useful for support staff (or a user) that investigates a resource-hungry job to identify similar jobs that are executed on the supercomputer. @@ -93,7 +93,7 @@ Re-executing the same job will lead to slightly different behavior, a program ma Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same I/O behavior would not be possible. In the paper \cite{Eugen20HPS}, we developed several distance measures and algorithms for the clustering of jobs based on the time series and their I/O behavior. -These distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity. +These distance measures can be applied to jobs with different runtimes and the number of nodes utilized, but differ in the way they define similarity. They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore similar jobs effectively. In this paper, we refine these algorithms slightly, include another algorithm, and apply them to rank jobs based on their temporal similarity to a reference job. @@ -132,7 +132,7 @@ Vampir generally supports the clustering of process timelines of a single job, a %Chameleon \cite{bahmani2018chameleon} extends ScalaTrace for recording MPI traces but reduces the overhead by clustering processes and collecting information from one representative of each cluster. %For the clustering, a signature is created for each process that includes the call-graph. -In \cite{halawa2020unsupervised}, 11 performance metrics including CPU and network are utilized for agglomerative clustering of jobs showing the general effectiveness of the approach. +In \cite{halawa2020unsupervised}, 11 performance metrics including CPU and network are utilized for agglomerative clustering of jobs, showing the general effectiveness of the approach. In \cite{rodrigo2018towards}, a characterization of the NERSC workload is performed based on job scheduler information (profiles). Profiles that include the MPI activities have shown effective to identify the code that is executed \cite{demasi2013identifying}. Many approaches for clustering applications operate on profiles for compute, network, and I/O \cite{emeras2015evalix,liu2020characterization,bang2020hpc}. @@ -155,10 +155,10 @@ Therefore, we first need to define how a job's data is represented, then describ On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in ten seconds intervals on all nodes nine I/O metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager. The results are 4D data (time, nodes, metrics, file system) per job. The distance measures should handle jobs of different lengths and node count. -In the open access article \cite{Eugen20HPS}\footnote{\scriptsize \url{https://zenodo.org/record/4478960/files/jhps-incubator-06-temporal-29-jan.pdf}}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. +In the open-access article \cite{Eugen20HPS}\footnote{\scriptsize \url{https://zenodo.org/record/4478960/files/jhps-incubator-06-temporal-29-jan.pdf}}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We will be using this representation. In a nutshell, for each job executed on Mistral, they partitioned it into 10 minutes segments\footnote{We found in preliminary experiments that 10 minutes reduces noise, i.e., the variation of the statistics when re-running the same job.} and compute the arithmetic mean of each metric, categorize the value into NonIO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively. -The values are chosen to be 0, 1, and 4 because we arithmetically derive metrics: naturally the value of 0 will indicate that no I/O issue appears; we weight critical I/O to be 4x as important as high I/O. +The values are chosen to be 0, 1, and 4 because we arithmetically derive metrics: naturally, the value of 0 will indicate that no I/O issue appears; we weight critical I/O to be 4x as important as high I/O. This strategy ensures that the same approach can be applied to other HPC systems regardless of the actual distribution of these statistics on that data center. After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (I/O activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis. By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, the dataset is reduced from 1 million jobs to about 580k jobs. @@ -167,12 +167,12 @@ By pre-filtering jobs with no I/O activity -- their sum across all dimensions an \subsection{Algorithms for Computing Similarity} We reuse the B and Q algorithms developed in~\cite{Eugen20HPS}: B-all, B-aggz(eros), Q-native, Q-lev, and Q-phases. They differ in the way data similarity is defined; either the time series is encoded in binary or hexadecimal quantization, the distance measure is the Euclidean distance or the Levenshtein-distance. -B-all determines similarity between binary codings by means of Levenshtein distance. +B-all determines the similarity between binary codings by means of Levenshtein distance. B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero. -Q-lev determines similarity between quantized codings by using Levenshtein distance. +Q-lev determines the similarity between quantized codings by using Levenshtein distance. Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$. There are various options for how a longer job is embedded in a shorter job, for example, a larger input file may stretch the length of the I/O and compute phases; another option can be that more (model) time is simulated. In this article, we consider these different behavioral patterns and attempt to identify situations where the I/O pattern of a long job is contained in a shorter job. Therefore, for jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity. -Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation. +Q-phases extracts phase information and performs a phase-aware and performance-aware similarity computation. The Q-phases algorithm extracts I/O phases from our 10-minute segments and computes the similarity between the most similar I/O phases of both jobs. @@ -180,7 +180,7 @@ The Q-phases algorithm extracts I/O phases from our 10-minute segments and compu Our strategy for localizing similar jobs works as follows: \begin{itemize} \item A user\footnote{This can be support staff or a data center user that was executing the job.} provides a reference job ID and selects a similarity algorithm. - \item The system iterates over all jobs of the job pool computing the similarity to the reference job using the specified algorithm. + \item The system iterates over all jobs of the job pool, computing the similarity to the reference job using the specified algorithm. \item It sorts the jobs based on the similarity to the reference job. \item It visualizes the cumulative job similarity allowing the user to understand how job similarity is distributed. \item The user starts the inspection by looking at the most similar jobs first. @@ -188,12 +188,12 @@ Our strategy for localizing similar jobs works as follows: The user can decide about the criterion when to stop inspecting jobs; based on the similarity, the number of investigated jobs, or the distribution of the job similarity. For the latter, it is interesting to investigate clusters of similar jobs, e.g., if there are many jobs between 80-90\% similarity but few between 70-80\%. -For the inspection of the jobs, a user may explore the job metadata, searching for similarities, and explore the time series of a job's I/O metrics. +For the inspection of the jobs, a user may explore the job metadata, search for similarities, and explore the time series of a job's I/O metrics. \section{Reference Job}% \label{sec:refjobs} -For this study, we chose the reference job called Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which write time series data after some spin up. %CHE.ws12 +For this study, we chose the reference job called Job-M: a typical MPI parallel 8-hour compute job on 128 nodes that write time series data after some spin up. %CHE.ws12 The segmented timelines of the job are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes on which the job ran. This coding is also used for the Q algorithms, thus this representation is what the algorithms will analyze; B algorithms merge all timelines together as described in~\cite{Eugen20HPS}. The figures show the values of active metrics ($\neq 0$); if few are active, then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview. @@ -222,10 +222,10 @@ Finally, the quantitative behavior of the 100 most similar jobs is investigated. To measure the performance for computing the similarity to the reference job, the algorithms are executed 10 times on a compute node at DKRZ which is equipped with two Intel Xeon E5-2680v3 @2.50GHz and 64GB DDR4 RAM. A boxplot for the runtimes is shown in \Cref{fig:performance}. The runtime is normalized for 100k jobs, i.e., for B-all it takes about 41\,s to process 100k jobs out of the 500k total jobs that this algorithm will process. -Generally, the B algorithms are fastest, while the Q algorithms often take 4-5x as long. -Q\_phases and Levenshtein based algorithm are significantly slower. +Generally, the B algorithms are the fastest, while the Q algorithms often take 4-5x as long. +Q\_phases and Levenshtein-based algorithms are significantly slower. Note that the current algorithms are sequential and executed on just one core. -They could easily be parallelized which would then allow for an online analysis. +They could easily be parallelized, which would then allow an online analysis. \begin{figure} @@ -253,7 +253,7 @@ In the quantitative analysis, we explore the different algorithms how the simila The support team in a data center may have time to investigate the most similar jobs. Time for the analysis is typically bound, for instance, the team may analyze the 100 most similar jobs and rank them; we refer to them as the Top\,100 jobs, and \textit{Rank\,i} refers to the job that has the i-th highest similarity to the reference job -- sometimes these values can be rather close together as we see in the histogram in \Cref{fig:hist} for the actual number of jobs with a given similarity. -As we focus on a feasible number of jobs, we crop it at 100 jobs (total number of jobs is still given). +As we focus on a feasible number of jobs, we crop it at 100 jobs (the total number of jobs is still given). It turns out that both B algorithms produce nearly identical histograms, and we omit one of them. In the figures, we can see again a different behavior of the algorithms depending on the reference job. We can see a cluster with jobs of higher similarity (for B-all and Q-native at a similarity of 75\%). @@ -273,7 +273,7 @@ Practically, the support team would start with Rank\,1 (most similar job, e.g., When analyzing the overall population of jobs executed on a system, we expect that some workloads are executed several times (with different inputs but with the same configuration) or are executed with slightly different configurations (e.g., node counts, timesteps). Thus, potentially our similarity analysis of the job population may just identify the re-execution of the same workload. -Typically, the support staff would identify the re-execution of jobs by inspecting job names which are user-defined generic strings. +Typically, the support staff would identify the re-execution of jobs by inspecting job names, which are user-defined generic strings. To understand if the analysis is inclusive and identifies different applications, we use two approaches with our Top\,100 jobs: We explore the distribution of users (and groups), runtime, and node count across jobs. @@ -284,8 +284,8 @@ To confirm the hypotheses presented, we analyzed the job metadata comparing job \paragraph{User distribution.} To understand how the Top\,100 are distributed across users, the data is grouped by userid and counted. \Cref{fig:userids} shows the stacked user information, where the lowest stack is the user with the most jobs and the topmost user in the stack has the smallest number of jobs. -Jobs from 13 users are included; about 25\% of jobs stem from the same user; Q-lev, and Q-native include more users (29, 33, and 37, respectively) than the other three algorithms. -We didn't include the group analysis in the figure as user count and group id is proportional, at most the number of users is 2x the number of groups. +Jobs from 13 users are included; about 25\% of jobs stem from the same user; Q-lev and Q-native include more users (29, 33, and 37, respectively) than the other three algorithms. +We didn't include the group analysis in the figure as user count and group id are proportional, at most the number of users is 2x the number of groups. Thus, a user is likely from the same group and the number of groups is similar to the number of unique users. \paragraph{Node distribution.} @@ -295,7 +295,7 @@ We can observe that the range of nodes for similar jobs is between 1 and 128. \paragraph{Runtime distribution.} The job runtime of the Top\,100 jobs is shown using boxplots in \Cref{fig:runtime-job}. -While all algorithms can compute the similarity between jobs of different length, the B algorithms and Q-native penalize jobs of different length preferring jobs of very similar length. +While all algorithms can compute the similarity between jobs of different lengths, the B algorithms and Q-native penalize jobs of different lengths, preferring jobs of very similar lengths. Q-phases is able to identify much shorter or longer jobs. \begin{figure} @@ -325,10 +325,10 @@ We subjectively found that the approach works very well and identifies suitable To demonstrate this, we include a selection of job timelines and selected interesting job profiles. Inspecting the Top\,100 is highlighting the differences between the algorithms. All algorithms identify a diverse range of job names for this reference job in the Top\,100. -The number of unique names is 19, 38, 49, and 51 for B-aggzero, Q-phases, Q-native and Q-lev, respectively. +The number of unique names is 19, 38, 49, and 51 for B-aggzero, Q-phases, Q-native, and Q-lev, respectively. When inspecting their timelines, the jobs that are similar according to the B algorithms (see \Cref{fig:job-M-bin-aggzero}) subjectively appear to us to be different. -The reason lies in the definition of the B-* similarity which aggregate all I/O statistics into one timeline. +The reason lies in the definition of the B-* similarity, which aggregates all I/O statistics into one timeline. The other algorithms like Q-lev (\Cref{fig:job-M-hex-lev}) and Q-native (\Cref{fig:job-M-hex-native}) seem to work as intended: While jobs exhibit short bursts of other active metrics even for low similarity, we can eyeball a relevant similarity particularly for Rank\,2 and Rank\,3 which have the high similarity of 90+\%. For Rank\,15 to Rank\,100, with around 70\% similarity, a partial match of the metrics is still given. @@ -417,9 +417,9 @@ While jobs exhibit short bursts of other active metrics even for low similarity, We introduced a methodology to identify similar jobs based on timelines of nine I/O statistics. The quantitative analysis shows that a diverse set of results can be found and that only a tiny subset of the 500k jobs is very similar to our reference job representing a typical HPC activity. The Q-lev and Q-native work best according to our subjective qualitative analysis. -Related jobs stems from the same user/group and may have a related job name, but the approach was able to find other jobs as well. -This was a first exploration of this methodology. -In the future, we will expand the study comparing more jobs in order to identify the suitability of the methodology. +Related jobs stem from the same user/group and may have a related job name, but the approach was able to find other jobs as well. +This was the first exploration of this methodology. +In the future, we will expand the study by comparing more jobs in order to identify the suitability of the methodology. \printbibliography%