Anonymization ^-^
This commit is contained in:
		
							parent
							
								
									7be00c5a3b
								
							
						
					
					
						commit
						fed2f1aa47
					
				| @ -70,16 +70,17 @@ | ||||
| \crefname{codecount}{Code}{Codes} | ||||
| 
 | ||||
| \title{A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis} | ||||
| \author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}} | ||||
| %\author{Julian Kunkel\inst{2} \and Eugen Betke\inst{1}} | ||||
| 
 | ||||
| 
 | ||||
| \institute{ | ||||
| University of Reading--% | ||||
| \email{j.m.kunkel@reading.ac.uk}% | ||||
| \and | ||||
| DKRZ -- | ||||
| \email{betke@dkrz.de}% | ||||
| } | ||||
| %\institute{ | ||||
| %University of Reading--% | ||||
| %\email{j.m.kunkel@reading.ac.uk}% | ||||
| %\and | ||||
| %DKRZ -- | ||||
| %\email{betke@dkrz.de}% | ||||
| %} | ||||
| 
 | ||||
| \begin{document} | ||||
| \maketitle | ||||
| 
 | ||||
| @ -126,10 +127,10 @@ It is non-trivial to identify jobs with similar behavior from the pool of execut | ||||
| Re-executing the same job will lead to slightly different behavior, a program may be executed with different inputs or using a different configuration (e.g., number of nodes). | ||||
| Job names are defined by users; while a similar name may hint to be a similar workload, finding other applications with the same IO behavior is would not be possible. | ||||
| 
 | ||||
| In our previous paper \cite{Eugen20HPS}, we developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior. | ||||
| In the paper \cite{Eugen20HPS}, the authors developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior. | ||||
| The distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity. | ||||
| We showed that the metrics can be used to cluster jobs, however, it remains unclear if the method can be used by data center staff to explore jobs of a reference job effectively. | ||||
| In this article, we refined these distance measures slightly and apply them to rank jobs based on their similarity to a reference job. | ||||
| They showed that the metrics can be used to cluster jobs, however, it remained unclear if the method can be used by data center staff to explore jobs of a reference job effectively. | ||||
| In this article, we refine these distance measures slightly and apply them to rank jobs based on their similarity to a reference job. | ||||
| Therefore, we perform a study on three reference jobs with a different character. | ||||
| We also utilize Kolmogorov-Smirnov-Test to illustrate the benefit and drawbacks of the different methods. | ||||
| 
 | ||||
| @ -193,8 +194,8 @@ Therefore, we first need to define how a job's data is represented, then describ | ||||
| On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager. | ||||
| The results are 4D data (time, nodes, metrics, file system) per job. | ||||
| The distance measures should handle jobs of different lengths and node count. | ||||
| In \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. | ||||
| In a nutshell, for each job executed on Mistral, we partition it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively. | ||||
| In \cite{Eugen20HPS}, the authors discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail. We are using their data. | ||||
| In a nutshell, for each job executed on Mistral, they partitioned it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively. | ||||
| The fixed interval of 10 minutes ensures the portability of the approach to other HPC systems. | ||||
| After the mean value across nodes is computed for a segment, the resulting numeric value is encoded either using binary (IO activity on the segment: yes/no) or hexadecimal representation (quantizing the numerical performance value into 0-15) which is then ready for similarity analysis. | ||||
| By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs. | ||||
| @ -206,7 +207,7 @@ B-all determines similarity between binary codings by means of Levenshtein dista | ||||
| B-aggz is similar to B-all, but computes similarity on binary codings where subsequent segments of zero activities are replaced by just one zero. | ||||
| Q-lev determines similarity between quantized codings by using Levensthein distance. | ||||
| Q-native uses a performance-aware similarity function, i.e., the distance between two jobs for a metric is $\frac{|m_{job1} - m_{job2}|}{16}$. | ||||
| For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity. | ||||
| For jobs with different lengths, a sliding-windows approach is applied which finds the location for the shorter job in the long job with the highest similarity. | ||||
| Q-phases extract phase information and performs a phase-aware and performance-aware similarity computation. | ||||
| The Q-phases algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs. | ||||
| In this paper, we add a similarity definition based on Kolmogorov-Smirnov-Test that compares the probability distribution of the observed values which we describe in the following. | ||||
| @ -221,7 +222,7 @@ This reduces the four-dimensional dataset to two dimensions (time, metrics). | ||||
| % Aggregation | ||||
| The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system. | ||||
| Unlike the previous similarity definitions, the concatenation of time series on the node dimension preserves the individual I/O information of all nodes while it still allows comparison of jobs with a different number of nodes. | ||||
| We apply no aggregation function to the metric dimension. | ||||
| No aggregation is performed on the metric dimension. | ||||
| 
 | ||||
| % Filtering | ||||
| %Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis. | ||||
|  | ||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user