Bugfix bib.

This commit is contained in:
Julian M. Kunkel 2020-10-23 18:33:30 +01:00
parent cceff91731
commit 966715a082
2 changed files with 31 additions and 9 deletions

View File

@ -135,6 +135,7 @@
year={2013}
}
@inproceedings{evans2014comprehensive,
title={{Comprehensive resource use monitoring for HPC systems with TACC stats}},
author={Evans, Todd and Barth, William L and Browne, James C and DeLeon, Robert L and Furlani, Thomas R and Gallo, Steven M and Jones, Matthew D and Patra, Abani K},
@ -160,9 +161,11 @@
year={2020}
}
@article{betke20,
title={The Importance of Temporal Behavior when Classifying Job IO Patterns Using Machine Learning Techniques},
author={Betke, Eugen and Kunkel, Julian}
@article{simakov2018workload,
title={{A Workload Analysis of NSF's Innovative HPC Resources Using XDMoD}},
author={Simakov, Nikolay A and White, Joseph P and DeLeon, Robert L and Gallo, Steven M and Jones, Matthew D and Palmer, Jeffrey T and Plessinger, Benjamin and Furlani, Thomas R},
journal={arXiv preprint arXiv:1801.04306},
year={2018}
}
@ -173,3 +176,22 @@
pages={1--8},
year={2018}
}
@incollection{chan2019resource,
title={{A Resource Utilization Analytics Platform Using Grafana and Telegraf for the Savio Supercluster}},
author={Chan, Nicolas},
booktitle={Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning)},
pages={1--6},
year={2019}
}
@article{betke20,
title={The Importance of Temporal Behavior when Classifying Job IO Patterns Using Machine Learning Techniques},
author={Betke, Eugen and Kunkel, Julian}
}
@article{Eugen20HPS,
title={TODO JHPS version},
author={Betke, Eugen and Kunkel, Julian}
}

View File

@ -111,7 +111,7 @@ Secondly, they aim to improve the efficiency of all workflows -- represented as
In order to optimize a single job, its behavior and resource utilization must be monitored and then assessed.
Rarely, users will liaise with staff and request a performance analysis and optimization explicitly.
Therefore, data centers deploy monitoring systems and staff must pro-actively identify candidates for optimization.
Monitoring tools such as \cite{Grafana} and \cite{XDMod} provide various statistics and time-series data for job execution.
Monitoring tools such as TACC Stats \cite{evans2014comprehensive}, Grafana \cite{chan2019resource}, and XDMod \cite{simakov2018workload} provide various statistics and time-series data for job execution.
The support staff should focus on workloads for which optimization is beneficial, for instance, the analysis of a job that is executed once on a medium number of nodes costs human resources and is not a good return of investment.
By ranking jobs based on the statistics, it isn't difficult to find a job that exhibits extensive usage of computing, network, and IO resources.
@ -129,7 +129,7 @@ Job names are defined by users; while a similar name may hint to be a similar wo
\eb{Hier fehlt noch die Info, warum der Support nach aehnlichen Jobs suchen sollen. So wie ich es verstehe, wenn ein Job Probleme verursacht, dann koennen auch aehnliche Jobs aehnliche Probleme verursachen.}
\eb{Vorteil fuer den Nutzer ist nicht ganz klar. Warum sollte ein Nutzer nach ähnlichen Jobs suchen?}
In our previous paper \cite{XXX}, we developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
In our previous paper \cite{Eugen20HPS}, we developed several distance measures and algorithms for the clustering of jobs based on the time series of their IO behavior.
The distance measures can be applied to jobs with different runtime and number of nodes utilized but differ in the way they define similarity.
We showed that the metrics can be used to cluster jobs, however, it remains unclear if the method can be used by data center staff to explore jobs of a reference job effectively.
In this article, we refined these distance measures slightly and apply them to rank jobs based on their similarity to a reference job.
@ -197,13 +197,13 @@ Therefore, we first need to define how a job's data is represented, then describ
On the Mistral supercomputer at DKRZ, the monitoring system \cite{betke20} gathers in 10s intervals on all nodes nine IO metrics for the two Lustre file systems together with general job metadata from the SLURM workload manager.
The results are 4D data (time, nodes, metrics, file system) per job.
The distance measures should handle jobs of different lengths and node count.
In \cite{TODOPaper}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
In \cite{Eugen20HPS}, we discussed a variety of options from 1D job-profiles to data reductions to compare time series data and the general workflow and pre-processing in detail.
In a nutshell, for each job executed on Mistral, we partition it into 10-minute segments and compute the arithmetic mean of each metric, categorize the value into non-IO (0), HighIO (1), and CriticalIO (4) for values below 99-percentile, up to 99.9-percentile, and above, respectively.
After data is reduced across nodes, we quantize the timelines either using binary or hexadecimal representation which is then ready for similarity analysis.
By pre-filtering jobs with no I/O activity -- their sum across all dimensions and time series is equal to zero, we are reducing the dataset from about 1 million jobs to about 580k jobs.
\subsection{Algorithms for Computing Similarity}
We reuse the algorithms developed in \cite{TODO}: BIN\_all, BIN\_aggzeros, HEX\_native, HEX\_lev, and HEX\_quant.
We reuse the algorithms developed in \cite{Eugen20HPS}: BIN\_all, BIN\_aggzeros, HEX\_native, HEX\_lev, and HEX\_quant.
They differ in the way data similarity is defined; either the binary or hexadecimal coding is used, the distance measure is mostly the Euclidean distance or the Levenshtein-distance.
For jobs with different lengths, we apply a sliding-windows approach which finds the location for the shorter job in the long job with the highest similarity.
The HEX\_quant algorithm extracts I/O phases and computes the similarity between the most similar I/O phases of both jobs.
@ -263,7 +263,7 @@ For this study, we chose several reference jobs with different compute and IO ch
\end{itemize}
The segmented timelines of the jobs are visualized in \Cref{fig:refJobs} -- remember that the mean value is computed across all nodes.
This coding is also used for the HEX class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{TODO}.
This coding is also used for the HEX class of algorithms, thus this representation is what the algorithms will analyze; BIN algorithms merge all timelines together as described in \cite{Eugen20HPS}.
The figures show the values of active metrics ($\neq 0$); if few are active then they are shown in one timeline, otherwise, they are rendered individually to provide a better overview.
For example, we can see in \Cref{fig:job-S}, that several metrics increase in Segment\,6.
@ -959,5 +959,5 @@ That would increase the likelihood that these jobs are very similar and what the
The KS algorithm finds jobs with similar histograms which are not necessarily what we are looking for.
%\printbibliography
\printbibliography
\end{document}