KS dimension reduction and filtering

This commit is contained in:
Eugen Betke 2020-09-03 18:32:22 +02:00
parent ea893d76f0
commit e4dd65c064
1 changed files with 26 additions and 0 deletions

View File

@ -112,6 +112,32 @@ The contribution of this paper...
\section{Methodology}
\label{sec:methodology}
\ebadd{
% Summary
For the analysis of the Kolmogorov-Smirnov-based similarity we perform two preparation steps.
Dimension reduction by mean and concatenation functions allow us to reduce the four dimensional dataset to two dimensions.
Pre-filtering omits irrelevant jobs in term of performance and reduces the dataset any further.
% Aggregation
The reduction of the file system dimension by the mean function ensures the time series values stay in the range between 0 and 4, independently how many file systems are present on an HPC system.
A fixed interval also ensure the portability of the approach to other HPC systems.
The concatenation of time series on the node dimension preserves I/O information of all nodes.
We apply no aggregation function to the metric dimension.
% Filtering
Zero-jobs are jobs with no sign of significant I/O load are of little interest in the analysis.
Their sum across all dimensions and time series is equal to zero.
Furthermore, we filter those jobs whose time series have less than 8 values.
% Similarity
For the analysis we use the kolmogorov-smirnov-test 1.1.0 Rust library from the official Rust Package Registry ``cargo.io''.
The similarity function \Cref{eq:ks_similarity} calculates the inverse of reject probability $p_{\text{reject}}$.
}
\begin{equation}\label{eq:ks_similarity}
similarity = 1 - p_{\text{reject}}
\end{equation}
Given: the reference job ID.
Create from 4D time series data (number of nodes, per file systems, 9 metrics, time) a feature set.