This commit is contained in:
eugen.betke 2020-08-20 14:17:03 +02:00
commit 75fe1a9952
8 changed files with 130 additions and 33 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -44,7 +44,8 @@
\usepackage{graphicx} \usepackage{graphicx}
\graphicspath{ \graphicspath{
{./pictures/} {./pictures/},
{../fig/}
} }
\usepackage[backend=bibtex, style=numeric]{biblatex} \usepackage[backend=bibtex, style=numeric]{biblatex}
@ -127,30 +128,62 @@ Check time series algorithms:
\begin{itemize} \begin{itemize}
\item bin \item bin
\item hex\_native/hex\_lev \item hex\_native
\item pm\_quant \item hex\_lev
\item hex\_quant
\end{itemize} \end{itemize}
\section{Evaluation} \section{Evaluation}
\label{sec:evaluation} \label{sec:evaluation}
Two study examples (two reference jobs): In the following, we assume a job is given and we aim to identify similar jobs.
We chose several reference jobs with different compute and IO characteristics visualized in \Cref{fig:refJobs}:
\begin{itemize} \begin{itemize}
\item jobA: shorter length, e.g. 5-10, that has a little bit IO in at least two metadata metrics (more better). \item Job-S: performs postprocessing on a single node. This is a typical process in climate science where data products are reformatted and annotated with metadata to a standard representation (so called CMORization). The post-processing is IO intensive.
\item jobB: a very IO intensive longer job, e.g., length $>$ 20, with IO read or write and maybe one other metrics. \item Job-M: a typical MPI parallel 8-hour compute job on 128 nodes which writes time series data after some spin up. %CHE.ws12
\item Job-L: a 66-hour 20-node job.
The initialization data is read at the beginning.
Then only a single master node writes constantly a small volume of data; in fact, the generated data is too small to be categorized as IO relevant.
\end{itemize} \end{itemize}
For each reference job: create CSV file which contains all jobs with: For each reference job and algorithm, we created a CSV files with the computed similarity for all other jobs.
\begin{itemize}
\item JOB ID, for each algorithm: the coding and the computed ranking $\rightarrow$ thus one long row.
\end{itemize} Sollte man was zur Laufzeit der Algorithmen sagen? Denke Daten zu haben wäre sinnvoll.
Alternatively, could be one CSV for each algorithm that contains JOB ID, coding + rank
Create histograms + cumulative job distribution for all algorithms. Create histograms + cumulative job distribution for all algorithms.
Insert job profiles for closest 10 jobs. Insert job profiles for closest 10 jobs.
Potentially, analyze how the rankings of different similarities look like. Potentially, analyze how the rankings of different similarities look like.
\begin{figure}
\begin{subfigure}{0.8\textwidth}
\includegraphics[width=\textwidth]{job-timeseries4296426}
\caption{Job-S} \label{fig:job-S}
\end{subfigure}
\caption{Reference jobs: timeline of mean IO activity}
\label{fig:refJobs}
\end{figure}
\begin{figure}\ContinuedFloat
\begin{subfigure}{0.8\textwidth}
\includegraphics[width=\textwidth]{job-timeseries5024292}
\caption{Job-M} \label{fig:job-M}
\end{subfigure}
\begin{subfigure}{0.8\textwidth}
\includegraphics[width=\textwidth]{job-timeseries7488914-30.pdf}
\caption{Job-L (first 30 segments of 400; remaining segments are similar)}
\label{fig:job-L}
\end{subfigure}
\caption{Reference jobs: timeline of mean IO activity; non-shown timelines are 0}
\end{figure}
\section{Summary and Conclusion} \section{Summary and Conclusion}
\label{sec:summary} \label{sec:summary}

14
scripts/create-paper-vis.sh Executable file
View File

@ -0,0 +1,14 @@
#!/bin/bash
# This script calls all other scripts to re-create the figures for the paper
mkdir fig
for job in 5024292 4296426 7488914 ; do
./scripts/plot-single-job.py $job "fig/job-"
done
# Remove whitespace around jobs
# for file in fig/*.pdf ; do
# pdfcrop $file output.pdf
# mv output.pdf $file
# done

View File

@ -5,12 +5,47 @@ import sys
from pandas import DataFrame from pandas import DataFrame
from pandas import Grouper from pandas import Grouper
from matplotlib import pyplot from matplotlib import pyplot
import matplotlib.cm as cm
jobs = [sys.argv[1]] jobs = sys.argv[1].split(",")
prefix = sys.argv[2] prefix = sys.argv[2].split(",")
print("Plotting the job: " + str(jobs)) print("Plotting the job: " + str(jobs))
# Color map
colorMap = { "md_file_create": cm.tab10(0),
"md_file_delete": cm.tab10(1),
"md_mod": cm.tab10(2),
"md_other": cm.tab10(3),
"md_read": cm.tab10(4),
"read_bytes": cm.tab10(5),
"read_calls": cm.tab10(6),
"write_bytes": cm.tab10(7),
"write_calls": cm.tab10(8)
}
markerMap = { "md_file_create": "^",
"md_file_delete": "v",
"md_other": ".",
"md_mod": "<",
"md_read": ">",
"read_bytes": "h",
"read_calls": "H",
"write_bytes": "D",
"write_calls": "d"
}
linestyleMap = { "md_file_create": ":",
"md_file_delete": ":",
"md_mod": ":",
"md_other": ":",
"md_read": ":",
"read_bytes": "--",
"read_calls": "--",
"write_bytes": "-.",
"write_calls": "-."
}
# Plot the timeseries # Plot the timeseries
def plot(prefix, header, row): def plot(prefix, header, row):
x = { h : d for (h, d) in zip(header, row)} x = { h : d for (h, d) in zip(header, row)}
@ -36,27 +71,45 @@ def plot(prefix, header, row):
groups = data.groupby(["metrics"]) groups = data.groupby(["metrics"])
metrics = DataFrame() metrics = DataFrame()
labels = [] labels = []
colors = []
style = []
for name, group in groups: for name, group in groups:
metrics[name] = [x[2] for x in group.values] metrics[name] = [x[2] for x in group.values]
labels.append(name) labels.append(name)
style.append(linestyleMap[name] + markerMap[name])
colors.append(colorMap[name])
ax = metrics.plot(subplots=True, legend=False, sharex=True, grid = True, sharey=True, colormap='jet', marker='.', markersize=10, figsize=(8, 2 + 2 * len(labels))) fsize = (8, 1 + 1.5 * len(labels))
for (i, l) in zip(range(0, len(labels)), labels): fsizeFixed = (8, 2)
ax[i].set_ylabel(l)
pyplot.close('all')
if len(labels) < 4 :
ax = metrics.plot(legend=True, sharex=True, grid = True, sharey=True, markersize=10, figsize=fsizeFixed, color=colors, style=style)
ax.set_ylabel("Value")
else:
ax = metrics.plot(subplots=True, legend=False, sharex=True, grid = True, sharey=True, markersize=10, figsize=fsize, color=colors, style=style)
for (i, l) in zip(range(0, len(labels)), labels):
ax[i].set_ylabel(l)
pyplot.xlabel("Segment number") pyplot.xlabel("Segment number")
pyplot.savefig(prefix + "timeseries" + jobid + ".png") pyplot.savefig(prefix + "timeseries" + jobid + ".pdf", bbox_inches='tight')
# Plot first 30 segments # Plot first 30 segments
if len(timeseries) <= 50: if len(timeseries) <= 50:
return return
ax = metrics.plot(subplots=True, legend=False, sharex=True, grid = True, sharey=True, colormap='jet', marker='.', markersize=10, xlim=(0,30))
for (i, l) in zip(range(0, len(labels)), labels): if len(labels) < 4 :
ax[i].set_ylabel(l) ax = metrics.plot(legend=True, xlim=(0,30), sharex=True, grid = True, sharey=True, markersize=10, figsize=fsizeFixed, color=colors, style=style)
ax.set_ylabel("Value")
else:
ax = metrics.plot(subplots=True, xlim=(0,30), legend=False, sharex=True, grid = True, sharey=True, markersize=10, figsize=fsize, color=colors, style=style)
for (i, l) in zip(range(0, len(labels)), labels):
ax[i].set_ylabel(l)
pyplot.xlabel("Segment number") pyplot.xlabel("Segment number")
pyplot.savefig(prefix + "timeseries" + jobid + "-30.png") pyplot.savefig(prefix + "timeseries" + jobid + "-30.pdf", bbox_inches='tight')
### end plotting function ### end plotting function
@ -65,6 +118,7 @@ def plot(prefix, header, row):
with open('job-io-datasets/datasets/job_codings.csv') as csv_file: with open('job-io-datasets/datasets/job_codings.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',') csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0 line_count = 0
job = 0
for row in csv_reader: for row in csv_reader:
if line_count == 0: if line_count == 0:
header = row header = row
@ -74,4 +128,5 @@ with open('job-io-datasets/datasets/job_codings.csv') as csv_file:
if not row[0].strip() in jobs: if not row[0].strip() in jobs:
continue continue
else: else:
plot(prefix, header, row) plot(prefix[job], header, row)
job += 1

View File

@ -19,10 +19,8 @@ data = read.csv(file)
# Columns are: jobid alg_id alg_name similarity # Columns are: jobid alg_id alg_name similarity
data$alg_id = as.factor(data$alg_id) data$alg_id = as.factor(data$alg_id)
print(nrow(data)) cat("Job count:")
cat(nrow(data))
# FILTER, TODO
data = data %>% filter(similarity <= 1.0)
# empirical cummulative density function (ECDF) # empirical cummulative density function (ECDF)
ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position="bottom") + scale_color_brewer(palette = "Set2") ggplot(data, aes(similarity, color=alg_name, group=alg_name)) + stat_ecdf(geom = "step") + xlab("SIM") + ylab("Fraction of jobs") + theme(legend.position="bottom") + scale_color_brewer(palette = "Set2")
@ -34,7 +32,7 @@ print(summary(e))
ggsave("ecdf-0.5.png") ggsave("ecdf-0.5.png")
# histogram for the jobs # histogram for the jobs
ggplot(data, aes(similarity), group=alg_name) + geom_histogram(color="black", binwidth=0.025) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + scale_y_continuous(limits=c(0, 100), oob=squish) + scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") ggplot(data, aes(similarity), group=alg_name) + geom_histogram(color="black", binwidth=0.025) + aes(fill = alg_name) + facet_grid(alg_name ~ ., switch = 'y') + scale_y_continuous(limits=c(0, 100), oob=squish) + scale_color_brewer(palette = "Set2") + ylab("Count (cropped at 100)") + theme(legend.position = "none")
ggsave("hist-sim.png") ggsave("hist-sim.png")
# load job information, i.e., the time series per job # load job information, i.e., the time series per job
@ -51,13 +49,10 @@ plotJobs = function(jobs){
md = metadata[metadata$jobid %in% jobs,] md = metadata[metadata$jobid %in% jobs,]
print(summary(md)) print(summary(md))
# print the job timeline # print the job timelines
r = e[ordered, ] r = e[ordered, ]
for (row in 1:length(jobs)) { prefix = do.call("sprintf", list("%s-%.0f-", level, r$similarity))
prefix = sprintf("%s-%f-%.0f-", level, r[row, "similarity"], row) system(sprintf("scripts/plot-single-job.py %s %s", paste(r$jobid, collapse=","), paste(prefix, collapse=",")))
job = r[row, "jobid"]
system(sprintf("scripts/plot-single-job.py %s %s", job, prefix))
}
} }
# Store the job ids in a table, each column is one algorithm # Store the job ids in a table, each column is one algorithm