Commit e6082458 authored by Ilya Mironov's avatar Ilya Mironov Committed by Karmel Allison
Browse files

Two new plot-generating scripts and changes in support of the ICLR poster. (#4092)

* + plot_partition.py (budget allocation with smooth sensitivity).
+ utility_queries_answered.py (Figure 1, left).
+ several changes to the scripts required for the ICLR poster.

* Adding Ananth Raghunathan (pseudorandom@) to the list of owners of differential_privacy.

* Fixing the owners list of differential_privacy.
parent db778817
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
/research/compression/ @nmjohn /research/compression/ @nmjohn
/research/deeplab/ @aquariusjay @yknzhu @gpapan /research/deeplab/ @aquariusjay @yknzhu @gpapan
/research/delf/ @andrefaraujo /research/delf/ @andrefaraujo
/research/differential_privacy/ @panyx0718 @mironov /research/differential_privacy/ @ilyamironov @ananthr
/research/domain_adaptation/ @bousmalis @dmrd /research/domain_adaptation/ @bousmalis @dmrd
/research/gan/ @joel-shor /research/gan/ @joel-shor
/research/im2txt/ @cshallue /research/im2txt/ @cshallue
......
...@@ -11,8 +11,8 @@ Erlingsson (ICLR 2018, https://arxiv.org/abs/1802.08908). ...@@ -11,8 +11,8 @@ Erlingsson (ICLR 2018, https://arxiv.org/abs/1802.08908).
* numpy * numpy
* scipy * scipy
* sympy (for smooth sensitivity analysis) * sympy (for smooth sensitivity analysis)
* write access to current directory (otherwise, output directories in download.py and *.sh scripts * write access to the current directory (otherwise, output directories in download.py and *.sh
must be changed) scripts must be changed)
## Reproducing Figures 1 and 5, and Table 2 ## Reproducing Figures 1 and 5, and Table 2
...@@ -27,30 +27,33 @@ For Table 2 run (may take several hours)\ ...@@ -27,30 +27,33 @@ For Table 2 run (may take several hours)\
`$ sh generate_table.sh`\ `$ sh generate_table.sh`\
The output is written to the console. The output is written to the console.
For data-independent bounds (for comparing with Table 2), run\ For data-independent bounds (for comparison with Table 2), run\
`$ sh generate_table_data_independent.sh`\ `$ sh generate_table_data_independent.sh`\
The output is written to the console. The output is written to the console.
## Files in this directory ## Files in this directory
* generate_figures.sh --- Master script for generating Figures 1 and 5. * generate_figures.sh — Master script for generating Figures 1 and 5.
* generate_table.sh --- Master script for generating Table 2. * generate_table.sh — Master script for generating Table 2.
* generate_table_data_independent.sh --- Master script for computing data-independent * generate_table_data_independent.sh — Master script for computing data-independent
bounds. bounds.
* rdp_bucketized.py --- Script for producing Figures 1 (right) and 5 (right). * rdp_bucketized.py — Script for producing Figure 1 (right) and Figure 5 (right).
* rdp_cumulative.py --- Script for producing Figure 1 (left, middle), Figure 5 * rdp_cumulative.py — Script for producing Figure 1 (middle) and Figure 5 (left).
(left), and partition.pdf (a detailed breakdown of privacy costs per
source).
* smooth_sensitivity_table.py --- Script for generating Table 2. * smooth_sensitivity_table.py — Script for generating Table 2.
* utility_queries_answered — Script for producing Figure 1 (left).
* plot_partition.py — Script for producing partition.pdf, a detailed breakdown of privacy
costs for Confident-GNMax with smooth sensitivity analysis (takes ~50 hours).
* rdp_flow.py and plot_ls_q.py are currently not used. * rdp_flow.py and plot_ls_q.py are currently not used.
* download.py --- Utility script for populating the data/ directory. * download.py — Utility script for populating the data/ directory.
All Python files take flags. Run script_name.py --help for help on flags. All Python files take flags. Run script_name.py --help for help on flags.
...@@ -17,8 +17,6 @@ ...@@ -17,8 +17,6 @@
counts_file="data/glyph_5000_teachers.npy" counts_file="data/glyph_5000_teachers.npy"
output_dir="figures/" output_dir="figures/"
executable1="python rdp_bucketized.py"
executable2="python rdp_cumulative.py"
mkdir -p $output_dir mkdir -p $output_dir
...@@ -27,18 +25,19 @@ if [ ! -d "$output_dir" ]; then ...@@ -27,18 +25,19 @@ if [ ! -d "$output_dir" ]; then
exit 1 exit 1
fi fi
$executable1 \ python rdp_bucketized.py \
--plot=small \ --plot=small \
--counts_file=$counts_file \ --counts_file=$counts_file \
--plot_file=$output_dir"noisy_thresholding_check_perf.pdf" --plot_file=$output_dir"noisy_thresholding_check_perf.pdf"
$executable1 \ python rdp_bucketized.py \
--plot=large \ --plot=large \
--counts_file=$counts_file \ --counts_file=$counts_file \
--plot_file=$output_dir"noisy_thresholding_check_perf_details.pdf" --plot_file=$output_dir"noisy_thresholding_check_perf_details.pdf"
python rdp_cumulative.py \
$executable2 \
--cache=False \ --cache=False \
--counts_file=$counts_file \ --counts_file=$counts_file \
--figures_dir=$output_dir --figures_dir=$output_dir
python utility_queries_answered.py --plot_file=$output_dir"utility_queries_answered.pdf"
\ No newline at end of file
"""Produces two plots. One compares aggregators and their analyses. The other
illustrates sources of privacy loss for Confident-GNMax.
A script in support of the paper "Scalable Private Learning with PATE" by
Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar,
Ulfar Erlingsson (https://arxiv.org/abs/1802.08908).
The input is a file containing a numpy array of votes, one query per row, one
class per column. Ex:
43, 1821, ..., 3
31, 16, ..., 0
...
0, 86, ..., 438
The output is written to a specified directory and consists of two files.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import os
import pickle
import sys
sys.path.append('..') # Main modules reside in the parent directory.
from absl import app
from absl import flags
from collections import namedtuple
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt # pylint: disable=g-import-not-at-top
import numpy as np
import core as pate
import smooth_sensitivity as pate_ss
plt.style.use('ggplot')
FLAGS = flags.FLAGS
flags.DEFINE_boolean('cache', False,
'Read results of privacy analysis from cache.')
flags.DEFINE_string('counts_file', None, 'Counts file.')
flags.DEFINE_string('figures_dir', '', 'Path where figures are written to.')
flags.DEFINE_float('threshold', None, 'Threshold for step 1 (selection).')
flags.DEFINE_float('sigma1', None, 'Sigma for step 1 (selection).')
flags.DEFINE_float('sigma2', None, 'Sigma for step 2 (argmax).')
flags.DEFINE_integer('queries', None, 'Number of queries made by the student.')
flags.DEFINE_float('delta', 1e-8, 'Target delta.')
flags.mark_flag_as_required('counts_file')
flags.mark_flag_as_required('threshold')
flags.mark_flag_as_required('sigma1')
flags.mark_flag_as_required('sigma2')
Partition = namedtuple('Partition', ['step1', 'step2', 'ss', 'delta'])
def analyze_gnmax_conf_data_ind(votes, threshold, sigma1, sigma2, delta):
orders = np.logspace(np.log10(1.5), np.log10(500), num=100)
n = votes.shape[0]
rdp_total = np.zeros(len(orders))
answered_total = 0
answered = np.zeros(n)
eps_cum = np.full(n, None, dtype=float)
for i in range(n):
v = votes[i,]
if threshold is not None and sigma1 is not None:
q_step1 = np.exp(pate.compute_logpr_answered(threshold, sigma1, v))
rdp_total += pate.rdp_data_independent_gaussian(sigma1, orders)
else:
q_step1 = 1. # always answer
answered_total += q_step1
answered[i] = answered_total
rdp_total += q_step1 * pate.rdp_data_independent_gaussian(sigma2, orders)
eps_cum[i], order_opt = pate.compute_eps_from_delta(orders, rdp_total,
delta)
if i > 0 and (i + 1) % 1000 == 0:
print('queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} '
'at order = {:.2f}.'.format(
i + 1,
answered[i],
eps_cum[i],
order_opt))
sys.stdout.flush()
return eps_cum, answered
def analyze_gnmax_conf_data_dep(votes, threshold, sigma1, sigma2, delta):
# Short list of orders.
# orders = np.round(np.logspace(np.log10(20), np.log10(200), num=20))
# Long list of orders.
orders = np.concatenate((np.arange(20, 40, .2),
np.arange(40, 75, .5),
np.logspace(np.log10(75), np.log10(200), num=20)))
n = votes.shape[0]
num_classes = votes.shape[1]
num_teachers = int(sum(votes[0,]))
if threshold is not None and sigma1 is not None:
is_data_ind_step1 = pate.is_data_independent_always_opt_gaussian(
num_teachers, num_classes, sigma1, orders)
else:
is_data_ind_step1 = [True] * len(orders)
is_data_ind_step2 = pate.is_data_independent_always_opt_gaussian(
num_teachers, num_classes, sigma2, orders)
eps_partitioned = np.full(n, None, dtype=Partition)
order_opt = np.full(n, None, dtype=float)
ss_std_opt = np.full(n, None, dtype=float)
answered = np.zeros(n)
rdp_step1_total = np.zeros(len(orders))
rdp_step2_total = np.zeros(len(orders))
ls_total = np.zeros((len(orders), num_teachers))
answered_total = 0
for i in range(n):
v = votes[i,]
if threshold is not None and sigma1 is not None:
logq_step1 = pate.compute_logpr_answered(threshold, sigma1, v)
rdp_step1_total += pate.compute_rdp_threshold(logq_step1, sigma1, orders)
else:
logq_step1 = 0. # always answer
pr_answered = np.exp(logq_step1)
logq_step2 = pate.compute_logq_gaussian(v, sigma2)
rdp_step2_total += pr_answered * pate.rdp_gaussian(logq_step2, sigma2,
orders)
answered_total += pr_answered
rdp_ss = np.zeros(len(orders))
ss_std = np.zeros(len(orders))
for j, order in enumerate(orders):
if not is_data_ind_step1[j]:
ls_step1 = pate_ss.compute_local_sensitivity_bounds_threshold(v,
num_teachers, threshold, sigma1, order)
else:
ls_step1 = np.full(num_teachers, 0, dtype=float)
if not is_data_ind_step2[j]:
ls_step2 = pate_ss.compute_local_sensitivity_bounds_gnmax(
v, num_teachers, sigma2, order)
else:
ls_step2 = np.full(num_teachers, 0, dtype=float)
ls_total[j,] += ls_step1 + pr_answered * ls_step2
beta_ss = .49 / order
ss = pate_ss.compute_discounted_max(beta_ss, ls_total[j,])
sigma_ss = ((order * math.exp(2 * beta_ss)) / ss) ** (1 / 3)
rdp_ss[j] = pate_ss.compute_rdp_of_smooth_sensitivity_gaussian(
beta_ss, sigma_ss, order)
ss_std[j] = ss * sigma_ss
rdp_total = rdp_step1_total + rdp_step2_total + rdp_ss
answered[i] = answered_total
_, order_opt[i] = pate.compute_eps_from_delta(orders, rdp_total, delta)
order_idx = np.searchsorted(orders, order_opt[i])
# Since optimal orders are always non-increasing, shrink orders array
# and all cumulative arrays to speed up computation.
if order_idx < len(orders):
orders = orders[:order_idx + 1]
rdp_step1_total = rdp_step1_total[:order_idx + 1]
rdp_step2_total = rdp_step2_total[:order_idx + 1]
eps_partitioned[i] = Partition(step1=rdp_step1_total[order_idx],
step2=rdp_step2_total[order_idx],
ss=rdp_ss[order_idx],
delta=-math.log(delta) / (order_opt[i] - 1))
ss_std_opt[i] = ss_std[order_idx]
if i > 0 and (i + 1) % 10 == 0:
print('queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} +/- {:.3f} '
'at order = {:.2f}. Contributions: delta = {:.3f}, step1 = {:.3f}, '
'step2 = {:.3f}, ss = {:.3f}'.format(
i + 1,
answered[i],
sum(eps_partitioned[i]),
ss_std_opt[i],
order_opt[i],
eps_partitioned[i].delta,
eps_partitioned[i].step1,
eps_partitioned[i].step2,
eps_partitioned[i].ss))
sys.stdout.flush()
return eps_partitioned, answered, ss_std_opt, order_opt
def plot_comparison(figures_dir, simple_ind, conf_ind, simple_dep, conf_dep):
"""Plots variants of GNMax algorithm and their analyses.
"""
def pivot(x_axis, eps, answered):
y = np.full(len(x_axis), None, dtype=float) # delta
for i, x in enumerate(x_axis):
idx = np.searchsorted(answered, x)
if idx < len(eps):
y[i] = eps[idx]
return y
def pivot_dep(x_axis, data_dep):
eps_partitioned, answered, _, _ = data_dep
eps = [sum(p) for p in eps_partitioned] # Flatten eps
return pivot(x_axis, eps, answered)
xlim = 10000
x_axis = range(0, xlim, 10)
y_simple_ind = pivot(x_axis, *simple_ind)
y_conf_ind = pivot(x_axis, *conf_ind)
y_simple_dep = pivot_dep(x_axis, simple_dep)
y_conf_dep = pivot_dep(x_axis, conf_dep)
# plt.close('all')
fig, ax = plt.subplots()
fig.set_figheight(4.5)
fig.set_figwidth(4.7)
ax.plot(x_axis, y_simple_ind, ls='--', color='r', lw=3, label=r'Simple GNMax, data-ind analysis')
ax.plot(x_axis, y_conf_ind, ls='--', color='b', lw=3, label=r'Confident GNMax, data-ind analysis')
ax.plot(x_axis, y_simple_dep, ls='-', color='r', lw=3, label=r'Simple GNMax, data-dep analysis')
ax.plot(x_axis, y_conf_dep, ls='-', color='b', lw=3, label=r'Confident GNMax, data-dep analysis')
plt.xticks(np.arange(0, xlim + 1000, 2000))
plt.xlim([0, xlim])
plt.ylim(bottom=0)
plt.legend(fontsize=16)
ax.set_xlabel('Number of queries answered', fontsize=16)
ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=16)
ax.tick_params(labelsize=14)
plot_filename = os.path.join(figures_dir, 'comparison.pdf')
print('Saving the graph to ' + plot_filename)
fig.savefig(plot_filename, bbox_inches='tight')
plt.show()
def plot_partition(figures_dir, gnmax_conf, print_order):
"""Plots an expert version of the privacy-per-answered-query graph.
Args:
figures_dir: A name of the directory where to save the plot.
eps: The cumulative privacy cost.
partition: Allocation of the privacy cost.
answered: Cumulative number of queries answered.
order_opt: The list of optimal orders.
"""
eps_partitioned, answered, ss_std_opt, order_opt = gnmax_conf
xlim = 10000
x = range(0, int(xlim), 10)
lenx = len(x)
y0 = np.full(lenx, np.nan, dtype=float) # delta
y1 = np.full(lenx, np.nan, dtype=float) # delta + step1
y2 = np.full(lenx, np.nan, dtype=float) # delta + step1 + step2
y3 = np.full(lenx, np.nan, dtype=float) # delta + step1 + step2 + ss
noise_std = np.full(lenx, np.nan, dtype=float)
y_right = np.full(lenx, np.nan, dtype=float)
for i in range(lenx):
idx = np.searchsorted(answered, x[i])
if idx < len(eps_partitioned):
y0[i] = eps_partitioned[idx].delta
y1[i] = y0[i] + eps_partitioned[idx].step1
y2[i] = y1[i] + eps_partitioned[idx].step2
y3[i] = y2[i] + eps_partitioned[idx].ss
noise_std[i] = ss_std_opt[idx]
y_right[i] = order_opt[idx]
# plt.close('all')
fig, ax = plt.subplots()
fig.set_figheight(4.5)
fig.set_figwidth(4.7)
l1 = ax.plot(
x, y3, color='b', ls='-', label=r'Total privacy cost', linewidth=1).pop()
for y in (y0, y1, y2):
ax.plot(x, y, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
ax.fill_between(x, [0] * lenx, y0.tolist(), facecolor='b', alpha=.5)
ax.fill_between(x, y0.tolist(), y1.tolist(), facecolor='b', alpha=.4)
ax.fill_between(x, y1.tolist(), y2.tolist(), facecolor='b', alpha=.3)
ax.fill_between(x, y2.tolist(), y3.tolist(), facecolor='b', alpha=.2)
ax.fill_between(x, (y3 - noise_std).tolist(), (y3 + noise_std).tolist(),
facecolor='r', alpha=.5)
plt.xticks(np.arange(0, xlim + 1000, 2000))
plt.xlim([0, xlim])
ax.set_ylim([0, 3.])
ax.set_xlabel('Number of queries answered', fontsize=10)
ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=10)
# Merging legends.
if print_order:
ax2 = ax.twinx()
l2 = ax2.plot(
x, y_right, 'r', ls='-', label=r'Optimal order', linewidth=5,
alpha=.5).pop()
ax2.grid(False)
ax2.set_ylabel(r'Optimal Renyi order', fontsize=16)
ax2.set_ylim([0, 200.])
ax.legend((l1, l2), (l1.get_label(), l2.get_label()), loc=0, fontsize=13)
ax.tick_params(labelsize=10)
plot_filename = os.path.join(figures_dir, 'partition.pdf')
print('Saving the graph to ' + plot_filename)
fig.savefig(plot_filename, bbox_inches='tight', dpi=800)
plt.show()
def run_all_analyses(votes, threshold, sigma1, sigma2, delta):
simple_ind = analyze_gnmax_conf_data_ind(votes, None, None, sigma2,
delta)
conf_ind = analyze_gnmax_conf_data_ind(votes, threshold, sigma1, sigma2,
delta)
simple_dep = analyze_gnmax_conf_data_dep(votes, None, None, sigma2,
delta)
conf_dep = analyze_gnmax_conf_data_dep(votes, threshold, sigma1, sigma2,
delta)
return (simple_ind, conf_ind, simple_dep, conf_dep)
def run_or_load_all_analyses():
temp_filename = os.path.expanduser('~/tmp/partition_cached.pkl')
if FLAGS.cache and os.path.isfile(temp_filename):
print('Reading from cache ' + temp_filename)
with open(temp_filename, 'rb') as f:
all_analyses = pickle.load(f)
else:
fin_name = os.path.expanduser(FLAGS.counts_file)
print('Reading raw votes from ' + fin_name)
sys.stdout.flush()
votes = np.load(fin_name)
if FLAGS.queries is not None:
if votes.shape[0] < FLAGS.queries:
raise ValueError('Expect {} rows, got {} in {}'.format(
FLAGS.queries, votes.shape[0], fin_name))
# Truncate the votes matrix to the number of queries made.
votes = votes[:FLAGS.queries, ]
all_analyses = run_all_analyses(votes, FLAGS.threshold, FLAGS.sigma1,
FLAGS.sigma2, FLAGS.delta)
print('Writing to cache ' + temp_filename)
with open(temp_filename, 'wb') as f:
pickle.dump(all_analyses, f)
return all_analyses
def main(argv):
del argv # Unused.
simple_ind, conf_ind, simple_dep, conf_dep = run_or_load_all_analyses()
figures_dir = os.path.expanduser(FLAGS.figures_dir)
plot_comparison(figures_dir, simple_ind, conf_ind, simple_dep, conf_dep)
plot_partition(figures_dir, conf_dep, False)
plt.close('all')
if __name__ == '__main__':
app.run(main)
...@@ -51,9 +51,11 @@ plt.style.use('ggplot') ...@@ -51,9 +51,11 @@ plt.style.use('ggplot')
FLAGS = flags.FLAGS FLAGS = flags.FLAGS
flags.DEFINE_enum('plot', 'small', ['small', 'large'], 'Selects which of' flags.DEFINE_enum('plot', 'small', ['small', 'large'], 'Selects which of'
'the two plots is produced.') 'the two plots is produced.')
flags.DEFINE_string('counts_file', '', 'Counts file.') flags.DEFINE_string('counts_file', None, 'Counts file.')
flags.DEFINE_string('plot_file', '', 'Plot file to write.') flags.DEFINE_string('plot_file', '', 'Plot file to write.')
flags.mark_flag_as_required('counts_file')
def compute_count_per_bin(bin_num, votes): def compute_count_per_bin(bin_num, votes):
"""Tabulates number of examples in each bin. """Tabulates number of examples in each bin.
...@@ -164,6 +166,8 @@ def main(argv): ...@@ -164,6 +166,8 @@ def main(argv):
m_check = compute_expected_answered_per_bin(bin_num, votes, 3500, 1500) m_check = compute_expected_answered_per_bin(bin_num, votes, 3500, 1500)
a_check = compute_expected_answered_per_bin(bin_num, votes, 5000, 1500) a_check = compute_expected_answered_per_bin(bin_num, votes, 5000, 1500)
eps = compute_privacy_cost_per_bins(bin_num, votes, 100, 50) eps = compute_privacy_cost_per_bins(bin_num, votes, 100, 50)
else:
raise ValueError('--plot flag must be one of ["small", "large"]')
counts = compute_count_per_bin(bin_num, votes) counts = compute_count_per_bin(bin_num, votes)
bins = np.linspace(0, 100, num=bin_num, endpoint=False) bins = np.linspace(0, 100, num=bin_num, endpoint=False)
...@@ -171,14 +175,14 @@ def main(argv): ...@@ -171,14 +175,14 @@ def main(argv):
plt.close('all') plt.close('all')
fig, ax = plt.subplots() fig, ax = plt.subplots()
if FLAGS.plot == 'small': if FLAGS.plot == 'small':
fig.set_figheight(4.7) fig.set_figheight(5)
fig.set_figwidth(5) fig.set_figwidth(5)
ax.bar( ax.bar(
bins, bins,
counts, counts,
20, 20,
color='orangered', color='orangered',
linestyle='dashed', linestyle='dotted',
linewidth=5, linewidth=5,
edgecolor='red', edgecolor='red',
fill=False, fill=False,
...@@ -189,7 +193,7 @@ def main(argv): ...@@ -189,7 +193,7 @@ def main(argv):
bins, bins,
m_check, m_check,
20, 20,
color='b', color='g',
alpha=.5, alpha=.5,
linewidth=0, linewidth=0,
edgecolor='g', edgecolor='g',
...@@ -238,12 +242,13 @@ def main(argv): ...@@ -238,12 +242,13 @@ def main(argv):
ax2.set_ylabel(r'Per query privacy cost $\varepsilon$', fontsize=16) ax2.set_ylabel(r'Per query privacy cost $\varepsilon$', fontsize=16)
plt.xlim([0, 100]) plt.xlim([0, 100])
ax.set_ylim([0, 2500])
# ax.set_yscale('log') # ax.set_yscale('log')
ax.set_xlabel('Percentage of teachers that agree', fontsize=16) ax.set_xlabel('Percentage of teachers that agree', fontsize=16)
ax.set_ylabel('Number of queries answered', fontsize=16) ax.set_ylabel('Number of queries answered', fontsize=16)
vals = ax.get_xticks() vals = ax.get_xticks()
ax.set_xticklabels([str(int(x)) + '%' for x in vals]) ax.set_xticklabels([str(int(x)) + '%' for x in vals])
ax.tick_params(labelsize=14) ax.tick_params(labelsize=14, bottom=True, top=True, left=True, right=True)
ax.legend(loc=2, prop={'size': 16}) ax.legend(loc=2, prop={'size': 16})
# simple: 'figures/noisy_thresholding_check_perf.pdf') # simple: 'figures/noisy_thresholding_check_perf.pdf')
......
...@@ -41,6 +41,7 @@ sys.path.append('..') # Main modules reside in the parent directory. ...@@ -41,6 +41,7 @@ sys.path.append('..') # Main modules reside in the parent directory.
from absl import app from absl import app
from absl import flags from absl import flags
import matplotlib import matplotlib
matplotlib.use('TkAgg') matplotlib.use('TkAgg')
import matplotlib.pyplot as plt # pylint: disable=g-import-not-at-top import matplotlib.pyplot as plt # pylint: disable=g-import-not-at-top
import numpy as np import numpy as np
...@@ -51,9 +52,10 @@ plt.style.use('ggplot') ...@@ -51,9 +52,10 @@ plt.style.use('ggplot')
FLAGS = flags.FLAGS FLAGS = flags.FLAGS
flags.DEFINE_boolean('cache', False, flags.DEFINE_boolean('cache', False,
'Read results of privacy analysis from cache.') 'Read results of privacy analysis from cache.')
flags.DEFINE_string('counts_file', '', 'Counts file.') flags.DEFINE_string('counts_file', None, 'Counts file.')
flags.DEFINE_string('figures_dir', '', 'Path where figures are written to.') flags.DEFINE_string('figures_dir', '', 'Path where figures are written to.')
flags.mark_flag_as_required('counts_file')
def run_analysis(votes, mechanism, noise_scale, params): def run_analysis(votes, mechanism, noise_scale, params):
"""Computes data-dependent privacy. """Computes data-dependent privacy.
...@@ -90,26 +92,26 @@ def run_analysis(votes, mechanism, noise_scale, params): ...@@ -90,26 +92,26 @@ def run_analysis(votes, mechanism, noise_scale, params):
n = votes.shape[0] n = votes.shape[0]
eps_total = np.zeros(n) eps_total = np.zeros(n)
partition = np.full(n, None, dtype=object) partition = [None] * n
order_opt = np.full(n, None, dtype=float) order_opt = np.full(n, np.nan, dtype=float)
answered = np.zeros(n) answered = np.zeros(n, dtype=float)
rdp_cum = np.zeros(len(orders)) rdp_cum = np.zeros(len(orders))
rdp_sqrd_cum = np.zeros(len(orders)) rdp_sqrd_cum = np.zeros(len(orders))
rdp_select_cum = np.zeros(len(orders)) rdp_select_cum = np.zeros(len(orders))
answered_sum = 0 answered_sum = 0
for i in xrange(n): for i in range(n):
v = votes[i,] v = votes[i,]
if mechanism == 'lnmax': if mechanism == 'lnmax':
logq_lnmax = pate.compute_logq_laplace(v, noise_scale) logq_lnmax = pate.compute_logq_laplace(v, noise_scale)
rdp_query = pate.rdp_pure_eps(logq_lnmax, 2. / noise_scale, orders) rdp_query = pate.rdp_pure_eps(logq_lnmax, 2. / noise_scale, orders)
rdp_sqrd = rdp_query**2 rdp_sqrd = rdp_query ** 2
pr_answered = 1 pr_answered = 1
elif mechanism == 'gnmax': elif mechanism == 'gnmax':
logq_gmax = pate.compute_logq_gaussian(v, noise_scale) logq_gmax = pate.compute_logq_gaussian(v, noise_scale)
rdp_query = pate.rdp_gaussian(logq_gmax, noise_scale, orders) rdp_query = pate.rdp_gaussian(logq_gmax, noise_scale, orders)
rdp_sqrd = rdp_query**2 rdp_sqrd = rdp_query ** 2
pr_answered = 1 pr_answered = 1
elif mechanism == 'gnmax_conf': elif mechanism == 'gnmax_conf':
logq_step1 = pate.compute_logpr_answered(params['t'], params['sigma1'], v) logq_step1 = pate.compute_logpr_answered(params['t'], params['sigma1'], v)
...@@ -117,16 +119,19 @@ def run_analysis(votes, mechanism, noise_scale, params): ...@@ -117,16 +119,19 @@ def run_analysis(votes, mechanism, noise_scale, params):
q_step1 = np.exp(logq_step1) q_step1 = np.exp(logq_step1)
logq_step1_min = min(logq_step1, math.log1p(-q_step1)) logq_step1_min = min(logq_step1, math.log1p(-q_step1))
rdp_gnmax_step1 = pate.rdp_gaussian(logq_step1_min, rdp_gnmax_step1 = pate.rdp_gaussian(logq_step1_min,
2**.5 * params['sigma1'], orders) 2 ** .5 * params['sigma1'], orders)
rdp_gnmax_step2 = pate.rdp_gaussian(logq_step2, noise_scale, orders) rdp_gnmax_step2 = pate.rdp_gaussian(logq_step2, noise_scale, orders)
rdp_query = rdp_gnmax_step1 + q_step1 * rdp_gnmax_step2 rdp_query = rdp_gnmax_step1 + q_step1 * rdp_gnmax_step2
# The expression below evaluates # The expression below evaluates
# E[(cost_of_step_1 + Bernoulli(pr_of_step_2) * cost_of_step_2)^2] # E[(cost_of_step_1 + Bernoulli(pr_of_step_2) * cost_of_step_2)^2]
rdp_sqrd = ( rdp_sqrd = (
rdp_gnmax_step1**2 + 2 * rdp_gnmax_step1 * q_step1 * rdp_gnmax_step2 + rdp_gnmax_step1 ** 2 + 2 * rdp_gnmax_step1 * q_step1 * rdp_gnmax_step2
q_step1 * rdp_gnmax_step2**2) + q_step1 * rdp_gnmax_step2 ** 2)
rdp_select_cum += rdp_gnmax_step1 rdp_select_cum += rdp_gnmax_step1
pr_answered = q_step1 pr_answered = q_step1
else:
raise ValueError(
'Mechanism must be one of ["lnmax", "gnmax", "gnmax_conf"]')
rdp_cum += rdp_query rdp_cum += rdp_query
rdp_sqrd_cum += rdp_sqrd rdp_sqrd_cum += rdp_sqrd
...@@ -139,9 +144,9 @@ def run_analysis(votes, mechanism, noise_scale, params): ...@@ -139,9 +144,9 @@ def run_analysis(votes, mechanism, noise_scale, params):
if i > 0 and (i + 1) % 1000 == 0: if i > 0 and (i + 1) % 1000 == 0:
rdp_var = rdp_sqrd_cum / i - ( rdp_var = rdp_sqrd_cum / i - (
rdp_cum / i)**2 # Ignore Bessel's correction. rdp_cum / i) ** 2 # Ignore Bessel's correction.
order_opt_idx = np.searchsorted(orders, order_opt[i]) order_opt_idx = np.searchsorted(orders, order_opt[i])
eps_std = ((i + 1) * rdp_var[order_opt_idx])**.5 # Std of the sum. eps_std = ((i + 1) * rdp_var[order_opt_idx]) ** .5 # Std of the sum.
print( print(
'queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} (std = {:.5f}) ' 'queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} (std = {:.5f}) '
'at order = {:.2f} (contribution from delta = {:.3f})'.format( 'at order = {:.2f} (contribution from delta = {:.3f})'.format(
...@@ -163,10 +168,10 @@ def print_plot_small(figures_dir, eps_lap, eps_gnmax, answered_gnmax): ...@@ -163,10 +168,10 @@ def print_plot_small(figures_dir, eps_lap, eps_gnmax, answered_gnmax):
""" """
xlim = 6000 xlim = 6000
x_axis = range(0, int(xlim), 10) x_axis = range(0, int(xlim), 10)
y_lap = np.zeros(len(x_axis)) y_lap = np.zeros(len(x_axis), dtype=float)
y_gnmax = np.full(len(x_axis), None, dtype=float) y_gnmax = np.full(len(x_axis), np.nan, dtype=float)
for i in xrange(len(x_axis)): for i in range(len(x_axis)):
x = x_axis[i] x = x_axis[i]
y_lap[i] = eps_lap[x] y_lap[i] = eps_lap[x]
idx = np.searchsorted(answered_gnmax, x) idx = np.searchsorted(answered_gnmax, x)
...@@ -220,7 +225,7 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1, ...@@ -220,7 +225,7 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1,
y_gnmax2 = np.full(lenx, np.nan, dtype=float) y_gnmax2 = np.full(lenx, np.nan, dtype=float)
y1_gnmax2 = np.full(lenx, np.nan, dtype=float) y1_gnmax2 = np.full(lenx, np.nan, dtype=float)
for i in xrange(lenx): for i in range(lenx):
x = x_axis[i] x = x_axis[i]
y_lap[i] = eps_lap[x] y_lap[i] = eps_lap[x]
idx1 = np.searchsorted(answered_gnmax1, x) idx1 = np.searchsorted(answered_gnmax1, x)
...@@ -289,86 +294,6 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1, ...@@ -289,86 +294,6 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1,
plt.show() plt.show()
def print_plot_partition(figures_dir, eps, partition, answered, order_opt):
"""Plots an expert version of the privacy-per-answered-query graph.
Args:
figures_dir: A name of the directory where to save the plot.
eps: The cumulative privacy cost.
partition: Allocation of the privacy cost.
answered: Cumulative number of queries answered.
order_opt: The list of optimal orders.
"""
xlim = 6000
x = range(0, int(xlim), 10)
lenx = len(x)
y = np.full(lenx, np.nan, dtype=float)
y1 = np.full(lenx, np.nan, dtype=float)
y2 = np.full(lenx, np.nan, dtype=float)
y_right = np.full(lenx, np.nan, dtype=float)
for i in xrange(lenx):
idx = np.searchsorted(answered, x[i])
if idx < len(eps):
y[i] = eps[idx]
fraction_step1, _, fraction_delta = partition[idx]
y1[i] = fraction_delta * y[i]
y2[i] = (fraction_delta + fraction_step1) * y[i]
y_right[i] = order_opt[idx]
# plt.close('all')
fig, ax = plt.subplots()
fig.set_figheight(4.5)
fig.set_figwidth(4.7)
l1 = ax.plot(
x, y, color='b', ls='-', label=r'Total privacy cost', linewidth=5).pop()
ax.plot(x, y1, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
ax.plot(x, y2, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
ax.fill_between(x, [0] * lenx, y1.tolist(), facecolor='b', alpha=.1)
ax.fill_between(x, y1.tolist(), y2.tolist(), facecolor='b', alpha=.2)
ax.fill_between(x, y2.tolist(), y.tolist(), facecolor='b', alpha=.3)
t1 = 300
ax.text(x[t1], y1[t1] * .35, r'due to $\delta$', alpha=.5, fontsize=18)
ax.text(
x[t1],
y1[t1] + (y2[t1] - y1[t1]) * .6,
r'selection ($\sigma_1$)',
alpha=.5,
fontsize=18)
t2 = 550
ax.annotate(
r'answering ($\sigma_2$)',
xy=(x[t2], (y[t2] + y2[t2]) / 2),
xytext=(x[200], y[t2] * 1.1),
arrowprops=dict(facecolor='black', shrink=0.005, alpha=.5),
fontsize=18,
alpha=.5)
ax2 = ax.twinx()
l2 = ax2.plot(
x, y_right, 'r', ls='-', label=r'Optimal order', linewidth=5,
alpha=.5).pop()
ax2.grid(False)
ax2.set_ylabel(r'Optimal Renyi order', fontsize=16)
plt.xticks(np.arange(0, 7000, 1000))
plt.xlim([0, xlim])
ax.set_ylim([0, 1.])
ax2.set_ylim([0, 200.])
ax.set_xlabel('Number of queries answered', fontsize=16)
ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=16)
# Merging legends.
ax.legend((l1, l2), (l1.get_label(), l2.get_label()), loc=0, fontsize=13)
ax.tick_params(labelsize=14)
fout_name = os.path.join(figures_dir, 'partition.pdf')
print('Saving the graph to ' + fout_name)
fig.savefig(fout_name, bbox_inches='tight')
plt.show()
def run_all_analyses(votes, lambda_laplace, gnmax_parameters, sigma2): def run_all_analyses(votes, lambda_laplace, gnmax_parameters, sigma2):
"""Sequentially runs all analyses. """Sequentially runs all analyses.
...@@ -446,8 +371,6 @@ def main(argv): ...@@ -446,8 +371,6 @@ def main(argv):
print_plot_small(figures_dir, eps_lap, eps_gnmax[0], answered_gnmax[0]) print_plot_small(figures_dir, eps_lap, eps_gnmax[0], answered_gnmax[0])
print_plot_large(figures_dir, eps_lap, eps_gnmax[1], answered_gnmax[1], print_plot_large(figures_dir, eps_lap, eps_gnmax[1], answered_gnmax[1],
eps_gnmax[2], partition_gmax[2], answered_gnmax[2]) eps_gnmax[2], partition_gmax[2], answered_gnmax[2])
print_plot_partition(figures_dir, eps_gnmax[2], partition_gmax[2],
answered_gnmax[2], orders_opt_gnmax[2])
plt.close('all') plt.close('all')
......
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from absl import app
from absl import flags
import matplotlib
import os
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.style.use('ggplot')
FLAGS = flags.FLAGS
flags.DEFINE_string('plot_file', '', 'Output file name.')
qa_lnmax = [500, 750] + range(1000, 12500, 500)
acc_lnmax = [43.3, 52.3, 59.8, 66.7, 68.8, 70.5, 71.6, 72.3, 72.6, 72.9, 73.4,
73.4, 73.7, 73.9, 74.2, 74.4, 74.5, 74.7, 74.8, 75, 75.1, 75.1,
75.4, 75.4, 75.4]
qa_gnmax = [456, 683, 908, 1353, 1818, 2260, 2702, 3153, 3602, 4055, 4511, 4964,
5422, 5875, 6332, 6792, 7244, 7696, 8146, 8599, 9041, 9496, 9945,
10390, 10842]
acc_gnmax = [39.6, 52.2, 59.6, 66.6, 69.6, 70.5, 71.8, 72, 72.7, 72.9, 73.3,
73.4, 73.4, 73.8, 74, 74.2, 74.4, 74.5, 74.5, 74.7, 74.8, 75, 75.1,
75.1, 75.4]
qa_gnmax_aggressive = [167, 258, 322, 485, 647, 800, 967, 1133, 1282, 1430,
1573, 1728, 1889, 2028, 2190, 2348, 2510, 2668, 2950,
3098, 3265, 3413, 3581, 3730]
acc_gnmax_aggressive = [17.8, 26.8, 39.3, 48, 55.7, 61, 62.8, 64.8, 65.4, 66.7,
66.2, 68.3, 68.3, 68.7, 69.1, 70, 70.2, 70.5, 70.9,
70.7, 71.3, 71.3, 71.3, 71.8]
def main(argv):
del argv # Unused.
plt.close('all')
fig, ax = plt.subplots()
fig.set_figheight(4.7)
fig.set_figwidth(5)
ax.plot(qa_lnmax, acc_lnmax, color='r', ls='--', linewidth=5., marker='o',
alpha=.5, label='LNMax')
ax.plot(qa_gnmax, acc_gnmax, color='g', ls='-', linewidth=5., marker='o',
alpha=.5, label='Confident-GNMax')
# ax.plot(qa_gnmax_aggressive, acc_gnmax_aggressive, color='b', ls='-', marker='o', alpha=.5, label='Confident-GNMax (aggressive)')
plt.xticks([0, 2000, 4000, 6000])
plt.xlim([0, 6000])
# ax.set_yscale('log')
plt.ylim([65, 76])
ax.tick_params(labelsize=14)
plt.xlabel('Number of queries answered', fontsize=16)
plt.ylabel('Student test accuracy (%)', fontsize=16)
plt.legend(loc=2, prop={'size': 16})
x = [400, 2116, 4600, 4680]
y = [69.5, 68.5, 74, 72.5]
annotations = [0.76, 2.89, 1.42, 5.76]
color_annotations = ['g', 'r', 'g', 'r']
for i, txt in enumerate(annotations):
ax.annotate(r'${\varepsilon=}$' + str(txt), (x[i], y[i]), fontsize=16,
color=color_annotations[i])
plot_filename = os.path.expanduser(FLAGS.plot_file)
plt.savefig(plot_filename, bbox_inches='tight')
plt.show()
if __name__ == '__main__':
app.run(main)
...@@ -119,7 +119,7 @@ def compute_logq0_gnmax(sigma, order): ...@@ -119,7 +119,7 @@ def compute_logq0_gnmax(sigma, order):
pate.rdp_data_independent_gaussian(sigma, order)) pate.rdp_data_independent_gaussian(sigma, order))
# Natural upper bounds on q0. # Natural upper bounds on q0.
logub = min(-(1 + 1. / sigma)**2, -((order - 1) / sigma)**2, -1 / sigma**2) logub = min(-(1 + 1. / sigma)**2, -((order - .99) / sigma)**2, -1 / sigma**2)
assert _check_validity_conditions(logub) assert _check_validity_conditions(logub)
# If data-dependent bound is already better, we are done already. # If data-dependent bound is already better, we are done already.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment