Two new plot-generating scripts and changes in support of the ICLR poster. (#4092)

* + plot_partition.py (budget allocation with smooth sensitivity). + utility_queries_answered.py (Figure 1, left). + several changes to the scripts required for the ICLR poster. * Adding Ananth Raghunathan (pseudorandom@) to the list of owners of differential_privacy. * Fixing the owners list of differential_privacy.

Two new plot-generating scripts and changes in support of the ICLR poster. (#4092)
* + plot_partition.py (budget allocation with smooth sensitivity). + utility_queries_answered.py (Figure 1, left). + several changes to the scripts required for the ICLR poster. * Adding Ananth Raghunathan (pseudorandom@) to the list of owners of differential_privacy. * Fixing the owners list of differential_privacy.
e6082458 · Ilya Mironov · Karmel Allison · db778817 · e6082458 · e6082458
Commit e6082458 authored Apr 26, 2018 by Ilya Mironov Committed by Karmel Allison Apr 26, 2018
8 changed files
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -11,7 +11,7 @@
 /research/compression/ @nmjohn
 /research/deeplab/ @aquariusjay @yknzhu @gpapan
 /research/delf/ @andrefaraujo
-/research/differential_privacy/ @panyx0718 @mironov
+/research/differential_privacy/ @ilyamironov @ananthr
 /research/domain_adaptation/ @bousmalis @dmrd
 /research/gan/ @joel-shor
 /research/im2txt/ @cshallue

--- a/research/differential_privacy/pate/ICLR2018/README.md
+++ b/research/differential_privacy/pate/ICLR2018/README.md
@@ -11,8 +11,8 @@ Erlingsson (ICLR 2018, https://arxiv.org/abs/1802.08908).
 * numpy
 * scipy
 * sympy (for smooth sensitivity analysis)  
-* write access to current directory (otherwise, output directories in download.py and *.sh scripts
-must be changed)
+* write access to the current directory (otherwise, output directories in download.py and *.sh
+scripts must be changed)

 ## Reproducing Figures 1 and 5, and Table 2

@@ -27,30 +27,33 @@ For Table 2 run (may take several hours)\
 `$ sh generate_table.sh`\
 The output is written to the console.

-For data-independent bounds (for comparing with Table 2), run\
+For data-independent bounds (for comparison with Table 2), run\
 `$ sh generate_table_data_independent.sh`\
 The output is written to the console.

 ## Files in this directory

-*   generate_figures.sh --- Master script for generating Figures 1 and 5.
+*   generate_figures.sh &mdash; Master script for generating Figures 1 and 5.

-*   generate_table.sh --- Master script for generating Table 2.
+*   generate_table.sh &mdash; Master script for generating Table 2.

-*   generate_table_data_independent.sh --- Master script for computing data-independent
+*   generate_table_data_independent.sh &mdash; Master script for computing data-independent
    bounds.

-*   rdp_bucketized.py --- Script for producing Figures 1 (right) and 5 (right).
+*   rdp_bucketized.py &mdash; Script for producing Figure 1 (right) and Figure 5 (right).

-*   rdp_cumulative.py --- Script for producing Figure 1 (left, middle), Figure 5
-    (left), and partition.pdf (a detailed breakdown of privacy costs per
-    source).
+*   rdp_cumulative.py &mdash; Script for producing Figure 1 (middle) and Figure 5 (left).
   
-*   smooth_sensitivity_table.py --- Script for generating Table 2.
+*   smooth_sensitivity_table.py &mdash; Script for generating Table 2.
+
+*   utility_queries_answered &mdash; Script for producing Figure 1 (left).
+
+*   plot_partition.py &mdash; Script for producing partition.pdf, a detailed breakdown of privacy
+costs for Confident-GNMax with smooth sensitivity analysis (takes ~50 hours).

 *   rdp_flow.py and plot_ls_q.py are currently not used.

-*   download.py --- Utility script for populating the data/ directory.
+*   download.py &mdash; Utility script for populating the data/ directory.


 All Python files take flags. Run script_name.py --help for help on flags.
--- a/research/differential_privacy/pate/ICLR2018/generate_figures.sh
+++ b/research/differential_privacy/pate/ICLR2018/generate_figures.sh
@@ -17,8 +17,6 @@

 counts_file="data/glyph_5000_teachers.npy"
 output_dir="figures/"
-executable1="python rdp_bucketized.py"
-executable2="python rdp_cumulative.py"

 mkdir -p $output_dir

@@ -27,18 +25,19 @@ if [ ! -d "$output_dir" ]; then
  exit 1
 fi

-$executable1 \
+python rdp_bucketized.py \
  --plot=small \
  --counts_file=$counts_file \
  --plot_file=$output_dir"noisy_thresholding_check_perf.pdf"

-$executable1 \
+python rdp_bucketized.py \
  --plot=large \
  --counts_file=$counts_file \
  --plot_file=$output_dir"noisy_thresholding_check_perf_details.pdf"

-
-$executable2 \
+python rdp_cumulative.py \
  --cache=False \
  --counts_file=$counts_file \
  --figures_dir=$output_dir
+
+python utility_queries_answered.py --plot_file=$output_dir"utility_queries_answered.pdf"
\ No newline at end of file
--- a/research/differential_privacy/pate/ICLR2018/plot_partition.py
+++ b/research/differential_privacy/pate/ICLR2018/plot_partition.py
+"""Produces two plots. One compares aggregators and their analyses. The other
+illustrates sources of privacy loss for Confident-GNMax.
+
+A script in support of the paper "Scalable Private Learning with PATE" by
+Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar,
+Ulfar Erlingsson (https://arxiv.org/abs/1802.08908).
+
+The input is a file containing a numpy array of votes, one query per row, one
+class per column. Ex:
+  43, 1821, ..., 3
+  31, 16, ..., 0
+  ...
+  0, 86, ..., 438
+The output is written to a specified directory and consists of two files.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import os
+import pickle
+import sys
+
+sys.path.append('..')  # Main modules reside in the parent directory.
+
+from absl import app
+from absl import flags
+from collections import namedtuple
+import matplotlib
+
+matplotlib.use('TkAgg')
+import matplotlib.pyplot as plt  # pylint: disable=g-import-not-at-top
+import numpy as np
+import core as pate
+import smooth_sensitivity as pate_ss
+
+plt.style.use('ggplot')
+
+FLAGS = flags.FLAGS
+flags.DEFINE_boolean('cache', False,
+                     'Read results of privacy analysis from cache.')
+flags.DEFINE_string('counts_file', None, 'Counts file.')
+flags.DEFINE_string('figures_dir', '', 'Path where figures are written to.')
+flags.DEFINE_float('threshold', None, 'Threshold for step 1 (selection).')
+flags.DEFINE_float('sigma1', None, 'Sigma for step 1 (selection).')
+flags.DEFINE_float('sigma2', None, 'Sigma for step 2 (argmax).')
+flags.DEFINE_integer('queries', None, 'Number of queries made by the student.')
+flags.DEFINE_float('delta', 1e-8, 'Target delta.')
+
+flags.mark_flag_as_required('counts_file')
+flags.mark_flag_as_required('threshold')
+flags.mark_flag_as_required('sigma1')
+flags.mark_flag_as_required('sigma2')
+
+Partition = namedtuple('Partition', ['step1', 'step2', 'ss', 'delta'])
+
+
+def analyze_gnmax_conf_data_ind(votes, threshold, sigma1, sigma2, delta):
+  orders = np.logspace(np.log10(1.5), np.log10(500), num=100)
+  n = votes.shape[0]
+
+  rdp_total = np.zeros(len(orders))
+  answered_total = 0
+  answered = np.zeros(n)
+  eps_cum = np.full(n, None, dtype=float)
+
+  for i in range(n):
+    v = votes[i,]
+    if threshold is not None and sigma1 is not None:
+      q_step1 = np.exp(pate.compute_logpr_answered(threshold, sigma1, v))
+      rdp_total += pate.rdp_data_independent_gaussian(sigma1, orders)
+    else:
+      q_step1 = 1.  # always answer
+
+    answered_total += q_step1
+    answered[i] = answered_total
+
+    rdp_total += q_step1 * pate.rdp_data_independent_gaussian(sigma2, orders)
+
+    eps_cum[i], order_opt = pate.compute_eps_from_delta(orders, rdp_total,
+                                                        delta)
+
+    if i > 0 and (i + 1) % 1000 == 0:
+      print('queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} '
+            'at order = {:.2f}.'.format(
+          i + 1,
+          answered[i],
+          eps_cum[i],
+          order_opt))
+      sys.stdout.flush()
+
+  return eps_cum, answered
+
+
+def analyze_gnmax_conf_data_dep(votes, threshold, sigma1, sigma2, delta):
+  # Short list of orders.
+  # orders = np.round(np.logspace(np.log10(20), np.log10(200), num=20))
+
+  # Long list of orders.
+  orders = np.concatenate((np.arange(20, 40, .2),
+                           np.arange(40, 75, .5),
+                            np.logspace(np.log10(75), np.log10(200), num=20)))
+
+  n = votes.shape[0]
+  num_classes = votes.shape[1]
+  num_teachers = int(sum(votes[0,]))
+
+  if threshold is not None and sigma1 is not None:
+    is_data_ind_step1 = pate.is_data_independent_always_opt_gaussian(
+        num_teachers, num_classes, sigma1, orders)
+  else:
+    is_data_ind_step1 = [True] * len(orders)
+
+  is_data_ind_step2 = pate.is_data_independent_always_opt_gaussian(
+      num_teachers, num_classes, sigma2, orders)
+
+  eps_partitioned = np.full(n, None, dtype=Partition)
+  order_opt = np.full(n, None, dtype=float)
+  ss_std_opt = np.full(n, None, dtype=float)
+  answered = np.zeros(n)
+
+  rdp_step1_total = np.zeros(len(orders))
+  rdp_step2_total = np.zeros(len(orders))
+
+  ls_total = np.zeros((len(orders), num_teachers))
+  answered_total = 0
+
+  for i in range(n):
+    v = votes[i,]
+
+    if threshold is not None and sigma1 is not None:
+      logq_step1 = pate.compute_logpr_answered(threshold, sigma1, v)
+      rdp_step1_total += pate.compute_rdp_threshold(logq_step1, sigma1, orders)
+    else:
+      logq_step1 = 0.  # always answer
+
+    pr_answered = np.exp(logq_step1)
+    logq_step2 = pate.compute_logq_gaussian(v, sigma2)
+    rdp_step2_total += pr_answered * pate.rdp_gaussian(logq_step2, sigma2,
+                                                       orders)
+
+    answered_total += pr_answered
+
+    rdp_ss = np.zeros(len(orders))
+    ss_std = np.zeros(len(orders))
+
+    for j, order in enumerate(orders):
+      if not is_data_ind_step1[j]:
+        ls_step1 = pate_ss.compute_local_sensitivity_bounds_threshold(v,
+            num_teachers, threshold, sigma1, order)
+      else:
+        ls_step1 = np.full(num_teachers, 0, dtype=float)
+
+      if not is_data_ind_step2[j]:
+        ls_step2 = pate_ss.compute_local_sensitivity_bounds_gnmax(
+            v, num_teachers, sigma2, order)
+      else:
+        ls_step2 = np.full(num_teachers, 0, dtype=float)
+
+      ls_total[j,] += ls_step1 + pr_answered * ls_step2
+
+      beta_ss = .49 / order
+
+      ss = pate_ss.compute_discounted_max(beta_ss, ls_total[j,])
+      sigma_ss = ((order * math.exp(2 * beta_ss)) / ss) ** (1 / 3)
+      rdp_ss[j] = pate_ss.compute_rdp_of_smooth_sensitivity_gaussian(
+          beta_ss, sigma_ss, order)
+      ss_std[j] = ss * sigma_ss
+
+    rdp_total = rdp_step1_total + rdp_step2_total + rdp_ss
+
+    answered[i] = answered_total
+    _, order_opt[i] = pate.compute_eps_from_delta(orders, rdp_total, delta)
+    order_idx = np.searchsorted(orders, order_opt[i])
+
+    # Since optimal orders are always non-increasing, shrink orders array
+    # and all cumulative arrays to speed up computation.
+    if order_idx < len(orders):
+      orders = orders[:order_idx + 1]
+      rdp_step1_total = rdp_step1_total[:order_idx + 1]
+      rdp_step2_total = rdp_step2_total[:order_idx + 1]
+
+    eps_partitioned[i] = Partition(step1=rdp_step1_total[order_idx],
+                                   step2=rdp_step2_total[order_idx],
+                                   ss=rdp_ss[order_idx],
+                                   delta=-math.log(delta) / (order_opt[i] - 1))
+    ss_std_opt[i] = ss_std[order_idx]
+    if i > 0 and (i + 1) % 10 == 0:
+      print('queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} +/- {:.3f} '
+            'at order = {:.2f}. Contributions: delta = {:.3f}, step1 = {:.3f}, '
+            'step2 = {:.3f}, ss = {:.3f}'.format(
+          i + 1,
+          answered[i],
+          sum(eps_partitioned[i]),
+          ss_std_opt[i],
+          order_opt[i],
+          eps_partitioned[i].delta,
+          eps_partitioned[i].step1,
+          eps_partitioned[i].step2,
+          eps_partitioned[i].ss))
+      sys.stdout.flush()
+
+  return eps_partitioned, answered, ss_std_opt, order_opt
+
+
+def plot_comparison(figures_dir, simple_ind, conf_ind, simple_dep, conf_dep):
+  """Plots variants of GNMax algorithm and their analyses.
+  """
+
+  def pivot(x_axis, eps, answered):
+    y = np.full(len(x_axis), None, dtype=float)  # delta
+    for i, x in enumerate(x_axis):
+      idx = np.searchsorted(answered, x)
+      if idx < len(eps):
+        y[i] = eps[idx]
+    return y
+
+  def pivot_dep(x_axis, data_dep):
+    eps_partitioned, answered, _, _ = data_dep
+    eps = [sum(p) for p in eps_partitioned]  # Flatten eps
+    return pivot(x_axis, eps, answered)
+
+  xlim = 10000
+  x_axis = range(0, xlim, 10)
+
+  y_simple_ind = pivot(x_axis, *simple_ind)
+  y_conf_ind = pivot(x_axis, *conf_ind)
+
+  y_simple_dep = pivot_dep(x_axis, simple_dep)
+  y_conf_dep = pivot_dep(x_axis, conf_dep)
+
+  # plt.close('all')
+  fig, ax = plt.subplots()
+  fig.set_figheight(4.5)
+  fig.set_figwidth(4.7)
+
+  ax.plot(x_axis, y_simple_ind, ls='--', color='r', lw=3, label=r'Simple GNMax, data-ind analysis')
+  ax.plot(x_axis, y_conf_ind, ls='--', color='b', lw=3, label=r'Confident GNMax, data-ind analysis')
+  ax.plot(x_axis, y_simple_dep, ls='-', color='r', lw=3, label=r'Simple GNMax, data-dep analysis')
+  ax.plot(x_axis, y_conf_dep, ls='-', color='b', lw=3, label=r'Confident GNMax, data-dep analysis')
+
+  plt.xticks(np.arange(0, xlim + 1000, 2000))
+  plt.xlim([0, xlim])
+  plt.ylim(bottom=0)
+  plt.legend(fontsize=16)
+  ax.set_xlabel('Number of queries answered', fontsize=16)
+  ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=16)
+
+  ax.tick_params(labelsize=14)
+  plot_filename = os.path.join(figures_dir, 'comparison.pdf')
+  print('Saving the graph to ' + plot_filename)
+  fig.savefig(plot_filename, bbox_inches='tight')
+  plt.show()
+
+
+def plot_partition(figures_dir, gnmax_conf, print_order):
+  """Plots an expert version of the privacy-per-answered-query graph.
+
+  Args:
+    figures_dir: A name of the directory where to save the plot.
+    eps: The cumulative privacy cost.
+    partition: Allocation of the privacy cost.
+    answered: Cumulative number of queries answered.
+    order_opt: The list of optimal orders.
+  """
+  eps_partitioned, answered, ss_std_opt, order_opt = gnmax_conf
+
+  xlim = 10000
+  x = range(0, int(xlim), 10)
+  lenx = len(x)
+  y0 = np.full(lenx, np.nan, dtype=float)  # delta
+  y1 = np.full(lenx, np.nan, dtype=float)  # delta + step1
+  y2 = np.full(lenx, np.nan, dtype=float)  # delta + step1 + step2
+  y3 = np.full(lenx, np.nan, dtype=float)  # delta + step1 + step2 + ss
+  noise_std = np.full(lenx, np.nan, dtype=float)
+
+  y_right = np.full(lenx, np.nan, dtype=float)
+
+  for i in range(lenx):
+    idx = np.searchsorted(answered, x[i])
+    if idx < len(eps_partitioned):
+      y0[i] = eps_partitioned[idx].delta
+      y1[i] = y0[i] + eps_partitioned[idx].step1
+      y2[i] = y1[i] + eps_partitioned[idx].step2
+      y3[i] = y2[i] + eps_partitioned[idx].ss
+
+      noise_std[i] = ss_std_opt[idx]
+      y_right[i] = order_opt[idx]
+
+  # plt.close('all')
+  fig, ax = plt.subplots()
+  fig.set_figheight(4.5)
+  fig.set_figwidth(4.7)
+  l1 = ax.plot(
+      x, y3, color='b', ls='-', label=r'Total privacy cost', linewidth=1).pop()
+
+  for y in (y0, y1, y2):
+    ax.plot(x, y, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
+
+  ax.fill_between(x, [0] * lenx, y0.tolist(), facecolor='b', alpha=.5)
+  ax.fill_between(x, y0.tolist(), y1.tolist(), facecolor='b', alpha=.4)
+  ax.fill_between(x, y1.tolist(), y2.tolist(), facecolor='b', alpha=.3)
+  ax.fill_between(x, y2.tolist(), y3.tolist(), facecolor='b', alpha=.2)
+
+  ax.fill_between(x, (y3 - noise_std).tolist(), (y3 + noise_std).tolist(),
+                  facecolor='r', alpha=.5)
+
+
+  plt.xticks(np.arange(0, xlim + 1000, 2000))
+  plt.xlim([0, xlim])
+  ax.set_ylim([0, 3.])
+
+  ax.set_xlabel('Number of queries answered', fontsize=10)
+  ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=10)
+
+  # Merging legends.
+  if print_order:
+    ax2 = ax.twinx()
+    l2 = ax2.plot(
+        x, y_right, 'r', ls='-', label=r'Optimal order', linewidth=5,
+        alpha=.5).pop()
+    ax2.grid(False)
+    ax2.set_ylabel(r'Optimal Renyi order', fontsize=16)
+    ax2.set_ylim([0, 200.])
+    ax.legend((l1, l2), (l1.get_label(), l2.get_label()), loc=0, fontsize=13)
+
+  ax.tick_params(labelsize=10)
+  plot_filename = os.path.join(figures_dir, 'partition.pdf')
+  print('Saving the graph to ' + plot_filename)
+  fig.savefig(plot_filename, bbox_inches='tight', dpi=800)
+  plt.show()
+
+
+def run_all_analyses(votes, threshold, sigma1, sigma2, delta):
+  simple_ind = analyze_gnmax_conf_data_ind(votes, None, None, sigma2,
+                                           delta)
+
+  conf_ind = analyze_gnmax_conf_data_ind(votes, threshold, sigma1, sigma2,
+                                         delta)
+
+  simple_dep = analyze_gnmax_conf_data_dep(votes, None, None, sigma2,
+                                           delta)
+
+  conf_dep = analyze_gnmax_conf_data_dep(votes, threshold, sigma1, sigma2,
+                                         delta)
+
+  return (simple_ind, conf_ind, simple_dep, conf_dep)
+
+
+def run_or_load_all_analyses():
+  temp_filename = os.path.expanduser('~/tmp/partition_cached.pkl')
+
+  if FLAGS.cache and os.path.isfile(temp_filename):
+    print('Reading from cache ' + temp_filename)
+    with open(temp_filename, 'rb') as f:
+      all_analyses = pickle.load(f)
+  else:
+    fin_name = os.path.expanduser(FLAGS.counts_file)
+    print('Reading raw votes from ' + fin_name)
+    sys.stdout.flush()
+
+    votes = np.load(fin_name)
+
+    if FLAGS.queries is not None:
+      if votes.shape[0] < FLAGS.queries:
+        raise ValueError('Expect {} rows, got {} in {}'.format(
+            FLAGS.queries, votes.shape[0], fin_name))
+      # Truncate the votes matrix to the number of queries made.
+      votes = votes[:FLAGS.queries, ]
+
+    all_analyses = run_all_analyses(votes, FLAGS.threshold, FLAGS.sigma1,
+                                    FLAGS.sigma2, FLAGS.delta)
+
+    print('Writing to cache ' + temp_filename)
+    with open(temp_filename, 'wb') as f:
+      pickle.dump(all_analyses, f)
+
+  return all_analyses
+
+
+def main(argv):
+  del argv  # Unused.
+
+  simple_ind, conf_ind, simple_dep, conf_dep = run_or_load_all_analyses()
+
+  figures_dir = os.path.expanduser(FLAGS.figures_dir)
+
+  plot_comparison(figures_dir, simple_ind, conf_ind, simple_dep, conf_dep)
+  plot_partition(figures_dir, conf_dep, False)
+  plt.close('all')
+
+
+if __name__ == '__main__':
+  app.run(main)
--- a/research/differential_privacy/pate/ICLR2018/rdp_bucketized.py
+++ b/research/differential_privacy/pate/ICLR2018/rdp_bucketized.py
@@ -51,9 +51,11 @@ plt.style.use('ggplot')
 FLAGS = flags.FLAGS
 flags.DEFINE_enum('plot', 'small', ['small', 'large'], 'Selects which of'
                  'the two plots is produced.')
-flags.DEFINE_string('counts_file', '', 'Counts file.')
+flags.DEFINE_string('counts_file', None, 'Counts file.')
 flags.DEFINE_string('plot_file', '', 'Plot file to write.')

+flags.mark_flag_as_required('counts_file')
+

 def compute_count_per_bin(bin_num, votes):
  """Tabulates number of examples in each bin.
@@ -164,6 +166,8 @@ def main(argv):
    m_check = compute_expected_answered_per_bin(bin_num, votes, 3500, 1500)
    a_check = compute_expected_answered_per_bin(bin_num, votes, 5000, 1500)
    eps = compute_privacy_cost_per_bins(bin_num, votes, 100, 50)
+  else:
+    raise ValueError('--plot flag must be one of ["small", "large"]')

  counts = compute_count_per_bin(bin_num, votes)
  bins = np.linspace(0, 100, num=bin_num, endpoint=False)
@@ -171,14 +175,14 @@ def main(argv):
  plt.close('all')
  fig, ax = plt.subplots()
  if FLAGS.plot == 'small':
-    fig.set_figheight(4.7)
+    fig.set_figheight(5)
    fig.set_figwidth(5)
    ax.bar(
        bins,
        counts,
        20,
        color='orangered',
-        linestyle='dashed',
+        linestyle='dotted',
        linewidth=5,
        edgecolor='red',
        fill=False,
@@ -189,7 +193,7 @@ def main(argv):
        bins,
        m_check,
        20,
-        color='b',
+        color='g',
        alpha=.5,
        linewidth=0,
        edgecolor='g',
@@ -238,12 +242,13 @@ def main(argv):
    ax2.set_ylabel(r'Per query privacy cost $\varepsilon$', fontsize=16)

  plt.xlim([0, 100])
+  ax.set_ylim([0, 2500])
  # ax.set_yscale('log')
  ax.set_xlabel('Percentage of teachers that agree', fontsize=16)
  ax.set_ylabel('Number of queries answered', fontsize=16)
  vals = ax.get_xticks()
  ax.set_xticklabels([str(int(x)) + '%' for x in vals])
-  ax.tick_params(labelsize=14)
+  ax.tick_params(labelsize=14, bottom=True, top=True, left=True, right=True)
  ax.legend(loc=2, prop={'size': 16})

  # simple: 'figures/noisy_thresholding_check_perf.pdf')

--- a/research/differential_privacy/pate/ICLR2018/rdp_cumulative.py
+++ b/research/differential_privacy/pate/ICLR2018/rdp_cumulative.py
@@ -41,6 +41,7 @@ sys.path.append('..')  # Main modules reside in the parent directory.
 from absl import app
 from absl import flags
 import matplotlib
+
 matplotlib.use('TkAgg')
 import matplotlib.pyplot as plt  # pylint: disable=g-import-not-at-top
 import numpy as np
@@ -51,9 +52,10 @@ plt.style.use('ggplot')
 FLAGS = flags.FLAGS
 flags.DEFINE_boolean('cache', False,
                     'Read results of privacy analysis from cache.')
-flags.DEFINE_string('counts_file', '', 'Counts file.')
+flags.DEFINE_string('counts_file', None, 'Counts file.')
 flags.DEFINE_string('figures_dir', '', 'Path where figures are written to.')

+flags.mark_flag_as_required('counts_file')

 def run_analysis(votes, mechanism, noise_scale, params):
  """Computes data-dependent privacy.
@@ -90,26 +92,26 @@ def run_analysis(votes, mechanism, noise_scale, params):

  n = votes.shape[0]
  eps_total = np.zeros(n)
-  partition = np.full(n, None, dtype=object)
-  order_opt = np.full(n, None, dtype=float)
-  answered = np.zeros(n)
+  partition = [None] * n
+  order_opt = np.full(n, np.nan, dtype=float)
+  answered = np.zeros(n, dtype=float)

  rdp_cum = np.zeros(len(orders))
  rdp_sqrd_cum = np.zeros(len(orders))
  rdp_select_cum = np.zeros(len(orders))
  answered_sum = 0

-  for i in xrange(n):
+  for i in range(n):
    v = votes[i,]
    if mechanism == 'lnmax':
      logq_lnmax = pate.compute_logq_laplace(v, noise_scale)
      rdp_query = pate.rdp_pure_eps(logq_lnmax, 2. / noise_scale, orders)
-      rdp_sqrd = rdp_query**2
+      rdp_sqrd = rdp_query ** 2
      pr_answered = 1
    elif mechanism == 'gnmax':
      logq_gmax = pate.compute_logq_gaussian(v, noise_scale)
      rdp_query = pate.rdp_gaussian(logq_gmax, noise_scale, orders)
-      rdp_sqrd = rdp_query**2
+      rdp_sqrd = rdp_query ** 2
      pr_answered = 1
    elif mechanism == 'gnmax_conf':
      logq_step1 = pate.compute_logpr_answered(params['t'], params['sigma1'], v)
@@ -117,16 +119,19 @@ def run_analysis(votes, mechanism, noise_scale, params):
      q_step1 = np.exp(logq_step1)
      logq_step1_min = min(logq_step1, math.log1p(-q_step1))
      rdp_gnmax_step1 = pate.rdp_gaussian(logq_step1_min,
-                                          2**.5 * params['sigma1'], orders)
+                                          2 ** .5 * params['sigma1'], orders)
      rdp_gnmax_step2 = pate.rdp_gaussian(logq_step2, noise_scale, orders)
      rdp_query = rdp_gnmax_step1 + q_step1 * rdp_gnmax_step2
      # The expression below evaluates
      #     E[(cost_of_step_1 + Bernoulli(pr_of_step_2) * cost_of_step_2)^2]
      rdp_sqrd = (
-          rdp_gnmax_step1**2 + 2 * rdp_gnmax_step1 * q_step1 * rdp_gnmax_step2 +
-          q_step1 * rdp_gnmax_step2**2)
+          rdp_gnmax_step1 ** 2 + 2 * rdp_gnmax_step1 * q_step1 * rdp_gnmax_step2
+          + q_step1 * rdp_gnmax_step2 ** 2)
      rdp_select_cum += rdp_gnmax_step1
      pr_answered = q_step1
+    else:
+      raise ValueError(
+          'Mechanism must be one of ["lnmax", "gnmax", "gnmax_conf"]')

    rdp_cum += rdp_query
    rdp_sqrd_cum += rdp_sqrd
@@ -139,9 +144,9 @@ def run_analysis(votes, mechanism, noise_scale, params):

    if i > 0 and (i + 1) % 1000 == 0:
      rdp_var = rdp_sqrd_cum / i - (
-          rdp_cum / i)**2  # Ignore Bessel's correction.
+          rdp_cum / i) ** 2  # Ignore Bessel's correction.
      order_opt_idx = np.searchsorted(orders, order_opt[i])
-      eps_std = ((i + 1) * rdp_var[order_opt_idx])**.5  # Std of the sum.
+      eps_std = ((i + 1) * rdp_var[order_opt_idx]) ** .5  # Std of the sum.
      print(
          'queries = {}, E[answered] = {:.2f}, E[eps] = {:.3f} (std = {:.5f}) '
          'at order = {:.2f} (contribution from delta = {:.3f})'.format(
@@ -163,10 +168,10 @@ def print_plot_small(figures_dir, eps_lap, eps_gnmax, answered_gnmax):
  """
  xlim = 6000
  x_axis = range(0, int(xlim), 10)
-  y_lap = np.zeros(len(x_axis))
-  y_gnmax = np.full(len(x_axis), None, dtype=float)
+  y_lap = np.zeros(len(x_axis), dtype=float)
+  y_gnmax = np.full(len(x_axis), np.nan, dtype=float)

-  for i in xrange(len(x_axis)):
+  for i in range(len(x_axis)):
    x = x_axis[i]
    y_lap[i] = eps_lap[x]
    idx = np.searchsorted(answered_gnmax, x)
@@ -200,7 +205,7 @@ def print_plot_small(figures_dir, eps_lap, eps_gnmax, answered_gnmax):


 def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1,
-                     eps_gnmax2, partition_gnmax2, answered_gnmax2):
+    eps_gnmax2, partition_gnmax2, answered_gnmax2):
  """Plots a graph of LNMax vs GNMax with two parameters.

  Args:
@@ -220,7 +225,7 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1,
  y_gnmax2 = np.full(lenx, np.nan, dtype=float)
  y1_gnmax2 = np.full(lenx, np.nan, dtype=float)

-  for i in xrange(lenx):
+  for i in range(lenx):
    x = x_axis[i]
    y_lap[i] = eps_lap[x]
    idx1 = np.searchsorted(answered_gnmax1, x)
@@ -289,86 +294,6 @@ def print_plot_large(figures_dir, eps_lap, eps_gnmax1, answered_gnmax1,
  plt.show()


-def print_plot_partition(figures_dir, eps, partition, answered, order_opt):
-  """Plots an expert version of the privacy-per-answered-query graph.
-
-  Args:
-    figures_dir: A name of the directory where to save the plot.
-    eps: The cumulative privacy cost.
-    partition: Allocation of the privacy cost.
-    answered: Cumulative number of queries answered.
-    order_opt: The list of optimal orders.
-  """
-
-  xlim = 6000
-  x = range(0, int(xlim), 10)
-  lenx = len(x)
-  y = np.full(lenx, np.nan, dtype=float)
-  y1 = np.full(lenx, np.nan, dtype=float)
-  y2 = np.full(lenx, np.nan, dtype=float)
-  y_right = np.full(lenx, np.nan, dtype=float)
-
-  for i in xrange(lenx):
-    idx = np.searchsorted(answered, x[i])
-    if idx < len(eps):
-      y[i] = eps[idx]
-      fraction_step1, _, fraction_delta = partition[idx]
-      y1[i] = fraction_delta * y[i]
-      y2[i] = (fraction_delta + fraction_step1) * y[i]
-      y_right[i] = order_opt[idx]
-
-  # plt.close('all')
-  fig, ax = plt.subplots()
-  fig.set_figheight(4.5)
-  fig.set_figwidth(4.7)
-  l1 = ax.plot(
-      x, y, color='b', ls='-', label=r'Total privacy cost', linewidth=5).pop()
-  ax.plot(x, y1, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
-  ax.plot(x, y2, color='b', ls='-', label=r'_nolegend_', alpha=.5, linewidth=1)
-  ax.fill_between(x, [0] * lenx, y1.tolist(), facecolor='b', alpha=.1)
-  ax.fill_between(x, y1.tolist(), y2.tolist(), facecolor='b', alpha=.2)
-  ax.fill_between(x, y2.tolist(), y.tolist(), facecolor='b', alpha=.3)
-  t1 = 300
-  ax.text(x[t1], y1[t1] * .35, r'due to $\delta$', alpha=.5, fontsize=18)
-  ax.text(
-      x[t1],
-      y1[t1] + (y2[t1] - y1[t1]) * .6,
-      r'selection ($\sigma_1$)',
-      alpha=.5,
-      fontsize=18)
-  t2 = 550
-  ax.annotate(
-      r'answering ($\sigma_2$)',
-      xy=(x[t2], (y[t2] + y2[t2]) / 2),
-      xytext=(x[200], y[t2] * 1.1),
-      arrowprops=dict(facecolor='black', shrink=0.005, alpha=.5),
-      fontsize=18,
-      alpha=.5)
-
-  ax2 = ax.twinx()
-  l2 = ax2.plot(
-      x, y_right, 'r', ls='-', label=r'Optimal order', linewidth=5,
-      alpha=.5).pop()
-  ax2.grid(False)
-  ax2.set_ylabel(r'Optimal Renyi order', fontsize=16)
-
-  plt.xticks(np.arange(0, 7000, 1000))
-  plt.xlim([0, xlim])
-  ax.set_ylim([0, 1.])
-  ax2.set_ylim([0, 200.])
-  ax.set_xlabel('Number of queries answered', fontsize=16)
-  ax.set_ylabel(r'Privacy cost $\varepsilon$ at $\delta=10^{-8}$', fontsize=16)
-
-  # Merging legends.
-  ax.legend((l1, l2), (l1.get_label(), l2.get_label()), loc=0, fontsize=13)
-
-  ax.tick_params(labelsize=14)
-  fout_name = os.path.join(figures_dir, 'partition.pdf')
-  print('Saving the graph to ' + fout_name)
-  fig.savefig(fout_name, bbox_inches='tight')
-  plt.show()
-
-
 def run_all_analyses(votes, lambda_laplace, gnmax_parameters, sigma2):
  """Sequentially runs all analyses.

@@ -408,15 +333,15 @@ def main(argv):

  # Paramaters of the GNMax
  gnmax_parameters = ({
-      't': 1000,
-      'sigma1': 500
-  }, {
-      't': 3500,
-      'sigma1': 1500
-  }, {
-      't': 5000,
-      'sigma1': 1500
-  })
+                        't': 1000,
+                        'sigma1': 500
+                      }, {
+                        't': 3500,
+                        'sigma1': 1500
+                      }, {
+                        't': 5000,
+                        'sigma1': 1500
+                      })
  sigma2 = 100  # GNMax parameters differ only in Step 1 (selection).
  ftemp_name = '/tmp/precomputed.pkl'

@@ -436,7 +361,7 @@ def main(argv):

    (eps_lap, eps_gnmax, partition_gmax,
     answered_gnmax, orders_opt_gnmax) = run_all_analyses(
-         votes, lambda_laplace, gnmax_parameters, sigma2)
+        votes, lambda_laplace, gnmax_parameters, sigma2)

    print('Writing to cache ' + ftemp_name)
    with open(ftemp_name, 'wb') as f:
@@ -446,8 +371,6 @@ def main(argv):
  print_plot_small(figures_dir, eps_lap, eps_gnmax[0], answered_gnmax[0])
  print_plot_large(figures_dir, eps_lap, eps_gnmax[1], answered_gnmax[1],
                   eps_gnmax[2], partition_gmax[2], answered_gnmax[2])
-  print_plot_partition(figures_dir, eps_gnmax[2], partition_gmax[2],
-                       answered_gnmax[2], orders_opt_gnmax[2])
  plt.close('all')



--- a/research/differential_privacy/pate/ICLR2018/utility_queries_answered.py
+++ b/research/differential_privacy/pate/ICLR2018/utility_queries_answered.py
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl import app
+from absl import flags
+import matplotlib
+import os
+
+matplotlib.use('TkAgg')
+import matplotlib.pyplot as plt
+
+plt.style.use('ggplot')
+
+FLAGS = flags.FLAGS
+flags.DEFINE_string('plot_file', '', 'Output file name.')
+
+qa_lnmax = [500, 750] + range(1000, 12500, 500)
+
+acc_lnmax = [43.3, 52.3, 59.8, 66.7, 68.8, 70.5, 71.6, 72.3, 72.6, 72.9, 73.4,
+             73.4, 73.7, 73.9, 74.2, 74.4, 74.5, 74.7, 74.8, 75, 75.1, 75.1,
+             75.4, 75.4, 75.4]
+
+qa_gnmax = [456, 683, 908, 1353, 1818, 2260, 2702, 3153, 3602, 4055, 4511, 4964,
+            5422, 5875, 6332, 6792, 7244, 7696, 8146, 8599, 9041, 9496, 9945,
+            10390, 10842]
+
+acc_gnmax = [39.6, 52.2, 59.6, 66.6, 69.6, 70.5, 71.8, 72, 72.7, 72.9, 73.3,
+             73.4, 73.4, 73.8, 74, 74.2, 74.4, 74.5, 74.5, 74.7, 74.8, 75, 75.1,
+             75.1, 75.4]
+
+qa_gnmax_aggressive = [167, 258, 322, 485, 647, 800, 967, 1133, 1282, 1430,
+                       1573, 1728, 1889, 2028, 2190, 2348, 2510, 2668, 2950,
+                       3098, 3265, 3413, 3581, 3730]
+
+acc_gnmax_aggressive = [17.8, 26.8, 39.3, 48, 55.7, 61, 62.8, 64.8, 65.4, 66.7,
+                        66.2, 68.3, 68.3, 68.7, 69.1, 70, 70.2, 70.5, 70.9,
+                        70.7, 71.3, 71.3, 71.3, 71.8]
+
+
+def main(argv):
+  del argv  # Unused.
+
+  plt.close('all')
+  fig, ax = plt.subplots()
+  fig.set_figheight(4.7)
+  fig.set_figwidth(5)
+  ax.plot(qa_lnmax, acc_lnmax, color='r', ls='--', linewidth=5., marker='o',
+          alpha=.5, label='LNMax')
+  ax.plot(qa_gnmax, acc_gnmax, color='g', ls='-', linewidth=5., marker='o',
+          alpha=.5, label='Confident-GNMax')
+  # ax.plot(qa_gnmax_aggressive, acc_gnmax_aggressive, color='b', ls='-', marker='o', alpha=.5, label='Confident-GNMax (aggressive)')
+  plt.xticks([0, 2000, 4000, 6000])
+  plt.xlim([0, 6000])
+  # ax.set_yscale('log')
+  plt.ylim([65, 76])
+  ax.tick_params(labelsize=14)
+  plt.xlabel('Number of queries answered', fontsize=16)
+  plt.ylabel('Student test accuracy (%)', fontsize=16)
+  plt.legend(loc=2, prop={'size': 16})
+
+  x = [400, 2116, 4600, 4680]
+  y = [69.5, 68.5, 74, 72.5]
+  annotations = [0.76, 2.89, 1.42, 5.76]
+  color_annotations = ['g', 'r', 'g', 'r']
+  for i, txt in enumerate(annotations):
+    ax.annotate(r'${\varepsilon=}$' + str(txt), (x[i], y[i]), fontsize=16,
+                color=color_annotations[i])
+
+  plot_filename = os.path.expanduser(FLAGS.plot_file)
+  plt.savefig(plot_filename, bbox_inches='tight')
+  plt.show()
+
+if __name__ == '__main__':
+  app.run(main)
--- a/research/differential_privacy/pate/smooth_sensitivity.py
+++ b/research/differential_privacy/pate/smooth_sensitivity.py
@@ -119,7 +119,7 @@ def compute_logq0_gnmax(sigma, order):
            pate.rdp_data_independent_gaussian(sigma, order))

  # Natural upper bounds on q0.
-  logub = min(-(1 + 1. / sigma)**2, -((order - 1) / sigma)**2, -1 / sigma**2)
+  logub = min(-(1 + 1. / sigma)**2, -((order - .99) / sigma)**2, -1 / sigma**2)
  assert _check_validity_conditions(logub)

  # If data-dependent bound is already better, we are done already.