From cfd9b2b9db287b81ccd33b28788bd6583567d4dd Mon Sep 17 00:00:00 2001 From: David Bau Date: Sat, 2 Apr 2022 18:16:08 -0400 Subject: [PATCH] Update readme. --- README.md | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/README.md b/README.md index b15e788..a57c85c 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,110 @@ A kit of David's secret tools to help with productive research prototyping with pytorch. Includes: + * Methods for tracing and editing internal activations in a network. * Interactive UI widgets for quick data exploration in a notebook. * Online algorithms for computing running stats in pytorch. * Fast and feature-rich data set objects for images and text. * Utilities for simplifying the task of running many batch jobs. + +## Trace library + +`Trace`, `TraceDict`, `subsequence`, `replace_module`; these simplify +the work of analyzing and altering internal computations of deep +networks. A short example of tracing a specific layer in `net`: + +``` +from baukit import Trace +with Trace(net, 'layer.name') as ret: + _ = net(inp) + representation = ret.output +``` + +## Widget library + +`show` is a vastly improved alternative to Jupyter notebook `display`; +it allows for quickly producing HTML layouts by arranging data and +images in nested python arrays. And HTML elements, attributes, and +CSS styles can be controlled with functions like `show.style(color='red')`. + +``` +from baukit import show +show([[show.style(color=c), c] for c in ['red', 'green', 'blue']]) +``` + +`show` works with a set of `Widget` subclasses such as, `Textbox`, +`Numberbox`, `Range`, `Menu`, `PlotWidget`, `PaintWidget` that provide +data-bound reactive objects for quickly making interactive +HTML visualizations that work in a Jupyter or Colab notebook. For +example, instad of using `matplotlib` directly to just draw a picture +of a plot, you can lay out interactive widget: + +``` +from baukit import PlotWidget, Range, show +import numpy +def how_to_draw_my_plot(fig, amp=1.0, freq=1.0): + [ax] = fig.axes + ax.clear() + x = numpy.linspace(0, 5, 100) + ax.plot(x, amp * numpy.sin(freq * x)) + +plot = PlotWidget(how_to_draw_my_plot, figsize=(5, 5)) +ra = Range(min=0.0, max=2.0, step=0.1, value=plot.prop('amp')) +rf = Range(min=0.1, max=20.0, step=0.1, value=plot.prop('freq') +show([plot, [show.style(textAlign='right'), 'Amp', ra, + show.style(textAlign='right'), 'Freq', rf]]) +``` + +This code shows the plot in a layout with two sliders. If you later +execute the code `plot.freq = 5.0`, the plot will update live, in-place, +to show the new curve, and the freq slider will also move to 5. And +of course, dragging the slider will also change the values live. + +## Online statistics library + +`Covariance`, `Mean`, `Quantile`, `TopK`, and other data summarization +methods are provided as online, gpu-optimized algorithms. + +``` +from baukit import Quantile, Topk, CombinedStat, tally +cs = CombinedStat( + qc=Quantile(), + tk=TopK(), +) +ds = MyDataset() +# Loads from my_stats.npz if already computed. +for [batch] in tally(stat, ds, cache='my_stats.npz', batch_size=50): + batch.cuda() + # Assumes dim=0 is the sampling axis; stats are per dim=1 feature. + stat.add(batch) +cs.to_('cpu') +median = cs.qc.quantile(0.5) +top_values, top_indexes = cs.tk.topk(10) +``` + +## Improved basic dataset objects + +`ParallelImageFolder` is faster and provides more features than +pytorch `ImageFolder` including the ability to gather multiple +streams of parallel data tensors (such as segmentations and images). + +`TokenizedDataset` tokenizes text through a provided tokenizer, +producing dictionaries designed to feed directly into `huggingface` +language models. It works with `length_collation` for creating +uniform-length batches for fast training and inference. + +## Batch job utilities + +`pbar` is a more readable progress bar utility wrapper around `tqdm` +that simplifies the display of progress status strings during a +long progress operation; it also provides a way for a caller to +slience progress output. + +`reserve_dir` reserves a directory for results of a job and grabs a lock +so that other proceses running `reserve_dir` will not do the same job. +This allows very simple batch parallelism: just run many processes +that run all the jobs, and each job will only be done once. + +`WorkerPool` simplifies creation of worker threads for consuming output +data; this can dramatically speed up writing of many output files +and is the output analogue of the torch DataLoader utility for inputs.