# baukit

Install using `pip install git+https://github.com/davidbau/baukit`.

Provides the `baukit` package, a kit of David's secret tools to help
with productive research prototyping with pytorch.

Includes:
 * Methods for tracing and editing internal activations in a network.
 * Interactive UI widgets for quick data exploration in a notebook.
 * Online algorithms for computing running stats in pytorch.
 * Fast and feature-rich data set objects for images and text.
 * Utilities for simplifying the task of running many batch jobs.

Full details can be found by reading the code.
Here is a partial overview:

## Trace library

`Trace`, `TraceDict`, `subsequence`, `replace_module`; these simplify
the work of analyzing and altering internal computations of deep
networks.  A short example of tracing a specific layer in `net`:

```
from baukit import Trace
with Trace(net, 'layer.name') as ret:
    _ = net(inp)
    representation = ret.output
```

Read the [nethook Trace source code](https://github.com/davidbau/baukit/blob/main/baukit/nethook.py) for more information.

## Widget library

`show` is a feature-rich alternative to Jupyter notebook `display`;
it allows for quickly producing HTML layouts by arranging data and
images in nested python arrays, and it knows how to directly display
PIL images, matplotlib figure objects, and interactive widgets.
HTML elements, attributes, and CSS styles can be controlled with
functions like `show.style(color='red')`.

```
from baukit import show
show([[show.style(color=c), c] for c in ['red', 'green', 'blue']])
```

There is a [notebook here](https://github.com/davidbau/baukit/blob/main/notebooks/using_show_and_widgets.ipynb) that shows off ways to use `show()`.

`show` works with a set of `Widget` subclasses such as, `Textbox`,
`Numberbox`, `Range`, `Menu`, `PlotWidget`, `PaintWidget` that provide
data-bound reactive objects for quickly making interactive
HTML visualizations that work in a Jupyter or Colab notebook.  For
example, instad of using `matplotlib` directly to just draw a picture
of a plot, you can lay out interactive widget:

```
from baukit import PlotWidget, Range, show
import numpy
def how_to_draw_my_plot(fig, amp=1.0, freq=1.0):
    [ax] = fig.axes
    ax.clear()
    x = numpy.linspace(0, 5, 100)
    ax.plot(x, amp * numpy.sin(freq * x))
						   
plot = PlotWidget(how_to_draw_my_plot, figsize=(5, 5))
ra = Range(min=0.0, max=2.0, step=0.1, value=plot.prop('amp'))
rf = Range(min=0.1, max=20.0, step=0.1, value=plot.prop('freq'))
show([plot, [show.style(textAlign='right'), 'Amp', ra,
             show.style(textAlign='right'),  'Freq', rf]])
```

This code shows the plot in a layout with two sliders.  If you later
execute the code `plot.freq = 5.0`, the plot will update live, in-place,
to show the new curve, and the freq slider will also move to 5.  And
of course, dragging the slider will also change the values live.

The [labwidget source code](https://github.com/davidbau/baukit/blob/main/baukit/labwidget.py) has much more detail.

## Online statistics library

`Covariance`, `Mean`, `Quantile`, `TopK`, and other data summarization
methods are provided as online, gpu-optimized algorithms.

```
from baukit import Quantile, Topk, CombinedStat, tally
cs = CombinedStat(
    qc=Quantile(),
    tk=TopK(),
)
ds = MyDataset()
# Loads from my_stats.npz if already computed.
for [batch] in tally(cs, ds, cache='my_stats.npz', batch_size=50):
    batch.cuda()
    # Assumes dim=0 is the sampling axis; stats are per dim=1 feature.
    stat.add(batch)
cs.to_('cpu')
median = cs.qc.quantile(0.5)
top_values, top_indexes = cs.tk.topk(10)
```

The [runningstats source code](https://github.com/davidbau/baukit/blob/main/baukit/runningstats.py) shows other things you can do.

## Improved basic dataset objects

`ImageFolderSet` is faster and provides more features than
pytorch `ImageFolder` including the ability to gather multiple
streams of parallel data tensors (such as segmentations and images).

`TokenizedDataset` tokenizes text through a provided tokenizer,
producing dictionaries designed to feed directly into `huggingface`
language models.  It works with `length_collation` for creating
uniform-length batches for fast training and inference.

## Batch job utilities

`pbar` is a more readable progress bar utility wrapper around `tqdm`
that simplifies the display of progress status strings during a
long progress operation; it also provides a way for a caller to
slience progress output.

`reserve_dir` reserves a directory for results of a job and grabs a lock
so that other proceses running `reserve_dir` will not do the same job.
This allows very simple batch parallelism: just run many processes
that run all the jobs, and each job will only be done once.

`WorkerPool` simplifies creation of worker threads for consuming output
data; this can dramatically speed up writing of many output files
and is the output analogue of the torch DataLoader utility for inputs.