mirror of
https://github.com/wassname/baukit.git
synced 2026-06-27 16:14:33 +08:00
129 lines
4.9 KiB
Markdown
129 lines
4.9 KiB
Markdown
# baukit
|
|
|
|
Install using `pip install git+https://github.com/davidbau/baukit`.
|
|
|
|
Provides the `baukit` package, a kit of David's secret tools to help
|
|
with productive research prototyping with pytorch.
|
|
|
|
Includes:
|
|
* Methods for tracing and editing internal activations in a network.
|
|
* Interactive UI widgets for quick data exploration in a notebook.
|
|
* Online algorithms for computing running stats in pytorch.
|
|
* Fast and feature-rich data set objects for images and text.
|
|
* Utilities for simplifying the task of running many batch jobs.
|
|
|
|
Full details can be found by reading the code.
|
|
Here is a partial overview:
|
|
|
|
## Trace library
|
|
|
|
`Trace`, `TraceDict`, `subsequence`, `replace_module`; these simplify
|
|
the work of analyzing and altering internal computations of deep
|
|
networks. A short example of tracing a specific layer in `net`:
|
|
|
|
```
|
|
from baukit import Trace
|
|
with Trace(net, 'layer.name') as ret:
|
|
_ = net(inp)
|
|
representation = ret.output
|
|
```
|
|
|
|
Read the [nethook Trace source code](https://github.com/davidbau/baukit/blob/main/baukit/nethook.py) for more information.
|
|
|
|
## Widget library
|
|
|
|
`show` is a feature-rich alternative to Jupyter notebook `display`;
|
|
it allows for quickly producing HTML layouts by arranging data and
|
|
images in nested python arrays, and it knows how to directly display
|
|
PIL images, matplotlib figure objects, and interactive widgets.
|
|
HTML elements, attributes, and CSS styles can be controlled with
|
|
functions like `show.style(color='red')`.
|
|
|
|
```
|
|
from baukit import show
|
|
show([[show.style(color=c), c] for c in ['red', 'green', 'blue']])
|
|
```
|
|
|
|
There is a [notebook here](https://github.com/davidbau/baukit/blob/main/notebooks/using_show_and_widgets.ipynb) that shows off ways to use `show()`.
|
|
|
|
`show` works with a set of `Widget` subclasses such as, `Textbox`,
|
|
`Numberbox`, `Range`, `Menu`, `PlotWidget`, `PaintWidget` that provide
|
|
data-bound reactive objects for quickly making interactive
|
|
HTML visualizations that work in a Jupyter or Colab notebook. For
|
|
example, instad of using `matplotlib` directly to just draw a picture
|
|
of a plot, you can lay out interactive widget:
|
|
|
|
```
|
|
from baukit import PlotWidget, Range, show
|
|
import numpy
|
|
def how_to_draw_my_plot(fig, amp=1.0, freq=1.0):
|
|
[ax] = fig.axes
|
|
ax.clear()
|
|
x = numpy.linspace(0, 5, 100)
|
|
ax.plot(x, amp * numpy.sin(freq * x))
|
|
|
|
plot = PlotWidget(how_to_draw_my_plot, figsize=(5, 5))
|
|
ra = Range(min=0.0, max=2.0, step=0.1, value=plot.prop('amp'))
|
|
rf = Range(min=0.1, max=20.0, step=0.1, value=plot.prop('freq'))
|
|
show([plot, [show.style(textAlign='right'), 'Amp', ra,
|
|
show.style(textAlign='right'), 'Freq', rf]])
|
|
```
|
|
|
|
This code shows the plot in a layout with two sliders. If you later
|
|
execute the code `plot.freq = 5.0`, the plot will update live, in-place,
|
|
to show the new curve, and the freq slider will also move to 5. And
|
|
of course, dragging the slider will also change the values live.
|
|
|
|
The [labwidget source code](https://github.com/davidbau/baukit/blob/main/baukit/labwidget.py) has much more detail.
|
|
|
|
## Online statistics library
|
|
|
|
`Covariance`, `Mean`, `Quantile`, `TopK`, and other data summarization
|
|
methods are provided as online, gpu-optimized algorithms.
|
|
|
|
```
|
|
from baukit import Quantile, Topk, CombinedStat, tally
|
|
cs = CombinedStat(
|
|
qc=Quantile(),
|
|
tk=TopK(),
|
|
)
|
|
ds = MyDataset()
|
|
# Loads from my_stats.npz if already computed.
|
|
for [batch] in tally(cs, ds, cache='my_stats.npz', batch_size=50):
|
|
batch.cuda()
|
|
# Assumes dim=0 is the sampling axis; stats are per dim=1 feature.
|
|
stat.add(batch)
|
|
cs.to_('cpu')
|
|
median = cs.qc.quantile(0.5)
|
|
top_values, top_indexes = cs.tk.topk(10)
|
|
```
|
|
|
|
The [runningstats source code](https://github.com/davidbau/baukit/blob/main/baukit/runningstats.py) shows other things you can do.
|
|
|
|
## Improved basic dataset objects
|
|
|
|
`ImageFolderSet` is faster and provides more features than
|
|
pytorch `ImageFolder` including the ability to gather multiple
|
|
streams of parallel data tensors (such as segmentations and images).
|
|
|
|
`TokenizedDataset` tokenizes text through a provided tokenizer,
|
|
producing dictionaries designed to feed directly into `huggingface`
|
|
language models. It works with `length_collation` for creating
|
|
uniform-length batches for fast training and inference.
|
|
|
|
## Batch job utilities
|
|
|
|
`pbar` is a more readable progress bar utility wrapper around `tqdm`
|
|
that simplifies the display of progress status strings during a
|
|
long progress operation; it also provides a way for a caller to
|
|
slience progress output.
|
|
|
|
`reserve_dir` reserves a directory for results of a job and grabs a lock
|
|
so that other proceses running `reserve_dir` will not do the same job.
|
|
This allows very simple batch parallelism: just run many processes
|
|
that run all the jobs, and each job will only be done once.
|
|
|
|
`WorkerPool` simplifies creation of worker threads for consuming output
|
|
data; this can dramatically speed up writing of many output files
|
|
and is the output analogue of the torch DataLoader utility for inputs.
|