mirror of
https://github.com/wassname/baukit.git
synced 2026-06-27 19:46:31 +08:00
Update readme.
This commit is contained in:
@@ -4,7 +4,110 @@ A kit of David's secret tools to help with productive research prototyping
|
||||
with pytorch.
|
||||
|
||||
Includes:
|
||||
* Methods for tracing and editing internal activations in a network.
|
||||
* Interactive UI widgets for quick data exploration in a notebook.
|
||||
* Online algorithms for computing running stats in pytorch.
|
||||
* Fast and feature-rich data set objects for images and text.
|
||||
* Utilities for simplifying the task of running many batch jobs.
|
||||
|
||||
## Trace library
|
||||
|
||||
`Trace`, `TraceDict`, `subsequence`, `replace_module`; these simplify
|
||||
the work of analyzing and altering internal computations of deep
|
||||
networks. A short example of tracing a specific layer in `net`:
|
||||
|
||||
```
|
||||
from baukit import Trace
|
||||
with Trace(net, 'layer.name') as ret:
|
||||
_ = net(inp)
|
||||
representation = ret.output
|
||||
```
|
||||
|
||||
## Widget library
|
||||
|
||||
`show` is a vastly improved alternative to Jupyter notebook `display`;
|
||||
it allows for quickly producing HTML layouts by arranging data and
|
||||
images in nested python arrays. And HTML elements, attributes, and
|
||||
CSS styles can be controlled with functions like `show.style(color='red')`.
|
||||
|
||||
```
|
||||
from baukit import show
|
||||
show([[show.style(color=c), c] for c in ['red', 'green', 'blue']])
|
||||
```
|
||||
|
||||
`show` works with a set of `Widget` subclasses such as, `Textbox`,
|
||||
`Numberbox`, `Range`, `Menu`, `PlotWidget`, `PaintWidget` that provide
|
||||
data-bound reactive objects for quickly making interactive
|
||||
HTML visualizations that work in a Jupyter or Colab notebook. For
|
||||
example, instad of using `matplotlib` directly to just draw a picture
|
||||
of a plot, you can lay out interactive widget:
|
||||
|
||||
```
|
||||
from baukit import PlotWidget, Range, show
|
||||
import numpy
|
||||
def how_to_draw_my_plot(fig, amp=1.0, freq=1.0):
|
||||
[ax] = fig.axes
|
||||
ax.clear()
|
||||
x = numpy.linspace(0, 5, 100)
|
||||
ax.plot(x, amp * numpy.sin(freq * x))
|
||||
|
||||
plot = PlotWidget(how_to_draw_my_plot, figsize=(5, 5))
|
||||
ra = Range(min=0.0, max=2.0, step=0.1, value=plot.prop('amp'))
|
||||
rf = Range(min=0.1, max=20.0, step=0.1, value=plot.prop('freq')
|
||||
show([plot, [show.style(textAlign='right'), 'Amp', ra,
|
||||
show.style(textAlign='right'), 'Freq', rf]])
|
||||
```
|
||||
|
||||
This code shows the plot in a layout with two sliders. If you later
|
||||
execute the code `plot.freq = 5.0`, the plot will update live, in-place,
|
||||
to show the new curve, and the freq slider will also move to 5. And
|
||||
of course, dragging the slider will also change the values live.
|
||||
|
||||
## Online statistics library
|
||||
|
||||
`Covariance`, `Mean`, `Quantile`, `TopK`, and other data summarization
|
||||
methods are provided as online, gpu-optimized algorithms.
|
||||
|
||||
```
|
||||
from baukit import Quantile, Topk, CombinedStat, tally
|
||||
cs = CombinedStat(
|
||||
qc=Quantile(),
|
||||
tk=TopK(),
|
||||
)
|
||||
ds = MyDataset()
|
||||
# Loads from my_stats.npz if already computed.
|
||||
for [batch] in tally(stat, ds, cache='my_stats.npz', batch_size=50):
|
||||
batch.cuda()
|
||||
# Assumes dim=0 is the sampling axis; stats are per dim=1 feature.
|
||||
stat.add(batch)
|
||||
cs.to_('cpu')
|
||||
median = cs.qc.quantile(0.5)
|
||||
top_values, top_indexes = cs.tk.topk(10)
|
||||
```
|
||||
|
||||
## Improved basic dataset objects
|
||||
|
||||
`ParallelImageFolder` is faster and provides more features than
|
||||
pytorch `ImageFolder` including the ability to gather multiple
|
||||
streams of parallel data tensors (such as segmentations and images).
|
||||
|
||||
`TokenizedDataset` tokenizes text through a provided tokenizer,
|
||||
producing dictionaries designed to feed directly into `huggingface`
|
||||
language models. It works with `length_collation` for creating
|
||||
uniform-length batches for fast training and inference.
|
||||
|
||||
## Batch job utilities
|
||||
|
||||
`pbar` is a more readable progress bar utility wrapper around `tqdm`
|
||||
that simplifies the display of progress status strings during a
|
||||
long progress operation; it also provides a way for a caller to
|
||||
slience progress output.
|
||||
|
||||
`reserve_dir` reserves a directory for results of a job and grabs a lock
|
||||
so that other proceses running `reserve_dir` will not do the same job.
|
||||
This allows very simple batch parallelism: just run many processes
|
||||
that run all the jobs, and each job will only be done once.
|
||||
|
||||
`WorkerPool` simplifies creation of worker threads for consuming output
|
||||
data; this can dramatically speed up writing of many output files
|
||||
and is the output analogue of the torch DataLoader utility for inputs.
|
||||
|
||||
Reference in New Issue
Block a user