From 329a3bfd7a5a28f8ff4b185b548132cd6a360062 Mon Sep 17 00:00:00 2001 From: Yannic Kilcher Date: Mon, 2 Jan 2023 23:45:06 +0100 Subject: [PATCH] added schemas for ranking and generation task --- docs/data_schemas.md | 108 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) diff --git a/docs/data_schemas.md b/docs/data_schemas.md index b1fe16c1..0bb9a96c 100644 --- a/docs/data_schemas.md +++ b/docs/data_schemas.md @@ -92,3 +92,111 @@ conversations, we should use a row-major format. Avro files and protobufs. Keep in mind that column-major files are better for reading, filtering, and aggregating, but row-major files are better for writing. + +# Task-Specific Data Schemas + +The main tasks are a) generation of response text and b) ranking of responses. +The following sections describe the data schemas for each of these tasks. Both +should be implementable in parquet files. + +## Common Data Structures + +```python + +class Message: + text: str # The text of the message + role: Literal['prompter', 'assistant'] # Whether the message is a user prompt/follow-up or an assistant response + +class Thread: + messages: list[Message] # The messages in the conversation + +``` + +The corresponding parquet schemas are: + +```parquet +message Message { + required binary text (UTF8); + required binary role (UTF8); +} + +message Thread { + required group messages (LIST) { + repeated group list { + required group element { + required binary text (UTF8); + required binary role (UTF8); + } + } + } +} + +``` + +## Generation + +```python + +class GenerationExample: + thread: Thread # The conversation thread before the message to be generated + message: Message # The message to be generated + +``` + +The corresponding parquet schema is: + +```parquet +message GenerationExample { + required group thread (LIST) { + repeated group list { + required group element { + required binary text (UTF8); + required binary role (UTF8); + } + } + } + required group message (LIST) { + repeated group list { + required group element { + required binary text (UTF8); + required binary role (UTF8); + } + } + } +} + +``` + +## Ranking + +```python + +class RankingExample: + thread: Thread # The conversation thread before the message to be ranked + messages: list[Message] # The messages to be ranked, in oder of decreasing preference + +``` + +The corresponding parquet schema is: + +```parquet +message RankingExample { + required group thread (LIST) { + repeated group list { + required group element { + required binary text (UTF8); + required binary role (UTF8); + } + } + } + required group messages (LIST) { + repeated group list { + required group element { + required binary text (UTF8); + required binary role (UTF8); + } + } + } +} + +```