added schemas for ranking and generation task

This commit is contained in:
Yannic Kilcher
2023-01-02 23:45:06 +01:00
parent 2df4e7a735
commit 329a3bfd7a
+108
View File
@@ -92,3 +92,111 @@ conversations, we should use a row-major format.
Avro files and protobufs. Keep in mind that column-major files are better for
reading, filtering, and aggregating, but row-major files are better for
writing.
# Task-Specific Data Schemas
The main tasks are a) generation of response text and b) ranking of responses.
The following sections describe the data schemas for each of these tasks. Both
should be implementable in parquet files.
## Common Data Structures
```python
class Message:
text: str # The text of the message
role: Literal['prompter', 'assistant'] # Whether the message is a user prompt/follow-up or an assistant response
class Thread:
messages: list[Message] # The messages in the conversation
```
The corresponding parquet schemas are:
```parquet
message Message {
required binary text (UTF8);
required binary role (UTF8);
}
message Thread {
required group messages (LIST) {
repeated group list {
required group element {
required binary text (UTF8);
required binary role (UTF8);
}
}
}
}
```
## Generation
```python
class GenerationExample:
thread: Thread # The conversation thread before the message to be generated
message: Message # The message to be generated
```
The corresponding parquet schema is:
```parquet
message GenerationExample {
required group thread (LIST) {
repeated group list {
required group element {
required binary text (UTF8);
required binary role (UTF8);
}
}
}
required group message (LIST) {
repeated group list {
required group element {
required binary text (UTF8);
required binary role (UTF8);
}
}
}
}
```
## Ranking
```python
class RankingExample:
thread: Thread # The conversation thread before the message to be ranked
messages: list[Message] # The messages to be ranked, in oder of decreasing preference
```
The corresponding parquet schema is:
```parquet
message RankingExample {
required group thread (LIST) {
repeated group list {
required group element {
required binary text (UTF8);
required binary role (UTF8);
}
}
}
required group messages (LIST) {
repeated group list {
required group element {
required binary text (UTF8);
required binary role (UTF8);
}
}
}
}
```