Skip to content
Snippets Groups Projects
Commit 5a60136d authored by Kreinsen, Moritz's avatar Kreinsen, Moritz
Browse files

Update

parent f26cb55d
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:6ec9d7f2-3a2c-4912-9b86-67e62961df9e tags:
# Getting Started with Bloom
%% Cell type:markdown id:33a29634-7abc-4d2d-bcc7-5ec45d89b5c1 tags:
* Referenz: https://towardsdatascience.com/getting-started-with-bloom-9e3295459b65
* Model: https://huggingface.co/malteos/bloom-1b5-clp-german
%% Cell type:markdown id:cc157d45-ca29-4777-a2d5-f1852570a936 tags:
%% Cell type:code id:e8602d3c-e0b0-4ff7-94ca-e93ad845153f tags:
``` python
import transformers
from transformers import BloomForCausalLM
from transformers import BloomTokenizerFast
import torch
import time
```
%% Output
2023-02-02 07:34:46.434678: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-02 07:34:47.218496: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-02 07:34:47.218726: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-02-02 07:34:47.218734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
%% Cell type:code id:2b5224d5-3b48-4f8a-9e8b-26499dfb7944 tags:
``` python
model = BloomForCausalLM.from_pretrained("malteos/bloom-1b5-clp-german")
tokenizer = BloomTokenizerFast.from_pretrained("malteos/bloom-1b5-clp-german")
```
%% Output
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'.
The class this function is called from is 'BloomTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
%% Cell type:code id:8f004740-04a1-45da-a5f8-6432eefb0fbf tags:
``` python
print("Gebe hier deinen Text ein:")
prompt = input("")
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")
```
%% Output
Gebe hier deinen Text ein:
Was ist ein Seepferdchen?
%% Cell type:code id:f999d2a5-941d-4c33-a3e7-e9073497de4c tags:
``` python
# Greedy Search
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Ein Seepferdchen ist ein kleines, weißes Tierchen, das im Wasser lebt. Es ist ein sehr schönes Tier. Es ist sehr klein. Es ist sehr klein. Es ist sehr klein. Es ist sehr klein
99.676s
%% Cell type:code id:e452cadb-431e-48d2-a3f5-ce1a113ab558 tags:
``` python
# Beam Search
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
num_beams=2,
no_repeat_ngram_size=2,
early_stopping=True
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Wie sieht es aus? Was kann man mit ihm machen? Und was hat es mit der Seefahrt zu tun?
Die Kinder lernen die Tiere kennen, die im Wasser leben. Sie erfahren, wie
158.992s
%% Cell type:code id:255e1194-9b21-4b1f-bb13-5574d7e0d500 tags:
``` python
# Sampling Top-k + Top-p
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
do_sample=True,
top_k=50,
top_p=0.9
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Ein Seepferdchen? Das sieht aus wie ein kleines, grünes Seepferdchen. Es hat ein breites Maul und einen kleinen Kopf.
Wer war schon mal in einem See? Das sind Tiere, die an großen
83.627s
%% Cell type:code id:bc6f1454-bc5e-4361-8413-84fa5397f6d5 tags:
``` python
```
%% Cell type:markdown id:0f901b18 tags:
# Next-Token-Prediction
This is based on the following blog posts:
* Predicting Next Word — NLP & Deep Learning: https://medium.com/@vijay2340025/predicting-next-word-nlp-deep-learning-85010d966671
%% Cell type:code id:b2471e25 tags:
``` python
import nltk
import pandas as pd
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
```
%% Cell type:code id:c0febc54 tags:
``` python
nltk.download('punkt')
```
%% Cell type:code id:8a781aa1 tags:
``` python
dataset = """
Is Antwerp a city?,
Is Antwerp a municipality?,
Is Antwerp in Belgium?,
What is Antwerp?,
What is the population of the city of Antwerp?,
Where is the city of Antwerp?,
Why is Antwerp important to fashion?,
Antwerp is to the east of what river?,
How many municipalities does Antwerp have?,
"""
```
%% Cell type:code id:39255d82 tags:
``` python
def get_all_possible_sequences(text):
seq = []
words = nltk.word_tokenize(text)
total_words = len(words)
for i in range(1, total_words):
for j in range(1, len(words)-i+1):
arr = words[j-1:j+i]
seq.append((arr[:-1], arr[-1]))
return seq
def build_vocabulary(docs):
vocabulary = []
for doc in docs:
for w in nltk.word_tokenize(doc):
if w not in vocabulary:
vocabulary.append(w)
vocabulary.append('UNK')
return vocabulary
```
%% Cell type:code id:edb54d0d tags:
``` python
docs = []
for row in dataset.split(","):
docs.append(row.lower())
lst = []
for doc in docs:
tmp_lst = get_all_possible_sequences(doc)
lst = lst + tmp_lst
vocabulary = build_vocabulary(docs)
id2word = {idx: w for (idx, w) in enumerate(vocabulary)}
word2id = {w: idx for (idx, w) in enumerate(vocabulary)}
def seq2id(arr):
return torch.tensor([word2id[i] for i in arr])
def get_max_seq():
return len(list(set([len(i[0]) for i in lst])))
MAX_SEQ_LEN = get_max_seq()
def get_padded_x(data):
new_data = F.pad(input=data.view(1,-1), pad=(0, MAX_SEQ_LEN-data.shape[0], 0, 0), mode='constant', value=word2id['UNK'])
return new_data
def get_xy_vector(arr):
x = seq2id(arr[0])
y = seq2id([arr[1]])
return x, y
```
%% Cell type:code id:431fd558 tags:
``` python
class NextWordModel(nn.Module):
""" Prediction of Next word based on the MAX_SEQ_LEN Sequence """
def __init__(self, embedding_dim, hidden_dim, vocab_size):
super(NextWordModel, self).__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.gru = nn.GRU(embedding_dim * MAX_SEQ_LEN, hidden_dim)
self.linear = nn.Linear(hidden_dim, vocab_size)
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
lstm_out, _ = self.gru(embeds.view(1, 1, -1))
x = self.linear(lstm_out.view(1, -1))
return x
```
%% Cell type:code id:d891422e tags:
``` python
if torch.cuda.is_available():
dev = "cuda:0"
else:
dev = "cpu"
print(f'Running on {dev}')
# set the model to be copied on GPU
device = torch.device(dev)
```
%% Cell type:code id:7c231c92 tags:
``` python
EMBEDDING_DIM = 10
NO_OF_EPOCHS = 300
HIDDEN_DIM = len(vocabulary)
model = NextWordModel(EMBEDDING_DIM, HIDDEN_DIM, len(vocabulary))
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
model.to(device)
for epoch in range(NO_OF_EPOCHS):
running_loss = 0.0
i = 0
for data in lst:
model.zero_grad()
x, y = get_xy_vector(data)
# convert to max seq length with padding
x = get_padded_x(x)
x = x.to(device)
y = y.to(device)
predicted = model(x)
loss = loss_function(predicted, y)
loss.backward()
optimizer.step()
running_loss += loss
i += 1
if i % 100 == 0:
#print(f'Loss at iteration {i} and epoch {epoch} is {running_loss / 100}')
running_loss = 0
print('Finished')
```
%% Cell type:code id:9d0ff01d tags:
``` python
with torch.no_grad():
print('Type something here . . .')
while True:
inp = input("")
inp = inp.strip()
if inp == "q":
break
tokens = nltk.word_tokenize(inp.lower())
x = seq2id(tokens)
x = get_padded_x(x)
x = x.to(device)
predicted = model(x).to(device)
predicted = predicted[0].cpu().numpy()
print(f'Answer: {inp} {id2word[np.argmax(predicted)]} ')
```
%% Cell type:code id:c1e828f0 tags:
``` python
```
nltk
pandas
torch
numpy
%% Cell type:markdown id:6ec9d7f2-3a2c-4912-9b86-67e62961df9e tags:
# Getting Started with Bloom
%% Cell type:markdown id:33a29634-7abc-4d2d-bcc7-5ec45d89b5c1 tags:
* Referenz: https://towardsdatascience.com/getting-started-with-bloom-9e3295459b65
* Model: https://huggingface.co/malteos/bloom-1b5-clp-german
%% Cell type:markdown id:cc157d45-ca29-4777-a2d5-f1852570a936 tags:
%% Cell type:code id:e8602d3c-e0b0-4ff7-94ca-e93ad845153f tags:
``` python
import transformers
from transformers import BloomForCausalLM
from transformers import BloomTokenizerFast
import torch
import time
```
%% Output
2023-02-02 07:34:46.434678: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-02 07:34:47.218496: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-02 07:34:47.218726: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-02-02 07:34:47.218734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
%% Cell type:code id:2b5224d5-3b48-4f8a-9e8b-26499dfb7944 tags:
``` python
model = BloomForCausalLM.from_pretrained("malteos/bloom-1b5-clp-german")
tokenizer = BloomTokenizerFast.from_pretrained("malteos/bloom-1b5-clp-german")
```
%% Output
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'.
The class this function is called from is 'BloomTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
%% Cell type:code id:8f004740-04a1-45da-a5f8-6432eefb0fbf tags:
``` python
print("Gebe hier deinen Text ein:")
prompt = input("")
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")
```
%% Output
Gebe hier deinen Text ein:
Was ist ein Seepferdchen?
%% Cell type:code id:f999d2a5-941d-4c33-a3e7-e9073497de4c tags:
``` python
# Greedy Search
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Ein Seepferdchen ist ein kleines, weißes Tierchen, das im Wasser lebt. Es ist ein sehr schönes Tier. Es ist sehr klein. Es ist sehr klein. Es ist sehr klein. Es ist sehr klein
99.676s
%% Cell type:code id:e452cadb-431e-48d2-a3f5-ce1a113ab558 tags:
``` python
# Beam Search
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
num_beams=2,
no_repeat_ngram_size=2,
early_stopping=True
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Wie sieht es aus? Was kann man mit ihm machen? Und was hat es mit der Seefahrt zu tun?
Die Kinder lernen die Tiere kennen, die im Wasser leben. Sie erfahren, wie
158.992s
%% Cell type:code id:255e1194-9b21-4b1f-bb13-5574d7e0d500 tags:
``` python
# Sampling Top-k + Top-p
start = time.time()
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
do_sample=True,
top_k=50,
top_p=0.9
)[0]))
ende = time.time()
print('{:5.3f}s'.format(ende-start))
```
%% Output
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Was ist ein Seepferdchen? Ein Seepferdchen? Das sieht aus wie ein kleines, grünes Seepferdchen. Es hat ein breites Maul und einen kleinen Kopf.
Wer war schon mal in einem See? Das sind Tiere, die an großen
83.627s
%% Cell type:code id:bc6f1454-bc5e-4361-8413-84fa5397f6d5 tags:
``` python
```
transformers
torch
time
nltk
pandas
torch
numpy
numpy
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment