Ricky1509/finephrase
Preview • Updated • 3.06k
Task: Text-Generation
Total training time: 1.8 hours
Inputs: text
Outputs: text
Params: 2,604,210
Final Loss: 2.37
Important Benchmark Scores:
1. ARC Easy - 32.11%
2. BLiMP - 65.33%
3. HellaSwag - 27.03%
Framework: PyTorch, transformers
Authors: Paul Courneya, Jonathan LY
Syn is a Tiny Language Model (TLM) trained on 2.7 billion tokens of synthetic data. The name Syn is an abbreviation of synthetic, reflecting the type of data the model was trained on.
max-autotune-no-cudagraphs(0.9, 0.95)float16| Dataset | Bytes | Size | Share |
|---|---|---|---|
| FinePhrase | 8,000,000,000 | 8.000 GB | 61.61% |
| Tiny-Strange-Textbooks | 4,000,000,000 | 4.000 GB | 30.81% |
| TinyStoriesv2 | 700,000,000 | 0.700 GB | 5.39% |
| LongPage | 284,000,000 | 0.284 GB | 2.19% |
Note: the byte counts are rounded.
| Task | Value |
|---|---|
| BLiMP | 65.33% |
| ARC Easy | 32.11% |
| ARC Challenge | 20.39% |
| HellaSwag | 27.03% |
| SWAG | 33.38% |
| PiQA | 53.48% |
ArithMark-2.0:
| Ops = 1 | Ops = 2 | Ops = 3 | Avg |
|---|---|---|---|
| 25.04% | 30.13% | 24.60% | 26.48% |
For a comparison with other small language models like this one, go here.
Prompt : 'Artificial intelligence is'
------------------------------------------------------------
Generated:
a form of artificial intelligence that involves creating an attractive and reliable source of information. This involves using advanced technology to create interactive, user-friendly platforms for users who are looking to use them in their daily lives.
## II. Why Use Applications?
To enhance the user experience, it's important to have the opportunity to learn how to write and understand your content properly. To ensure that you are using new tools, take some time to read, then let go of yourself or others through conversations with the user as they look at them. Additionally, by practicing self-assessment, users can gain more control over their own ideas.
### IV. Conclusion
In this lesson, we learned about the different types of apps used for frames, their importance, and its applications. We also discussed the practical aspects of the application and practical applications of Frameworks. By understanding these concepts, readers can apply these skills to various scenarios in their careers.
Before using, distributing, selling, or modifying this software, you must read the license here.
#!/usr/bin/env python3
MODEL_DIR = "fromziro/Syn-2.6M"
TOKENIZER_PATH = MODEL_DIR
PROMPT = "Artificial intelligence is"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_P = 0.95
TOP_K = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE = True
import torch
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerFast
device = (
"cuda" if torch.cuda.is_available() else
"mps" if torch.backends.mps.is_available() else
"cpu"
)
print(f"Device : {device}")
def load_tokenizer(path_or_repo: str):
p = Path(path_or_repo)
if p.exists() and p.is_file() and p.suffix.lower() == ".json":
tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
else:
tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)
if tok.bos_token is None:
tok.add_special_tokens({"bos_token": "<|bos|>"})
if tok.eos_token is None:
tok.add_special_tokens({"eos_token": "<|eos|>"})
if tok.unk_token is None:
tok.add_special_tokens({"unk_token": "<|unk|>"})
if tok.pad_token is None:
tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"
tok.padding_side = "left"
return tok
print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f" Vocab size : {len(tokenizer)}")
print(f" BOS : {tokenizer.bos_token!r}")
print(f" EOS : {tokenizer.eos_token!r}")
print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
torch_dtype=torch.float16 if device == "cuda" else torch.float32,
low_cpu_mem_usage=True,
)
model.eval()
model.to(device)
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
model.generation_config.use_cache = False
total_params = sum(p.numel() for p in model.parameters())
print(f" Parameters : {total_params:,}")
def generate(
prompt: str = PROMPT,
max_new_tokens: int = MAX_NEW_TOKENS,
temperature: float = TEMPERATURE,
top_p: float = TOP_P,
top_k: int = TOP_K,
repetition_penalty: float = REPETITION_PENALTY,
do_sample: bool = DO_SAMPLE,
) -> str:
bos = tokenizer.bos_token or ""
full_prompt = bos + prompt
inputs = tokenizer(
full_prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
inputs.pop("token_type_ids", None)
gen_kwargs = dict(
max_new_tokens=max_new_tokens,
do_sample=do_sample,
repetition_penalty=repetition_penalty,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
use_cache=False,
)
if do_sample:
gen_kwargs["temperature"] = temperature
gen_kwargs["top_p"] = top_p
gen_kwargs["top_k"] = top_k
with torch.inference_mode():
output_ids = model.generate(**inputs, **gen_kwargs)
prompt_len = inputs["input_ids"].shape[-1]
new_ids = output_ids[0][prompt_len:]
return tokenizer.decode(new_ids, skip_special_tokens=True)
if __name__ == "__main__":
print(f"\nPrompt : {PROMPT!r}")
print("-" * 60)
output = generate(PROMPT)
print("Generated:")
print(output)
Copyright (c) 2026 FromZero
Copyright (c) 2026 Paul Courneya
Copyright (c) 2026 Jonathan LY
@misc{syn2.6m,
title = {Syn-2.6M: A Tiny Languagte Model trained on 2.7B tokens of Synthetic Data},
author = {FromZero},
year = {2026},
url = {https://huggingface.co/fromziro/Syn-2.6M}
}