InfScale is a Python package that provides a unified and practical framework for applying inference-time scaling to large language models (LLMs).
At its core, InfScale moves beyond single-shot generation. It produces multiple candidate responses—either complete or partial—from one or more LLMs, and then intelligently selects or aggregates them to yield a higher-quality final output.
This strategy allows developers to enhance performance across a variety of tasks—including text summarization, question answering, information extraction, and paraphrasing—without the need for model fine-tuning or retraining.
By optimizing at inference time, InfScale offers a scalable and cost-efficient alternative to traditional training-heavy approaches. We view this paradigm as a highly practical and impactful way to improve LLM applications in real-world settings. This approach leverages model diversity and response sampling to increase the probability of obtaining a higher-quality output.
The current architecture of InfScale is shown below:
Pipeline overview:
- Multiple LLM models generate candidate responses
- Each model can generate N samples
- All responses are collected
- A scoring mechanism selects the best candidate
-
Model Loading : The library loads one model from a set predefind models for the given task (Text Summarization, Question Answering, Information Extraction, etc...). Currently we support generating different samples of responses from one model , we plan at some point to add a model selector layer, generate and blend reponses from different models.
-
Generate Multiple Responses : Each model generates N candidate responses for the same input. The library is designed to get a batch of queries / input pieces of text.
-
Compute Semantic Similarity :All responses are evaluated using an evaluation metric that tries to mimic the human evaluation as much as possible
-
Generate the final response :The response with the highest score is selected as the final output. (In the future we will implement appraoches the blend top K responses from one or differnet models stay tuned!!)
Currently we support the following Tasks to be done using InfScale
- Text Summarization
We plan to support the following tasks in the following releases
- Question-Answering
- Information Extraction
- Paraphrasing
The models we support currently (from the pool of models in Hugging Face) are facebook/bart-large-cnn, sshleifer/distilbart-cnn-12-6. Why ? If you see the list of Hugging-Face summarization models, you will find that these are the most liked ones. We know this might be a naive approach for selecting the models to support, in the future we plan a more rigorous benchmarking.
As InfScale is designed to be "scalable", we focus , in the beginning, on the reference-free metrics for Automateed-Summarization Evaluation . You can check this simple article for more information. We support two metrics :
-
Cosine Similarity : We simple embed the query and result summarization using
all-MiniLM-L6-v2then calculateCos(embedding(query),embedding(response)) -
ROUGE : This is one of the most classical metrics in text-summarization tasks . We apply ROUGE in and unsupervised (reference-free) fashion : we calculate ROUGE based on the input query as a reference text against the response as the predicted text. We use the
rouge-scorelibrary in our current implementation. In the future, we will create our own to control the calculation poroces.
pip install InfScale datasets sentence-transformers rich
import json
from InfScale.best_of_n import BestOfNSampler
from datasets import load_dataset
from rich import print_json
if __name__ == "__main__":
# Candidate models
model_names = [
"Sachin21112004/distilbart-news-summarizer",
"google/pegasus-xsum"
]
# Initialize Best-of-N sampler
bon = BestOfNSampler(models_names=model_names)
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
# Example queries
queries = [
dataset["train"][0]["article"],
dataset["train"][1]["article"],
dataset["train"][2]["article"]
]
# Generate responses
results = bon.generate(queries=queries, n=3)
# Pretty print results
print_json(json.dumps(results, indent=4))
If you are intrested in the details of development and changes in each version, check the CHANGE LOG
