Introducing ๐ผ๐ฝ๐ฒ๐ป ๐๐ฒ๐ฒ๐ฝ-๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต by Hugging Face! ๐ฅ
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set. โก๏ธ open Deep Research is at 55% (powered by o1), it is: - the best pass@1 solution submitted - the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Now you can launch a code agent directly from your terminal! โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0: Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
If you are using AWS, give a read. It is a running document to showcase how to deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.
We're working hard to enable all the scenarios, whether you want to deploy to Inference Endpoints, Sagemaker or EC2; with GPUs or with Trainium & Inferentia.
We have full support for the distilled models, DeepSeek-R1 support is coming soon!! I'll keep you posted.
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year." Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
With the big hype around AI agents these days, I couldnโt stop thinking about how AI agents could truly enhance real-world activities. What sort of applications could we build with those AI agents: agentic RAG? self-correcting text-to-sql? Nah, boringโฆ
Passionate about outdoors, Iโve always dreamed of a tool that could simplify planning mountain trips while accounting for all potential risks. Thatโs why I built ๐๐น๐ฝ๐ถ๐ป๐ฒ ๐๐ด๐ฒ๐ป๐, a smart assistant designed to help you plan safe and enjoyable itineraries in the French Alps and Pyrenees.
Built using Hugging Face's ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ library, Alpine Agent combines the power of AI with trusted resources like ๐๐ฌ๐ช๐ต๐ฐ๐ถ๐ณ.๐ง๐ณ (https://skitour.fr/) and METEO FRANCE. Whether itโs suggesting a route with moderate difficulty or analyzing avalanche risks and weather conditions, this agent dynamically integrates data to deliver personalized recommendations.
In my latest blog post, I share how I developed this projectโfrom defining tools and integrating APIs to selecting the best LLMs like ๐๐ธ๐ฆ๐ฏ2.5-๐๐ฐ๐ฅ๐ฆ๐ณ-32๐-๐๐ฏ๐ด๐ต๐ณ๐ถ๐ค๐ต, ๐๐ญ๐ข๐ฎ๐ข-3.3-70๐-๐๐ฏ๐ด๐ต๐ณ๐ถ๐ค๐ต, or ๐๐๐-4.
โท๏ธ Curious how AI can enhance adventure planning?โจTry the app and share your thoughts: florentgbelidji/alpine-agent ๐ Want to build your own agents? Whether for cooking, sports training, or other passions, the possibilities are endless. Check out the blog post to learn more: https://huggingface.co/blog/florentgbelidji/alpine-agent
Many thanks to @m-ric for helping on building this tool with smolagents!
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention: โฃ Mixture of Experts with 456B total parameters (45.9B activated per token) โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context: โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling: โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy: โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐ It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
๐ช๐ฒ'๐๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ญ.๐ฏ.๐ฌ ๐, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
Microsoft's rStar-Math paper claims that ๐ค ~7B models can match the math skills of o1 using clever train- and test-time techniques. You can now download their prompt templates from Hugging Face ! ๐ The paper introduces rStar-Math, which claims to rival OpenAI o1's math reasoning capabilities by integrating Monte Carlo Tree Search (MCTS) with step-by-step verified reasoning trajectories. ๐ค A Process Preference Model (PPM) enables fine-grained evaluation of intermediate steps, improving training data quality. ๐งช The system underwent four rounds of self-evolution, progressively refining both the policy and reward models to tackle Olympiad-level math problemsโwithout GPT-4-based data distillation. ๐พ While we wait for the release of code and datasets, you can already download the prompts they used from the HF Hub! Details and links here ๐ Prompt-templates docs: https://moritzlaurer.github.io/prompt_templates/ Templates on the hub: MoritzLaurer/rstar-math-prompts Prompt-templates collection: MoritzLaurer/prompt-templates-6776aa0b0b8a923957920bb4 Paper: https://arxiv.org/pdf/2501.04519
The main bottleneck in building GUI agents it to find training data. GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach: โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting โฃ It then reverse-engineers high-level tasks from successful interaction patterns โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality: โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments: โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
FACTS is a great paper from @GoogleDeepMind on measuring the factuality of LLM outputs. You can now download their prompt templates from @huggingface to improve LLM-based fact-checking yourself!
๐ The paper introduces the FACTS Grounding benchmark for evaluating the factuality of LLM outputs.
๐ค Fact-checking is automated by an ensemble of LLM judges that verify if a response is fully grounded in a factual reference document.
๐งช The authors tested different prompt templates on held-out data to ensure their generalization.
๐ It's highly educational to read these templates to learn how frontier labs design prompts and understand their limitations.
๐พ You can now download and reuse these prompt templates via the prompt-templates library!
๐ The library simplifies sharing prompt templates on the HF hub or locally via standardized YAML files. Letโs make LLM work more transparent and reproducible by sharing more templates like this!
The TRL v0.13 release is ๐ฅ! My highlight are the new process reward trainer to train models similar to o1 and tool call support:
๐ง Process reward trainer: Enables training of Process-supervised Reward Models (PRMs), which reward the quality of intermediate steps, promoting structured reasoning. Perfect for tasks like stepwise reasoning.
๐ Model merging: A new callback leverages mergekit to merge models during training, improving performance by blending reference and policy models - optionally pushing merged models to the Hugging Face Hub.
๐ ๏ธ Tool call support: TRL preprocessing now supports tool integration, laying the groundwork for agent fine-tuning with examples like dynamic temperature fetching in prompts.
โ๏ธ Mixture of judges: The new AllTrueJudge combines decisions from multiple binary judges for more nuanced evaluation.
๐ Supercharge your LLM apps with Langfuse on Hugging Face Spaces!
Langfuse brings end-to-end observability and tooling to accelerate your dev workflow from experiments through production
Now available as a Docker Space directly on the HF Hub! ๐ค
๐ Trace everything: monitor LLM calls, retrieval, and agent actions with popular frameworks 1โฃ One-click deployment: on Spaces with persistent storage and integrated OAuth ๐ Simple Prompt Management: Version, edit, and update without redeployment โ Intuitive Evals: Collect user feedback, run model/prompt evaluations, and improve quality ๐ Dataset Creation: Build datasets directly from production data to enhance future performance
Kudos to the Langfuse team for this collab and the awesome, open-first product theyโre building! ๐ @marcklingen@Clemo@MJannik
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
OpenAI is losing money on the $200/month subscription ๐คฏ. It's crazy how expensive it is to run these largest LLMs:
- ChatGPT Pro costs $200/month ($2,400/year) and is still unprofitable for OpenAI due to higher-than-expected usage. - OpenAI reportedly expected losses of about $5 billion on revenue of $3.7 billion last year, with ChatGPT alone once costing an estimated $700,000 per day to operate. ๐ธ๐ฅ - They build strong models and do great research. Whether this business model will work in the long run is one of the biggest questions in the AI economy today.