Introducing ๐ผ๐ฝ๐ฒ๐ป ๐๐ฒ๐ฒ๐ฝ-๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต by Hugging Face! ๐ฅ
OpenAI's latest agentic app Deep Research seems really good... But it's closed, as usual.
โฑ๏ธ So with a team of cracked colleagues, we set ourselves a 24hours deadline to replicate and open-source Deep Research! โฑ๏ธ
โก๏ธ We built open-Deep-Research, an entirely open agent that can: navigate the web autonomously, scroll and search through pages, download and manipulate files, run calculation on data...
We aimed for the best performance: are the agent's answers really rigorous?
On GAIA benchmark, Deep Research had 67% accuracy on the validation set. โก๏ธ open Deep Research is at 55% (powered by o1), it is: - the best pass@1 solution submitted - the best open solution ๐ช๐ช
And it's only getting started ! Please jump in, drop PRs, and let's bring it to the top !
Now you can launch a code agent directly from your terminal! โจ ๐๐๐๐๐๐๐๐๐ "๐๐๐๐ ๐๐๐๐" directly launches a CodeAgent โถ๏ธ This also works with web agents (replace ๐๐๐๐๐๐๐๐๐ with ๐ ๐๐๐๐๐๐๐) thanks to @merve !
๐พ Another treat from smolagents release 1.7.0: Now agents have a memory mechanism, enabling many possibilities like replaying the last run with ๐๐๐๐๐.๐๐๐๐๐๐ข(), thank you @clefourrier !
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!
๐งช Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.
๐ง Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.
๐ฅ Step 3: show we can go from base model -> SFT -> RL via multi-stage training.
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year." Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention: โฃ Mixture of Experts with 456B total parameters (45.9B activated per token) โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context: โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling: โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy: โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐ It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
๐ช๐ฒ'๐๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ญ.๐ฏ.๐ฌ ๐, and it comes with a major feature: you can now log agent runs using OpenTelemetry to inspect them afterwards! ๐
This interactive format is IMO much easier to inspect big multi-step runs than endless console logs.
The main bottleneck in building GUI agents it to find training data. GUI Agent trajectories are not easy to get by. Crowdsourcing trajectories, then manually annotating them, could be an option, but at scale, it's hard to do
You could use synthetic data generation (ask 1000s small existing GUI agents to solve tasks, keep only successful runs). But then it's hard to come up with many high level-tasks.
โก๏ธ Well, a novel technique was just published that creates a new promising paradigm for synthetic data generation: Shanghai AI Lab researchers propose OS-Genesis, a novel way to create training data for GUI agents that flips the traditional approach on its head. Instead of starting with predefined tasks and having humans or machines execute them, OS-Genesis first explores the interface naturally, then derives meaningful tasks from those interactions.
๐ Exploration-driven vs task-driven approach: โฃ Instead of starting with tasks, OS-Genesis first explores GUIs by clicking and interacting โฃ It then reverse-engineers high-level tasks from successful interaction patterns โฃ This leads to more natural and diverse training data than predefined tasks
๐ฏ Novel reward model for trajectory quality: โฃ Rather than discarding incomplete trajectories, OS-Genesis scores them based on coherence and completion โฃ This preserves valuable partial successes that would otherwise be wasted
๐ Superior results across environments: โฃ Nearly doubles performance on AndroidWorld (9.8% โ 17.4%)
By the way, this field of GUI agents is still in infancy, so you can still make a difference with "low-cost" setups: their paper gets SOTA results with only 8xA100!
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!
* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting. * Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases) * Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1 * Use the resulting data for SFT & RL * Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.
Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!