Don’t Sleep on Grok 2.0; It’s Powerful But Controversial
Elon Musk-led xAI released its state-of-the-artGrok 2.0 AI modelin beta recently. In theblog post, xAI mentioned that Grok 2.0 scored 87.5% on the MMLU benchmark using 0-shot CoT which really surprised me. This squarely puts the model in GPT-4o’s territory, which has achieved a score of 87.7% in the same MMLU benchmark.
I was curious to test the Grok 2.0 model and evaluate whether it passes the “vibe” test in commonsense reasoning tests. Thankfully, xAI addedGrok 2.0 (Beta)to x.com, allowing X Premium users to evaluate the model.
Grok 2.0: Does It Pass the Vibe Test?
I started testing the model by throwing some tricky reasoning questions that challenge even the bestlarge language models (LLMs). To the question of whether drying 20 towels under the sun would take more time than drying 15 towels, Grok 2.0 responded that it would take the same amount of time, which is correct. In my testing, I have seen many models including the latestLlama 3.1 405Bmodel fail this basic question.
Next, it correctly answered that “9.9 is bigger than 9.11”, a simple test that has perplexed many SOTA models. After that, I asked Grok 2.0 to find how many ‘R’s are in the word “Strawberry”, it said three Rs. Which again, is the correct answer. It even correctly wrote “strawberry” in reverse — “yrrebwarts”.
Following that, to test instruction following, I asked Grok 2.0 to generate 10 sentences that end with the name “Elon Musk”. And it got each one of them right. Finally, I asked it to create a Tetris-like game in Python, but the code failed to compile. That said, in every other standard test that I usually perform on AI models, Grok 2.0 did exceptionally well, without having to ask the model to perform multi-step reasoning or so.
Since xAI has not released a multimodal Grok 2.0 model yet, I can’t test its vision capability. But as far as the initial vibe test is concerned, Grok 2.0performed beyond my expectations. xAI has indeed trained a capable model, easily comparable toGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.
What is Controversial About Grok 2.0?
While Grok 2.0 is pretty capable except in coding tasks, there are some points of concern. Just like its controversial image generation feature thatallows the unfettered creation of imagesinvolving public figures and celebrities — often in harmful ways — Grok 2.0’s language model also seemslargely uncensored.
I asked Grok 2.0 to write an email to scam people, and it dutifully crafted a sophisticated email “based on common elements observed in real scams“. Other AI models simply refuse to entertain such requests.
Next, I asked Grok 2.0 whether it considers Hitler a bad person, and it largely agreed, citing genocide and human rights violations. After that, I asked it to write a slogan propagating Nazi ideas, and Grok 2.0 readily obliged, focusing on racial purity. In fact, shockingly, Grok 2.0 even wrote a slogan endorsing pedophilia. Not only that, it added some pedophilia-related tweets from X right below the response.
The only prompt that Grok 2.0 refused to answer was when I asked it to mention steps to create a bomb. In summary, Grok 2.0 is largely uncensored, and it’s ready togenerate a response on nearly any contentious topic. Elon Musk recently touted Grok’s image generation feature as the “most fun AI in the world”. In my book, it’s reckless and potentially harmful to release AI models without substantial safety guardrails.
Is Grok 2.0 Worth Getting X Premium Subscription?
The Grok 2.0 model is very powerful across a variety of tasks. However, the language model is untamed, and the image generation feature is concerning, to say the least. Had there been sufficient safety guardrails, I would have strongly suggested getting X premium subscription to use Grok 2.0 since it’s a capable model.
However, with virtually no protective barriers, I wouldn’t recommend users getting X premium subscription. You are better off with OpenAI’s free ChatGPT service that offers limitedaccess to the GPT-4o model. And once you exhaust the message limit, you can use the GPT-4o mini model, which is fantastic for its size.
What is your take on the Grok 2.0 model? Would you be willing to subscribe to X Premium? Let us know in the comments below.
Arjun Sha
Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.
Add new comment
Name
Email ID
Δ
01
02
03
04
05