Kimi Math Edition: Visual Understanding and CoT

Three weeks ago, when Kimi donned glasses, we introduced the Kimi Math Edition. Recently, using Kimi, I noticed a transformation in the glasses-wearing student. The VI transitioned from linear to three-dimensional, sporting a blue expression that exudes wisdom. At that time, the Kimi Math Edition primarily processed text-based queries, providing step-by-step solutions and structured answers. It only supported LaTeX format questions, a standard for mathematical typesetting, making it difficult to understand and respond to non-LaTeX, geometric, graphical, or handwritten queries.

When I inquired about the timeline for supporting image-based solutions, Kimi’s developers assured me it would be ‘soon.’ Lo and behold, within three weeks, the K1 model, which supports visual understanding and Chain of Thought (CoT) reasoning, was introduced. Note, the Kimi Math Edition launched three weeks ago was based on the K0-math model, which is K0 (líng), not KO[əʊ], and not the uppercase O of the O1 model.

After several days of in-depth experience, I concluded: Kimi’s visual thinking model is akin to an ultra-sharp AI detective, capable of dissecting image information and deducing step by step. It handles both objective mathematical and chemical subjects as well as subjective life-oriented ones with ease, offering depth and detail. How to experience over 10+ cases? Today, we won’t look at benchmarks, which can be both standard and non-standard; nor at demos, which are often finely tuned results hard for ordinary people to replicate.

Only by testing the model’s generalization capabilities—how it performs with new, unseen data—can we discern its true abilities, something every user can personally sense. 1) Question: What is the sum of singing, dancing, rapping, and basketball? I presented this image to Kimi for a solution. Ah, you’re straight to the point, aren’t you? 2) Question: I just got off at Chengdu East Station, how long will it take me to get to Kuanzhai Alley following this guide map? Which subway line should I take? Answer: Approximately 32 minutes (including walking time).

Let’s verify with a map application. Indeed, it’s recommended to take line 2, boarding at Chengdu East Station and alighting at People’s Park, taking about 30 minutes. Let’s ask a more complex question. Question: I have 6 hours and want to visit Jinli, Chunxi Road, Jianshe Road, and Dongjiao Memory. Help me plan my itinerary starting from Shuangliu Airport. Kimi then engaged in detailed contemplation and reasoning.

The final result was organized into a table by Kimi for a more intuitive presentation.

I am also sharing this image with everyone, and welcome all to come and play in Chengdu.

Recently, I am planning to visit Jiuzhaigou and came across this image online. I would like Kimi to help me organize a detailed travel guide to Jiuzhaigou, including lunch time, for a total of 9 hours. Impressively, even handwritten text can be recognized. This is a significant feature of the K1 model, which can accurately recognize images with ‘noise’ such as dark photos, blurry images, multiple shots, handwritten text, and slanted shooting angles.

Some time ago, I took this picture at Chengdu Software Park to challenge Kimi. Question: Can you guess where this is in Chengdu? Correct, it’s Tianfu Fourth Street in Chengdu High-Tech Zone. The answer is so accurate it’s disarming. ‘This place indeed has some employees working overtime at night.’ Hahaha, the people from ByteDance are crying.

Question: Predict the stock price of BYD tomorrow? Answer: 280 yuan ≤ BYD’s closing price tomorrow ≤ 290 yuan. Kimi did not provide a specific stock price number, only a range value. Personally, I think there is a 90% chance of the prediction being correct, corresponding to a stock price fluctuation of -1.7% to 1.7%. Although no specific numbers are given, the thought process is worth looking at. ‘These moving averages are close to the current price, indicating that the stock price may stabilize in the short term.

‘ ‘The current stock price is slightly higher than the moving average, which may be a positive signal.’ ‘If market conditions remain stable, without any significant negative news, BYD’s closing price tomorrow may be between 280 and 290 yuan. But remember, this is just an estimate based on current information, and the actual results may vary.’ However, it must be solemnly stated. The above text is for AI testing purposes only and does not represent any investment advice.

Question: Carefully and earnestly understand this image, and help me write a prompt for generating this image. Let AI understand AI, it’s still you, hahaha.

Complex charts can also be recognized. For example, the performance chart of AI models on the Nobel Prize dataset. Question: What does this chart say? Kimi concludes: GPT-4 performs best in distinguishing the originality of Nobel Prize papers from other papers, while the Mixtral model performs better in the correlation between originality scores and citation counts. No more worries when reading foreign papers in the future.

Recently, there have been several computer use products. Test Kimi to see if it can recognize web pages and itself. Question: What is this? Accurately recognized as the Kimi website, which provides intelligent assistant services, with input boxes, quick options, and topic recommendations, among other features.
Follow-up question: How can one use the “Kimi Visual Thinking Edition”? This series of answers was unexpected to me. It actually tried to visit the website kimi.moonshot.cn on its own to answer this question. I think it won’t be long before Kimi launches its own computer use product. With such visual recognition ability, it would be a waste not to develop this product.

After a comprehensive experience, my overall impression of the K1 model is as follows: In the fields of physics and chemistry with unique answers, K1 is logical and gets the questions right. In the rich and diverse life scenarios, K1 can reason deeply. Truly, every pixel is engaged in in-depth thinking. Moreover, it will fully display the thinking chain, allowing users to see not only the results but also the process.

K1’s excellence stems from its technological breakthrough. Traditional visual reasoning models usually rely on OCR technology or other visual models to first convert images into text and then conduct reasoning. This process will inevitably lead to information loss. However, K1 is an end-to-end visual reasoning model. It first obtains a basic model through pre-training and then undergoes reinforcement learning and post-training on the basis of the basic model, seamlessly integrating visual recognition and reasoning abilities. While ensuring no information loss, it also enhances the reasoning ability.

Kimi, which started with productivity applications, has expanded to life and entertainment scenarios this year and now takes the lead in the learning scenario.

Experience path:

1. APP: Enter the @ symbol in the dialogue box and select the Kimi Visual Thinking Edition.

2. PC: On the sidebar of the official website, click on the student wearing glasses.

It has to be said that Kimi, this all-rounder, is becoming more and more powerful.

Kimi Math Edition: Enhanced Visual Understanding and CoT Capabilities

Leave a Comment Cancel Reply

Must Read

Leave a Comment Cancel Reply