![]() I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison. It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests. This is a really neat setup, it tests for various things (e.g. Finally, they had human raters check these translations. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. There's a set of grammar books for this language that adds up to ~250K tokens. There's a language called Kalamang with only 200 native speakers left. Their in-context long-sequence understanding "benchmark" is pretty interesting. I'd speculate that this means there is some sort of compression going on a full video frame with text on it is going to use a lot more tokens than the text needle. There are some intriguing clues in their results that this isn't just a single ultra-long vector for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. I really want to know how they're getting to 10M context, though. Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Open models downstream should see a significant uptick in quality using it, which is great. I've found 1.0 Ultra to be very capable, if a bit slow. Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. ![]() They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives. It seems like 1.5 Ultra is going to be highly capable. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.ĥ. This is going to make things much, much simpler for a lot of use cases.Ĥ. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). The 10M context ability wipes out most RAG stack complexity immediately. They don't talk about how they get to 10M token contextģ. They don't talk about how they get to 10M token contextĢ. Delivery dates cannot be guaranteed however our shipping service is generally very prompt.The white paper is worth a read. ![]() Shipping times are estimates and subject to weather and other delays. A surcharge will be applied to those provinces please contact us for shipping to the excluded provinces and territories.Ĭurrent shipping Estimates are 1-3 business days in Alberta and 1-5 business days outside of Alberta for items IN STOCK. ![]() Please note our pricing has changed! We have now consolidated all pricing for each province into one price page and all provinces will receive the same price with the exception of NWT YU and NU. Need BC Workbooks or Teacher Solutions? Western Campus Resources has partnered with AVP and will be our new BC Based distributor for all BC Titles! Place all orders here for BC.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |