Gary Markus says Effective LLMs can be Trained on Open-Souce Data. Here is Why He is Wrong.

Gary Markus says Effective LLMs can be Trained on Open-Source Data. Here is Why He is Wrong.

www.analyticsdrift.com Image source: Analytics Drift

Open-Source Data for LLMs

[{"selector":"#anim-23a6c4b1-9729-4b79-b7b2-b93b9d8efb45","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-c8fe7736-8996-41b8-9cd0-f32434a3e31d","keyframes":{"transform":["translate3d(0px, 202.49693%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-e3cc7704-0f19-4d23-b181-bc0d3eb93d56","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-33163253-0f15-44f6-b6cf-7b87ae401171","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Gary Marcus advocates for training LLMs on open-source data, but is this the full solution for effective AI? Image source: Twitter

Wikipedia's Misleading Data

[{"selector":"#anim-8b07b3d5-a34c-45f6-a57a-a4697363e53c","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-2530b78b-b031-431a-a5d6-866547bec8c9","keyframes":{"transform":["translate3d(0px, 200.24977%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-e12dbd3d-c362-48ce-b53b-729c6e9c6a63","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-86ac9542-0d8e-4e16-a7fa-89099887754b","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Wikipedia, while vast, can contain inaccuracies, leading LLMs trained on such data to potentially spread misinformation. Image source: Wikipedia

The Risk of Misinformation

[{"selector":"#anim-782ecf2a-398b-4669-a821-3fbe6556860a","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-d4c9d368-c043-42dc-9158-f22e0b1eb133","keyframes":{"transform":["translate3d(0px, 201.37335%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-20ffca28-a224-4642-9ea6-7e509d5b0ab4","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-3d78dca8-4f0a-427b-a8e9-2b9811475b2e","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Relying on Wikipedia risks embedding errors into LLMs, undermining their reliability and credibility. Image source: Canva

Incomplete Knowledge Base

[{"selector":"#anim-60ff5873-54c3-413a-8d49-016a13a2ac65","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-d54aa178-7cb9-41cd-b441-ab119621bac3","keyframes":{"transform":["translate3d(0px, 201.37335%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-365f53a2-0b00-4a48-bf5f-ef944d1da323","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-7f1307f2-b3df-493a-b6d0-87b7c14235ef","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Not all topics are comprehensively covered on Wikipedia, presenting a challenge for LLMs to develop a well-rounded understanding. Image source: Canva

Varying Article Quality

[{"selector":"#anim-19eab81f-3086-4b0f-b898-3379ba82ea60","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-b159a158-46bf-4df1-afbd-4539b8675a0b","keyframes":{"transform":["translate3d(0px, 204.74416%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-cb0c8cfc-382c-41ab-ab56-657095d03cb2","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-bcecf983-d170-4463-b6a4-60e69676f39a","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] The quality of Wikipedia articles varies significantly, with some subjects suffering from biases or lack of expert review. Image source: Canva

The Need for Diverse Sources

[{"selector":"#anim-78d72571-408a-43c3-935f-e3623778927c","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-e131eed7-10cd-401f-b044-8d435b193d62","keyframes":{"transform":["translate3d(0px, 203.62051%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-cb28d861-8692-44e4-a910-4b6483d5e600","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-49c9127e-a2fa-4a40-91f8-5f0f802e7105","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Training LLMs effectively requires a diverse set of high-quality, vetted sources beyond just open-source platforms. Image source: Canva

Verifying Truthfulness

[{"selector":"#anim-52740103-5de9-4bf5-909f-6abde2fb5a66","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-43583fc8-b883-406a-9bc9-c9c6fc6f17fd","keyframes":{"transform":["translate3d(0px, 203.62051%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-85a119f9-1a5d-425c-889b-7ef4e87b877f","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-0b568693-6f85-41e4-a01b-29db01f64f56","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] LLMs need mechanisms to verify the truthfulness of data, a challenge when relying on user-generated content. Image source: Canva

Ethical AI Development

[{"selector":"#anim-c4721f9d-b450-4f65-be85-a392b2935555","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-14e8bde1-7671-45b7-be3f-ebacbc681c37","keyframes":{"transform":["translate3d(0px, 201.37335%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-d0e53cbe-1517-4a06-b753-4d6aa2517840","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-eb1bd158-9af9-4bf8-bb67-b86f3c29ed19","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Ethical AI development demands careful consideration of data sources to prevent the propagation of falsehoods. Image source: Canva

Beyond Wikipedia

[{"selector":"#anim-93850252-d7aa-4caf-8405-9193c8d42225","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-2f06c47e-0461-4a2c-9266-77434e6b0555","keyframes":{"transform":["translate3d(0px, 199.13576%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-e571a6c0-f247-4ae7-9245-33d92b015701","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-3c9c6c84-0888-4822-b057-f5e73a9e9464","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Exploring beyond Wikipedia and open-source, incorporating a variety of data can lead to more robust and effective LLMs. Image source: Canva

Conclusion

[{"selector":"#anim-a8226345-6f25-49cf-88c2-89173078e888","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-19cd052a-9629-4820-a59d-09e8bc2de034","keyframes":{"transform":["translate3d(0px, 179.66267%, 0)","translate3d(0px, 0px, 0)"]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-d47c6c79-7bb4-4782-b767-6cb2f5cd1841","keyframes":{"opacity":[0,1]},"delay":120,"duration":1300,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-7ea63fa6-ef94-479c-9497-1c07c76db91f","keyframes":{"opacity":[0,1]},"delay":120,"duration":1200,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] Marcus' point underscores a crucial debate in AI: How to responsibly source data to build LLMs that are both effective and trustworthy Image source: Canva Read more

Get the latest updates on AI developments

[{"selector":"#anim-de226752-f6b8-44a5-817c-f127ba1e8f12","keyframes":{"opacity":[0,1]},"delay":200,"duration":1500,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-f3e692d7-3c02-4bd6-8f22-97a299acc1e7","keyframes":{"transform":["translate3d(-103.35917%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-55481bfd-b1bc-4ef9-85af-7aa047e267aa","keyframes":{"opacity":[0,1]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-b147985f-6ce8-4acf-a5b6-4fceb2232208","keyframes":{"transform":["scale(0.15)","scale(1)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"forwards"}] [{"selector":"#anim-7a91edba-4d84-45e3-9fd4-e413bd3ff846","keyframes":{"transform":["translate3d(134.00810%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-ff3f498b-2696-42de-8082-d86fa58175ed","keyframes":{"opacity":[0,1]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-06a8987d-4ef7-4485-b2d2-e3d96eabade8","keyframes":{"transform":["scale(0.15)","scale(1)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"forwards"}] [{"selector":"#anim-cb3850a3-bdf0-4a5b-8be2-696b3abd5956","keyframes":{"transform":["translate3d(129.34363%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-273a5405-4a53-43e2-8b08-b95056cb8673","keyframes":{"opacity":[0,1]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-95d9b9d9-f67c-4927-bb8c-42feed6b28a3","keyframes":{"transform":["scale(0.15)","scale(1)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"forwards"}] Produced by: Analytics Drift Designed by: Prathamesh Join Now