Toggle light / dark theme

Data is the new oil, as they say, and perhaps that makes Harvard University the new Exxon. The school announced Thursday the launch of a dataset containing nearly one million public domain books that can be used for training AI models. Under the newly formed Institutional Data Initiative, the project has received funding from both Microsoft and OpenAI, and contains books scanned by Google Books that are old enough that their copyright protection has expired.

Wired in a piece on the new project says the dataset includes a wide variety of books with “classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries.” As a general rule, copyright protections last for the lifetime of the author plus an additional 70 years.

Foundational language models, like ChatGPT, that behave like a verisimilitude of a real human require an immense amount of high-quality text for their training—generally the more information they ingest, the better the models perform at imitating humans and serving up knowledge. But that thirst for data has caused problems as the likes of OpenAI have hit walls on how much new information they can find—without stealing it, at least.

Researchers at the University of Sydney Nano Institute have made a significant advance in the field of molecular robotics by developing custom-designed and programmable nanostructures using DNA origami.

This innovative approach has potential across a range of applications, from targeted drug delivery systems to responsive materials and energy-efficient optical signal processing. The method uses ‘DNA origami’, so-called as it uses the natural folding power of DNA, the building blocks of human life, to create new and useful biological structures.

As a proof-of-concept, the researchers made more than 50 nanoscale objects, including a ‘nano-dinosaur’, a ‘dancing robot’ and a mini-Australia that is 150 nanometres wide, a thousand times narrower than a human hair.

Originally published on Towards AI.

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

Robots can convince other robots to do something.


https://sc.mp/subscribe-youtube.

Erbai, a robot built by a Chinese start-up, was seen in August, in footage recently released, persuading other robots to flee from an exhibition hall and “go home”

Support us:
https://subscribe.scmp.com.

Follow us on:

What do we think?


The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.

My Links 🔗
➡️ Subscribe:
➡️ Twitter: https://twitter.com/WesRothMoney.
➡️ AI Newsletter: https://natural20.beehiiv.com/subscribe.

#ai #openai #llm