Toggle light / dark theme

Data is the new oil, as they say, and perhaps that makes Harvard University the new Exxon. The school announced Thursday the launch of a dataset containing nearly one million public domain books that can be used for training AI models. Under the newly formed Institutional Data Initiative, the project has received funding from both Microsoft and OpenAI, and contains books scanned by Google Books that are old enough that their copyright protection has expired.

Wired in a piece on the new project says the dataset includes a wide variety of books with “classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries.” As a general rule, copyright protections last for the lifetime of the author plus an additional 70 years.

Foundational language models, like ChatGPT, that behave like a verisimilitude of a real human require an immense amount of high-quality text for their training—generally the more information they ingest, the better the models perform at imitating humans and serving up knowledge. But that thirst for data has caused problems as the likes of OpenAI have hit walls on how much new information they can find—without stealing it, at least.

Researchers at the University of Sydney Nano Institute have made a significant advance in the field of molecular robotics by developing custom-designed and programmable nanostructures using DNA origami.

This innovative approach has potential across a range of applications, from targeted drug delivery systems to responsive materials and energy-efficient optical signal processing. The method uses ‘DNA origami’, so-called as it uses the natural folding power of DNA, the building blocks of human life, to create new and useful biological structures.

As a proof-of-concept, the researchers made more than 50 nanoscale objects, including a ‘nano-dinosaur’, a ‘dancing robot’ and a mini-Australia that is 150 nanometres wide, a thousand times narrower than a human hair.

Originally published on Towards AI.

In the evolving landscape of artificial intelligence, data remains the fuel that powers innovation. But what happens when acquiring real-world data becomes challenging, expensive, or even impossible?

Enter synthetic data generation — a groundbreaking technique that leverages language models to create high-quality, realistic datasets. Consider training a language model on medical records without breaching privacy laws, or developing a customer interaction model without access to private conversation logs, or designing autonomous driving systems where collecting data on rare edge cases is nearly impossible. Synthetic data bridges gaps in data availability while maintaining the realism needed for effective AI training.

What do we think?


The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.

My Links 🔗

Southeast Asia’s emerging economies are vying to become a top AI hub — a race that has them both coming together and, quietly, battling among themselves.

The Association of Southeast Asian Nations (ASEAN), made up of 10 countries with a combined population of 672 million people, already has some advantages when compared to Europe or the U.S.

With over 200 million people aged 15 to 34, the region’s youthful and largely tech savvy populations make the region adaptable to future technological advances. That, combined with government support for accelerating AI in the region, could deliver substantial rewards for local workers.