A Trillion Pages and a $5K Summer
In 1998 I built a thing called Spark Online. The host died, the site mostly vanished, and for a decade I doubted whether any of it had been real. Then one day I typed the URL into the Wayback Machine and there it was. My twenties, preserved by a nonprofit in San Francisco that decided the web was worth remembering on purpose.
Start with the May 27 recap set
This article is part of the Vancouver AI Meetup #29 recap package: the event archive, the speaker-focused recaps, and the full Michelle Diamond photo gallery.
- 29 Months, 200 Pounds of Meat, and the AI Commons
- A Trillion Pages and a $5K Summer — currently reading
- Tofu Isn't Failed Meat: Rachel Horst's Anti-Slop Machine
- May 27 Vancouver AI Meetup event archive
- Original May 27 Luma listing
- Michelle Diamond photo gallery

So when Andrea Mills, Executive Director of Internet Archive Canada, agreed to keynote Vancouver AI Meetup #29 on our “Building the AI Commons” night, I was genuinely fired up. While everyone else races to lock knowledge behind paywalls, Andrea and her team are doing the opposite. They’re archiving the open web and building public infrastructure with data that actually belongs to the public.
Here’s what she told the room, and the fellowship she slipped in at the end like a true Canadian.

Thirty years, 200 petabytes, a trillion pages
Brewster Kahle founded the Internet Archive in 1996, right at the birth of personal web publishing, when for the first time you didn’t have to be a publisher or a big institution to put something online. That moment mattered, Andrea said, because the early web was distributed. Anyone could publish. But there was a problem: the web forgets. It’s ephemeral by design.
A library is the opposite. A library is memory by design. So the Archive set out to be the library of the web. Today it’s over 200 petabytes. And last fall they crossed a number that’s almost impossible to picture:
“We celebrated 1 trillion web pages in the Wayback Machine.”
That’s the intellectual output of millions of people, kept open for researchers, the curious, and anyone who needs a big dataset. And it’s normal operations, not crisis: about 25% of any link saved online eventually moves or goes offline. The Archive is the reason the other version still exists.
Their preservation model is geographically distributed now: Internet Archive Canada holds a good chunk of the global archive, with Europe, Switzerland, and the UK coming online. Because, as Andrea put it:
“A cloud is just someone else’s computer, and you better really, really trust that person.”

AI changed the threat model
In 2022, ChatGPT shipped and Midjourney took off, and Andrea said it changed the game for the Archive in a specific way. Open archives started getting scraped relentlessly.
“In terms of functionality, it’s a DDoS attack by another name. The ravenous appetite of AI is a real thing.”
The Archive has the resources to bolster its defenses. Smaller archives don’t. And the deeper problem isn’t bandwidth, it’s enclosure. Andrea described libraries and archives that digitized collections, kept them online for years, and then had to take them down because a publisher signed an exclusive deal with an AI company. In some cases the publisher no longer even owns the original. The library took care of it. And now the only digitized copy is the one being pulled.
We’ve seen this movie. Google Books offered free digitization to academic libraries and then those holdings got locked up. If you want the newspaper clipping of your grandparents getting married, it’s on newspapers.com behind a paywall. The shredder, as Andrea called it, has been turned back on for the AI age.
I summarized it back to her on stage and she confirmed it (with very Canadian hedging): you archived their stuff, they lost their copy, they did a deal to let an AI company train on it, then they made you delete yours.
The controversial part: slop belongs in the archive
Then Andrea said the thing that gets her into arguments with other librarians, and I loved it.
She asked an AI tool to generate “what AI slop looks like” and put the result in her own slide. Her take: AI slop, low-quality as it is, should still be archived in some capacity. People are making real life decisions based on this content. It’s the record of the exact moment we’re living through. Excluding it from public archives means erasing the evidence of now.
Which is a fascinating counterpoint to Rachel Horst’s talk an hour earlier, where slop was the enemy of meaning. Both can be true. As a creative practice, fight the slop. As a historical record, keep it. The commons has to hold both.
We don't have the concept of done.
Andrea Mills
That’s the Archive’s whole posture. They go back to 19-year-old books and re-describe them with new tools, including machine learning, which makes them findable in ways they never were. Memory isn’t a snapshot. It’s maintenance.
Public AI, and a sandwich
Andrea’s framing for all of this is public AI: the overlap between thirty years of web archiving and what’s possible now if we build with public data, shared funding, and real rules of engagement instead of scrape-and-lock.
Her metaphor was a sandwich. If I have a whole sandwich and you have none, I should give you half. It’ll be slower to get off the ground than the corporate version. It’ll be worth it. And the question she left hanging:
“Do we really trust a corporation with our collective data more than we trust a nonprofit or a public utility?”
Your homework: the .ca kiosk at VPL
Concrete assignment from Andrea. There’s a new way to experience the .ca web archive, built by Internet Archive Europe and Switzerland and landed here in Vancouver. It’s on the fourth floor of the Vancouver Public Library central branch downtown, open from the end of this week.
Big screen, a gamepad with a microphone, no keyboard. You hit the button, you talk, and it’s semantically organized by subject. Go find your own old website and giggle. This is what public infrastructure feels like when someone actually designs it for humans.

The AI Builders Fellowship
Here’s the part she slid in at the end.
Internet Archive Canada and BC + AI are running a $5,000 summer fellowship to put one sharp builder to work on the real archive: the AI Builders Fellowship.
“We’re gonna be working this summer, along with Kris and the AI community, on what we’re calling the AI Builders Fellowship, that you’re gonna hear more about at subsequent events.”
Andrea Mills, announcing the fellowship
What the fellow builds: real tools, data visualizations, dashboards, storytelling, useful functional stuff on top of the archive. Not scrapers. As Andrea said, she wants the servers stressed because people are using the collection for research and creativity, not just crawling it.
“I want the servers to be stressed because they’re being used for stuff like that, rather than just getting scraped again.”
Andrea Mills
I’ll say the obvious thing I said on stage: people could offer you $5,000 to work on their bullshit dataset. This is the Internet Archive. It’s the coolest, most legit dataset in the world, and you’d be making it more usable for everyone while documenting what’s broken so their engineers can fix it. That two-way loop, you build something and the archive gets better, is the whole point.
This is the first joint build bridging archive.org and BC + AI, and it’s exactly the “Building the AI Commons” thesis made real: public memory, public tooling, public benefit.
Who this is for
If you’re a data scientist, ML engineer, designer, or data storyteller, and especially if you’re one of the young folks who was sitting in the third row wondering what to do this summer, this is the gig. Gord’s Data for Good crew (1,600 data volunteers) is a natural pool here too.
How to throw your hat in
The full program (timeline, eligibility, what to propose, how selection works) is finalized with the Internet Archive team. Applications open the week of June 30. Right now, the move is to get on the interest list so you hear the moment it opens.
Want in? Applications open the week of June 30. Follow BC + AI and the May 27 event page for the interest-list link when it goes live.

And keep the commons alive
- Go visit the .ca kiosk at VPL, fourth floor.
- Donate to or volunteer with archive.org. Email [email protected] if you know of a local news site or collection about to go dark; they have a team that beams in to save things.
- Come to Future Proof (October 28-30), where Philippe Pasquier and Erica’s DFR experience with the Meta Creation Lab ties into the archive thread.
Thank you, Andrea. You’re generous with your time, you put your money where your mouth is as a BC + AI sponsor, and you just handed our community the most legit summer project in the world. Let’s go find the right person for it.