Search

633: Thomas Steiner on AI in Chrome and the Web

Download MP3

Thomas Steiner from Project Fugu talks with us about AI in Chrome, the small large language model in use, how features like this are rolled out, the ethics and concerns around sending and sharing data, on device vs web APIs, and ideas for use cases and ways to explore AI on the web.

Tags:

Guests

Thomas Steiner

Thomas Steiner

Web · Social

Developer Relations Engineer at Google, focused on the Web and Project Fugu.

Time Jump Links

  • 01:12 Introducing Thomas Steiner
  • 03:03 AI, API, and Google
  • 12:03 Is this a good tool to test out small large language models?
  • 22:23 Sponsor: Jam.dev
  • 24:20 How much planning goes into a service like this at Google size?
  • 26:52 The ethics and concerns about sending and sharing data
  • 32:06 On device vs API vs web
  • 41:18 Could you build a toxicity report with AI?
  • 46:12 What kind of bandwidth and sizes of models are we working with?
  • 51:32 How is Google exploring use cases for this?
  • 57:25 What's the near future for this kind of tech?

Episode Sponsors 🧡

Transcript

[Banjo music]

MANTRA: Just Build Websites!

Dave Rupert: Hey there, Shop-o-maniacs. You're listening to another episode of the ShopTalk Show. I'm Dave--way too much coffee--Rupert, and with me is Chris--smurfin'--Coyier. Hey, Chris. How are you doing?

Chris Coyier: [buzzing sounds] I've got a big mug of coffee here, too, Dave.

Dave: I had to switch to water--

Chris: Not as much as you.

Dave: --to decaffeinate my veins, so we're good. We're good.

Chris: We're good. We're good.

Dave: We're up there.

Chris: You look good. You look bright, though.

Dave: Feeling pudgy. Yeah.

Chris: Bushy tailed. Well, this is going to be a great episode. We mentioned the fact that browsers (specifically Chrome) is experimenting and playing with putting AI right in the browser. I think that opens a lot of people's eyeballs up. You're like, "Wait. What does that mean? How does that work? What's going on there?"

I think I read one blog post about it and then was just like, rawr-rawr-rawr-rawr, and had a bunch of thoughts. If I said anything wrong, we're going to set all that straight today because we're probably going to spend the entire episode talking about basically AI in Chrome, and we have the perfect person to do that with, Tom Steiner. Hey, Tom, how ya doin'?

Thomas Steiner: Doing good. Doing good.

Chris: Good. Thanks for reaching out for being on the show because this is a great topic to talk about for all sorts of reasons. What is your role at Google and involvement in all of this?

00:01:31

Thomas: Yeah, so I work in Chrome in the dev rel department, developer relations. And particularly, I am busy with anything about AI, assembly and some of the APIs that we call project Fugu.

Chris: Fugu! Web!

Dave: Oh, Fugu. Woo...

Thomas: Yeah.

Chris: [Laughter]

Thomas: I was on the show a couple of years ago, I guess.

Dave: Yeah.

Thomas: It was a good episode.

Chris: Yeah.

Dave: I feel like your name is always at the top of the blog posts of, like, my list of tabs of stuff I need to learn. I feel like your name is always at the top of the blog posts of every blog post I've got there. Thank you so much for your contributions in education and everything.

Chris: Yeah. Absolutely. He was thinking Tom, not me. I'm Chris speaking here.

Yeah, high five for Fugu. I think that's really cool. It's like behind the scenes making sure that the Web doesn't lose. That's how I think of it.

The Web needs to be able to make sure that it can do stuff that native platforms can do. And if nobody is fighting for that, the Web will deteriorate to a point that less people choose it as a platform. Thanks to you and everybody else that makes that type of stuff a priority.

Thomas: Yeah. Well, thanks for Google for paying me for doing that.

[Laughter]

Chris: Amen. They do have a bunch of people on staff. Sometimes you're like, "How do you--?" The way that they make money for Google seems so abstract that I'm like, "Actually, don't think about it too hard. Just keep paying them, please. That'd be great."

Thomas: Just say under the radar. Keep having fun.

Chris: Yeah. Okay, so I'll set the stage with what happened last time. I think, like I said, I read one blog post. I was like, "Oh, interesting." I saw murmurs about it. You know you hear on social media, like, "Chrome is putting AI in the browser. Awesome."

You are like, "Really?! That's kind of cool," because you can imagine that normally when you think of AI--at least LLMs, the text-based ones where, for example, you go to openai.com, and you log into their little ChatGPT 4 interface - or whatever it is - and there's a big text area at the bottom--you type some crap, and it gives you usually a pretty half decent answer, I'll say. That's kind of where we're at with AI.

But they also have APIs, and so you can just get an API key, and then you can send in that prompt (as they're often called) and get a response at the API level, meaning that you can integrate it into your app. That's neat and all but, number one, it costs some money. Number two, it has to hit the network. Number three, God knows what happens with that data.

There are all these things that happen, but we're used to APIs. Developers love APIs. Every damn thing we build in the world is using APIs. That's fine. And if that's the way it is, cool.

But then all of a sudden you're like, "Hmm.... What if it was baked into the browser?" Now you're not making a network request. You're not paying anybody, maybe. Theoretically private (as much as we can trust that).

There are all these dominos that start getting knocked over, and it really is kind of an eye-opening thing, like, "What if?" Particularly the speed, I think, is extra interesting, if you're basically getting instant answers from stuff. Is that what we're talking about? Is that happening?

00:04:50

Thomas: That's sort of what's happening, yeah. There's a bunch of, yeah, absolutely really cool use cases that you can realize once you have an AI built into the browser ranging from anything offline to more privacy aware, you know prompting to, yeah, just in general.

Not having to pay money for doing stuff is always a big winner, and all of this right in the browser without any extra downloads, so it's kind of cool. Very surprising.

Chris: What's the 101 use case then? I will also mention this URL at the top that I didn't know existed, but it's just chrome.dev, which is a banger URL that for some reason is just like ten kind of random demos.

Dave: Ten hyperlinks. [Laughter]

[Laughter]

Chris: It's most random page.

Dave: It's the perfect website. Don't mess with it. It's the perfect website.

Chris: Yeah, never change it. It's amazing.

Thomas: [Laughter]

Chris: Chrome.dev, just go there. There's just a couple of links to some demos on there. One of them is called the Prompt AI Playground. You had to send around kind of a Google Doc explaining how to do this because, at the moment, and I'm sure this will change over time, it's a little hoop-jumpy to get this stuff to work at all.

Thomas: It's a total pain, to be honest. Yes.

Chris: Yeah. [Laughter] So, if you're looking forward to anybody just downloading Chrome Canary and then immediately playing with this, it's not that bad. If you follow the doc, you'll get it. But yeah, it takes a minute. Even then, not all of these worked for me because some of them require, like, you get invited to this program or something.

But this one works for me. The Prompt AI Playground, you can work and, guess what, it's a text area. You type some crap into the text area and hit return, and it spits you some stuff out back.

Here, let me do this. "How tall is Tom Cruise?" Return. "Tom Cruise is 1.72 meters or 5'7" tall." Look at that! That didn't go to the network or anything, and now I know how tall Tom Cruise is. Holy crap. How does that work?

00:06:50

Thomas: Well, you probably want to fact-check that because it might be a hallucination. Who knows?

Chris: Sure. Sure, sure, sure.

Thomas: Yeah, let me get some points across here. First, you said something along the lines of you need to be invited.

We do have what we call the early preview program. The idea is there it's a sign-up program where you just fill out a form. You tell us a little bit of what you want to do with this. Some stats, like basic stuff. Then you're a part of the program. There is no approval process, no anything, so it's not invite-only or anything.

Chris: Oh, really?! I was sitting around waiting for it. I applied and then I was like, "Uh... This is taking them a while, I guess." But just by virtue of filling out that form, I'm already in?

Thomas: [Laughter] You are. The reason this is--

Chris: Oh, that's good.

Thomas: --coming across as invite-only is it's a really scrappy thing right now. You essentially become part of a Google group. From there on you will receive messages sent to that group. But there is no backfilling, so all of the messages that you would have gotten in the past, there's no backfilling. So, you need to manually go to the group and see the archive. Then you can see what's happening there.

Yeah, I guess it's really just how of a scrappy program that is. But yeah, it is something that we are looking into to improve and just fill your backend with all the messages that you might have missed.

Once you are on the program, you get access to a bunch of documents where you can see what we call the API reference of all the various APIs that we've launched. You mentioned the prompt API, but there's also the summarization API. There's a write or rewriter API. There's language detection API. There are a bunch more that we have in the pipeline, like a translate API or translation API.

Essentially, we have a big split between the APIs. The first one is the task-based APIs. This is anything where you have a very clearly defined task like a translation, like a detection of the language a text is written in; writing and rewriting where you have something that exists or needs to be written, and then you refine it.

Chris: Really? Okay, so that's totally different than prompt.

Thomas: That's totally different than prompt, yes, because the prompt API is completely free form. You can tell it to write a poem. You can tell it to summarize. You can tell it to translate. It's, yeah, just completely free-form playground.

00:09:10

Chris: Oh, I see. If it's a Venn diagram, prompt is just free-form. It will happily rewrite something for you, right?

Thomas: Exactly.

Chris: But it's too broad. It's not specifically for that task. Okay.

Thomas: Exactly. The reason we do that is we use a prompt API sort of as a playground for people to see what people even use this form. If they see, "Oh, there are a lot of people who use this for task X," then there might be some good motivation for actually creating an API that is just for doing X in the browser.

Chris: Hmm...

Thomas: The reason is, yeah, if you know what the person is up to, what the developer wants to do, it can fine-tune a model accordingly. The problem with the existing Gemini Nano model ... now, version 1, you probably will have noticed (once you start doing anything of substance), the quality, yeah, leaves to be desired a little bit. That's definitely not the quality that we aim for, but it's V1, so that's something that we have out there right now.

Chris: Yeah.

Thomas: But once you know what people want to do, you can refine it.

Dave: Is that kind of like the small language model problem? Is that what you would call Gemini Nano is small language model, not the large language model?

Thomas: It's an LLM, but it's paradoxically a small, large language model.

Dave: A small, large language model.

Thomas: [Laughter] It's kind of stupid, yeah.

Dave: So, that's like the number of whatever things it was trained on - or whatever - is smaller? Is that kind of the idea?

Thomas: It's the number of parameters. Essentially, yeah.

Dave: Parameters? Okay.

Thomas: ...what it was trained on, yeah.

Dave: Okay. Yeah.

Chris: You can tell it's dumber. You just can. It's cool that it's there. But yeah, you'd have to be really careful using it. But like you said, it's only because it's so broad right now. It's refined for nothing.

I just wrote, "What is Dave Rupert famous for?" and it said--

Dave: Yes, he is a plant.

[Laughter]

Chris: Dave Rupert is best known for his role as Rupert Van Satan.

Dave: See!

Chris: Sorry, Dave. Yeah.

Dave: People miss that whole... where I was the demon lord Satan. You know?

Chris: [Laughter]

Dave: I was Dracula for, like, 12 years of my life and no one talks about it.

Chris: But it's really broad, so why would it know a whole bunch about Dave? It probably just wasn't trained on a bunch of Dave's stuff - or whatever - missed his Wikipedia entry - or whatever it is. But it's also very broad.

I also wrote in, "Give me the CSS to make a rainbow button." For some reason, my brain always goes to, "Make me a rainbow button." It's like my go-to checker on this thing.

It did okay. It made CSS for a rainbow button. As far as a small language model goes, at least it's got stuff like that in there, which feels a little bit impressive to me. Of course, it generates it in 0.0001 seconds - or whatever it is, which is also impressive.

00:12:01

Dave: I think this is... I don't know. I'm pretty AI skeptic. I'd probably put myself in that camp, just in general. But I kind of love this for the experimentation. I'm not burning an entire ocean just to experiment with this. It's just like, "Okay, cool. The binary has shipped, and I can play with it. I know there's a better version, or I could create a better version or fine-tune."

I'm kind of curious how that fine-tuning kind of works in this context. But then I can experiment and be like, "Cool, there is a better - whatever - AI brain, LLM that I could hook into later. But for right now, I could experiment and see if this text summarize button is cool or not," or something like that. Right? Is that kind of what you're envisioning?

Thomas: Yeah, so definitely. The moment you come to task APIs, like summarization is a good example, actually right now under the hood, it's implemented essentially as just a system prompt on top of the prompt API. It's essentially telling the prompt API under the hood, "Hey, whatever comes next, please summarize it as whatever, bullet points or teaser or give me a headline for the following article or something." Under the hood, that's what it does. But it's fine-tuned a little bit on the particular task.

The reason fine-tuning is happening there is, well, obviously you get better results if you tell the model what it should look for and what it should optimize, like to remove filler words and stuff. Fine-tuning is also very complex in the sense that you need to be sort of an expert if you want to do it yourself (in most cases). But most JavaScript developers are not, so JavaScript, we tend to not be AI experts. Some are, but the general JavaScript developer is not an AI developer. We think there are some chances that we get better results by just leaving the fine-tuning to the experts delivering those task APIs, and then yeah, launching them in the browser.

Another reason for doing so is also if you have different models that ship in different browsers. That's one of the problems that you touched upon in your last episode. You can get very different results. In Chrome, we have Gemini Nano. The Edge team are starting to experiment with Phi-3-mini, which is a different model that has different behavior.

If your model, if your application works with your fine-tuned prompt that is prompt engineered exactly to the way you know the model behaves and you switch model, of course then, yeah, you may get very, very unexpected results. In the worst case, you get no results at all, nothing useful. In the best case, you get something that is, yeah, just as usable as before.

But with a fine-tuned model, people can, yeah, use models that are known to work for a certain task. So, if you know summarization is a task that is well studied, as a browser vendor, you can pick one of the off-the-shelf models, ship them in your browser, and be good and call it a day.

Different browsers will have different approaches to that, so one browser might be only willing to ship an open-source model. The other browser might be something alongside of only shipping proprietary models. Either some more, let's say developer or anti-browsers might allow you to actually plug your models in some sort of private browser preference, so in the end, what we want to do is we want to get some sort of agreement on the APIs, on the interfaces, and then look at the models sort of as a separate task.

When you look at the prompt AI in Chrome right now, as I said, what you get, yes, Gemini Nano. The open-source model is Gemma. That's what I was confusing it with for a second. You get Gemini Nano version 1.

We're working on exposing a new version that will get better results. Yeah, this is definitely one of the big challenges, what you do with different models in different browsers.

00:16:19

Chris: Yeah, agreement was an interesting part there. That was one of my criticisms after reading too little about this. I was like, "Oh, y'all just going to slap it on the window object, huh? Just window.ai. Okay."

I feel like, as Web developers, we need to have some part of our soul needs to be a watch guard for stuff like that. When browsers do that, you have to be like, "Oh, did you run that through the right wringer?" It sounds like you probably have, so we'll hear about all that.

But my point is that you don't want browsers individually just making calls like that. There is a standards process for a reason because it gets dangerous.

We know Apple is marketing the hell right now out of their version, which they won't even say AI. They always call it Apple Intelligence, right? They apparently don't even like the two-letter acronym: AI.

Wouldn't it stand to reason that, in Safari, if they like this and choose to be on board with it -- maybe they could even be a blocker for it. I don't know. They shipped something else entirely. Navigator.intelligence, or something, that's the API they want.

Well, now as Web developers, we have to be like, "If navigator.intelligence, then else window.ai," and we send the same prompt to it. You're like, "Okay, that's fine. We're used to if/else statements." But that's opening the door to hell as developers again.

[Laughter]

Chris: We just don't want to go there again. Standards saved us from that, from browsers just doing whatever the hell they wanted in browsers.

It turns out that's not quite the case here, right? There were proposals written up. This is the name space we should use. Getting browsers to kind of agree to that kind of thing. Can you tell me more about that?

00:18:05

Thomas: Yeah, absolutely. First, let's set the record straight here. Nothing is shipping. If you want to get to the window.something API, so window.ai.storeapis, there's this horrible process right now where you need to flip a flag. You need to wait for some component to download. You actually need to flip two flags now.

For each individual API, you need to flip yet another flag, so it's definitely not in a "this is shipped" state. Absolutely not.

There are a ton of conversations happening in the WICG (the Web Incubator Community Group) where we have started to migrate some of those APIs. There's the translation API proposal. There is the writing assistant API proposal.

We're working on I think the prompt API, but this is a little bit further behind. So, as I said, it's one of the more experimental APIs.

What we're doing here is we want to get some early developer experimentation happening. But then the standardization process is going to take a long time. Most likely, it will launch somewhere in proprietary Chrome extensions namespace before, so you can only use it in extensions to begin with before it hits an actual Chrome version that, yeah, regular users can test.

On Chrome, we have the regular phases where we have something is behind a flag, which is what we are at now. Then there is the origin trial phase where U.S. site owners say, "Hey, I want to register example.com for this origin trial." You get back a special token that you need to add to your site. Then when the Chrome browser detects the presence of this token, it will enable those APIs, so you can test it with real users.

Once this step is done, we will proceed with shipping. So, it is to be said we don't block on other browsers not shipping. So, if let's say Safari, let's Mozilla Firefox disagreed totally with what we're doing or what we are planning to do here, we won't block on that.

But there are a lot of conversations that are happening and that need to happen. So, we go to the WICG, as I said before. We ask the W3C technical advisory committee, the group, the tag.

In September, actually later this month, the W3C TPAC event is happening (Technical Planner Advisory Committee). The TPAC (Technical Advisory Planning Committee) meeting is happening where we engage with folks active in the Web ML working group to just get opinions on these APIs.

What you are seeing right now, as I said, is far from anything shipping. It's a thing that we're experimenting with. It's behind a flag. People can, if you are a part of the program, link the Google Docs where the reference documentation is happening or you can blog about it. People have done that.

The reason why we encourage people to not do that is because we want to get a backchannel to folks. So, if people are a part of the early preview program, there's a way for us to send you all feedback surveys or that we can even contact you and say, "Hey, there was a breaking change," whereas if someone writes a blog post, which is awesome, and includes documentation for how to enable certain APIs, if this blog post doesn't get updated and someone else finds it, they might be disappointed because the instructions don't work anymore or the API shape has completely changed, which is something that did happen with the prompt API.

Chris: Yeah.

Thomas: Before, it had a different shape than it has now.

Chris: That's been a problem forever. [Laughter] It's tricky. As dev rel, I'm sure it's near and dear to you. Yeah. Anyway, update your blog posts, people.

Thomas: Oh, please do. Yeah.

Chris: Yeah. [Laughter] Or link to somewhere. I don't know. That's the responsible thing to do is be like, "This is early. This stuff changes. Here's a URL that is to the source that might have more up-to-date information than this blog post does," or whatever.

Thomas: Ideally, you link to the sign-up form where people can sign up to the program, or even the article where we tell people, "Hey, what even is this?" because, as I said, there are some misconceptions about this being invite only or something; it's kind of a secret club. It's not, so link to the blog post where we tell people how it works, how to sign up, and stuff.

00:22:26

[Banjo music starts]

Chris: This episode of ShopTalk Show was brought to you in part by jam.dev. You've got to check it out. Go to jam.dev. It's a free browser plugin you can install.

As a developer, you need some kind of tool for capturing and annotating screenshots for bugs or, even better, recording little video screencasts of what the bug is. They communicate the bug so much better.

You've got to have a tool like that. That's what Jam does, but it does so much more than that because it automatically captures a whole bunch of interesting, actionable metadata along with that.

Imagine you've recorded this little screenshot now. It automatically becomes this shareable URL that you can put wherever. There are integrations, too. Send it to Jira. Send it to Notion. Use it in your Slack. Whatever. You can comment on them. Leave text there explaining what's going on.

But it automatically captures the console, everything in the console, what happened. So, if it's a bug and the bug threw, for example, a JavaScript error, you'll be able to find it.

You integrate it with Sentry, then we definitely have that at CodePen. Now I take a screenshot. There's some backend problem or front-end problem, even. It can compare the timestamps of what got reported in Sentry and then what was happening on the website and marry the two so you can see more even than in the console with Sentry. I think that's amazing. Not to mention, you're looking at the video of the problem - super, super useful.

Then it's got the browser, the platform, the version, when, where it happened, all this stuff. More than just the console to other stuff from the dev tools. Just a tremendous thing.

And it doesn't make any more work. You need a screenshot tool, a screen recording tool anyway. You might as well use this one and get all the free metadata information and debug information there, too. I even see there's a little AI tab, so it's like, "Oh, this is representing a bug." Now with all this information, it'll take a crack at how it thinks you could fix it. Why not? Sometimes you're stuck. Let it help you.

Check this all out at jam.dev. A really cool tool.

[Banjo music stops]

00:24:37

Chris: All right. I think we should cover the APIs some more a bit just to see what the surface is. Do you think, though, before we do that--? It seems like it's pretty well considered, and it seems like there's a lot of thinking and planning and excitement and money and stuff behind this now. Does it feel like the trajectory is this is just going to happen? [Laughter] Or does it feel so wishy-washy at the moment that you wouldn't be surprised if they were like, "Ah, you know what? Forget it"?

Thomas: I think we need to separate a bit into controversial and noncontroversial APIs. Detecting the language of text is something that is a well-studied problem. You don't even need AI for doing so. In the worst case, you can detect if something is written in English by a well-crafted regular expression, if you want. Of course, the quality--

Chris: Hmm... That's one of the demos on chrome.dev is you can go in there and guess the language. I typed, "Hola!" and it was like, "I am - whatever - 78% sure this is Spanish."

Thomas: Yeah, so this is a noncontroversial API. Translation is probably a noncontroversial API.

If you look at Firefox, they have on-device AI-based translation already. It's just not Web-exposed. I think, for APIs like that, we could probably get agreement.

If you look at Safari, they also have a translate feature. Chrome has one since forever. If this is happening on-device already, as in the case of Firefox, exposing it to something like - whatever - ai.translator.whatever is probably something that is relatively non-controversial. Even people who are skeptical of the entire AI space like you, Dave, they might say, "Well, this is actually a net win," whereas stuff where you can potentially do harm, just in general prompting, asking an AI model stuff that is just free-form, this might be a little bit more controversial and people may have very strong opinions. Sometimes they just hate purely on AI for good reasons, for no good reasons.

There is a lot to criticize about APIs. Let's get the record straight here.

00:26:52

Dave: I think my thing is there's... I am submitting my prompt to a black box. I think that should raise the hairs on the back of the neck of any developer who's ever submitted code to a black box - or whatever. It's just this kind of like, "Oh, who is reading that data? Who is getting access to that? Who is like, 'Oh, well, I'm storing that prompt for telemetry. Wink.'"

That stuff is not clear, and so that's where I think I'm more excited about these kind of local, small, large language models just on-device.

Apple is making a whole deal about Siri being this private, on-device. I watched an hour-and-a-half presentation about it yesterday. [Laughter] You know what I mean? I think that's very comforting, I guess. I say comforting, but I think that's what's kind of weird about these things.

Then my blog posts are out there. If I say, "Hey, AI, help me fix this sentence or this paragraph or make it make more sense," and then it says, "Great, I will. Also, that blog post is now mine." That makes me feel... You know? It's like, "Oh, I'm an AI and I just stole your text to train myself for the future." That's a weird one, too, for me.

It's just kind of like I think, as a consumer, it's like knowledge is power. I don't think we have the knowledge or the sort of... I don't think we're informed or there's not language or badging or whatever around how do you say what happens to my text that gets typed in here.

But I totally agree with you. I think there are very non-sensitive things. Once the model is built, the summarization is just like nothing. It's just vectors in a database. Once you understand it, it's just like, "I'm just going to take text vectors and make a smaller set of points and spit out text that's similar," or do the opposite. I'm going to use a lot of words.

No one would be like, "Predict-a-text on my phone is ruining... is steeling my text ideas." No one would say that. But equally, those are probably being telemetry somewhere - or whatever. I don't know.

I don't know. That's where I'm like the skeptic kind of comes in. It's just like what happens to the stuff I type in the box.

00:29:42

Thomas: Yeah, for the local models, you can at least be sure that it won't be uploaded to a server. The developer could still do so. But the way the API works, it's totally local.

What is your stuff trained on? That's an entirely different ethical discussion here.

Chris: That's mine. That's my big one. It was almost definitely trained on lots of stuff that I wrote, and it's not just about me. It's about everybody else, too. We weren't asked and we weren't compensated.

It's one of those things, like, is it just like, "Too bad"? Are we at the "too bad"? It's like, "Get over it or get off the bus." It feels powerless, in a way. It's like the ship has sailed, or whatever. It's like, "Yeah, I guess."

You can be real mad that Uber exists, but then you're stuck at the airport. You can use Uber or you can just walk, I guess. [Laughter] I don't know. That's a bad model, I guess. But sometimes it's like bad stuff that you disagree with happens in the world and you just have to get over it, I guess. I don't know.

I'm wearing a shirt right now. I have no idea if every person involved with the creation of this shirt was paid a living wage or not. But I have to wear a shirt, so here we are, I guess. You know? I don't know. [Laughter]

Thomas: Yeah. I know there are documented ways to opt out your site of being used for training now.

Chris: Mm-hmm.

Thomas: But as I say, most of the training has happened. [Laughter] There are, I guess, some ways to untrain. Let's say you want to... I mostly associate you still with CSS-Tricks, so let's say you want to opt-out CSS-Tricks' content from being used. You can do that now, and there might be, in the future, a way to say, "Hey - whatever - Google, take all the content that you took from CSS-Tricks.com and untrain your model from the knowledge that it has gained from that," which I think there's some research on how this could happen. But it's definitely not something that I think any of the AI companies are seriously considering right now.

Chris: Yeah, and I don't even know what the answer would be like, if I still owned and operated CSS-Tricks. I probably would put the robots.txt on there to do it just because I feel like I wasn't asked to begin with, so I might grant you the access to train on it if you asked. But you didn't ask. Anyway... [Laughter] It's where we are.

We don't need to dwell on that for too much longer. I just wanted to acknowledge that because it does feel like that's my hardest one to get past. But it also feels like we're almost--

I feel like, two years from now, nobody is going to even ask that question anymore. It's just the new reality. Text gets slurped up. Too bad. And then we just get over it and move on because we're getting enough value out of this - or something.

One of my goals was the API thing again. We've talked about prompt a bunch and, Tom, you made sure to let us know that's the really broad one, open-ended, do whatever you want.

We talked about summarization. That one can be quite useful. In fact, I think I pay Google a few bucks or something to get access to the bigger Gemini. By virtue of doing that, my Gmail, now--when I open it on my phone and read an email--has a giant button at the top of it. I actually want it to go away. But for now, it's there.

Thomas: [Laughter]

Chris: It says, "Summarize this email." I click it, and it summarizes that email. Now, it's probably not an on-device model, but it could be in the future maybe.

It's just a different API from the prompt API, right? It's just more specific.

00:33:19

Thomas: Honestly, if you say Gmail, this already is very unclear because you have Gmail and Android. Android does have an on-device model that it ships with.

Chris: Oh...

Thomas: At least on some of the high-end Pixel phones. There is a lot of buzz around the on-device.

Chris: Okay.

Thomas: If they were doing an Android app--

Chris: We're talking about the Web here.

Thomas: Yes.

Chris: But Android--

Thomas: ...Android abuses it, I don't know.

Chris: Okay.

Thomas: If you go to the Gmail, you can actually just inspect and see what kind of requests are being made if you hit this "summarize email" button. I think right now it definitely still goes out to the server just because the on-device prompt API is behind a flag, so you can really use it.

Chris: Okay. Then there are others as well that I understand less. How many are there? Is it worth--? Can you walk through some of the different APIs that exist that are attached to window.ai.star?

00:34:11

Thomas: That's the write API. The core idea there is you have a task in mind. So, "Tell Chris and Dave politely that I can't make it to the podcast today because - whatever - my cat died."

I'm not in the mood of writing an email today, but I know I need to tell you folks, so I can use this API to make it happen. This is one idea.

Then the rewrite API takes existing text and allows you to rewrite it. Let's say I got this email draft, and it sounds kind of very formal. So, I can tell the rewrite API, "Hey, actually, I sort of know Chris and Dave at least from the podcast, so I feel like I know them very well. Make this less formal. Make this shorter." I can then take this text and have it rewritten for me.

You can, of course, sort of get the same behavior with the prompt API if you just tell it. But as I said before, we were looking at what are the tasks that people use the prompt API for. In general, is there anything that we can take away from those and then generalize it?

You will see that the write and the rewrite API is part of the broader writing assistance APIs. In there is, as well, the summarization API where you can just summarize long text and make it shorter. This can be anything from you have this long blog post and you want to just get a headline for it.

Something I always struggled with is how do I title my posts, or I have this very long email thread and I want to get a summary, so I want to get the key points. Sometimes you have a long movie description and you want to write a teaser that will tease the movie for people to watch without really giving away too many spoilers. When it comes to that, we definitely do have some overlap with the APIs.

Let's take this movie example. You have a description of what is in the movie, what's happening in the movie step by step each scene. But there are of course spoilers. So, I can create a teaser for it. But then the teaser will still contain spoilers, and this is definitely then a moment where you need somehow to tell the API, "Hey, actually, don't tell that - whatever - they die at the end - or something. Don't spoil the surprises."

To some extent, you can tell the rewrite API but the context that you provide it, "Hey, actually, don't tell whatever, that Jay dies in the end - or something," so there is some way to do that. But when it comes to very, very special use cases where you want to summarize something in a way that it makes sense for a technical audience and this technical audience is very well aware with a certain technical concept, it starts to get very complex where you say, "Hey, summarize this. But actually, by the way, if they talk about P, don't summarize the P concept because people know what P is, so there's no need to summarize this."

What I'm getting at is the moment your summarization gets or has to have some additional instructions, at some point you probably need some sort of a free-form prompt API. But we are still exploring when is that point. To what extent can we get along with just the task-based APIs that do one thing and one thing hopefully well? And where do we need this free-form liberty?

Chris: Right. Okay, so we've got prompts. We've got write or rewriter. We've got summarize. Those seem a little straightforward.

Thomas: Language detection one.

Chris: The language detection one.

Thomas: Yeah, there's a language detection one.

Chris: Okay.

Dave: Does it route? Can it detect from my prompt, "Oh, he's trying to rewrite something, let's use rewriter instead"? Can it do that or do I as the author have to be like, "Ah-ah-ah. You didn't click the rewrite button"? [Laughter] "I don't know what to do here." How does that work?

00:38:13

Thomas: Yeah, so right now if you give it the wrong tug or if you call the wrong API with the wrong task, there's no way that the API will be smart enough and tell you, "Hey, actually, you might be better off with this other API."

Dave: Right.

Thomas: I think, in the end, what we want to do is we want to make those task APIs really clear. So, if you have a task at hand, the exact, "Oh, this is something that I need the translate API for," and that's the one that we didn't talk about, so language detection is a required task for the translation API because typically, once you get some text, if you don't even know the source language but you only want to know that you... If you want to know that you want to get the result in French, you need some way to detect what even is the source language.

Some people look at Asian scripts and they don't know, is it Korean, is it Japanese, is it Chinese, is it whatever? If you have some sort of language detection API as an input first step, then it can use that. Yeah, and then get the translation API to pair up with a language detection API and that. I think this is two cases where the two can work together.

There's a bunch of them that we have internally that make use of several APIs. One that I'm working on is what I call lingua franca Mastodon. The idea there is you can subscribe to hashtags on Mastodon like U.S. politics, which is kind of interesting these days.

Dave: You don't say. [Laughter]

Thomas: [Laughter]

Dave: No.

Thomas: Some of these toots might be in written languages that you don't speak, so I can use the translation API to first detect what is the language the toots are written in then translate to a language that you do speak. Then I can use the summarization API to get a feeling of what are the core points that people are talking about. Then finally, you can use the writer or rewriter API to then say, "Hey, I'm a journalist. I want to give people an overview of what people are talking about Mastodon about a certain hashtag," in all languages that the journalist may not even speak.

So, that's a combination of all the different APIs that we have so far. I have to say it sounds promising. It works crap, so this is mostly because toots are very short messages to begin with, so it's a little longer than the traditional Twitter messages, but still toots are relatively short. Toots are mixed quality, so people hashtag a lot, spam tag a lot, so there's a bunch of just unrelated toots.

You could try it with a filtering step where you tell the prompt API, "Hey, actually find out if this toot that is labeled X actually has to do with X or if it's just spammy." You can imagine a pre-filtering step like that. But anyhow, the more steps you build into the pipeline, the less reliant or reliable is the final outcome there.

The idea is there. I hope, once we have a better model, that the results will be better. But yeah, it's something we play with internally and just see what are the use cases that we even internally can come up with.

00:41:18

Dave: Here's an idea. I've got ideas. I'm an ideas man.

Chris: Yeah. Yeah.

Dave: I think everyone who has listened to the podcast knows that. [Laughter] I've always wanted a weather report for my social media. When I log on, it's like, "What's the temperature here? Is it just a shit show or is it a good time? Sunny days, dark skies?"

Chris: [Laughter]

Dave: I think that would be really cool, and this is exactly probably what I would need to--for tweet in tweets--analyze it and then summarize it and then sum that up and then give me a score on toxicity or something. Then just give me a weather report on, is it acid rain out there? Oh, boy. I'm just going to close this.

That would be fun to me. I don't know. I assume, because it's happening on-device, it can be pretty quick. In that situation, I'm not... Well, A) I can prototype it for zero dollars. That's great for me.

[Laughter]

Dave: But then B) in that situation, I'm just causing harm to myself, I guess, if that makes sense. I think, with any AI or computer, it's not going to understand sarcasm very well. Maybe it does but it doesn't understand other things. And so, I don't know.

Gosh, yeah. If you're like, "Oh, are there toxic trending topics on Twitter? Just delete them. Just delete any tweet that seems stupid." It seems like you could build cool tools for yourself with this stuff. I don't know.

Thomas: You could totally do that. There's toxicity detecting models where you can just say, "Hey, if this is a toxic message, just leave it out; filter it out."

I think something that I always wanted for any kind of social network--Twitter, Facebook, anything--people bring their whole selves to the socials, and that's good. But sometimes I'm in a hurry, so while I usually might enjoy, let's say, Chris's JavaScript and CSS toots, but maybe sometimes his cooking toots, if I'm in a hurry, I might just want to filter out his cooking toots - or something. Or I just only want to get the technical toots - or something.

Something like this could definitely be built into a Web-based Mastodon client. Yeah, most of those, or some of those even, are open-source if you're using Fanpy or out.zone or something. There's probably someone somewhere out there who, after listening to this, is in a mood to actually try it and say, "Hey, we have filtering in the Mastodon network by hashtags, but this requires the author to have placed the hashtag." But if there just was a way to say, "Hey, if this toot is about cooking, then just don't show it," I think this is something that you could perfectly build as a client side.

It could be an extension. It could be a plugin that you have as a developer that you activate in your Mastodon client or something. Totally doable, and I think we are technically there.

I think always a little bit of a problem would be what if this is overzealous. Let's say you tell it to remove toxic tweets, but then it takes toxicity as also sarcasm, which is something that you might want to enjoy, actually. Fine-tuning this and getting it right for just your taste, I think this is then where something like the fine-tuning and model tuning would come in.

Chris: I don't even know if these... I'm sure, from your perspective, it's like, ah, it's really nice to hear what use cases people have and stuff. But it's kind of like there are zillions of use cases. I don't know that these particular APIs need to be the farming ground for AI use cases.

The whole world of AI is used cases. It's almost like, "Look outside. See what cool things are happening in AI. Then make them possible to do in the browser."

I like Copilot. It's interesting that we got that so early on in the AI days, but it's nice. It does type ahead for code. It does a pretty good job with it. It helps me refactor code and stuff. That should be... If that happened on-device, that would be sick, so do that. You know? [Laughter]

Do it. Yeah. Copilot for free, baby. You know?

Dave: You could put that in... I could think of one online code editor-- [Laughter]

Chris: Yeah.

Dave: --that could really benefit from something like that. Yeah.

Chris: I want to feel like I'm smart for being trailing edge with this stuff. I want to be like, "Oh, we were just waiting for Chrome to finish up with their thing, and now we have it, so good job.

Dave: Yeah.

Chris: And guess what. It cost us nothing.

What about the size? I think people have questions about that. There are some demos, I think people have seen, of AI on the Web. Then it's like, "Visit this website. Wait for an hour. Then you have AI in the browser," because - whatever - in the background, it's downloading some two-gigabyte model.

Dave: Forty-gig model, yeah.

Chris: Yeah, right. Which is possible to do, right? I think you can download enormous things on the Web. That's not going to be the case here, or maybe it is. When is that the case and when is that not the case? Yeah, I guess that.

00:46:45

Thomas: Well, the model needs to download once. But once it's downloaded, you can use it on all the origins. Whereas if you have a demo like Stable Diffusion online or something, or we have seen the cool folks from WebLLM that make you download any of the--

Chris: Where does it go, though? Is it there until I clear local storage or clear site data (as Chrome would say)? Or it's tucked into some secret place?

Thomas: It's sort of tucked into a secret place, but there are people who have reverse-engineered how it works and stuff. I probably shouldn't talk about it, but if you know, you can find it.

It's somewhere deep, hidden, deeply hidden in - whatever - the application support folder of your Mac, if you're using Mac - or something. It is downloaded there. But if you take any of the other APIs, API demos that need to download the model, then you have the problem of, "Oh, this model is only usable on example.com."

Chris: Hmm...

Thomas: You are bound to one particular origin, so even if you have the same model being used in different origins, you need to still download it again for each origin, which is super annoying, of course, and it's a huge space waste, waste of space.

Chris: Right.

Thomas: There's one way around it if you work with a browser that has the file system access API supported. You can point your browser at a particular file on your disk, and like that, share the same file across different origins (if you want to). I've documented that in an article.

Chris: Fancy.

Thomas: So, if you are really into doing something like that, you can do so.

Chris: It's an executable or something?

00:48:31

Thomas: It's not an executable. It's essentially just you download the model once, put it somewhere on your hard disk, and then, for your app, you create a file handle that points at this particular file. This file handle can be serialized to IndexedDB, so when you reload the app, the browser can re-establish the connection that it had to the file handle that use the same file.

You can do this from any origin. Of course, then all of those origins make use of the same file. So, it's not for the faint-hearted.

Chris: Yeah.

Thomas: You really want to be sure that you know what you're doing there. But for researchers, for example, this could be very useful.

If you are working on AI research and you have different social experiments with different models in different applications, you could totally use such an approach. But as I said, it's not for end users. Definitely not.

But for end users, what might work is the built-in model where you have the model that is downloaded only once. The browser takes care of the download. The browser takes care of interrupting the download whenever my network goes down. I need to resume whenever the network comes back.

There's this sort of... The browser takes care of it black box, but you don't have to deal with it. On that end, it's a plus.

If you look at the requirements in the documents that we mention, we say you need to have, I think, 20 or 22 gigabytes of free disk space. The browser will remove the model again once you go below ten gigabytes of free disk space, which happened to me this morning, actually, because I was playing with some of the online demos. I was downloading Stable Diffusion and stuff. All of a sudden, my built-in browser model had gone away again.

Chris: Hmm...

Thomas: But the browser will then also take care of downloading again if I make space on the hard disk, so it's, yeah, a bit of--

Chris: Isn't that an image one? We haven't talked about images at all, and Stable Diffusion is an image model, isn't it?

Thomas: Yes.

Chris: Is that--?

Thomas: Stable Diffusion is for creating images. Yeah.

Chris: You were just playing around with it otherwise, or is that also going to be available via Web stuff?

Thomas: Oh, it's no secret that Gemma and the model family of Gemma and Gemini works multimodal.

Chris: Mm-hmm.

Thomas: So, images, text, audio, whatever. We were looking at all those use cases, so I don't announce anything right now, but--

Chris: Oh, I see.

Thomas: If you use the Gemini API, you can see already this is stuff that people do and can do, like transcribe a podcast by sending in the audio and get back--

Chris: Right, right, right.

Thomas: --a vtt file.

Chris: I mean I just did it just now for fun. If you go to Gemini, I can type in, "Make me an image of a monitor with a hand coming out of it," or whatever, and it knows what I mean and it will generate an image of that. That's probably not happening of window.ai.prompt - or whatever it is right now, but it could, right? Reading between the lines, you're saying it's possible that these experimental APIs start returning images as well. Check. Got it.

Thomas: We are exploring use cases here.

Chris: Yes.

Thomas: It's very open-ended at this stage.

00:51:38

Dave: When you say exploring use cases, again black box, are you saying these are people in the group that are like, "Hey, I made this," or is it Google with its little headset radio on saying, "They used it to make this," or which brand of seeing what people are using it for is it?

Thomas: I think it's a mix of everything, so we have people who sign up to the program. They say, "Hey, I want to use this with - whatever - images because I want to generate alt text, alternative text, for this image, and I don't want to go to the cloud because it's expensive - whatever. So, I want to get a suggestion from the model for an image that I have in - whatever - my Mastodon client that I'm programming online. I'm getting a proposal or suggestion. How could the image be alt-texted?"

I can refine that or accept it if it's good. There are use cases like that that people just let us know.

Chris: Hmm...

Thomas: We have a partnerships team that works with big brands in the industry like, in the past, we announced that we were working with Adobe for Photoshop on the Web. So, there are definitely some people who have conversations around use cases.

But then also just random developers. And I don't mean this in a negative sense like random developers who hear this show or who just follow the work we do on GitHub or something. They can come up with use cases and write them down, open a GitHub issue, and say, "Hey, I want to use this API for creating images or transcribing podcasts or summarizing videos."

Dave: The Adobe one is interesting because, in theory, they have their own models for images and stuff like that, right? Firefly, right? Isn't that Adobe's thing?

Chris: Yeah, who they said that they trained it only on images that they own, which is pretty cool. Just saying.

Dave: But is that--? Could they sideload their own model into this Window AI API, so we have a consistent API - or something like that? Is that ever talked about?

Thomas: Not right now. Not in a current implementation. It's something that I teased in the beginning when we talked about this. It could be possible.

In the end, what we want to standardize is the interface, like really just window.ai.create or window.ai.capabilities. If a browser decides to then make the model pluggable, I think Brave lets you use a selector where you can say, "Hey, I want to use Facebook's open-source model, Meta's open-source model."

Dave: Hmm...

Thomas: You could image something like this.

Dave: Kind of like if I had a specifically fine-tuned model, I could say, "Use this one," or whatever. Import. Yeah.

Thomas: It would be possible, I guess. It's a use case that we definitely have heard before. If it's a strong use case that you have, you should definitely comment on the spec issue and say, "Hey, let's make the model pluggable. Give us an API for plugging our own models into this so we can use the interface. But at the same time, yeah, don't depend on whatever the browser decides to ship as its model on default or by default," for example.

Dave: See. Yeah, now I ask for it but I don't know if I want it because now it's like, "Oh, bad guy AI says I want to use my model and make it look like the Window AI." I don't know. Anyway--

Chris: Hmm...

Dave: Hmm... But I don't know.

00:55:06

Thomas: One open question also is, should the API tell you what model is being used or even what model version, because models behave differently. But then always on the Web, you have this problem of fingerprinting. If you know exactly what kind of model version down to whatever minor semantic something-something 0.X version a model is being used.

Chris: Be like, "Ah! That's Thomas! I know he uses that model."

Thomas: Exactly. This is another fingerprinting vector.

Chris: Hmm...

Thomas: We are looking at what is actually needed. Is there a way that this makes sense? Can we make this somehow a reality without compromising the user's privacy?

As I said, right now there is no such plans. We're looking at the higher-level task APIs: translation, language detection, and so on. But the model is not that important. In most cases, the existing models that are fine-tuned for doing translation or summarization or whatever, they work differently but well enough for the use case to make sense and realizable.

Dave: Yeah. Kind of the same. Just a new brain. You're putting in a new brain in the crane. [Laughter] Sorry.

Chris: We obviously need two new APIs. We need... Maybe it's part of write, but the suggest thing, like, "Here is a block of text. My cursor is here. On that same line, what do you suggest?" Like the Copilot thing, an API that's specifically for that would be nice.

Thomas: I guess this would be something optimized for code completion.

Chris: Probably. But it's the same one, like when you're writing an email, too, you know, where you type, "Hi mom," and then it's like, "I sent a package. Be on the lookout for it," or something like that. Suggest seems to happen outside of coding context, too. But if it was coding-specific, I'd be obviously super down with that.

Thomas: Mm-hmm.

Chris: And then, like you mentioned, the alt text thing, which I do think is generally a good use case. Is there one of these APIs already that takes binary data?

Thomas: Not for the built-in APIs.

Chris: Right. But it could... so--

Thomas: It could. Who knows.

Chris: Those are my two votes for now that benefit me particularly the most.

Dave: [Laughter] You're like, "Could I get this one and this one, please?"

Chris: [Laughter] Yeah.

Thomas: [Laughter]

Dave: "By launch day Q4?"

Thomas: [Laughter]

Dave: Well, that's interesting. I mean so I think we're kind of hitting towards time here, but what's kind of the near future of this tech? I guess you were saying looking for feedback, use cases, et cetera, but what's kind of the near future for the AI spec or, I guess, not spec, proposal - whatever level it is?

00:57:53

Thomas: Yeah, so I think the next big step is the WebML working group having a look at this. So, if you are a TPAC by any chance (in Anaheim), there's TPAC happening the week of September 16th, maybe. It's soon. I'm going. [Laughter] I still don't know when it is.

There's a public... Sorry, not public. There is a WebML meetup happening. So, if you are a member of the WebML working group or an invited member, you can get access to this.

Typically, the discussion tends to be sort of archived in a public way, so you can see what was discussed there, what was the standards gurus' opinion of these APIs. On the standardization space, I think this is the big moment that a lot of people are working towards right now.

In the browser implementation area, I think the next big step is some of these APIs hitting origin trials so the people can test it with real users (not just behind a flag in your own room, but with real users who go to your site and then get the prompt API - or whatever).

Yeah, I think these are the two big steps that we're working towards right now.

Dave: Cool. All right. That's good to know. Okay, well, cool.

Well, I guess, for people who aren't following you and giving you money and using your AI tools, how can they do that?

Thomas: People should not be giving me money. My employer already does. If you want to give money, do it for a good cause that you care about.

But I am @tomayac on essentially everywhere on the Internet, so tomayac.com is my domain. I'm @tomayac on GitHub, Mastodon, if you still feel like you can use Twitter, I still respond to DMs and mentions. Yeah, do reach out, [email protected] if you fancy email. Yeah, @tomayac. If you search tomayac, you will find my handles.

Dave: Awesome.

Chris: All right!

Dave: Yeah. Thank you so much for coming on the show. It's been super helpful.

Chris: For sure.

Thomas: Thank you for having me.

Dave: I don't know. It's weird because I'm wholeheartedly a skeptic but then I'm also, "This would be kind of fun to play with," so it's like a cool tool. Yeah, I don't know. Anyway, it's fun.

Anyway, thank you so much for coming on the show and getting our heads straight on this.

Thank you, dear listener, for downloading this in your podcatcher of choice. Be sure to star, heart, favorite it up. That's how people find out about the show.

Follow us on Mastodon. That's the good one. Then join us in the D-d-d-d-discord, patreon.com/shoptalkshow. Chris, do you got anything else you'd like to say?

Chris: Oh... How should I end this podcast? Let's see what it says. [Speaking in a whisper] ShopTalkShow.com.