this post was submitted on 21 Sep 2023

155 points (91.0% liked)

Technology

74193 readers

4153 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

155

OpenAI’s new AI image generator pushes the limits in detail and prompt fidelity (arstechnica.com)

submitted 2 years ago by Voyager@psychedelia.ink to c/technology@lemmy.world

30 comments fedilink hide all child comments

On Wednesday, OpenAI announced DALL-E 3, the latest version of its AI image synthesis model that features full integration with ChatGPT. DALL-E 3 renders images by closely following complex descriptions and handling in-image text generation (such as labels and signs), which challenged earlier models. Currently in research preview, it will be available to ChatGPT Plus and Enterprise customers in early October.

Like its predecessor, DALLE-3 is a text-to-image generator that creates novel images based on written descriptions called prompts. Although OpenAI released no technical details about DALL-E 3, the AI model at the heart of previous versions of DALL-E was trained on millions of images created by human artists and photographers, some of them licensed from stock websites like Shutterstock. It's likely DALL-E 3 follows this same formula, but with new training techniques and more computational training time.

Judging by the samples provided by OpenAI on its promotional blog, DALL-E 3 appears to be a radically more capable image synthesis model than anything else available in terms of following prompts. While OpenAI's examples have been cherry-picked for their effectiveness, they appear to follow the prompt instructions faithfully and convincingly render objects with minimal deformations. Compared to DALL-E 2, OpenAI says that DALL-E 3 refines small details like hands more effectively, creating engaging images by default with "no hacks or prompt engineering required."

top 30 comments

sorted by: hot top controversial new old

[–] simple@lemm.ee 22 points 2 years ago

That avocado image is insane. I've yet to see any image prompt AI get text and composition anywhere near this level. Mindblowing to know there were zero edits. I really want to try this now.

[–] MysticKetchup@lemmy.world 22 points 2 years ago (3 children)

One of the things I've noticed from testing these generators is that they look very good when someone is showing off cherry-picked photos or you are going in with little to no expectation of what you'll get. The more you have a specific image in mind or the more you want certain details the worse it gets. And you quickly realize that when you generate things from similar prompts over and over the model gives you the same results but slightly adjusted.

Obviously art generators look bad for artists right now, but I think once the new toy factor wears off people will realize they aren't as good as they seem. There's a lot of smoke and mirrors involved and once you've seen a good amount of AI photos it gets easier and easier to pick them out of a lineup. They're closer to advanced stock photo generators, in terms of what you can actually get from them. Companies' race to the bottom means that this is going to have effects on artist jobs, but I think the next "revolution" in art is going to be having human art as a selling point the same way stuff like fully orchestrated music or hand-crafted things are.

[–] thehatfox@lemmy.world 12 points 2 years ago (1 children)

I wish more people realised this. It's much harder to create very specific images with the current image generation tools than most people seem to think, which is creating an inaccurate view of the technology in the public eye.

The generator will create something inspired by the prompt it is given, but it can be very hard to make it match the output the prompt writer imagines when writing the prompt. There are various tools that can refine and narrow the generator's output, to try and control things like posing, composition, style etc and to redraw details. But even then it's often pot luck as to the output. The generated images aren't necessarily bad, just not what was wanted.

I think the comparison to stock photo images is apt, current image generators are great for creating themed but somewhat generic images. The tools are going to continue to advance, and they are useful in for some applications already. But they are still a long way off from truly replacing human artistry.

[–] lloram239@feddit.de 1 points 2 years ago

The crux with that argument is that the artists is the only one that cares about specific output, meanwhile the art consumer doesn't. When somebody plays a game or watch a movie, they don't know what to expect, that's part of the fun, they just care about it being good. So as long as the output is good enough for the consumer, whatever the artists thinks about it, really doesn't matter, assuming they still have a job to begin with.

[–] GenderNeutralBro@lemmy.sdf.org 1 points 2 years ago

Getting good results usually requires a prompt that looks like the title of a generic product in Amazon: just an endless stream of keywords.

The examples from simple prompts are cherry-picked.

[–] lloram239@feddit.de 1 points 2 years ago* (last edited 2 years ago) (1 children)

And you quickly realize that when you generate things from similar prompts over and over the model gives you the same results but slightly adjusted.

That's quite true, however it's worth keeping in mind that this is largely not due to a limit in the model itself, but a limit how that model interfaces with the human. Text just isn't very good to get specific results, especially not when it lacks the incremental refinement that you can do in ChatGPT with follow up prompts.

On the other side if I take StableDiffusion with ControlNet, instead of just a text prompt, I can generate far more specific results, as I can feed other images and sketches into the generation.

but I think once the new toy factor wears off people will realize they aren’t as good as they seem.

Quite the opposite, there is a ton of hidden potential still left to uncover. We have barely even started to train them on video or 3D data, integration of image models with newer language models is also a work in progress and integration into old-school image manipulation tools has just began as well.

Worth keeping in mind that Dalle-1 isn't even three years old. We are basically still in the Atari2600 days of image generation.

Meanwhile Dalle-3 comes along and can produce this level of quality with a complete generic prompt: "A fan-art of Guardians of the Galaxy Vol. 3" on the first try.

I think the next “revolution” in art is going to be having human art as a selling point

The big problem for artists is that AI art drives the value of art down to zero. It'll be hard to convince anybody to pay hundred of dollars for something when AI can produce something similar in 30sec for free. Worse yet, AI can take any existing image and remix it. The whole idea of a singular static images feels quite restrictive once you played around with AI art for a while, as everything is just a few clicks away from being something different.

I think the idea of AI art as just generators for stock images doesn't capture the magnitude of the changes that are coming. We are straight up heading into Holodeck territory where you tell the computer what you want and you get it. The AI generators won't be a tool for the artists, but go right to the users. There won't be an static image that comes out the other end, the AI will be the medium of media consumption. Just like people today can flip through TikTok, future people will flip through a AI generated stream of content custom made for them.

Wanna play some 2D game with snow and ice? Tell the computer, a couple seconds waiting, and boom here it is. First try. Want lava instead? Done. How about an N64 game? How do you compete with that as a human when AI can pull that out of thin air in seconds?

[–] MysticKetchup@lemmy.world 1 points 2 years ago (1 children)

Nah that Guardians of the Galaxy art is exactly what I'm talking about. It makes basic mistakes even a child could point out and looks more long a knockoff. And refining it is just rolling the dice to get a better result, whereas an artist you can actually give feedback they can understand.

The game assets look a little better, but if you look carefully you'll notice that they don't tile correctly. It's 90% there but the last 10% is the hardest part and it's important especially for large projects and not just single static images. Not too mention they look generic as fuck, you're not going to get the next Hollow Knight or Darkest Dungeon with an amazing original style from AI, you're only going to get existing styles mashed together. The more specific the vision for the artstyle the harder it will be to generate it.

Also the idea of a Tiktok feed of AI generated content is exactly why I hate AI art. Sure, go ahead and use it to help existing artists generate cheap assets that would otherwise be random brush strokes. But replacing them? The idea that AI generated slop will have anything close to the quality and meaning of even cheap art is ridiculous. Why would anyone want that when they could have actual art made by real people, more of which exists today than anyone could go through in their entire life?

[–] lloram239@feddit.de 1 points 2 years ago* (last edited 2 years ago)

The idea that AI generated slop will have anything close to the quality and meaning of even cheap art is ridiculous.

You are missing the bigger picture: This took SECONDS, no effort on my part and it was a first try, using technology that was a little less than three years ago at this stage. I can generate new images on any topic I want, instantly. This stuff is already incredible today and is getting better rapidly.

Meanwhile here are examples of glorious human art:

Human art is full of mistakes. The best of the best human art has "quality and meaning", the average not really. Stuff like "Somehow, Palpatine returned" was written by humans. There is a lot of garbage that slips through, even in project that have so much money that there is really no excuse. I'll take a few additional AI generated fingers, that are trivial to fix, over that trash.

Here some of the box art recreated with AI, again zero effort, first try: https://imgur.com/a/kHcwv4j

And you can remix it at will: https://i.imgur.com/y38UPX6.jpg

Also the idea of a Tiktok feed of AI generated content is exactly why I hate AI art.

Netflix is already running personalized thumbnails, not with AI, but that's exactly the kind of stuff I expect AI to be used for real soon, if it isn't already in some capacity.

Why would anyone want that when they could have actual art made by real people, more of which exists today than anyone could go through in their entire life?

Nobody cares about who makes the art outside of some art historians. Every movie, TV show or game has dozens or even hundreds of people involved, you have no idea who was responsible for what or what was going on behind the scenes. All you see is the result and you either like it or you don't. "The Death of the Author" and all that.

[–] PopOfAfrica@lemmy.world 11 points 2 years ago (1 children)

Ive already accepted that my graphic design degree is worthless.

[–] 2nsfw2furious@lemmynsfw.com 2 points 2 years ago

This is just automatic Photoshop - if all you were doing with graphic design was pasting blond hair onto a brunette, yes, this has really screwed you (or made your job a lot easier). If you're actually doing any level of design... you're safe for now

[–] PlexSheep@feddit.de 7 points 2 years ago

closedAI still sucks, even if their (closed!!) Tools are powerful.

[–] AlmightySnoo@lemmy.world 5 points 2 years ago* (last edited 2 years ago) (4 children)

I will be convinced when they learn to draw hands correctly, which they seem to boast about here.

[–] Devorlon@lemmy.zip 16 points 2 years ago (2 children)

Here's an example image from the article.

https://cdn.arstechnica.net/wp-content/uploads/2023/09/plategirl-980x560.jpg

[–] Chozo@kbin.social 29 points 2 years ago (1 children)

from the article

Well no wonder they couldn't find this example.

[–] fmstrat@lemmy.nowsci.com 8 points 2 years ago

For a system where the intent is to read, learn, or be entertained (and kill time), people seem unwilling to do the first to accomplish the latter.

[–] Quicky@lemm.ee 19 points 2 years ago* (last edited 2 years ago) (2 children)

Was the prompt “Woman from China”?

Edit: I feel like the nuance of this joke may have been lost on some. Whether or not I read the article is irrelevant, since this was not a genuine question, rather a play on words of the double meaning of “china” as in “A woman from (the country) China” and “A woman (emerging) from china (porcelain)”.

I’ll get my coat.

[–] Chariotwheel@kbin.social 12 points 2 years ago (1 children)

The prompt is on the picture in the article:

A DALL-E 3 image provided by OpenAI with the prompt: "A middle-aged woman of Asian descent, her dark hair streaked with silver, appears fractured and splintered, intricately embedded within a sea of broken porcelain. The porcelain glistens with splatter paint patterns in a harmonious blend of glossy and matte blues, greens, oranges, and reds, capturing her dance in a surreal juxtaposition of movement and stillness. Her skin tone, a light hue like the porcelain, adds an almost mystical quality to her form."

Why do we need AI creating text, when nobody is reading?

[–] Quicky@lemm.ee 1 points 2 years ago

Whoosh

[–] Aatube@kbin.social 1 points 2 years ago (1 children)

You might want to put it all lowercase next time

[–] Quicky@lemm.ee 2 points 2 years ago

The next time I make the same joke?

I reckon I’ll just keep it to myself instead. I already feel ridiculous for having to explain it. Lemmy is harder than real life.

[–] ZILtoid1991@kbin.social 4 points 2 years ago

Making the context window likely helps with stuff, however it still has the issue of "background breaking".

[–] lloram239@feddit.de 2 points 2 years ago* (last edited 2 years ago) (1 children)

Seems to be about 50/50, quite a few good looking hands, but still plenty of crocked fingers with some prompts. I think they might need training on video or 3D models, the structure of hands is probably difficult to figure out just from 2D images.

[–] AlmightySnoo@lemmy.world 1 points 2 years ago

Yup that's the thing with most of generative AI models, they have no implicit 3D modelling of the world. So depending on perspective, a real 2D image may give the impression that there are 2 or 3 fingers only but the model doesn't know that that's just because of perspective.

[–] tonytins@pawb.social 1 points 2 years ago* (last edited 2 years ago) (1 children)

The reason AI struggles with hands is because real artists struggle with them too.

[–] thbb@kbin.social 15 points 2 years ago* (last edited 2 years ago)

While there is some truth in this, humans and AI do not make the same type of mistakes with hands.

Humans will rebuild the topological structure of the hand: 5 fingers protruding from a base, and get the proportions wrong..while the topology is credible.

AI will rebuild the image of a hand from the 2d appearance of a hand: a variable number of flesh colored, parallel stripes, and improvise from that.

While both can get it wrong, the errors are not similar.

[–] autotldr@lemmings.world 4 points 2 years ago

This is the best summary I could come up with:

On Wednesday, OpenAI announced DALL-E 3, the latest version of its AI image synthesis model that features full integration with ChatGPT.

DALL-E 3 renders images by closely following complex descriptions and handling in-image text generation (such as labels and signs), which challenged earlier models.

Judging by the samples provided by OpenAI on its promotional blog, DALL-E 3 appears to be a radically more capable image synthesis model than anything else available in terms of following prompts.

While OpenAI's examples have been cherry-picked for their effectiveness, they appear to follow the prompt instructions faithfully and convincingly render objects with minimal deformations.

DALL-E 3 also appears to handle text within images in a way that its predecessor couldn't (some competing models like Stable Diffusion XL and DeepFloyd are getting better at it).

Microsoft's Bing Chat AI assistant, also built on technology from OpenAI, has been able to generate images in conversation since March.

The original article contains 420 words, the summary contains 151 words. Saved 64%. I'm a bot and I'm open source!

[–] avater@lemmy.world 1 points 2 years ago (1 children)

always wanted to try it out but no way I'm giving my phone number to them, although I understand their approach to reduce bot accounts.

[–] Dkarma@lemmy.world 3 points 2 years ago

Try stabil diffusion

[–] echodot@feddit.uk 1 points 2 years ago

What a time to be alive!

[–] Voyajer@lemmy.world 0 points 2 years ago

Nice username!