What I learned by spending $ 15 DALL-E 2 credits to create this AI image | AI specialized news media AINOW

September 6, 2022
No Comments

Last update date: September 6, 2022

The author, Joy Zhang, who lives in Australia, is a startup that runs an AI competition.Coder Oneis the founder of In his Medium post, “What I Learned After Spending $15 DALL-E 2 Credits To Make This AI Image,” he summarizes the DALL-E 2’s weaknesses when generating images. .
Zhang, who got access to DALL-E 2 at the end of July 2022, learned that the model has some weaknesses by generating images with the theme of ‘a llama dunking a basketball.’ I called. These weaknesses can be enumerated as follows:

Weaknesses when DALL-E 2 generates images

Poor framing:For images that contain multiple objects,Inability to properly draw positional relationshipsSometimes.
Failing to draw animal faces:A limit is set to not render human faces photorealistically, but this limit isAlso applies to animal facesmay be
Difficult to specify angles and shots:Even if you enter a word such as “distant view”,An image with the specified angle of view is not easily generated。
Words spelled incorrectly:Regarding the misspelling of words,OpenAI officially recognizes it as a limitation。
Difficulty with complex input sentences:For complex input sentences,Images are not generated as expectedSometimes.

After listing the weaknesses of DALL-E 2 as above, to output the desired image with the same modelPrompt input trial and error is essentialSo a minimum of 15 credits (or 15 outputs) is required, advises Zhang. In the current situation, “prompt engineering”, which is the technology to output the desired image to the same model, is not sufficiently developed, and if the desired image is complicated, it is not possible to easily obtain the desired image using the same model.

In addition, when translating the prompt input sentences in the following translated article text, the original English sentences were written together with the translated sentences. If you input the original English text into DALL-E 2, you will see the output.

In addition, the following article text was translated after contacting Joy Zhang directly and obtaining translation permission. In addition, the contents of the translated articles are his own views, and do not represent any particular country, region, organization or group, nor do they represent the principles of the translator or the AINOW editorial department.
In creating the following translated articles, I have made supplementary translations and clarified the context in order to make it easier to read as Japanese sentences.

Generated by the author using DALL-E 2, a llama playing basketball.

Yes, the image above is of a llama dunking a basketball. A summary of the process, limitations and lessons learned while experimenting with the closed beta of DALL-E 2

Artificial image of this “Shiba Inu Bento”Ever since I first saw the DALL-E 2, I’ve been dying to try it.

Wow, this is disruptive technology.

For those of you who don’t know,DALL-E 2 is a system created by OpenAI that can generate original images from text.

It’s currently in closed beta, and I was put on the waitlist in early May and got access at the end of July. During the beta period, users will receive credits (50 credits free for the first month, 15 credits per month thereafter), consume 1 credit per use, and get 3-4 images per use . You can also purchase 115 credits for US$15.

Readers who can’t wait to try PSDALL-E 2,DALL-E miniTry it for free.However, the images produced by the DALL-E mini are generally of low quality (hence manyDALL-E Meme), and it takes about 60 seconds per prompt to generate an image (DALL-E 2 takes about 5 seconds).

(*Translation Note 1) OpenAI official blog published on July 20, 2022articleand,DALL-E 2 Beta Released to Publicdid.

(*Translation Note 2) Published by Hugging Face, which operates an AI communityDALL-E miniis developed by OpenAIRenamed to “craiyon” to avoid confusion with the original DALL-EIt is

I’m sure you’ve seen a selection of generated images online of what the DALL-E 2 can do (with the right and creative prompts). In this article, I’ll give you a candid behind-the-scenes look at what it takes to create an image from scratch for your next subject. The theme is “Lama playing basketball”.If you are thinking of trying DALL-E 2 or want to understand the functions of DALL-E 2, please refer to it.

starting point

Knowing what prompts to give the DALL-E 2 can be explained in terms of both art and science. for example,”llama playing basketball‘ results are as follows:

Image generated by the author by typing “llama playing basketball” into the DALL-E 2 prompt

Why is the DALL-E 2 leaning toward generating cartoon images for this prompt? I suspect this has something to do with the fact that the model didn’t see actual images of llamas playing basketball during its training.

Taking it a step further,realistic photo ofThe result of adding the keyword “” is as follows.

Image generated by the author by typing “realistic photo of llama playing basketball” into the DALL-E 2 prompt

The llamas look more realistic, but the whole image is starting to look like a Photoshop failure. In this case, it’s clear that the DALL-E 2 seems to need a boost to create a cohesive scene.

Prompt engineering is the art of realizing exactly what the user wants

in the context of DALL-EPrompt engineering refers to the process of designing prompts to achieve desired results.

DALL-E 2 Prompt Bookis a great resource for that. A detailed list of inspirations for prompts with photos and works of art as keywords is provided.

I wonder why we need something like this.Because it is difficult to get usable output from DALL-E 2 (for business etc.)(Especially if you don’t know what the DALL-E 2 can do). Therefore, in order to save the user time and money in coming up with prompt input sentences by himself,A startup that created a marketplace that trades a single prompt input for $1.99was even born (*3).

Published by TechCrunch on July 30, 2022articleAccording to the company, in June 2022, a marketplace “PromptBase‘ has begun to operate. 20% of the transaction value goes to PromptBase, which launched the service. However, amidst the abundance of free materials on Prompt Engineering, there are also movements to criticize the service.
In addition, all input sentences traded on PromptBase have been scrutinized, and there is no concern that images contrary to common sense will be output.

One of my personal favorite discoveries I made while trying Prompt Engineering was “dramatic backlighting“is.

The image we’re talking about! DALL-E 2 prompts “Film still of a llama dunking a basketball, low angle, extreme long shot, indoors, Image generated by the author by typing “dramatic backlighting”

What is important in prompt engineering is what you want DALL-E 2 to output.accuratelyIt is to tell. Apparently it’s not clear from the context given by the prompt input (as you can see in the image above) whether the llama I’m asking should be dressed. However,”llama wearing a jersey”, the model successfully realizes the following fantastic scenes.

A llama who dunks a basketball, this time wearing a jersey. DALL-E 2 prompts “Film still of an alpaca wearing a jersey, dunking a basketball, low angle, long shot, indoors, dramatic backlighting, high detail. image generated by the author by inputting “low angle, long shot, indoors, dramatic backlighting, high detail)”

The above results do not stop there. Special phrases like “dunking a basketball” or “action shot of…” are needed to make this llama really fly to add drama to the image. become. My favorite of these phrases is “…llama in a jersey dunking a basketball like Michael Jordan”.

An image of Michael Jordan when he was a llama, based on DALL-E 2. DALL-E 2 prompts “Film still of llama in jersey dunking basketball like Michael Jordan, low angle, bottom view, tilted frame, 35°, dutch angle, extreme long shot, high detail , film still of a llama in a jersey dunking a basketball like Michael Jordan, low angle, show from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, indoors, dramatic backlighting )” image generated by the author by typing

Tip: DALL-E 2 only saves the last 50 generations in the history tab.Save your favorite images。

As you may have noticed, the composition produced by DALL-E 2 is not good

From the context of the input statement “dunk a basketball”, you would think it would be obvious where the relative positions of the llama, ball, and goal should be. More often than not, however, Lama dunks in the wrong direction, or the ball is placed in a position that makes it unlikely that he will score a shot.Even though the prompt input statement clearly states all the elements (that should be generated), DALL-E 2 doesn’t really “understand” the positional relationship of each element.This article delves deeper into this topic(*translation note 4).

DALL-E 2 prompts “Film still of llama in jersey dunking a basketball like Michael Jordan, low angle, shot from below, tilted frame, 35°, dutch angle, extreme long shot, high detail , Film still of a llama in a jersey dunking a basketball like Michael Jordan, low angle, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, indoors, dramatic backlighting )” image generated by the author by typing

(*Translation Note 4) According to an article published on August 4, 2022 by AI specialized media “Unite.AI”, a research team at Harvard University published a paper on July 29, 2022 titled “Testing relational understanding in text-guided image generationdiscusses deficiencies in the positional relationships between objects in images produced by the DALL-E 2.
According to the paper above, there arerealistic input sentencesFor images based on87% of 169 human evaluators considered the drawing correctOn the other hand, “A monkey touching an iguana”Only 11% of unrealistic input sentences were rated correct.。
The paper describes an AI model jointly researched by the University of Washington and NVIDIA as a way to improve the ability to understand the positional relationship of DALL-E 2.CLIPORTis proposed to implement. This model was developed for robot control,In addition to image recognition ability, spatial understanding ability is also implementedIt is

Another flaw caused by the DALL-E 2 not ‘understanding’ the scene is the occasional mix of textures. In the image below, the net is made of fur (a human would know this scene is morbid with a little thought).

DALL-E 2 prompts “Expressive photo of a jersey llama dunking a basketball like Michael Jordan, low angle, extreme wide shot, indoors, dramatic backlight, high detail (Expressive photo of a Image generated by the author by typing llama wearing a jersey dunking a basketball like Michael Jordan, low angle, extreme wide shot, indoors, dramatic backlighting, high detail.

DALL-E 2 struggles to generate realistic faces

according to some information, It is said that the struggle to generate realistic faces was a deliberate measure to prevent deepfakes from occurring (*5). You would think that this measure would only apply to humans, but apparently it also applies to llamas.

Some of the realistic llama face generation failures were downright creepy.

DALL-E 2 prompts, “Dramatic photo of an llama wearing a jersey dunking a basketball like Michael Jordan, low angle, wide shot, indoors, dramatic backlight, high detail. a jersey dunking a basketball like Michael Jordan, low angle, wide shot, indoors, dramatic backlighting, high detail.

(*Translation Note 5) Media “IEEE Spectrum” operated by IEEE released on July 14, 2022article, the limits of DALL-E 2 are discussed from multiple angles. In this article, it is reported that the model is not good at drawing multiple people. For example, an image with a single female astronaut is generated just fine, but an image with seven engineers has distorted faces.

Regarding the fact that DALL-E 2 has a limit that does not generate photorealistic human faces, please refer to the previous OpenAI official blogarticleis written as follows:

Mitigation of abuse: To minimize the risk of unauthorized use of DALL-E, we refuse to upload images containing realistic faces or create caricatures of public figures, including celebrities and prominent politicians. ing. In addition, advanced technology is used to prevent photorealistic reproduction of the face of a real person.

Other DALL-E 2 limitations

Below are some other minor issues I’ve encountered.

Interpret angles and shots loosely

“Vision (in the distance)」「extreme long shot‘, but it’s hard to find an image that fits the entire llama in the frame.

In some cases, framing was completely ignored.

DALL-E 2 prompts: “Dramatic film still of a jersey-clad llama dunking a basketball, low angle, shot from below, slanted frame, 35°, Dutch angle, extreme long shot, indoors, dramatic backlight, Enter (Dramatic film still of a llama wearing a jersey dunking a basketball, low angle, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, indoors, dramatic backlighting, high detail.) in high detail. image generated by the author

DALL-E 2 can’t spell words

Given that the DALL-E 2 struggles to “understand” the positional relationships between elements in an image, the inability to spell words correctly doesn’t seem too surprising. 6). However, in the right context, it can generate fully formed characters.

DALL-E 2 prompts “Film still of a fluffy llama in a jersey dunking a basketball like Michael Jordan, low angle, shot from below, tilted rhem, 35°, dutch angle, extreme long shot Film still of a fluffy llama in a jersey dunking a basketball like Michael Jordan, low angle, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, Image generated by the author by typing “indoors, dramatic backlighting.”

DALL-E 2 can be capricious with complex or poorly worded prompts

Also, depending on the addition of keywords and phrasing, you may get completely different results than you expected.

For example, in the case below, the real subject of the prompt (a llama in a jersey) was completely ignored.

It’s certainly an impressive dunk shot. DALL-E 2 prompts “low angle, long shot, indoors, dramatic backlighting, professional photo of a llama wearing a jersey and dunking a basketball. Image generated by the author by typing llama wearing a jersey, dunking a basketball.

Even the addition of the word “fluffy” dramatically worsened the performance, making the DALL-E 2brokenThere were several cases that looked like this.

DALL-E 2 prompts “Film still of a fluffy llama in a jersey dunking a basketball like Michael Jordan, high detail, indoors, with dramatic backlight. Image generated by the author by typing “like Michael Jordan, high detail, indoors, dramatic backlighting.” (Image deliberately processed to blur and hide the face)

When working with the DALL-E 2,without overcrowding or adding redundant wordsit is important to be specific about what you are looking for.

DALL-E 2’s ability to transfer styles is impressive

Please try DALL-E 2’s style transition.

Once you have decided on a subject that will serve as a keyword, you can generate images in a surprising number of art styles.

“Abstract style…”

DALL-E 2 prompts: “Abstract painting of llama in jersey dunking basketball like Michael Jordan, shot from below, tilted frame, 35°, dutch angle, extreme long shot, high detail, dramatic Background is a stadium full of people (Abstract painting of a llama in a jersey dunking a basketball like Michael Jordan, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, dramatic backlighting, Image generated by the author by typing “indoors. In the background is a stadium full of people.”

“Vaporwave”

DALL-E 2 prompts, “Film still of a llama in a jersey dunking a basketball like Michael Jordan, dramatic backlight, vivid sunset, vaporwave. Image generated by the author by inputting “Jordan, dramatic backlighting, vibrant sunset, vaporwave.”

“Digital Art”

DALL-E 2 prompts, “Lama in jersey dunking a basketball like Michael Jordan, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, dramatic backlight, Epic, digital art (llama in a jersey dunking a basketball like Michael Jordan, shot from below, tilted frame, 35°, Dutch angle, extreme long shot, high detail, dramatic backlighting, epic, digital art) Image generated by the author

“Screenshots from the Miyazaki animated film”

DALL-E 2 prompts “Llama in a jersey dunking a basketball like Michael Jordan, screenshots from the Miyazaki anime movie” Image generated by the author by typing . Thanks for the tips in this article.

(*Translation Note 7) An article posted on May 2, 2022 on the opinion media “LessWrong”What DALL-E 2 can and cannot do]the same model isGood at generating images related to various pop culturespoints out that For example, it is possible to generate images related to Marvel heroes and Disney princesses as shown below. The “screenshots from Miyazaki’s animated films” in this article can also be said to be image examples related to pop culture.

Image generated by prompting “art nouveau stained glass window depicting Marvel’s Captain America”

Image generated by prompting “Elsa from Frozen, cross-stitched sampler”

final impression

After investing more than 100 credits (equivalent to 13 US dollars), the image below was completed after trial and error.

Winning image I made.https://labs.openai.com/s/HYv3Kp8ElKDAWKHq2vs76VXu

The image isn’t perfect, but the DALL-E 2 fulfilled about 80% of my expectations.

I put most of the credit into getting the style, face and composition right.

OpenAI’s DALL-E Announcementhas the following description:

“…users have full rights of use, including reproduction, sale, and merchandising, to commercialize images created with DALL-E.”(*translation note 8)

(*Translation Note 8) For a description of the commercialization rights for DALL-E 2 generated images, please refer to the OpenAI official blog that announced the public release of the beta version mentioned above.articleIt is described in.

Many users are expected to be at the mercy of this rule.

For content creators, DALL-E 2 will be most useful for creating simple illustrations, photos and graphics for blogs and websites. My plan is to use it instead of Unsplash to create unique blog cover images.

For those of you who want to try the DALL-E 2 for yourself, here’sThings to know before startingI want to introduce

DALL-E 2 Prompt Bookcheck! (fan-madeprompt engineering sheetthere is also).
To get what you want, you need to be prepared to make trial and error. 15 free credits may sound like a lot, but it really isn’t. To generate a usable image,at leastLet’s assume you use 15 credits. DALL-E 2 nevernot cheap。
Don’t forget to save your favorite images.

・・・

Thank you for reading.We are waiting for your impressions and opinions after experiencing DALL-E 2.

For those who read this article, I will also introduce articles written by other writers.

original
『I spent $15 in DALL·E 2 credits creating this AI image, and here’s what I learned』

author
Joy Zhang

translation
Yuki Yoshimoto (freelance writer, JDLA Deep Learning for GENERAL 2019 #1)

edit
Ozaken