Goading companies into improving image and video generation by showing them how terrible they are is only going to make them go faster, and personally I’d like to enjoy the few moments I have left thinking that maybe something I watch is real.
It will evolve into people hooked into entertainment suits most of the day, where no one has actual relationships or does anything of consequence, like some sad mashup of Wall-E and Ready Player One.
If we’re lucky, some will want to meatspace with augmented reality.
Maybe we’ll get really nice holovisions, where we can chat with virtual celebrities.
Who needs that?
We’re already having to shoot up weight-loss drugs because we binge watch streaming all the time. We’ve all given up, assuming AI will do everything. What good will come from having better technology when technology is already doing harm?
There are way ways past this, from religion and Amish-style cultural approaches, to legal prohibition of making and selling and using it, to dictatorial control of the companies which could make it, to individuals being personally immune, to paying people money if they don't use it. Like there are people who avoid alcohol, opioids, heroin, all other wireheading-style drugs and experiences that exist already, and people who do exercise and stay thin in a world of fast food and cars.
A great filter needs to apply to every civilisation imaginable, no exceptions, nerfing billions of species before they get to a higher Kardashev scale, not just something that "could happen" or the latest “Dunning-Kruger” mic-drop in every thread. In 1960s "the great filter is nuclear war", in 1890 "the great filter is heroin", in 1918 "the great filter is world war, we are destined to destroy ourselves", in 2015 "the great filter is climate change our emissions will end us like bacteria in a petri dish", in antiquity "the great filter is the punishment for crossing the will of the Gods".
It's got to be something you cannot get around even if you try really really hard and get very very lucky, because there are ~200,000,000,000 stars in the Milky Way and with those numbers there will be some species which lucks its way past almost any candidate, spreads out and in a mere 100k years is all over this galaxy leaving rocket trails and explosion signatures and radio signals and terraforming signs and megastructures.
I am looking at ways to approximate Gaussian splats without having to reinvent the wheel, but I'm a bit over my depth since I haven't been playing a lot of attention to those in general.
I'm quite delighted that the gif banding artefacts make it look life the photi of a fire is flickering, and also highly impressed that the AI was able to recognize the fire as a photo within a photo and keep it in 2d.
I just refactored the rendering and resampling approach. Took me a few tries to figure out how to remove the banding masks from the layers, but with more stacked layers and a bit of GPT-foo to figure out the API it sort of works now (updated the GIF)
Keep in mind that this is not Gaussian splat rendering but just a hacked approximation--on my NVIDIA machine that looks way smoother.
Can someone ELI5 what this does? I read the abstract and tried to find differences in the provided examples, but I don't understand (and don't see) what the "photorealistic" part is.
Imagine history documentaries where they take an old photo and free objects from the background and move them round giving the illusion of parallax movement. This software does that in less than a second, creating a 3D model that can be accurately moved (or the camera for that matter) in your video editor. It's not new, but this one is fast and "sharp".
Until your comment I didn't realise I'd also read it wrong (despite getting the gist of it). Attempted rephrase of the original sentence:
Imagine history documentaries where they take an old photo, free objects from the background, and then move them round to give the illusion of parallax.
Apple does something similar right now in their photos app, generating spatial views from 2d photos, where parallax is visible by moving your phone. This paper’s technique seems to produce them faster. They also use this same tech in their Vision Pro headset to generate unique views per eye, likewise on spatialized images from Photos.
Takes a 2D image and allows you to simulate moving the angle of the camera with correct-ish parallax effect and proper subject isolation (seems to be able to handle multiple subjects in the same scene as well)
I guess this is what they use for the portrait mode effects.
It turns a single photo into a rough 3D scene so you can slightly move the camera and see new, realistic views. "Photorealistic" means it preserves real textures and lighting instead of a flat depth effect. Similar behavior can be seen with Apple's Spatial Scene feature in the Photos app: https://files.catbox.moe/93w7rw.mov
Basically depth estimation to split the scene into various planes, and then inpainting to work out the areas in the obscured parts of the planes, and then the free movement of them to allow for parallax. Think of 2D side scrolling games that have various different background depths to give illusion of motion and depth.
From a single picture it infers a hidden 3D representation, from which you can produce photorealistic images from slightly different vantage points (novel views).
I just want to emphasize that this is not a NERF where the model magically produces an image from an angle and then you ask "ok but how did you get this?" and it throws up its hands and says "I dunno, I ran some math and I got this image" :D.
Black Mirror episode portraying what this could do: https://youtu.be/XJIq_Dy--VA?t=14. If Apple ran SHARP on this photo and compared it to the show, that would be incredible.
Agreed, this is a terrible presentation. The paper abstract is bordering on word salad, the demo images are meaningless and don’t show any clear difference to the previous SotA, the introduction talks about “nearby” views while the images appear to show zooming in, etc.
I note the lack of human portraits in the example cases.
My experience with all these solutions to date (including whatever apple are currently using) is that when viewed stereoscopically the people end up looking like 2d cutouts against the background.
I haven't seen this particular model in use stereoscopically so I can't comment as to its effectiveness, but the lack of a human face in the example set is likely a bit of a tell.
Granted they do call it "Monocular View Synthesis", but i'm unclear as to what its accuracy or real-world use would be if you cant combine 2 views to form a convincing stereo pair.
Im not sure how the depth estimation alone translates into the view synthesis, but the current implementation on-device is definitely not convincing for literally any portrait photographs I have seen.
True stereoscopic captures are convincing statically, but don't provide the parallax.
Good monocular depth estimation is crucial if you want to make a 3D representation from a single image. Ordinarily you have images from several camera poses and can create the gaussian splats using triangulation, with a single image you have to guess z position for them.
Apple's Spatial Scene in the Photos app shows similar behavior, turning a single photo into a small 3D scene that you can view by tilting the phone. Demo here: https://files.catbox.moe/93w7rw.mov
I could not find any mention of it but does this use regenerative AI? I can’t imagine it able to accomplish anything like this without using a large graphical
Model in the back.
In Chapter D.7 they describe: "The complex reflection in water is interpreted by the network as a distant mountain, therefore the water surface is broken."
This is really interesting to me because the model would have to encode the reflection as both the depth of the reflecting surface (for texture, scattering etc) as well as the "real depth" of the reflected object. The examples in Figure 11 and 12 already look amazing.
This is incredibly cool. It's interesting how it fails in the section where you need to in-paint. SVC seems to do that better than all the rest, though not anywhere close to the photorealism of this model.
Is there a similar flow but to transform either a video/photo/NeRF of a scene into a tighter, minimal polygon approximation of it. The reason I ask is that it would make some things really cool. To make my baby monitor mount I had to knock out the calipers and measure the pins and this and that, but if I could take a couple of photos and iterate in software that would be sick.
You'd still need one real measurement at least: this might get proportions right if background can be clearly separated, but the absolute size of an object can be worlds apart.
Have a look through the rest of the images. TMPI has some pretty obvious shortcomings in a lot of them.
1. Sky looks jank
2. Blurry/warped behind the horse
3. The head seems to move a lot more than the body. You could argue that this one is desirable
4. Bit of warping and ghosting around the edges of the flowers. Particularly noticeable towards the top of the image.
5. Very minor but the flowers move as if they aren't attached to the wall
I'm confused, does it actually generate environments from photographs? I can't view the galleries since I didn't sign up for emails but all of the gallery thumbnails are AI, not photos.
Works great, model file is 2.8 GB, on M2 rendering took a few seconds, result is guassian .ply file but repo implementation requires CUDA card to render video, I have used one of webgl live renderers from here https://github.com/scier/MetalSplatter?tab=readme-ov-file#re...
That is really impressive. However, it was a bit confusing at first because in the koala example at the top, the zoomed in area is only slightly bigger than the source area. I wonder why they didn't make it 2-3x as big in both axes like they did with the others.
I understand AI for reasoning, knowledge, etc. I haven't figured out how anyone wants to spend money for this visual and video stuff. It just seems like a bad idea.
Simulation. It takes a lot of effort today to bring up simulations in various fields. 3 D programming is very nontrivial and asset development is extremely expensive. If I have a workspace I can take a photo of and just use it to generate a 3d scene I can then use it in simulations to test ideas out. This is particularly useful in robotics and industrial automation already.
I don't see any examples of a 3D scene information usable for simulation. If you want to simulate something hitting a table, you need the whole table (surface) in space, not just some spatial illusion effect extrapolated from an image of a table. I also think modelling the 3D objects for simulation is the least expensive part of an simulation... the simulation is the expensive thing.
I doubt this will be useful for robotics or industrial automation, where you need an actual spatial, or functional understanding of the object/environment.
With research like this you need to start somewhere. The fact we can get 3d information helps. There are people looking into making splats capture collision information [1].
I have worked on simulation and in my day job do a lot of simulation. While physics is oftem hard and expensive you only need to write the code once.
Assets? You need to comission 3d artists and then spend hours wrangling file formats. Its extremely tedious. If we could take a photo and extract meshes Im sure we'd have a much easier time.
Photo apps on phones (can you still call them cameras?) already have a lot of "AI" to enhance photos and videos taken. Some of it is technological necessity, since you're capturing something through a tiny hole, a lot of it is sexying it up to appeal to people, because hey, people would prefer a cinema-quality depiction of their memories rather than the reality...
This specific paper is pretty different to the kind of photo/video generation that has been hyped up in recent years. In this case, I think this might be what they're using for the iOS spatial wallpaper feature, which is arguably useless but is definitely an aesthetic differentiator to Android devices. So, it's indirectly making money.
Do people not spend on entertainment? Commercials? It's probably less of a bad idea than knowledge. AI giving a bad visual has less negatives than giving the wrong knowledge leading to the wrong decision.
It will evolve into people hooked into entertainment suits most of the day, where no one has actual relationships or does anything of consequence, like some sad mashup of Wall-E and Ready Player One.
If we’re lucky, some will want to meatspace with augmented reality.
Maybe we’ll get really nice holovisions, where we can chat with virtual celebrities.
Who needs that?
We’re already having to shoot up weight-loss drugs because we binge watch streaming all the time. We’ve all given up, assuming AI will do everything. What good will come from having better technology when technology is already doing harm?
https://en.wikipedia.org/wiki/Great_Filter
A great filter needs to apply to every civilisation imaginable, no exceptions, nerfing billions of species before they get to a higher Kardashev scale, not just something that "could happen" or the latest “Dunning-Kruger” mic-drop in every thread. In 1960s "the great filter is nuclear war", in 1890 "the great filter is heroin", in 1918 "the great filter is world war, we are destined to destroy ourselves", in 2015 "the great filter is climate change our emissions will end us like bacteria in a petri dish", in antiquity "the great filter is the punishment for crossing the will of the Gods".
It's got to be something you cannot get around even if you try really really hard and get very very lucky, because there are ~200,000,000,000 stars in the Milky Way and with those numbers there will be some species which lucks its way past almost any candidate, spreads out and in a mere 100k years is all over this galaxy leaving rocket trails and explosion signatures and radio signals and terraforming signs and megastructures.
Maybe when NASA, ESA, SpaceX, RosCOSMOS, CNSA, IRSA all collapse because of this effect… look how many countries have a space agency! https://en.wikipedia.org/wiki/List_of_government_space_agenc...
https://m.youtube.com/watch?v=DgPaCWJL7XI&t=1s&pp=2AEBkAIB0g...
https://www.youtube.com/watch?v=X0oSKFUnEXc
https://github.com/rcarmo/ml-sharp (has a little demo GIF)
I am looking at ways to approximate Gaussian splats without having to reinvent the wheel, but I'm a bit over my depth since I haven't been playing a lot of attention to those in general.
Keep in mind that this is not Gaussian splat rendering but just a hacked approximation--on my NVIDIA machine that looks way smoother.
Gaussian splashing is pretty awesome.
Imagine history documentaries where they take an old photo, free objects from the background, and then move them round to give the illusion of parallax.
Even using commas, if you leave the ambiguous “free” I suggest you prefix “objects” with “the” or “any”.
I guess this is what they use for the portrait mode effects.
(I am oversimplifying).
I just want to emphasize that this is not a NERF where the model magically produces an image from an angle and then you ask "ok but how did you get this?" and it throws up its hands and says "I dunno, I ran some math and I got this image" :D.
Or if you prefer Blade Runner: https://youtu.be/qHepKd38pr0?t=107
My experience with all these solutions to date (including whatever apple are currently using) is that when viewed stereoscopically the people end up looking like 2d cutouts against the background.
I haven't seen this particular model in use stereoscopically so I can't comment as to its effectiveness, but the lack of a human face in the example set is likely a bit of a tell.
Granted they do call it "Monocular View Synthesis", but i'm unclear as to what its accuracy or real-world use would be if you cant combine 2 views to form a convincing stereo pair.
https://github.com/apple/ml-depth-pro
https://learnopencv.com/depth-pro-monocular-metric-depth/
True stereoscopic captures are convincing statically, but don't provide the parallax.
https://github.com/apple/ml-sharp#rendering-trajectories-cud...
CUDA is needed to render side scrolling video, but there is many ways to do other things with result.
Photoshop content aware fill could do equally or better many years ago.
Not only do many VR and AR systems acquire stereo, we have historical collections of stereo views in many libraries and museums.
Without that that it's hard to tell how cherry-picked the NVS video samples are.
EDIT: I did it myself, if anyone wants to check out the result (caveat, n=1): https://github.com/avaer/ml-sharp-example
This is really interesting to me because the model would have to encode the reflection as both the depth of the reflecting surface (for texture, scattering etc) as well as the "real depth" of the reflected object. The examples in Figure 11 and 12 already look amazing.
Long tail problems indeed.
Is there a similar flow but to transform either a video/photo/NeRF of a scene into a tighter, minimal polygon approximation of it. The reason I ask is that it would make some things really cool. To make my baby monitor mount I had to knock out the calipers and measure the pins and this and that, but if I could take a couple of photos and iterate in software that would be sick.
1. Sky looks jank 2. Blurry/warped behind the horse 3. The head seems to move a lot more than the body. You could argue that this one is desirable 4. Bit of warping and ghosting around the edges of the flowers. Particularly noticeable towards the top of the image. 5. Very minor but the flowers move as if they aren't attached to the wall
[0]https://www.spaitial.ai/
It’s a website that collects people’s email addresses
Why no landscape or underwater scenes or something in space, etc.?
I believe this company is doing image (or text) -> off the shelf image model to generate more views -> some variant of gaussian splatting.
So they aren't really "generating" the world as one might imagine.
I doubt this will be useful for robotics or industrial automation, where you need an actual spatial, or functional understanding of the object/environment.
I have worked on simulation and in my day job do a lot of simulation. While physics is oftem hard and expensive you only need to write the code once.
Assets? You need to comission 3d artists and then spend hours wrangling file formats. Its extremely tedious. If we could take a photo and extract meshes Im sure we'd have a much easier time.
[1] https://trianglesplatting.github.io/