The model can change nearly anything.
Take a look at this experiment with optimizing Neural Atlases through Stable Diffusion. Omer Bar Tal's video shows you can change almost anything, and it will look quite nice. The experiment is based on "Layered Neural Atlases for Consistent Video Editing" – a research presenting a method that decomposes an input video into a set of layered 2D atlases.
"For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases, giving us a consistent parameterization of the video, along with an associated alpha (opacity) value."
Edits applied to a 2D atlas are mapped back to the original video frames preserving occlusions, deformation, and other scene effects such as shadows and reflections.
Omer Bar Tal has worked on a similar project called "Text2LIVE: Text-Driven Layered Image and Video Editing", where the researchers introduced a method for zero-shot, text-driven appearance manipulation in natural images and videos.
The method combines an input image or video and a target text prompt to edit the appearance of existing objects or augment the scene with new visual effects "in a semantically meaningful manner." The key idea of the research is to generate an edit layer that is composited over the original input, allowing constraining the generation process and maintaining high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer.