Temporal aware text-driven style transfer for motion-based video transformation

dc.contributor.authorDe Silva, A.D.D.S.
dc.contributor.authorSiyambalapitiya, R.
dc.date.accessioned2025-11-06T03:57:16Z
dc.date.available2025-11-06T03:57:16Z
dc.date.issued2025-11-07
dc.description.abstractVideo stylisation plays a crucial role in creative domains such as virtual reality, game development, content creation, and filmmaking. However, existing traditional video style transfer methods often produce flickering and popping due to inconsistent frame stylization. They typically rely on reference style images or computationally heavy modules such as neural atlas layers, making them unsuitable for real-time use. This research introduces a deep learning-based framework for text-guided style transfer on single or multiple objects in videos, focusing on temporal consistency and low computational cost. Two approaches are proposed; combining You Only Look Once version 8 (YOLOv8) for object detection, Deep Simple Online and Realtime Tracking (DeepSORT) for tracking, and the Segment Anything Model (SAM) for precise segmentation. Stylisation is performed using CLIPStyler, which applies descriptive text prompts as style instructions, making the process fully text-driven and independent of reference images. A custom dataset of elephants was created and annotated for training and evaluation. In the first approach, stylisation was applied only to segmented object frames, which were then blended with unaltered background frames. This method is efficient but slightly affects background quality. To address this, the second approach first generated a fully stylised video and then used segmentation masks to isolate stylised objects, merging them with the original background frames. This preserved background clarity while maintaining object stylisation. Quantitative evaluation produced strong results: Intersection over Union (IoU) = 0.9689, dice coefficient = 0.9689, F1-score = 0.88, precision = 0.90, and recall = 0.87. A user study with 30 participants, including professional videographers, found that over 90% agreed the style aligned with the text, object shapes were preserved, and background artefacts were minimal. These results demonstrate the framework’s effectiveness for object-level stylisation. Future work will explore advanced detection models, improved segmentation with SAM 2, zero-shot object detection, and voice-based style control.
dc.identifier.citationProceedings of the Postgraduate Institute of Science Research Congress (RESCON)-2025, University of Peradeniya,p78
dc.identifier.issn3051-4622
dc.identifier.urihttps://ir.lib.pdn.ac.lk/handle/20.500.14444/6013
dc.language.isoen
dc.publisherPostgraduate Institute of Science (PGIS), University of Peradeniya, Sri Lanka
dc.relation.ispartofseriesVolume 12
dc.subjectCLIPStyler
dc.subjectTemporal consistency
dc.subjectText-guided style transfer
dc.subjectVideo object segmentation
dc.titleTemporal aware text-driven style transfer for motion-based video transformation
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
18 RESCON 2025 CMS-30.pdf
Size:
302.8 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description:

Collections