[{"id":"86","phd":"0","class":"113","name_en":"Yun-Tung Hsieh","name_ch":"謝昀彤","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"87","phd":"0","class":"113","name_en":"Chieh-Ling Lee","name_ch":"李婕綾","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"88","phd":"0","class":"113","name_en":"Chih-Hung Han","name_ch":"韓志鴻","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"89","phd":"0","class":"113","name_en":"Ruo-An Wang","name_ch":"王若安","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"90","phd":"0","class":"113","name_en":"Yu-Chieh Hsiao","name_ch":"蕭育傑","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"81","phd":"0","class":"112","name_en":"Yu-Jie Lin","name_ch":"林語潔","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"82","phd":"0","class":"112","name_en":"Jen-Chueh Hsu","name_ch":"許仁覺","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"83","phd":"0","class":"112","name_en":"Hong-Hui Yu","name_ch":"尤虹惠 \r\n","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"84","phd":"0","class":"112","name_en":"Pei-Ling Lu\r\n","name_ch":"呂珮伶","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"85","phd":"0","class":"112","name_en":"Hsuan-Tung Lin","name_ch":"林宣彤","research_en":"","resercher_intro":"","research_ch":"","abstract_en":"","abstract_ch":"","picture":"","personal_page":""},{"id":"76","phd":"0","class":"111","name_en":"Yu-An Chang","name_ch":"張友安","research_en":"Effective Strategies of Adversarial Signal Embedding for Resisting Deepfakes Images","resercher_intro":"","research_ch":"針對深度偽造生成影像之對抗性擾動訊號嵌入策略","abstract_en":"The technology for deepfakes using generative models is rapidly advancing and becoming increasingly accessible. Potential applications include synthesizing images of individuals that match specific requirements, such as certain expressions and appearances, or converting images into different styles. However, these applications also bring serious concerns. Most generative model outputs contain human faces, but their sources may involve sensitive issues or unauthorized use of individuals’ images. Preventing the misuse of such images is an important issue. One countermeasure against facial generative models is to introduce subtle but imperceptible perturbations into images to disrupt the subsequent operation of generative models. Existing methods, while causing content disruption in the outputs of generative models, often result in noticeable distortions in the images with embedded perturbations, reducing their practical usability. This study proposes a method that combines Just Noticeable Difference (JND) with various adversarial image generation strategies to produce perturbations that are closer to the original image. We also explore different implementation methods to ensure effective disruption of the generative model’s output. To validate the adaptability of the perturbations, we test against counter-perturbation attacks, comparing the effectiveness of different adversarial perturbation strategies. Experimental results show that, compared to existing methods that limit the maximum pixel value change, our JND-based approach provides better image quality preservation while ensuring effective disruption of the target generative model.\r\nKeywords – Deepfake, Watson Perceptual Model, GAN, Adversarial Perturbation, Deep Learning.","abstract_ch":"利用生成模型進行深度偽造的技術日益進步且易於使用,可能的應用包括將輸入的人物影像合成符合某種需求如特定表情與外觀的輸出影像,或者是將影像轉換為不同的風格的畫面。此類應用同時也帶來不少潛在隱憂。大多數生成模型影像包含人臉,但其來源可能觸及敏感議題或未經畫面人物的授權使用,如何防範影像的不當使用是值得關注的議題。一種對人臉生成模型的反制方法是在影像中加入微小但不易察覺的擾動,藉此干預後續生成模型的運作。現存方法雖然讓加入擾動訊號的影像在生成模型的產出中產生內容破壞,但嵌入的擾動訊號卻容易造成影像明顯的失真,減少了實際運用的可行性。本研究提出結合視覺感知之最小可覺差(Just Noticeable Difference)與多種對抗性影像生成演算法的方式,產生與原圖更接近的擾動訊號嵌入影像,並探究不同的實作方式以確認對於生成模型的產出進行有效破壞。為了驗證擾動的適應性,我們亦測試反擾動攻擊,藉此比較對抗性擾動策略的優劣。實驗結果顯示,與現有方式限制最大像素值改變的方法相比,在保證對於目標生成模型的破壞效果下,我們基於最小可覺差的方法在影像品質的保持有更佳的表現。\r\n關鍵字 – 深度偽造、視覺感知模型、GAN、對抗型擾動、深度學習","picture":"","personal_page":""},{"id":"77","phd":"0","class":"111","name_en":"Si-Ting Lin","name_ch":"林思婷","research_en":"Registration of Infrared and Visible Images Using Style Transfer-Based Semantic Segmentation","resercher_intro":"","research_ch":"使用基於風格轉換之語意分割實現紅外光與可見光影像融合畫面對齊","abstract_en":"Infrared and visible image fusion aims to integrate the complementary information from both types of sensors to generate a single image that incorporates the features of both. This fusion is intended to better match human visual perception or assist with high-level visual tasks such as semantic segmentation and object detection. Most current fusion algorithms assume that paired infrared and visible images are available. However, different sensor devices often cause misalignment of image content or result in frame drops, leading to temporal misalignment. Recent research addresses slight displacements and distortions between input images under the assumption of the same resolution. However, significant differences in resolution and field of view in actual captured images necessitate more effective alignment methods. Existing image fusion datasets lack object and semantic segmentation annotations, which hampers the training of related models, and the differing content between infrared and visible images across datasets makes traditional feature matching methods less effective.\r\nThis paper proposes a method for creating an infrared and visible image fusion dataset with semantic segmentation information. By applying style transfer to existing semantic segmentation dataset images, we generate corresponding infrared and visible images. These images are then used to retrain semantic segmentation models, resulting in a dataset that matches the application scenario and includes relevant semantic segmentation annotations and masks. Depending on whether the background includes common segmentation classes, we use either semantic segmentation annotations or important object masks. We achieve global spatial alignment by calculating image scaling and translation using logarithmic polar coordinate transformation and Fourier Transforms. We can choose to refine local slight displacements using deep learning methods to achieve more accurate object alignment. To address temporal alignment issues, we combine spatial alignment and mask comparison to identify the maximum object overlap and corresponding images between infrared and visible targets, overcoming temporal misalignment caused by frame drops or device settings. Finally, we propose a low-parameter image fusion design to reduce computational resource requirements while enhancing image fusion performance and efficiency.\r\nKeywords - Image fusion, Image alignment, Deep learning, Semantic segmentation, Style transfer.","abstract_ch":"紅外光與可見光影像融合藉由擷取此兩種影像感測器畫面的互補資訊進而生成兼具兩者特徵的單一影像,希望融合畫面更符合人類視覺感知,或協助後續場景語意分割與物件偵測等高階視覺任務。現今的融合演算法多假設可取得成對的紅外光與可見光影像,然而,不同的感測裝置經常造成畫面內容物錯位或是發生掉幀而出現時間域的不對齊。近期研究在輸入影像解析度相同的前提下或能消除存在於兩輸入影像中的輕微位移及變形,但實際拍攝的影像在解析度及拍攝範圍等可能存在甚大差異而需更有效的畫面對齊方式。現有影像融合資料集缺乏物件和語意分割標記而不利相關模型的訓練,且不同資料集的紅外光與可見光內容也讓傳統特徵比對方法難有令人滿意的效果。本論文提出建立具語意分割資訊的紅外光與可見光影像融合資料集方法,將現有語意分割資料集影像經風格轉換生成對應的紅外光與可見光影像,再利用這些影像重新訓練語意分割模型,從而建立符合應用場景情境且包含相對應語意分割標記與遮罩的影像資料集。我們根據背景是否包含經典畫面分割類別而選擇使用語意分割標記或重要物件遮罩,透過對數極座標轉換暨傅立葉轉換於頻域上計算畫面縮放和平移量以達成全局影像空間域對齊。我們可再利用深度學習方法微調局部輕微位移以取得畫面中物件更精確的對齊效果。關於時間域對齊問題,我們結合空間域對齊及遮罩比對逐一檢視紅外光與可見光目標影像以找出最大物件重疊相對應畫面,藉此克服因掉幀或裝置設定所導致的時域錯位。最後,我們提出超低參數量的影像融合設計以降低計算資源需求,同時提升影像融合性能及效率。\r\n關鍵字 – 影像融合、影像對齊、深度學習、語意分割、風格轉換","picture":"","personal_page":""},{"id":"78","phd":"0","class":"111","name_en":"Kai-Ming Chang","name_ch":"張凱名","research_en":"Evaluating Block Consistency by Compression Features for Forgery Detection of Encoded Images and Videos","resercher_intro":"","research_ch":"利用區塊壓縮特徵一致性之編碼影像竄改及視訊偽造偵測","abstract_en":"With the growing accessibility of image editing tools and deep learning-based forgery applications, individuals can easily alter images and videos, disseminating them across social media networks. Such tampered images and forged videos not only create confusion but can also cause irreversible damage to personal reputation and privacy. In response, numerous detection methods for forged images and deepfake videos have been developed in recent years. These methods often rely on training with datasets containing specific tampering techniques to create targeted detection mechanisms. However, as forgery technologies advance, new and unknown tampering methods may emerge. Additionally, tampered images and videos may undergo compression or encoding during dissemination, which can obscure tampering traces, diminishing the effectiveness of current detection methods. This study introduces a deep learning-based forgery detection method that utilizes block consistency to address the challenge of diminished tampering traces in compressed images and encoded videos. By evaluating the similarity of block content within the images or videos, this method determines whether tampering has occurred. Unlike existing approaches that train on the datasets with target tampering operations, our method uses general image data to train the deep learning model, thereby enhancing the model’s generalization capability. The proposed scheme was formed by first developing a feature extractor using convolutional neural networks to identify the source of the images and then employing a Siamese network to classify image compression levels. For image tampering detection, heatmap transformations and foreground extraction were used to pinpoint tampered areas. In deepfake video detection, we concentrated on facial regions, assessing the similarity between consecutive frames to verify the video’s authenticity. The effectiveness of this method was validated and tested using publicly available datasets, which include a range of tampered images and outputs from deepfake video models. The strong performance of the proposed block consistency method underscores its potential in enhancing image tampering detection and deepfake video identification. \r\nKeywords – Image tampering, Deepfake video, deep learning, convolution neural network, Siamese network","abstract_ch":"由於影像編輯工具和基於深度學習的偽造生成應用的普及,人們可以輕易地修改影像和視訊並將其散播至社交媒體網路。竄改的影像及偽造的視訊不僅混淆視聽,對於個人名譽或身分隱私更可能造成無法挽回的損害。近年多種影像竄改內容定位技術和相關深偽視訊偵測方法相繼被提出,現有方法通常針對目標竄改手法畫面進行訓練而產生針對性的偵測機制。然而偽造技術與時俱進,竄改內容的方法可能是未知的,且竄改影像及視訊在網路上傳播時又可能經過壓縮編碼等處理致使竄改痕跡消失,讓現有方法的偵測結果無法令人信服。本研究提出使用區塊一致性的深度學習偽造偵測方法,針對壓縮影像和編碼視訊中偽影減少問題,透過評估畫面中區塊內容的相似性來判斷影像或視訊是否受到竄改。所提出的方法不針對各式竄改操作的資料進行訓練,僅使用一般的影像資料訓練深度學習模型以實現相關的偵測辨別,降低模型泛化能力不足的疑慮。我們透過卷積神經網路,設計能夠分辨來源影像的通用特徵提取器,並利用孿生網路進行影像壓縮程度分類。對於影像竄改偵測,我們使用熱力圖轉換和前景提取技術定位竄改區域。而對於視訊偽造偵測,我們針對人臉周圍區域,透過比對前後幀的相似程度來判斷該視訊的真實性。本研究方法在公開的資料集上進行驗證和測試以證明其可行性,這些資料集包含各式竄改影像及深偽視訊模型之輸出,代表不同類型的影像和視訊,藉此顯示此區塊壓縮特徵一致性方法有助於影像竄改偵測與深偽視訊識別。關鍵字 – 深度學習、影像竄改、深度偽造、孿生網路","picture":"","personal_page":""},{"id":"79","phd":"0","class":"111","name_en":"Yi-Han Cheng","name_ch":"鄭伊涵","research_en":"JSN : Design and Analysis of JPEG Steganography Network","resercher_intro":"","research_ch":"JSN : JPEG影像隱寫網路之設計與分析","abstract_en":"Image Steganography is the technique of hiding messages within images for secret communication, using the carrier image as a disguise to avoid detection by outsiders. The recipient can then extract the hidden message from the stego image. To transmit a large amount of information, the hidden content can also be an image, leading to applications where images are hidden within images. Although existing image steganography techniques can embed nearly the same-sized secret image into a carrier image without significant distortion and can extract the complete secret image, they often do not account for necessary lossy compression during transmission, such as when saving and transmitting images in the commonly used JPEG format. Lossy compression can lead to the failure of image steganography.\r\nTo mitigate the impact of JPEG compression on secret messages, we propose an image steganography model called JSN (JPEG Steganography Network) that aligns with JPEG compression. JSN utilizes a reversible neural network as the backbone of the deep learning model, combined with the JPEG encoding process. It applies an 8×8 Discrete Cosine Transform (DCT) and takes account of the quantization steps specified by JPEG, ensuring that the impact of JPEG lossy compression on the stego image be reduced. The use of a reversible neural network allows JSN to use the same architecture and parameters during both the embedding and extraction processes. In addition to maintaining the quality of both the stego and secret images, the additional quantization process after embedding influences both the embedding and extraction network parameters during training.\r\nWe have conducted extensive testing on JSN, and the experimental results confirm that JSN achieves excellent image steganography performance and meets the practical needs of related applications. \r\nKeywords – Steganography, Deep Learning, Invertible Neural Network, JPEG, Discrete Cousin Transform","abstract_ch":"影像隱寫(Image Steganography)是將訊息隱藏入於影像中以進行秘密通訊,即利用載體影像做為偽裝來避免外人察覺,接收方則可由該影像中擷取秘密訊息。為了傳遞大量訊息,所隱藏的內容可同為影像,即以圖藏圖的應用。現存的影像隱寫技術雖然可將幾乎同樣大小的秘密影像嵌入於載體影像中而不產生明顯失真,並可擷取出完整秘密影像,但通常未考量傳輸影像時必要的有損壓縮,例如將影像以最常見的JPEG格式儲存及傳輸,失真壓縮可能導致影像隱寫失敗。為了避免JPEG壓縮對於秘密訊息的影響,我們提出一個符合JPEG壓縮邏輯的影像隱寫模型JSN (JPEG Steganography Network)。JSN運用可逆神經網路做為深度學習模型架構主幹,結合JPEG編碼流程,對影像施予8×8離散餘弦轉換並考量JPEG所規範的量化步階,使載體影像在嵌入秘密影像後能有效降低JPEG有損壓縮的影響。可逆神經網路的使用讓JSN在嵌入過程與擷取過程使用共同的架構與參數,除了維持載體影像與秘密影像畫質外,在嵌入後的額外量化程序能在訓練擷取網路參數的同時也能影響嵌入網路參數。我們對JSN進行廣泛測試,實驗結果證實JSN能夠取得良好的影像隱寫效果,並符合相關應用的實際需求。\r\n關鍵字 – 資料隱藏、深度學習、可逆神經網路、JPEG、離散餘弦轉換\r\n","picture":"","personal_page":""},{"id":"80","phd":"0","class":"111","name_en":"Meng-Chieh Lee","name_ch":"李孟潔","research_en":"Scene-Text Segmentation and Recognition via \r\nCharacter Spacing Detection","resercher_intro":"","research_ch":"基於字元間隙偵測之自然場景文字分割與辨識","abstract_en":"Scene text indicates text appearing in street signs, shop signs, notices, and \r\nproduct packaging, etc. and reliably detecting and recognizing scene text is \r\nbeneficial for a variety of potential applications. Text in natural scenes may \r\nappear in complex street views or on neven backgrounds, and its detection and recognition are easily affected by changes in lighting, reflections, angle distortions, or other obstructions. Nowadays, common research methods adopt \r\ndeep learning models, with words labeled as units to facilitate subsequent word \r\nsegmentation, text detection, and recognition. These methods usually require \r\nmore data and larger deep learning models to handle the diversity of text words. \r\nBesides, multilingual text appears quite often and labeling in a unified manner \r\nis not a trivial task.\r\nConsidering the cost of model training and the detection of multilingual \r\ntext, this study proposes using character gaps or spacings as detection targets \r\nto assist in the segmentation of multilingual characters. By detecting character \r\ngaps to locate character centers, and then using a nearest neighbor algorithm to \r\ndraw character bounding boxes, a lighter model can be used for single-character \r\nrecognition. However, the challenge of detecting character gaps or spacings lies \r\nin the fact that most current datasets are labeled for words, lacking labels for \r\ncharacters or character gaps. We form an synthetic image dataset that mimics \r\nnatural scenes, containing character bounding boxes and character gap boxes. \r\nCombined with weakly supervised learning on real datasets with word labels, \r\nthis approach allows the model to be fine-tuned and iteratively updated to more \r\naccurately locate character gaps. Experimental results show that the proposed method is feasible for detecting character gaps or spacings to locate characters III in the multilingual datasets.\r\nIndex Terms – Deep learning, semantic segmentation, scene text localization, \r\nmultilingual text localization, character recognition, weakly supervised learning.","abstract_ch":"自然場景文字包含街景路標、商店招牌、告示牌以及商品包裝等,可靠地偵測與辨識這些文字有助於實現多種具潛力的應用。自然場景文字可能出現於複雜街景或非平整背景,易受到光線變化、反光、角度扭曲或其他遮蔽物影響,於自然場景影像中準確偵測與辨識文字並不容易。現今常見的研究方法是利用深度學習模型,並以字詞為單位進行標記以利後續的字詞分割、文字偵測及辨識,通常需要較多資料與較大型的深度學習模型來因應存在於字詞的多樣性。此外,經常出現的不同語種文字會增加標記與辨識的困難。考量模型訓練成本與多語種文字偵測的需求,本研究\r\n提出以字元間隙為標的之文字偵測模型來協助定位自然場景中的多語種字元,透過字元間隙決定字元中心,再使用近鄰演算法畫出字元框區域,可與其中以較輕量的模型進行字元辨識。然而,偵測字元間隙的挑戰在於現今大部分資料集的標記都是針對字詞,在缺乏字元或字元間隙標記的情況下,本研究先產生接近自然場景的人工資料集,該資料集包含字元標記框以及字元間隙標記框,再搭配弱監督式學習以含有字詞標記的真實資料集進行模型調整,使得模型在微調以及迭代更新下能更準確地定位字元間隙,進而找出字元位置。實驗結果顯示,對於包含多國語種的文字\r\n資料集,我們所提出的偵測字元間隙方法以定位字元中心位置是可行的。\r\n關鍵字 – 深度學習、語義分割、自然場景文字定位、多國語言文字定位、字元辨識、弱監督式學習","picture":"","personal_page":""},{"id":"71","phd":"0","class":"110","name_en":"Hsing-Wei Chang","name_ch":"常興唯","research_en":"Establishment and Evaluation of a Semantic Segmentation Dataset for Infrared and Visible Image Fusion","resercher_intro":"","research_ch":"紅外線與可見光影像融合之語意分割資料集建立及其對融合效果的影響評估","abstract_en":"The purpose of image fusion is to integrate different types of input images and generate a more complete image with improved scene representation and visual perception, supporting advanced vision tasks such as object detection and semantic segmentation. Infrared and visible image fusion is a widely studied research area, but training fusion models using deep learning methods often requires a large amount of annotated data. Existing infrared and visible image fusion datasets only provide images without precise object annotations or semantic segmentation, which affects the presentation of fusion results and limits the further development of related fields. In this study, we propose a method to create a dataset for infrared and visible image fusion with semantic segmentation information. We utilize general images from existing semantic segmentation datasets and generate corresponding infrared images using style transfer techniques. This allows us to establish a labeled fusion image dataset, where each pair of infrared and visible images is accompanied by their respective semantic segmentation labels. This dataset creation method improves image fusion performance and can also provide an alignment method based on semantic segmentation masks for disparate resolution and misalignment in real-world infrared and visible images. which saves significant time and resources in the common alignment preprocessing step. \r\nKeywords - Image fusion, Image alignment, Semantic segmentation, Deep learning, Style transfer.","abstract_ch":"影像融合的目的是整合不同類型的輸入影像,透過影像間的互補資訊生成具更完整場景顯示和視覺感知的影像,以支援後續的進階視覺任務,例如物件偵測與語意分割等。紅外線與可見光影像融合是受到廣泛關注的研究領域,但使用深度學習方法進行模型訓練時通常需要大量的標記資料,現有的紅外線與可見光影像融合資料集卻只提供影像,缺乏精確的物件標記以及語意分割等,從而影響影像融合結果的呈現,也限制了相關領域的進一步發展。本研究提出創建具語意分割資訊的紅外線與可見光影像融合資料集方法,利用現有的語意分割資料集的一般影像,以風格轉換方式生成相對應的紅外線影像,依此建立具標記的融合影像資料集,即每組紅外線與可見光影像皆包含對應的語意分割標記。這樣的資料集建立方式能夠提升影像融合效果,也能針對實際拍攝的紅外線與可見光影像可能出現畫面解析度不同及內容錯位的問題,提供基於語意分割遮罩的對齊方法,將紅外線及可見光影像進行重新採樣對齊,對於此類研究中常見的對齊前處理能節省不少的時間與人力。\r\n關鍵字 – 影像融合、影像對齊、語意分割、深度學習、風格轉換","picture":"","personal_page":""},{"id":"72","phd":"0","class":"110","name_en":"Yun-Chi Tsai","name_ch":"蔡允齊","research_en":"Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction","resercher_intro":"","research_ch":"擷取有效畫面域與時間域資訊進行深度學習手語辨識","abstract_en":"Automatic sign language recognition based on deep learning requires a large amount of video data for model training. However, the creation and collection of sign language videos are time-consuming and tedious processes. Limited or insufficiently diverse datasets restrict the accuracy of sign language recognition models. In this study, we propose effective spatial and temporal data extraction methods for sign language recognition. The goal is to augment the limited sign language video data to generate a larger and more diverse training dataset. The augmented data, used as inputs to deep learning networks, can be paired with simpler architectures like 3D-ResNet, which allows for achieving considerable sign language recognition performance without the need for complex or resource-intensive network structures.\r\nOur spatial data extraction employs three types of data: skeletons obtained using Mediapipe, hand region patterns or masks, and optical flows. These three data types can be used as three-channel inputs, akin to the approach often used in earlier 3D-ResNet models. Nevertheless, our distinct data types offer specific features that enhance feature extraction. For temporal data extraction, we determine certain key-frames to capture more meaningful visual information, thus employing different scene selection strategies.\r\nThe proposed spatial and temporal data extraction methods facilitate data augmentation, which simulates various hand sizes, gesture speeds, shooting angles, etc. The strategy significantly contributes to expanding the dataset and increasing its diversity. Experimental results demonstrate that our approach significantly improves the recognition accuracy for commonly used American Sign Language datasets.\r\nKeywords: Sign-language recognition, key-frames, deep learning","abstract_ch":"基於深度學習的自動手語辨識需要大量視訊資料進行模型訓練,然而手語視訊的製作與蒐集相當費時繁瑣,少量或不夠多樣的資料集則限制了手語辨識模型的準確率。本研究針對手語辨識提出有效的空間域與時間域資料擷取方法,希望將有限的手語視訊資料透過合理的擴增處理產生更大量與多樣的訓練資料,這些做為深度學習網路的輸入資料可搭配較簡易的架構如3D-ResNet來搭建,可以不採用複雜或需要大量訓練資源的網路架構即可獲致相當的手語辨識效果。我們的空間域資料擷取採用以Mediapipe所取得的骨架、手部區域型態或遮罩,以及移動光流,這三種資料可做為像是較早的3D-ResNet模型所常採用的三通道輸入,但與以往RGB輸入不同的是我們的三種資料各有特點而讓特徵擷取更具效果。時間域資料擷取則透過計算與決定關鍵幀的方式挑選更有意義畫面,藉此達成不同的畫面選擇策略。我們所提出的時間域與空間域資料可再用有效的資料增強模擬多種手尺寸、手勢速度、拍攝角度等,對於擴充資料集與增加多樣性都有很大的助益。實驗結果顯示我們的方法對於常用的美國手語資料集有顯著的辨識準確度提升。\r\n關鍵字–手語辨識、關鍵幀、深度學習","picture":"","personal_page":""},{"id":"73","phd":"0","class":"110","name_en":"Li-Zhu Chen","name_ch":"陳莉筑","research_en":"Character Segmentation in Scene-Text Images\r\nBased on Weakly Supervised Learning","resercher_intro":"","research_ch":"基於弱監督式學習之自然場景文字字元分割","abstract_en":"In recent years, there has been a prevailing trend in deep learning-based research for natural scene-text detection. The primary focus has generally been on word-based detection, which has yielded promising results. However, text fonts have significant variations, and the backgrounds of test images tend to be complex. Text may also be obstructed by occlusions, particularly in cases where natural scene text exhibits diverse orientations. Achieving accurate word-level detection under such circumstances is challenging and can also impact the subsequent text recognition accuracy. To address the difficulty of detecting irregularly oriented words, this paper proposes a pixel-level character detection network. By detecting individual characters, the detection boxes can adhere more closely to the text boundaries, reducing the negative influence of complex backgrounds on the detection network. Lighter-weight recognition networks can thus be employed for subsequent text recognition, reducing the resource and time requirements for training. The main challenge in character detection lies in the fact that existing natural scene-text detection datasets focus on word-level annotations, since character-level annotation is a laborious and time-consuming task. To overcome this challenge, we generate a large volume of synthetic data that closely resembles real-world scenarios. We employ partially annotated data for training, incorporating weakly supervised learning techniques and the inclusion of real-world data during training. For real-world data without character-level annotations, we adopt an iterative update approach to automatically learn more reliable character positions through the use of updated results to improve the accuracy of the model. Additionally, we propose a new evaluation method for character detection to address the lack of character-level annotated test datasets. Experimental results demonstrate the superiority of our method over other character detection models on the ICDAR2017, TotalText, and CTW-1500 datasets. We also apply the same approach to train models for character detection in other languages to validate the feasibility of the proposed method. \r\nIndex Terms – Deep learning, semantic segmentation, arbitrary orientations text localization, weakly supervised learning.","abstract_ch":"近年來基於深度學習於自然場景文字檢測的相關研究盛行,普遍以偵測字詞(word)為主要目標,並取得不錯的效果。然而,文字字體型態多變,且待測影像背景趨於複雜,文字可能受到遮蔽物阻擋,特別是當自然場景文字走向多元時,準確的字詞偵測並不容易達成,也影響下一階段文字辨識的準確度。本研究提出像素級字元(character)偵測網路,透過偵測字元的方式嘗試解決不規則走向字詞不易偵測的問題。字元偵測能讓偵測框更緊貼文字邊緣,降低複雜背景對於偵測網路所造成的影響,後續的文字辨識或可使用較輕量的辨識網路,減少訓練所需的資源與時間。字元偵測的主要挑戰在於現有自然場景文字檢測資料集皆採用字詞標記,因為針對字元的人工標記相當耗時費力。我們藉由生成大量貼近真實場景的合成資料來解決訓練集缺少字元標記的問題,並結合弱監督式學習在含有字詞標記的真實影像進行模型訓練。對於這些沒有字元標記的真實資料,我們以迭代更新結果的方式使網路自動學習偵測更可靠的字元位置,提升模型表現。另外,因應缺少字元標記的測試資料,我們提出新的字元偵測評估方式。實驗結果顯示我們的方法在ICDAR2017、TotalText和CTW-1500資料集上皆優於其他字元偵測模型,我們也將同樣的方式運用於訓練中文字元偵測以驗證所提出方法在其他語言內容的可行性。\r\n關鍵字 – 深度學習、語意分割、任意走向文字定位、弱監督式學習","picture":"","personal_page":""},{"id":"74","phd":"0","class":"110","name_en":"Bo-Hong Huang","name_ch":"黃博鴻","research_en":"Detecting Forged Images and DeepFake Videos via Block Consistency Evaluation","resercher_intro":"","research_ch":"基於區塊一致性評估之影像竄改與深偽視訊偵測","abstract_en":"With the advancement of technology, image manipulation has evolved to a point where it can significantly alter the content of images and\/or videos while maintaining a high level of realism. The emergence of DeepFake technology has further revolutionized and facilitated these manipulations, posing a significant threat to the integrity of digital visual media due to malicious intent. Although several countermeasures have been proposed, the diversity and constant evolution of tampering techniques make it challenging, if not impractical, to collect a comprehensive dataset for supervised learning. Even if such a dataset were available, the sheer volume of data would present a formidable challenge.\r\nTherefore, in this study, we propose a deep learning-based forensic method that leverages block consistency evaluation to identify forgery or affected areas in images and videos. This approach aims to circumvent the need for training with various types of tampered data by utilizing information from original or unaltered image blocks for identification. We train a convolutional neural network to extract features from image blocks and employ a Siamese network for similarity measurement between block pairs to determine potential tampered areas. Furthermore, to combat image inpainting, we incorporate a segmentation network for further refinement of the tampered areas. When dealing with DeepFake, we first locate the facial regions and then assess the authenticity of videos by comparing the similarity of facial regions across consecutive frames.\r\nWe test and validate the proposed method on publicly available datasets that encompass a wide range of image and video types, covering various tampering techniques. Through comparison with other methods, we demonstrate the superiority of our approach in terms of accuracy and stability, showcasing its feasibility and potential. These findings underscore the effectiveness and promise of the proposed scheme in addressing the challenges posed by image manipulation and DeepFake technologies. \r\nKeywords: Image inpainting, DeepFakes, Siamese network, deep learning","abstract_ch":"隨著時代的進步,現在的圖像竄改已具有顯著改變圖像和\/或視訊內容的能力,同時保持極高的逼真度,而深度偽造(DeepFake)技術的出現更造成巨大的變革和便利性,卻也因為各種惡意目的的操作下,對於數位視訊判斷真實性帶來很大的威脅。現今已經提出不少對抗方案,然竄改方式的多樣性和不斷演進變化,要收集所有類型的竄改資料進行監督式學習是相當困難或不切實際的,即使蒐集齊全也要面臨資料集過於龐大的問題。\r\n因此,在本研究中,我們從另一角度出發,提出了一種基於區塊相似性的深度學習辨識方法,通過評估區塊內容參數的一致性來識別圖像和視訊中的偽造或受影響區域。這種方法旨在避免使用各種類型的竄改資料進行訓練,而是利用原始或未修改的圖像區塊的信息來實現辨識。我們訓練了一個卷積神經網路來提取圖像區塊的特徵,並使用Siamese網路進行區塊對之間的相似度比對,以確定可能的竄改區域。此外,為了對抗圖像竄改,我們還引入了分割網路來對竄改區域進行進一步的精細處理。在處理DeepFake問題時,我們首先定位人臉區域,然後通過比對前後幀中的人臉區域相似度來判斷視訊的真實性。我們在公開的資料集上對所提出的方法進行了測試和驗證,以驗證所提出方法的可行性。這些數據集包含了各種不同類型的圖像和視訊,涵蓋了多種竄改操作。通過與其他方法的比較,我們證明了所提出的方案在準確性和穩定性方面的優越性,這一結果也顯示了所提出方案的可行性和潛力。\r\n關鍵字 – 圖像竄改、深度偽造、孿生網路、深度學習 ","picture":"","personal_page":""},{"id":"75","phd":"0","class":"110","name_en":"Ling-Wei Yeh","name_ch":"葉凌瑋","research_en":"Using Positive and Negative Images for Supervised Training to Achieve Luminance-Adaptive Fusion of Infrared and Visible Light Images","resercher_intro":"","research_ch":"運用正負影像進行監督式訓練以實現紅外光與可見光之\r\n畫面亮度自適應融合","abstract_en":"The main task of fusing infrared and visible light images is to preserve the \r\nspectral information of the same scene in a single frame. However, the extreme \r\nbrightness differences between the two input images can lead to content \r\ninterference and affect the presentation of complementary information in the \r\nfused image. Existing image fusion methods often perform well for lower \r\nbrightness images, but when one of the images contains high-brightness content, \r\nwe observed a decrease in texture contrast in the fused image. To overcome the \r\nissue of poor fusion results caused by extremely high and low brightness \r\nimages, we propose a new training method that utilizes self-supervised learning \r\nwith positive and negative images in a deep learning neural network. We \r\nextract image gradients to generate ground-truth that preserve the details from \r\nthe original images for reference in supervised learning. Additionally, we use \r\nedge enhancement as the ground-truth for the gradients of the fused image to \r\nmitigate the adverse effects of brightness on preserving fused details. We also \r\nintroduce a channel attention module to enhance or weaken different channels \r\nin the feature maps. The training process measures the similarity between the \r\npositive and negative images and the designed ground truth, as well as the \r\nsimilarity between the inverted negative fused image and the positive fused \r\nimage. This encourages the deep learning network to preserve detailed features \r\nand achieve Luminance-adaptive image fusion.Experimental results \r\ndemonstrate the effectiveness of our proposed method, confirming that the generation of ground-truth can guide the preservation of information in the \r\nfusion of infrared and visible light images.\r\nIndex Terms – Image Fusion, Deep Learning, Convolution Neural Network,\r\nSupervised Learning, Luminance Self-Adaptive","abstract_ch":"紅外光與可見光影像融合的目的是將同一場景之不同光譜影像資訊保留於單一畫面中。然而,兩輸入圖的較大亮度差異可能導致彼此內容干擾,影響融合影像的互補資訊呈現。現存的影像融合方法通常對於較低亮度影像有較佳的效果,但是當其中一張影像具有較高亮度內容時,我們發現融合影像的紋理對比度經常出現降低的情況。為了避免極高亮度和極低亮度影像所引起的融合效果不理想,我們提出新的深度學習模型訓練方法,運用正負影像進行自監督式訓練。考量畫面內容的邊緣是融合影像的呈現重點,我們計算影像梯度提取紋理以保留原圖細節供監督式學習參考。畫面中不同區域的紋理協助產生訓練融合影像的引導圖,我們另也運用邊緣增強做為融合影像梯度的參考以降低極端亮度對於畫面細節的影響。我們引入通道注意力模塊來針對特徵圖中不同通道進行強化或弱化,並加速模型訓練。監督式訓練計算正負影像與引導圖間的相似度、負融合影像反轉後與正融合影像間的相似度,以及融合影像梯度的相似度,盡可能保留畫面細節,並實現亮度自適應的影像融合目標。實驗結果證明我們所提出的方法能夠取得良好的成效。\r\n關鍵字 – 影像融合、深度學習、卷積神經網路、監度式學習、亮度自適應 ","picture":"","personal_page":""},{"id":"66","phd":"0","class":"109","name_en":"Yi-Ting Tung","name_ch":"董怡廷","research_en":"Designs of the Traditional Chinese Scene Text Dataset and Performance Evaluation for Text Detection and Recognition","resercher_intro":"","research_ch":"繁體中文場景資料集建置暨文字定位與辨識之評估","abstract_en":"Texts in pictures contain rich information. Extracting and recognizing these texts in images, i.e., scene text detection and recognition, help to facilitate many interesting and potential applications. Therefore, scene text analytics have become one of the research topics in the filed of computer vision. Nevertheless, most existing datasets and competitions related to scene text detection and recognition focused on English or other languages. The Traditional Chinese used in Taiwan has not been paid too much attention in this field. In order to promote the research of Traditional Chinese scene text analytics, in this study, we collected a large volume of street-view images to form the dataset called ”Traditional Chinese Street-View Texts” (TCSVT), containing 20, 188 images with careful annotations. The characters in this dataset have various forms and the strings have varying orientations, sizes, and fonts. We formulated a set of labeling principles for texts containing Chinese so that the annotations can be more standardized. The labels of text lines and characters include their locations, contents and the language types. This dataset was then adopted in the 2021 AICUP Traditional Chinese Scene Text Recognition Competition. This competition has three stages: 1) Text-line Localization, 2) Traditional Chinese Text-line Recognition and 3) Text Spotting and Recognition in Complex Streetscapes. We set up reasonable evaluation metrics of each task. The competition started in April 2021 and ended in December 2021. The numbers of teams partipating the three stages are 341, 183 and 128, repectively. The numbers of valid submissions of the three tasks are 246, 60 and 91 respectively.\nIndex Terms—Deep learning, scene text dataset, text detection, text recognition","abstract_ch":"場景文字包含非常豐富的影像相關訊息,擷取並辨識畫面中的文字內容能夠促成許多具潛力的應用,因此場景文字分析目前為電腦視覺領域所關注的研究議題之一。然而,現有場景文字資料集或相關競賽多集中於英文或其他語言的處理,台灣所使用的繁體中文尚未有較完整的資料。為了促進繁體中文字辨識領域的發展,本研究蒐集大量繁體中文街景圖片,包含 20,188 張街景影像,經後處理與標記後整合為繁體中文場景文字資料集。由於中文字的走向、大小、字體相當多元,為了讓標記資料趨於一致,我們訂定較符合包含中文場景文字的標記原則,其中的字串與字元都帶有位置與內容,並加上語言種類。資料集經錯誤檢查與整理後,應用於日前所舉辦的繁體中文場景文字辨識競賽。此競賽共分成三項任務,初階賽-文字定位、進階賽-繁體中文字元辨識,以及高階賽-複雜街景之中英數字辨識。本論文針對各階段競賽訂定評分原則,並展示競賽最終結果。比賽於 2021 年 4 月開始,2021 年12 月結束。每項競賽的參賽隊伍數與提交次數分別為,初階賽 341 組 246 次有效提交; 進階賽 183 組 60 次有效提交; 高階賽 128 組 91 次有效提交。\n關鍵字— 深度學習、場景文字資料集、文字定位、文字辨識","picture":"","personal_page":""},{"id":"67","phd":"0","class":"109","name_en":"Chia-Yin Lin","name_ch":"林佳穎","research_en":"Character Spotting and Language Recognition for Multilingual Scene Texts based on Image Segmentation","resercher_intro":"","research_ch":"基於影像分割之多語言場景文字字元偵測與語言辨識","abstract_en":"In recent years, scene text analysis based on deep learning techniques draw a lot of research attention. Text detection in natural scenes is an important step of scene text analysis and most of the existing text detection designs are based on string detection. However, a string may contain words of different languages so it is not easy to mark the language to which the string belongs accurately. Scene text recognition using string level annotations need to consider the effect of irregular orientations and require a lot of training data and training time. Conversely, character based recognition methodologies do not need to consider orientations, which simplifies the training processes. Multilingual natural scene text recognition may be benefited from the flexibility of selecting suitable recognition models according to different language characteristics. In this research, we use a high-resolution network architecture to label word regions and point out the centers of characters, and also employ multiple channels for substring language classification. Due to the lack of character-level annotations in real datasets, we propose a weakly supervised learning approach for characters, enabling the network to improve the detection of characters significantly. The performance of multi-language recognition is verified by using individual classifiers after detection or by performing language recognition at the same time. The feasibility of the proposed design is verified by showing the character detection of different languages, including Latin, Chinese, Japanese, and Korean, as examples.","abstract_ch":"基於深度學習的自然場景文字分析相關研究在近年來十分盛行,文字區域偵測更是其中的重要環節。現今文字偵測大多以字串為標記單位,然而字串中可能包含不同語言的文字,標記時較不易確認該字串文字所屬語言。本研究提出以字元為單位的偵測方式,不僅能準確標記所屬語言,也讓辨識時能採用相對應語言模型以達到更好的效果。對於辨識模型而言,字串需要考量不規則的文字走向,且字串辨識模型通常需要較大量的訓練資料與訓練時間。反觀字元辨識則不太需要考慮文字走向,訓練模型相對簡單省時,且面對多語言自然場景文字時能更有彈性地根據語言特性,選擇適合的辨識單位與方法。本研究使用高解析度網路架構,以字元為偵測單位,標記字元區域並點出字元中心,且利用多個通道進行語言分類。由於真實資料集字元標記的缺乏,我們提出針對字元的弱監督式學習方法,使得網路在缺乏字元標記的情況下也能在偵測字元的表現有明顯的效果提升。在多語言分類上,不管是偵測後用個別分類器亦或是在偵測的同時進行語言辨識皆有一定的效果,驗證了字元辨識的可行性。我們實驗以拉丁文(英數字)、中文、\n日文、韓文為範例,分析本設計的可行性與合理性。\n\n關鍵字 – 深度學習、街景文字定位、多語言文本辨識、弱監督式學習","picture":"","personal_page":""},{"id":"68","phd":"0","class":"109","name_en":"Yu-Jen Chen","name_ch":"陳昱任","research_en":"Suitable Data Input for Deep-Learning-Based Sign-Language Recognition with a Small Training Dataset","resercher_intro":"","research_ch":"適用於少量訓練資料之深度學習手語辨識輸入組合","abstract_en":"Deep learning-based sign language recognition usually requires a large number of sign language videos to train neural network models. In this study, we consider generating effective sign language training data to help construct deep learning recognition models through feature extraction and expansion of training data when a smaller number of sign language videos are used for training. We use Mediapipe to obtain the hand skeleton from the sign language video, analyze several hand skeleton adjustment policies and color arrangement, and generate hand masks from the skeleton to simulate hands of different persons. Since the miss detection of hands may happen due to the motion blurring caused by rapid hand movements, we incorporate optical flows to ensure that the hand movement information is retained in each frame. We use different spatial and temporal processing strategies to simulate different hand sizes, different filming angles, and different hand speeds. The experimental results show that the proposed approach is effective in improving the accuracy of sign language recognition in the American Sign Language dataset.\nIndex Terms - Sign Language Recognition, Feature Extraction, Deep Learning","abstract_ch":"基於深度學習的手語辨識通常需要大量視訊來訓練神經網路模型,本研究考量在手語視訊較不足的情況下,透過特徵擷取及擴大訓練資料等方式,產生有效的手語訓練資料以協助建構深度學習辨識模型。我們利用Mediapipe嘗試由手語視訊中取得手部骨架,分析幾種手部骨架調整策略以及顏色安排,並由骨架產生手部遮罩以模擬生成不同人的手部型態。由於手部偵測有時會因手指快速移動的動態模糊導致失誤,我們因此結合光流圖以確保每張畫面保留手部移動資訊。我們將手部骨架、手部型態以及畫面光流作為3D-ResNet模型的三個通道輸入,採用不同的空間域變化與時間域採樣策略,模擬不同大小的手、不同拍攝角度、不同手速等情形。實驗結果顯示我們所提出的方式於美國手語資料集中可以有效提高辨識準確度。\n關鍵字 - 手語辨識、特徵擷取、深度學習","picture":"","personal_page":""},{"id":"69","phd":"0","class":"109","name_en":"Chun Tsao","name_ch":"曹鈞","research_en":"Quality Assessment of Image Retargeting based on Importance of Objects","resercher_intro":"","research_ch":"基於物件重要性程度之影像尺寸調整評估機制","abstract_en":"Many image retargeting methods have been proposed to resize images to\nfit in various sizes of display devices with less perceptual distortion. Assessing\nthe quality of retargeted images has thus become an important task for\ndeveloping such methods. In this research, we propose an image retargeting\nquality assessment (IRQA) based on importance of objects. We utilize\nsemantic segmentation to classify pixels, which are assigned with different\nimportance values representing the sensitivity of human eyes to distortion. A\nvisual saliency map is created to better fit the subjective perception of humans\nand is then used in the evaluation method called “Aspect Ratio Similarity”\n(ARS) to improve its accuracy. Furthermore, as observing that human eyes\ntend to be affected more by the global information loss in images in which\nthere is no obvious foreground object, we propose the strategy of information\nloss adjustment in such images. We first utilize semantic information to\ndetermine whether a foreground object exists and then adopt different degrees\nof information loss penalty to improve the accuracy of the assessment. The\nexperimental results show that the proposed approach is effective in\nevaluating the image retargeting methods and outperforms existing quality\nassessment methods.","abstract_ch":"為了將影像完整呈現於各種尺寸的輸出裝置,且盡量減少視覺上的\n扭曲變形,許多基於內容之影像尺寸調整機制被提出,如何有效地評估\n各種方法的效果成為一項重要任務。本研究提出一個基於物件重要程度\n的影像尺寸調整評估機制,透過語義分割方法將影像中的所有像素點分\n類,根據語義中的類別,給予該所在區域不同的視覺重要程度,依此做\n為人眼視覺對於該區域受破壞的敏感度衡量,希冀獲致更貼近使用者主\n觀感受的顯著圖,並將其應用於長寬比相似性畫質衡量演算法以提升準\n確度。我們另外觀察到人眼觀看無前景物影像時容易受到畫面整體資訊\n損失的影響,因此提出無明顯前景物資訊損失懲罰調整策略。我們先利\n用語義資訊判斷場景中有無明顯前景物,再給予不同大小級別的資訊損\n失懲罰,提高無明顯前景物場景的評分準確度。實驗結果顯示,本研究\n能有效評估影像尺寸調整機制,與現有方法相較有更高的準確度。","picture":"","personal_page":""},{"id":"70","phd":"0","class":"109","name_en":"Kuan-Chung Wang","name_ch":"王冠中","research_en":"Adversarial Perturbation against Deepfakes\r\nbased on Visual Perceptual Model","resercher_intro":"","research_ch":"基於視覺感知模型之深度偽造對抗性擾動","abstract_en":"The emergence of Deepfakes poses a serious threat to the authenticity of\r\ndigital videos. Recently, many studies have proposed methods for detecting and\r\nidentifying the presence of Deepfakes in videos. On the other hand, some\r\nresearchers adopted the approach of digital watermarking by embedding\r\nadversarial signals in public images to make the tampering results generated by\r\nDeepfake models deviate from their expected goals, so as to avoid producing\r\neffective falsified content. Most existing watermarking methods embedded such\r\nadversarial signals in the pixel domain. However, in order to prevent the quality\r\nof original image from being damaged by overly strong watermark signals,\r\nmaking large changes to the pixel values is not feasible. In this research, we\r\npropose to embed the adversarial watermark signals in the frequency domain of\r\nimages. After converting the image from RGB color channels to YUV channels,\r\nthe DCT (Discrete Cosine Transform) is applied on each channel. The Watson’s\r\nperception model is employed to determine the maximum possible change of DCT\r\ncoefficients to ensure that the modification won’t be noticed by the human eyes.\r\nThe perceptual mask is also used to determine the modification step size of the\r\nwatermark in the training stage. The experimental results show that embedding\r\nsuch stronger watermarking signals can introduce more severe distortions on the\r\nimage generated by the Deepfake models.\r\n\r\nKeywords: Deepfakes, adversarial watermark, deep learning","abstract_ch":"深度偽造技術的出現對於數位視訊真實性帶來很大的威脅,近期許多研\r\n究針對深度偽造內容是否存在於視訊中發表相關的偵測與辨識方法,另也\r\n有研究學者提出在公開的影像中嵌入所謂對抗性浮水印,試圖使深偽模型\r\n所生成的竄改影像內容偏離預期結果,避免產生有效的竄改內容。現有的浮\r\n水印方法多於像素域中加入這種對抗性訊號,然而為了避免過強的浮水印\r\n訊號損及原影像畫質,無法在像素值施予較大幅度的改變。本研究提出於影\r\n像頻率域中嵌入對抗性浮水印,將影像轉換至亮度及色度空間後計算離散\r\n餘弦轉換(Discrete Cosine Transform, DCT),透過 Watson 感知模型計算在不\r\n被人眼察覺下,確保 DCT 係數的修改低於可能的最大改變量,並依此決定\r\n浮水印在訓練階段時的修改步長。實驗結果顯示,所加入的高強度浮水印訊\r\n號確實能使深偽模型所生成的影像更容易發生嚴重失真,同時藉由計算影\r\n像畫質衡量來證實這樣的方法與像素值嵌入方法相比可有效降低對於原影\r\n像畫質的破壞。\r\n\r\n關鍵字 – 深度偽造、對抗性浮水印、深度學習","picture":"","personal_page":""},{"id":"62","phd":"0","class":"108","name_en":"Sin-Wun Syu","name_ch":"許馨文","research_en":"Traditional Chinese Scene Text\r\nRecognition Strategies based on Deep\r\nLearning Networks","resercher_intro":"","research_ch":"基於深度學習網路之繁體中文場景文字\r\n辨識策略","abstract_en":"Text recognition is an important task for extracting information from imagery\r\ndata. Scene text recognition is one of its challenging scenarios since the texts\r\nappearing in natural scenes may have diversified fonts or size, be occluded by\r\nother objects and be captured from varying angles or under different light conditions. In contrast to alpha­numerical characters, Traditional Chinese Characters (TCC) receive less attention and the large number of TCC makes it difficult to collect and label enough scene­text images. This research aims at\r\ndeveloping a set of strategies for TCC recognition. We develop a synthetic\r\ndataset using a variety of data augmentation methods, including text deformations, noise adding and background changes, which appear often in natural\r\nscenes. A segmentation­based text spotting scheme is used to locate the areas of text­lines and characters so that the characters can be recognized by the\r\ntrained model and then linked into meaning text­lines. The text­lines can be\r\ncorrected via network search, which will further boost the model performance\r\nafter re­training. The experimental results show that the proposed strategies\r\nwork better in recognizing TCC in natural scenes, when compared with existing publicly available tools.\r\nIndex Terms—Deep Learning, Optical Character Recognition, Scene Text Recognition, Text­line Correction","abstract_ch":" 文字辨識是一個從圖像中提取文字特徵的影像辨識任務,目前也有許多相關的\r\n應用場景,例如:印刷文件文字辨識、手寫字辨識、車牌辨識等。相較於針對文件\r\n掃描的文字識別,自然場景中的文字因為多樣化的字型、角度、光線變化以及障礙\r\n物遮擋等,增加了文字辨識的挑戰性。繁體中文自然場景文字識別的相關研究目前\r\n較為少見,主因是僅台灣廣泛地使用繁體中文字,且相較於英數字,中文字元種類\r\n數量龐大,蒐集足夠數量的街景文字圖片十分困難,影像標記也非常耗時。本研究\r\n使用多種字型檔產生人工資料集,並針對街景文字場景設計多種資料增強方法,包\r\n括調整文字大小、傾斜角度、背景紋理變化以及文字輪廓外框等,於訓練過程中策\r\n略性隨機調用,期使人工資料集達到模擬真實街景影像的效果,不僅增強資料的可\r\n靠性,也解決了資料類別不平衡、以及可能的標記錯誤。本研究提出基於深度學習\r\n網路的繁體中文字辨識策略,並且設計文字串校正機制,針對字串中少部分文字辨\r\n識錯誤的情況,使用校正方法來提升文字串的整體辨識準確度。實驗結果顯示,本\r\n研究能有效識別自然場景中的繁體中文字,與現有方法評比擁有更佳的準確度。\r\n關鍵字— 深度學習、光學字元辨識、繁體中文字辨識、字串校正","picture":"","personal_page":"http:\/\/msp.csie.ncu.edu.tw\/ssw\/"},{"id":"63","phd":"0","class":"108","name_en":"Chi-Hsuan Huang","name_ch":"黃啟軒","research_en":"Using Synthetic Data to Construct Deep Learning Datasets for Air-Writing Applications","resercher_intro":"","research_ch":"利用虛擬資料建構深度學習訓練集以實現凌空書寫應用\u000b","abstract_en":"Air-writing is a novel input way for applications in human-computer\r\ninteraction. By the real-time fingertip detection in the screen captured by the\r\ncamera, the coordinates of fingertip are formed into trajectory, and then recognize\r\nthe word represented by the trajectory. This method can be used as a text input\r\nmethod such as smart glasses, and touchless writing is also useful in some\r\nhygiene-sensitive fields, for example, to prevent users from being infected with\r\nviruses such as COVID-19 due to contact with the device. This research aims to\r\npropose first-person and third-person air-writing techniques based on deep\r\nlearning. First, since deep learning technology relies on a large amount of labeled\r\ndata, we choose to use Unity3D to establish our training dataset. We synthesize\r\nthe hand model into a random natural background or a single-color background\r\nto efficiently and quickly generate accurately labeled synthetic data. We added\r\ndifferent skin colors, skin materials, light changes and motion blurriness to the\r\nhand model to increase the diversity, and then simulate the rotation and movement\r\nof the hands in various writing situations. We use the object detection model to\r\ndetect the position of the fingertip to form a text trajectory and delete the\r\nredundant strokes generated during the writing process, so that the trajectory is\r\ncloser to the text itself. Finally, we use the handwritten dataset and artificially\r\ngenerated various printed characters for training, and use the classification model\r\nResNeSt to recognize nearly 5000 Chinese characters. The experimental results\r\nshow that the large amount of high quality labeled synthetic data we generate can\r\neffectively train the model and realize the real-time air-writing mechanism.\r\nIndex Terms – fingertip detection, air-writing, synthetic datasets, traditional\r\nChinese character recognition.","abstract_ch":" 凌空書寫是一項新穎的人機互動輸入方式,使用者自然地在空中書寫想\r\n要輸入於若干機器或設備的文字,藉由攝影機所拍攝的畫面中進行即時指\r\n尖偵測,將指尖座標點形成軌跡,進而辨識該軌跡所代表的文字。凌空書寫\r\n可做為如智慧型眼鏡的文字輸入方法,非接觸式的書寫方式也能使用於若\r\n干衛生敏感場域,例如降低在醫院的使用者因接觸設備而感染病毒的風險。\r\n本研究旨在提出基於深度學習之第一人稱以及第三人稱凌空書寫技術。由\r\n於深度學習技術的使用需仰賴大量標記資料,我們選擇以 Unity3D 建立訓\r\n練資料集,將所建構的手部虛擬模型合成於隨機的自然背景或單一顏色背\r\n景中,藉此有效且快速地生成標記精準合成資料。我們利用手部模型的改變,\r\n模擬書寫過程中的旋轉以及移動來增加資料多樣性。在較複雜得第三人稱\r\n場景中,我們更加入隨機變換的人臉以及人體軀幹讓虛擬資料更接近真實\r\n情況。我們利用物件偵測模型偵測指尖位置以形成文字軌跡,並刪除書寫過\r\n程中所產生的冗餘筆跡,讓處理後筆跡更貼近文字本身。我們結合手寫字與\r\n印刷字形成綜合資料集訓練文字辨識模型,採用ResNeSt架構來辨識近5000\r\n個中文字。實驗結果顯示我們所產生的大量且精準標記合成資料可有效訓\r\n練模型,協助實現包括第一與第三人稱的即時凌空書寫。\r\n關鍵字 – 指尖偵測、 凌空書寫、 合成資料、 繁體中文文字辨識","picture":"","personal_page":"http:\/\/msp.csie.ncu.edu.tw\/hch\/"},{"id":"64","phd":"0","class":"108","name_en":"Jun-Ming Wong","name_ch":"翁浚銘","research_en":"Using a Small Video Dataset to Construct\r\na Taiwanese-Sign-Language Word\r\nClassification Model","resercher_intro":"","research_ch":"以少量視訊建構台灣手語詞分類模型","abstract_en":"Sign languages (SL) are visual languages that use shapes of hands,\r\nmovements, and even facial expressions to convey information, acting\r\nas the primary communication tool for hearing-impaired people. Sign\r\nlanguage recognition (SLR) based on deep learning technologies has attracted much attention in recent years. Nevertheless, training neural\r\nnetworks requires a massive number of SL videos. Their preparation process is time-consuming and cumbersome. This research proposes using\r\na set of SL videos to build effective training data for the classification\r\nof Taiwanese Sign Language (TSL) vocabulary. First, we begin with a\r\nseries of TSL teaching videos from the video-sharing platform. Then,\r\nMask-RCNN[1] is employed to extract the segmentation masks of hands\r\nand faces in all video frames. Next, spatial-domain data augmentation is\r\napplied to create the training sets with different contents. Varying temporal domain sampling strategies are also employed to simulate the speeds\r\nof different signers. Finally, the attention-based 3D-ResNet trained by\r\nthe synthetic dataset is used to classify a variety of TSL vocabulary. The\r\nexperimental results show the promising performance and the feasibility\r\nto SLR.\r\niiIndex Terms— Taiwanese sign language, pose estimation, gesture recognition, pattern recognition, video classification","abstract_ch":"手語是一種視覺語言,利用手形、動作,甚至面部表情傳達訊息以作為聽障人\r\n士主要的溝通工具。以深度學習技術進行手語辨識在近年來受到矚目,然而神經網\r\n路訓練資料需仰賴大量手語視訊,其製作過程頗費時繁瑣。本研究提出利用單一手\r\n語視訊建構深度學習訓練資料的方法,實現在視訊畫面中辨識台灣手語詞彙。\r\n首先,我們由視訊共享平台中取得一系列手語教學視訊,透過 Mask RCNN[1]\r\n找出所有教學畫面中的手部和面部分割遮罩,再透過空間域數據增強來創建更多不\r\n同內容的訓練集。我們也採用不同的時間域採樣策略,模擬不同手譯員的速度。最\r\n後我們以具注意力機制的 3D-ResNet 對多種台灣手語辭彙進行分類,實驗結果顯\r\n示,我們所產生的合成資料集能在手語辭彙辨識上帶來幫助。\r\n關鍵字— 台灣手語、動作評估、手勢識別、行為識別、影片分類","picture":"","personal_page":""},{"id":"65","phd":"0","class":"108","name_en":"Yu-Hong Hou","name_ch":"侯昱宏","research_en":"Exploiting Distance to Boundary for\nSegmentation-based Scene-Text Spotting\n","resercher_intro":"","research_ch":"利用邊界距離\r\n改進裁切式場景文字偵測\r\n","abstract_en":"Scene text spotting helps to locate regions of interest in images as texts inside pictures often provide abundant information. Many existing schemes adopted the segmentation-based methodology, which classifies each pixel as a specific type, usually text or background. Major advantages of pixel prediction include easy to implement, good performance and flexibility. However, appropriately separating words in such schemes remains a challenging issue.\r\nThis research investigates the use of distance to boundary for partitioning texts to achieve more accurate scene text spotting. The proposed scheme can be used to extract single characters, words, text-lines or objects with similar textures. It is also applicable to detecting texts bounded by rectangles, quadrilaterals or boxes with arbitrary shapes. The labeling process is relatively efficient. The issues of network architecture, categorical imbalance and post-processing are discussed. The experimental results demonstrate the feasibility of the proposed design, which can help to improve segmentation-based scene-text spotting approaches.\r\n\r\n\r\n\r\nIndex Terms – Deep learning, scene text spotting, semantic segmentation ","abstract_ch":"由於影像中的文字提供了豐富的資訊,場景文字定位有助於擷取影像中的感興趣區域。現今許多場景文字定位方法採用基於裁切的像素預測方式,即將每個像素分類為特定類型,經常是文字類別與背景類別,再將屬於文字的像素聚集成需要偵測的文字區域。像素預測方式的優點包括易於實現、良好的性能以及應用的靈活性。然而,自然場景中的文字有著不同大小、形狀及顏色,要正確地分離文句仍是具有挑戰的議題。本研究提出運用邊界距離的方式來協助分割文字像素,以達成更精確的場景文字定位。我們的方法可用於提取單一字元、單詞、文字串或具有相似紋理的圖案,同時也適用於檢測以矩形、四邊形或任意形狀包圍的文字框。此外,文字標記的過程相比於其他方法亦更為簡便。我們探討了網路架構、分類不平衡與後處理等議題。實驗結果顯示此設計的可行性,證實其有助於改進基於裁切的場景文字定位方法。\n\n\n\n關鍵字 – 深度學習、街景文字定位、語義分割。\n","picture":"","personal_page":""},{"id":"58","phd":"0","class":"107","name_en":"Kung-Yu Su","name_ch":"蘇冠宇","research_en":"Traditional Chinese Scene Text Recognition based on Attention-Residual Network","resercher_intro":"","research_ch":"基於注意力殘差網路之\r\n繁體中文街景文字辨識\r\n","abstract_en":"Texts in nature scenes, especially street views, usually contain rich information related to the images. Although recognition of scanned documents has been well studied, scene text recognition is still a challenging task due to variable text fonts, inconsistent lighting conditions, different text orientations, background noises, angle of camera shooting and possible image distortions. This research aims at developing effective Traditional Chinese recognition scheme for streetscape based on deep learning techniques. It should be noted that constructing a suitable training dataset is an essential step and will affect the recognition performance significantly. However, the large alphabet size of Chinese characters is certainly an issue, which may cause the so-called data imbalance problem when collecting corresponding images. In the proposed scheme, a synthetic dataset with automatic labeling is constructed using several fonts and data augmentation. In an investigated image, the potential regions of characters and text-lines are located. For the located single characters, the possibly skewed images are rectified by the spatial transform network to enhance the performance. Next, the proposed attention-residual network improves the recognition accuracy in this large-scale classification. Finally, the recognized characters are combined using detected text-lines and corrected by the information from Google Place API with the location information. The experimental results show that the proposed scheme can correctly extract the texts from the selected areas in investigated images. The recognition performance is superior to Line OCR and Google Vision in complex street scenes.\r\n\r\nIndex Terms – scene text recognition, scene text detection, synthetic data\r\n","abstract_ch":"街景招牌文字經常傳達豐富的資訊,若能經由視覺技術辨識這些影像中的文字將有利於許多相關應用的開發。儘管電腦視覺於光學文本辨識已有相當成熟的技術,但自然場景文字辨識仍是非常具有挑戰性的任務。除了更多樣的字體、文字大小、與使用者拍攝角度等因素外,繁體中文字訓練資料目前仍不多見,眾多中文字也很難平均地蒐集相對應的照片,即使蒐集了足夠資料也會面臨數據不平衡問題。因此,本研究使用數種繁體中文字體產生高品質訓練影像及標記資料,模擬街景上複雜的文字變化,同時避免人工標記可能造成的誤差。除此之外,本文中亦探討如何使人工生成繁體文字影像更貼近街景真實文字,透過調整光線明亮度、幾何轉換、增加外框輪廓等方式產生多樣化訓練資料以增強模型的可靠性。對於文字偵測及辨識,我們採用兩階段演算法。首先我們採用Deep Lab模型以語意分割方式偵測街景中的單字與文本行所在區域,接著使用STN (Spatial Transformer Network) 修正偵測階段所框列的傾斜文字以利後續辨識階段的特徵提取。我們改良了ResNet50 模型,透過注意力機制改善模型在大型分類任務中的準確率。最後,我們透過使用者的GPS資訊與Google Place API中的地點資訊進行交叉比對,藉此驗證與修正模型輸出文字,增強街景文字的辨識能力。實驗結果顯示本研究能有效偵測及辨識繁體中文街景文字,並在複雜街景測試下表現優於Line OCR及Google Vision。\r\n\r\n關鍵字 – 電腦視覺、深度學習、街景文字偵測、繁體中文字辨識\r\n","picture":"","personal_page":""},{"id":"59","phd":"0","class":"107","name_en":"Yu-Jung Chen","name_ch":"陳宥榕","research_en":"Hand Feature Extraction and Gesture Recognition\r\nfor Taiwan Sign Language by Using Synthetic\r\nDatasets","resercher_intro":"","research_ch":"使用虛擬合成資料實現臺灣手語\r\n特徵擷取暨手型辨識","abstract_en":"Hearing-impaired people rely on sign languages to communicate with each\r\nother but may have problems interacting with the persons who may not understand\r\nsign languages. Since sign languages belong to a type of visual languages,\r\ncomputer vision approaches to recognizing sign languages are usually considered\r\nfeasible to bridge the gap. However, recognition of sign languages is a complex\r\ntask, which requires classifying hand shapes, hand motions and facial expressions.\r\nThe detection and classification of hand gestures should be the first step because\r\nhands are the most important elements. This research thus focuses on hand feature\r\nextraction and gesture recognition for Taiwan Sign Language (TSL) videos.\r\nFirst, we established a synthetic dataset by using Unity3D. The advantage of\r\nusing synthetic data is to reduce the effort of manual labeling and to avoid possible\r\nerrors. A large dataset with high quality labeling can thus be achieved. The dataset\r\nis generated by changing hand shapes, colors and orientations. The background\r\nimages are also changed to increase the robustness of the model. Motion\r\nblurriness is also added to make the synthetic data look closers to real cases. We\r\ncompare three feature extractions: bounding boxes, semantic segmentation\r\ngenerated by the ResNeSt+Detectron2 and the heatmap generated by the\r\nEfficientDet. The bounding boxes are selected for the subsequent gesture\r\nrecognition. We also employ Unity3D to create several basic sign gestures for\r\nTSL, and then use ResNeSt for classification and recognition.\r\nExperimental results demonstrate that the synthetic dataset can effectively\r\nhelp to train the suitable models for hand feature extraction and gesture\r\nrecognition in TSL videos.\r\nIndex Terms - synthetic datasets, Taiwanese sign language, feature extraction,\r\ngesture recognition.","abstract_ch":"本研究針對臺灣手語視訊進行手部特徵擷取暨手型辨識。首先,我們\r\n以 Unity3D 建立訓練資料集,利用 3D 手部模型合成於自然場景、人物場景\r\n及純色背景中,快速地且大量地產生高品質訓練資料,其中包含手部影像、\r\n手部輪廓、手部關節點。透過合成資料的使用,可以減少人工標記所可能產\r\n生的負擔與誤差。我們討論如何讓人工合成影像更貼近實際影像,藉由調整\r\n背景複雜度、膚色多樣性及加入動態模糊等方式產生多樣化影像以增加模\r\n型可靠度。接著,我們比較利用 ResNeSt+ Detectron2 模型產生的邊界框\r\n(bounding box)和語義分割(semantic segmentation)、以及改良 EfficientDet 模\r\n型所產生之熱圖(heatmap)的完整性後,最終我們使用邊界框作為手型辨識\r\n的特徵擷取,利用邊界框切出手語視訊中的手部影像進行手型辨識。我們同\r\n樣以 Unity3D 建立訓練資料集,利用 3D 手部模型製作數個臺灣手語基本手\r\n型,再利用 ResNeSt 進行分類辨識。實驗結果顯示本研究所產生的大量且\r\n高品質虛擬合成資料能有效的應用於手部特徵擷取,及臺灣手語之手型辨\r\n識。\r\n關鍵字 – 虛擬合成資料、臺灣手語、 特徵擷取、 手型辨識。","picture":"","personal_page":""},{"id":"60","phd":"0","class":"107","name_en":"Pei-Han Kao","name_ch":"高珮涵","research_en":"Steganalysis in Digital Images based\r\non Dual Path Networks","resercher_intro":"","research_ch":"基於雙路徑網路之影像隱寫分析","abstract_en":"Steganography is a technique to embed a large amount of information in such \r\ncarriers as images, audio, videos and even texts to achieve effective secret \r\ncommunications. On the other hand, steganalysis is the adversarial technique \r\naimed at determining whether the investigated carriers contain hidden information. \r\nIn the field of steganalysis, heuristic features were usually adopted. Recently deep \r\nlearning techniques are often employed but most existing methods still use certain \r\nhigh-pass filters to apply pre-processing. In this research, we focus on image \r\nsteganalysis and adopt the dual path networks (DPN) to achieve an end-to-end \r\narchitecture. The proposed scheme uses ResNet to extract features, and then \r\nemploys DenseNet to extract deeper and smaller features. It combines the \r\nadvantages of both networks to form a DPN blocks with shared weights. The \r\nscheme uses the group convolution to reduce the amount of computation. Finally, \r\ndual path blocks with different parameters are tested to build suitable steganalysis\r\narchitectures. SRNet, which uses ResNet, performs quite well in image \r\nsteganalysis. We first replace its ResNet blocks with DPN blocks for comparison. \r\nThe detection accuracy is improved and confirms that the structure using DPN is \r\nhelpful to steganalysis. We then use DPN blocks to build our architecture and then \r\ncompare the performance with the existing steganalysis architectures. Finally, we \r\nuse the ALASKA II dataset to verify the feasibility of the proposed scheme.\r\nIndex Terms - Steganalysis, steganography, deep learning, convolutional \r\nneural networks.\r\n","abstract_ch":"隱寫術(Steganography)是將若干影像甚至音視訊等資料做為載體,再把\r\n大量機密資訊嵌入於其中以達成秘密通訊的效果,而隱寫分析(Steganalysis)\r\n則是偵測可能的載體以確認秘密資訊是否藏於其中。關於隱寫分析,以往多\r\n採用人工設計之特徵擷取,但需要耗費較多人力以及依賴相關研究經驗,近\r\n期則以深度學習技術為主,但也多使用指定的濾波器對待測資料進行人工\r\n預處理,無法達成完全的自動學習。本研究主要為影像隱寫分析,使用雙路\r\n徑卷積神經網路(Dual Path Networks, DPN)達成端到端(end-to-end)架構,以\r\nResNet 擷取特徵,再以 DenseNet 提取更深層且細微的特徵,結合兩者的優\r\n勢,組成權值共享的雙通道區塊(DPN blocks),並採用 ResNeXt 的分組卷積\r\n降低計算量,使用不同參數的雙通道區塊組合以利隱寫分析。SRNet 為目前\r\n隱寫分析模型中效果較為優異者,當中採用了 ResNet 作為特徵擷取,我們\r\n將其替換為雙通道區塊進行比較,偵測準確度有所提升,也證實了 DPN 有\r\n助於隱寫特徵的擷取。接著我們將整體架構改為以 DPN 為主,與以往的隱\r\n寫分析架構不同,並與這些架構比較以彰顯所提出架構的可行性。\r\n關鍵字:隱寫分析、隱寫術、深度學習、卷積神經網路。\r\n","picture":"","personal_page":""},{"id":"61","phd":"0","class":"107","name_en":"Hsin-Tzu Wang","name_ch":"王心慈","research_en":"Evaluating Image Block Consistency by\r\nDeep Learning for Locating Forgery Areas","resercher_intro":"","research_ch":"基於深度學習影像區塊一致性衡量之\r\n竄改區域偵測","abstract_en":"Identifying the type of a camera used to capture an investigated image is a useful image\r\nforensic tool, which usually employs machine learning or deep learning techniques to train\r\nvarious models for effective forgery detection. In this research, we propose a forensic scheme\r\nto detect and even locate image manipulations based on deep-learning-based camera model\r\nidentification. Since the ways of tampering images are very diverse, it’s difficult to collect\r\nenough tampered images for supervised learning. The proposed method avoids using tampered\r\nimages of various kinds as the training data but employ the information of original pictures. We\r\nfirst train a convolutional neural network to acquire generic features for identifying camera\r\nmodels. Next, the similarity measurement using the Siamese network to evaluate the\r\nconsistency of image block pairs is employed to locate approximate tampered areas. Finally,\r\nwe refine the tampering areas through a segmentation network.\r\nThe main contributions of this research include: (1) extending the study of image region\r\nconsistency to image forensics applications, (2) designing a better block comparison algorithm,\r\nand (3) improving the accuracy of detected tampered regions. We test the proposed methods\r\nusing public tempered image database and our own data to verify their feasibility. The results\r\nalso show that the proposed scheme outperforms existing ones in locating tampered areas.\r\nKeywords: Image forensics, deep learning, Siamese network, image segmentation","abstract_ch":"由於數位相機與智慧型手機的普及,人們可以輕易地取得各式高解析度數位影像,\r\n而便利的相片編修工具讓幾乎所有的使用者都能自行修改數位影像,這也意味著數位影\r\n像內容有可能受到有心人士的竄改,並將其網路或社群網站中散布,更改的影像不僅混\r\n淆視聽,更可能被作為操縱輿論的工具。然而,目前對於多樣化的影像竄改方式仍無完\r\n善應對的方法,數位影像內容的真實性因此受到若干質疑。\r\n在影像鑑識領域中,一個重要的分支為來源相機模型的辨識,本研究以相機模型辨\r\n識為基礎,提出可運用於偵測各種影像畫面竄改的影像鑑識架構。所提出的方法不需要\r\n使用竄改影像作為訓練資料,而是採用原始影像或相片自身資訊,透過卷積神經網路,\r\n設計能夠學習相機模型的通用特徵提取器,接著運用孿生網路來學習比較兩個圖片區塊\r\n是否具備一致性,再根據比較結果選取適當的竄改區域,接著透過前景提取技術精修竄\r\n改區域。本研究的主要貢獻為 (1) 將影像區域一致性的研究延伸至影像鑑識應用、 (2)\r\n設計更好的區塊比較模式、 (3) 改善竄改區域準確度。實驗結果證實本機制的實用性,\r\n並與現有方法的評比中取得最好的效果。","picture":"","personal_page":""},{"id":"54","phd":"0","class":"106","name_en":"Guan-Xin Zeng","name_ch":"曾冠鑫","research_en":"A Fully Convolutional Text-Line Extraction Network with Connectionist Refined Proposals","resercher_intro":"","research_ch":"一個結合連接區域精修之全卷積文字串擷取網路","abstract_en":"Texts appearing in images are often regions of interest and locating such areas for further analysis may help to extract image-related information and facilitate many interesting applications. Pixel-based segmentation and region-based object classification are two methodologies for locating text areas in images and have their own pros and cons. In this research, we proposed a text detection scheme consisting a main pixel-based classification network and a supplemented region proposal network. The main network is a Fully Convolutional Network (FCN) employing Feature Pyramid Network (FPN) and Atrous Spatial Pyramid Pooling (ASPPP) to identify text areas and borders with higher recall. Certain areas are further processed by the supplemented refinement network, i.e., a simplified Connectionist Text Proposal Network (CTPN) with higher precision. Non-Maximum Suppression (NMS) is then applied to form suitable text-lines. The experimental results show feasibility of the proposed text-detection scheme.Index Terms – text detection, street view, fully convolutional network, region \r\nproposal network ","abstract_ch":"影像中的文字為重要的感興趣區域(regions of interest),在影像中定位文字供後續處理能夠幫助該影像相關資訊的擷取,並有利於許多有趣應用的開發。近年來語義分割和通用物件檢測框架技術已被文字偵測任務所廣泛採用,兩者在實作中有各自的優勢與缺點。本研究提出結合兩者優點的文字偵測機制,其中包含一個主要文字串偵測網路輔以一個文字精修網路。主要網路利用語意分割的方式並搭配FPN (Feature Pyramid Network)與ASPP (Atrous Spatial Pyramid Pooling)等技術,強化特徵提取效果,藉此偵測文字區域與邊框,將其視為主要結果且具備高召回率。我們接著使用以區域檢測框架為基礎的精修網路再次分析可能的文字區域,將主要結果中較不確定區域以精修網路協助判斷,最後再使用非極大值抑制技術(Non-Maximum Suppression, NMS)得到最終的文字區域偵測結果。實驗結果顯示本研究能有效的在複雜場景中偵測文字,並藉此探討兩種不同架構的深度學習網路在目標應用中的使用方式。關鍵字 – 文字偵測、街景、全卷積神經網路、區域候選網絡。","picture":"","personal_page":""},{"id":"55","phd":"0","class":"106","name_en":"Yung-Han Chen","name_ch":"陳永瀚","research_en":"Egocentric-View Real-Time Fingertip Detection based on Regional Convolutional Neural Networks","resercher_intro":"","research_ch":"基於區域卷積神經網路之第一人稱視角即時手指偵測\n","abstract_en":"This research investigates real-time fingertip detection in RGB images\/frames captured from such wearable devices as smart glasses. First, we established a synthetic dataset by using Unity3D and focused on the pointing gesture for egocentric view. The advantage of synthetic data is to avoid manual labeling errors and provide a large benchmark dataset with high quality. We discuss the dataset generation and how to produce the images in a natural way. Second, a modified Mask Regional Convolution Neural Network (Mask R-CNN) is proposed with one region-based CNN for hand detection and another three-layer CNN for locating the fingertip. We employ MobileNetV2 as the backbone network and simplify the number of bottleneck layers to avoid redundant features. Moreover, we improve the accuracy of detecting small objects by employing FPN and RoIAlign. We achieve fingertip detection with 25 milliseconds per frame for the 640×480 resolution by GPU and average 8.31 pixel errors. The processing speed is high enough to facilitate several interesting applications. One application is to trace the location of a user’s fingertip from first-person perspective to form writing trajectories. A text input mechanism for smart glasses can thus be implemented to enable a user to write letters\/characters in air as the input and even interact with the system using simple gestures. Experimental results demonstrate the feasibility of this new text input methodology.Index Terms – fingertip, smart glasses, region proposal network, air-writing","abstract_ch":"本研究針對第一人稱視角 RGB 影像,進行手指指尖即時偵測,並依此於智慧型眼鏡中實作空中手寫輸入的應用。首先,我們以 Unity3D 建立訓練資料集,即利用 3D 手部模型合成於自然場景中以快速地產生大量且高品質的訓練影像與標記資料,同時避免人工標記所可能產生的誤差。我們討論如何讓人工合成影像更貼近實際影像,並利用包含調整背景複雜度、光線明亮度、色彩對比等方式產生多樣化的影像以增加模型的可靠度。接著,我們改良 Mask R-CNN 模型,藉由簡化特徵提取網路,以及改善網路模型對於偵測小物件的適應性,讓所提出的模型在精準度或速度上都為該領域最佳,在 640×480 的 RGB 影像上進行手指偵測,平均像素誤差僅 8.31 像素點,處理畫幀速度達到每秒 38.8 張。最後我們整合手指偵測網路模型於智慧型眼鏡中,以手指指尖移動軌跡作為手寫輸入,再利用 Google Input API 辨識文字以回傳候選字給智慧型眼鏡使用者選擇,建立適用於智慧型眼鏡的新互動輸入法。關鍵字 – 手指偵測、智慧型眼鏡應用、區域卷積神經網路、空中手寫。","picture":"","personal_page":"http:\/\/msp.csie.ncu.edu.tw\/cyh\/"},{"id":"56","phd":"0","class":"106","name_en":"Dai-Yan Wei","name_ch":"韋岱延","research_en":"Content-Based Multi-Operator Retargeting and Its Quality Evaluation ","resercher_intro":"","research_ch":"基於內容分析之多運算子畫面尺寸調整與品質衡量機制","abstract_en":"This research proposes a content-based multi-operator image retargeting scheme, enabling the retargeted images to preserve its content after adaptation in various displays. Besides, a quality evaluation model is also proposed to compare original images and retargeted images. The proposed multi-operator retargeting scheme is termed “SCAN” as it contains Seam caving, Cropping, Adding seams and Normalization (scaling). This research mainly concentrates on improving the step of content-based cropping in SCAN. We classify images into two categories via foreground detection and adopt different types of visual saliency to determine appropriate cropping limits. The face detection is also introduced to protect face areas appearing at the edges of an image from being removed. A building detection mechanism is employed to determine whether a building in an image is significant or not. The experimental shows that the improved multi-operator retargeting scheme can effectively preserve the content and objects’ shape when dealing with various images. In the proposed quality evaluation model, we make use of SIFT Flow to compare the contents of original and retargeted images and identify possible geometric distortion and line distortion. We further consider salient objects and image semantics in the evaluation process. With these attributes, we utilize the neural network regression model to determine the weights of every feature in order to fit the Mean Opinion Score (MOS). The results show that the proposed model is closer to MOS than other evaluation methods. Keyword ─ Multi-Operators, Foreground Detection, Retarget, Quality Evaluation, SIFT Flow, Line Distortion, Geometric Distortion, Regression Analysis. ","abstract_ch":"本論文研究提出基於畫面內容之多運算子影像尺寸調整機制,希望在顯示畫面於不同輸出設備時仍能保持畫質,本研究亦提出適用於此應用的畫質衡量模型,合理評估原始影像與修改後影像的差異。首先我們改良多運算子畫面調整機制 SCAN,它包含了圖縫裁減(Seam carving)、邊緣裁切(Cropping)、增加圖縫(Add seams)與畫面縮放(Normalization)。本研究主要改善邊緣裁切步驟,透過前景物偵測將影像分類,根據類別及畫面中的物體以不同的視覺顯著圖決定適當裁切位置。此外,我們加入人臉與建築物偵測,避免出現於畫面邊緣的人臉可能遭受不當裁切,並判斷建築物是否為畫面重要內容。實驗結果顯示所提出的改良式多運算子畫面調整機制在各式影像中能有效維持內容完整。在畫質衡量模型中,我們利用 SIFT Flow 比較原始影像及濃縮影像的內容差異,考量可能出現的幾何扭曲及線段扭曲,根據畫面顯著物及語意相關程度,以類神經網路迴歸分析找出平均意見分數(MOS)對每種屬性的依據,進而得到更貼近於人眼主觀感受的評估。實驗結果顯示,與其他評估方法相較,我們所提出的模型更貼近於 MOS的結果。關鍵字 ─ 多運算子畫面調整機制、前景物偵測、濃縮影像品質衡量、SIFT Flow、線段扭曲、幾何扭曲、迴歸分析 ","picture":"","personal_page":""},{"id":"57","phd":"0","class":"106","name_en":"Jia-Ming Hu","name_ch":"胡家銘","research_en":"Steganalysis in JPEG Images based on Densely Connected Convolutional Neural","resercher_intro":"","research_ch":"基於密集連接卷積神經網路之\r\nJPEG影像隱寫分析","abstract_en":"Steganography is a technique for hiding large amounts of data in such carriers as images and videos, and steganalysis is a technique to determine whether additional information is hidden in the carrier. In this study, deep learning is used to train a densely connected convolutional neural networks (CNN) to design a new steganalysis architecture for JPEG image steganography. The model training does not require manual preprocessing and can automatically learn effective features. In the feature extraction, the module does not use pooling operations to avoid suppressing steganographic signals. The detection is targeted at three steganographic schemes and a “multi-mode” decision combining single models is proposed The experimental results show that the proposed model outperforms the state-of-the-art approaches, including the most advanced steganalysis model - SRNet.","abstract_ch":"隱寫術(Steganography)是將大量資料隱藏於如影像、視訊等載體之技術,而隱寫分析(Steganalysis)為隱寫術的對抗技術,用於判斷載體中是否隱藏額外資訊。本研究訓練一個由密集連結卷積神經網路(Convolutional Neural Networks,CNN)模塊為主體的JPEG影像隱寫分析架構。本研究的模型訓練不需人工預處理,可自動學習有效特徵。在特徵提取的部分,本機制不使用池化操作以避免抑制隱寫訊號,偵測則以三種JPEG隱寫術為對象。此外,我們設計所謂「多模式決策方法」,結合多個單一模型一起檢測,並以原先的個別單一訓練模型比較。實驗結果顯示我們的模型幾乎都超越了比較的目標,包括目前最先進的隱寫分析模型-SRNet。\r\n\r\n關鍵詞:隱寫分析、隱寫術、卷積神經網路。","picture":"","personal_page":""},{"id":"50","phd":"0","class":"105","name_en":"Chen-Kuang Hsieh","name_ch":"謝鎮光","research_en":"Detecting Shot Manipulation in H.264\/AVC Videos by Analyzing the Changes of Macroblock Coding Modes","resercher_intro":"","research_ch":"基於分析宏區塊編碼模式變化之H.264\/AVC視訊畫面異常操作偵測","abstract_en":"The purpose of this research is to determine if an investigated H.264\/AVC video stream has suffered such manipulations as shot deletion, insertion or replacement. The scheme is designed based on the information of H.264\/AVC de-blocking filter, in which the boundary strength (BS) is utilized for analyzing the coding modes of macroblocks. We first employ the BS values to form two feature maps, PRG (Prediction Residual Graph) and IPG (Inter-Prediction Graph). We then further extract two features based on the two feature maps, including VRF (Variation of Residual Footprint) and DOF (Degree of Fragments) to deal with different anti-detection methods or varying coding scenarios. The distances between the detected peaks from VRF and DOF are used to acquire the original GOP (Group of Pictures) size. Finally, with the detected original GOP size and the identified peaks, we can observe the abnormal periodic phenomenon and locate the exact positions of shot manipulation. Several tests and a state-of-the-art anti-detection method that modifies quantized coefficients and coding modes are applied in the experiments to verify the robustness of the proposed scheme. \r\n\r\nKeywords: H.264\/AVC; video coding; coding mode analyzing; de-blocking filter; frame\/video shot insertion\/deletion; \r\n","abstract_ch":"此研究的目的為鑑定H.264\/AVC視訊是否曾遭受畫面異常操作,包括畫面片段刪減、插入或抽換等,所提出的方法利用H.264\/AVC去區塊濾波器(De-blocking filter)所產生的邊界強度(Boundary Strength, BS)資訊以分析宏區塊(Macroblock)編碼模式。首先,我們以BS資訊產生兩種特徵圖,分別為PRG (Prediction Residual Graph)與IPG (Inter-Prediction Graph),再依此兩種特徵圖,分別提出VRF (Variation of Residual Footprint)與DOF (Degree of Fragments)兩種特徵,以因應不同的編碼情形與反偵測方法。我們藉由VRF與DOF所產生的數據進行峰值偵測,統計各峰值間的距離後取得此視訊遭受編輯前的原始GOP(Group of Pictures)大小,再分析所產生的峰值圖與原始GOP大小來追蹤峰值的異常週期,並進一步找出待測視訊的畫面操作位置。此外,本論文也針對一個基於改動編碼量化係數及編碼模式,且讓眾多偵測方法失效的新穎反偵測方法進行驗證。實驗結果顯示,不論在二次編碼時採用固定量化係數(Quantization Parameter)或是固定位元率(Constant Bitrate)編碼,在施予若干反偵測與否的情形下都能有穩定且強健的偵測效果。\r\n關鍵字 - H264\/AVC; 視訊編碼; 編碼模式分析;去區塊濾波器; 畫面增刪操作;\r\n","picture":"","personal_page":""},{"id":"51","phd":"0","class":"105","name_en":"Po-Wei Chang","name_ch":"張博崴","research_en":"Text Detection in Street View Images with Hierarchical Fully Convolution Neural Networks","resercher_intro":"","research_ch":"使用階層式全卷積神經網路偵測街景文字","abstract_en":"Considering that traffic\/shop signs appearing in street view images contain important visual information such as locations of scenes, effects of advertising on billboards, and the information of store, etc., a text\/graph detection mechanism in street view images is proposed in this research. However, many of these objects in street view images are not easy to extract with a fixed template. In addition, street view images often contain cluttered backgrounds such as buildings or trees, which may block some parts of the signs, complicating the related detection. Weather, light conditions and filming angle may also increase the challenges. Another issue is related to the Chinese writing style as the characters can be written vertically or horizontally. Detecting different directions of text-lines is one of the contributions in this research. The proposed detection mechanism is divided into two parts. A fully convolutional network (FCN) is used to train a detection model for effectively locating the positions of signs in street view images, which will be viewed as the regions of interest. The text-lines and graphs in the sign regions can then be successfully extracted by Region Proposal Network (RPN). Finally, post-processing is applied to distinguish horizontal and vertical text-lines, and eliminate false detections. Experimental results show the feasibility of the proposed scheme, especially when complex street views are investigated.\r\n\r\n \r\n\r\nIndex Terms – text detection, sign detection, street view, fully convolutional network, region proposal network","abstract_ch":"考量街景圖像中所出現的交通路牌與商家招牌等傳達了重要的影像相關資訊,本研究提出街景影像之招牌\/路牌偵測機制,於其中定位文字與圖形區域。研究的挑戰在於街景影像常包含與文字紋理相似的雜亂背景,且畫面中的招牌或路牌可能遭到其他物體遮蔽,天候、光線和拍攝角度等因素亦增加偵測的困難。此外,中文字能夠以垂直和水平方式書寫,因此必須能夠偵測這些不同方向的文字並加以區分。我們所提出的偵測機制分成兩個部分,第一部分定位影像中的路牌及招牌所屬區域,採用全卷積網路(Fully Convolutional Network, FCN)訓練街景路牌及招牌偵測模型,將偵測的招路牌視為感興趣區域(Region of Interest, ROI)。第二部分則於ROI中擷取文字及商標,我們使用區域候選網絡(Region Proposal Network, RPN)訓練文字偵測模型,藉此對影像分別做水平與垂直的文字串偵測,再根據第一部分所偵測的ROI,減少RPN對文字的錯誤偵測。最後我們進行後處理以結合水平及垂直文字串,排除錯誤偵測和處理文字串的複雜交集情形,以文字串長寬比、面積、交集情況、招牌背景顏色等來判定有效的區域。實驗結果顯示本研究能有效的在複雜街景畫面中找出招\/路牌並偵測文字與圖案區域,並探討兩種不同架構的深度學習網路在此應用中的使用方式。\r\n\r\n關鍵字 – 文字偵測、招牌偵測、街景、全卷積神經網路、區域候選網絡。","picture":"","personal_page":""},{"id":"52","phd":"0","class":"105","name_en":"Po-Wei Hsieh","name_ch":"謝柏維","research_en":"Chinese Character Segmentation via Fully Convolutional Neural Network","resercher_intro":"","research_ch":"基於全卷積神經網路之中文字分割機制","abstract_en":"The important information conveyed by texts and artificial symbols in natural scenes, so capturing text context from images has many potential applications. However, the current methods are almost based on the processing of phonetic text, and the methods for morpheme text such as Chinese are still improved. This study attempts to propose a Chinese character text detection mechanism of semantic segmentation for natural scene images, with marking the label for each individual Chinese character. The proposed method is divided into two stages: in the first stage, we trained the Fully Convolutional Network (FCN) as the Chinese text detection model for natural scenes. We adopted real natural scene as the training dataset, and added synthetic datasets and to enhance the detection ability of the model. In the second stage, it assisted in separating the text areas and grouping the text boxes by the regional distribution relationship, and combined the character information in different writing directions and layouts to improve the worth of application. The experimental results show that the proposed method can effectively detect Chinese text in natural scenes, and explore the impact of each step on the detection results.\r\nIndex Terms – text detection, natural scenes, full convolutional neural networks","abstract_ch":"自然場景中的文字與人工符號傳達的重要訊息,因此從影像中擷取文字具有許多潛在的用途,然而目前的方法多根基於對拼音文字文本處理,對於中文這類語素文字文本仍有改進的空間。本研究嘗試以單一中文字作為標記重點,提出結合語意分割 (semantic segmentation) 的自然場景中文字偵測機制。我們所提出的方法分成兩階段:第一階段採用全卷積網路 (Fully Convolutional Network, FCN) 訓練對自然場景的中文文本偵測模型,在訓練時除了真實場景訓練集資料外,也加入模擬資料彌補資料集的缺失,強化模型的偵測能力。第二階段則協助分離文字區域,並以區域分布關係對文字框分組,使節和的文字串在不同文字書寫方向和排版中仍然有效,提升應用價值。實驗結果顯示所提出的方法能有效偵測中文文本,並探討各步驟對偵測結果的影響。\r\n關鍵字 – 文字偵測、自然場景、全卷積神經網路。\r\n","picture":"","personal_page":""},{"id":"53","phd":"0","class":"105","name_en":"Dao-Wei Yang","name_ch":"楊道偉","research_en":"An Adaptive Vehicle Detection Scheme for Urban Traffic Scenes based on Convolutional Neural Networks","resercher_intro":"","research_ch":"基於卷積神經網路之市區道路場景自適應車輛偵測機制\r\n","abstract_en":"A large number of digital cameras have been installed at intersections in urban areas to help monitor traffic conditions. Making better use of the scenes captured by these traffic surveillance cameras can facilitate the construction of advanced Intelligent Transportation Systems. This research aims at developing an adaptive vehicle detection scheme for urban traffic scenes, which collects roadside surveillance videos from publicly available sources. The proposed scheme consists of two main phases; the first phase is to collect a small number of traffic surveillance images for training a general model using Faster R-CNN. The second phase utilizes background subtraction to extract vehicle proposals. A sufficient number of vehicles are collected by comparing proposals with the results using the general model. The collected vehicles are superimposed on the constructed background in an appropriate order to achieve semi-automatic generation of training data with annotations. These training data are used to train a second-phase adaptive model. The experimental results show that the proposed scheme performs quite well and can handle vehicle occlusion problem.\r\nIndex Terms––Urban Scenes, Adaptation Model, Vehicle Detection, Vehicle Recognition, Faster R-CNN, Background Subtraction\r\n","abstract_ch":"近年來大量的攝影機被架設於市區路口以協助檢視各種交通狀況,若能善用這些畫面將有助於先進智慧型運輸系統(Intelligent Transportation System)的建置。本研究嘗試以開放式政府資料庫蒐集市區道路監控影像,提出場景適應式行駛車輛偵測機制。由於路口攝影機通常有著不同角度的畫面,而畫面中可能存在各式背景,例如建築物、路邊物、招牌與行道樹等,加上人與車輛在道路上可能發生相互遮蔽的情況,都讓單一離線偵測模型存在若干改進空間。本研究所提出的方法分為兩個階段;第一階段蒐集少量市區道路影像,利用Faster R-CNN訓練通用車輛偵測模型,並對目標場景進行車輛偵測與分類。第二階段則利用背景建立法產生車輛遮罩,搭配第一階段的通用模型偵測結果,經比對蒐集足量的單一種類車輛,並以時序方式貼在目標場景中,以幾乎自動的方式產生大量該場景標記資料。我們將這些標記資料再以Faster R-CNN訓練第二階段場景適應式模型,以此模型進行車輛偵測及後續可能的車流估計。實驗結果顯示所提出的方法能有效偵測與分類市區場景車輛,對於遮蔽車輛偵測也有不錯的表現。\r\n關鍵字 ─ 市區道路影像、自適應模型、車輛偵測、車輛識別、Faster R-CNN、背景建立 \r\n","picture":"","personal_page":""},{"id":"46","phd":"0","class":"104","name_en":"Tzu-Liang Hsu","name_ch":"許子亮","research_en":"A Biomedical Signal Acquisition Platform for Wearable and Mobile Devices with Cloud Computing","resercher_intro":"","research_ch":"穿戴式生理量測於行動裝置與雲端伺服器平台之實現\r\n","abstract_en":"This research presents a biomedical signal collecting scheme, which consists of a self-developed software architecture and a MSP430F5438A biomedical signal collecting hardware. Biomedical signals will be collected by MSP430F5438A with a 500Hz sampling rate. The collected signals in the proposed scheme include Electroencephalography, Electrooculography, Electrocardiography, and Photoplethysmography. The hardware will acquire these signals and transfer the data to a mobile device running an Android system through the Bluetooth interface. The signals will be preprocessed and the data will be transferred to a cloud-computing server. The mobile device also provides a real-time signal viewing interface for users to check. A web service, a file transfer service, and a signal processing service are running on this cloud-computing server to support multiple users. \r\nThis proposed scheme relies on communications between multiple platforms to redistribute the computing resources. The biomedical signals often have a large volume of data, especially when we collect them with a high frequency and from multiple signal sources. Therefore, the major research objectives are to reduce the cost of signal collection and to solve the problem that healthcare professionals cannot acquire real-time data. The demo shows the feasibility of the proposed scheme.\r\n\r\nIndex Terms- Wearable devices, Mobile devices, Cloud computing, Biomedical signal processing\r\n","abstract_ch":"本研究提出利用雲端監控之生理訊號偵測機制,此機制藉由自行開發的軟體架構,結合實驗室所研製的生理訊號採集樣本硬體MSP430F5438A完成整體研究實作。生理訊號由MSP430F5438A以頻率500Hz採樣,採樣內容包含腦波、眼動圖、血氧飽和指數、心電圖等訊號,並通過藍牙無線介面傳輸至移動式裝置。移動式裝置採用Android系統,生理訊號在此裝置經過初步前處理後,再同步上傳至雲端伺服器記錄結果,我們亦在移動裝置上提供使用者生理訊號即時查看功能。雲端伺服器包含一個網頁伺服器,以及一個檔案傳輸伺服器,並建立背景服務實現生理訊號即時濾波與分析,提供即時瀏覽網頁功能讓遠端監測者觀看,完成服務多人之訊號偵測系統架構實作。\r\n本機制利用多平台溝通與分散運算資源,降低高採樣率所造成的硬體負擔與建置成本,也提供了一個生理訊號監測即時平台,可將其應用於任一生理訊號即時回報環境,如醫學或生理訊號教學評估系統,解決生理訊號蒐集者無法獲得即時資訊的困難。\r\n關鍵字 - 穿戴式裝置、行動裝置、雲端運算、生理訊號處理\r\n","picture":"","personal_page":""},{"id":"47","phd":"0","class":"104","name_en":"Chien-Yu Chien","name_ch":"簡宇謙","research_en":"Video Editing Detection Based on Coding Mode Analysis by Deep Learning Techniques ","resercher_intro":"","research_ch":"基於深度學習與編碼模式分析之視訊剪輯檢測 ","abstract_en":"Digital videos are ubiquitous these days, serving as an important source or \r\nmedium for information dissemination. Since many digital surveillance cameras \r\nare deployed around cities, the content of digital videos often provides effective \r\nvisual evidence of crime scene investigation or proof on the court. Nevertheless, \r\neasiness of content manipulation raises certain concerns over authenticity of \r\ndigital videos. A malicious user may tamper the video via widely available \r\nediting tools to change the meaning of content so that the subsequent \r\nexamination or analysis would be affected. This research aims at providing a \r\nvideo forensic tool for determining whether an investigated video has been \r\nedited. The considered editing operations include frame\/video segment insertion \r\nor deletion. The methodology is to employ the fact that video editing usually \r\nresults in double compression\/encoding and affects the coding selection of \r\ncertain frames when the fixed GOP (Group of Pictures) is used. The coding \r\nmodes of H.264\/AVC are utilized to examine the traces of video editing via \r\nconvolutional neural networks. The abnormal frames appearing periodically will \r\nbe located to determine the original GOP size of the first encoding, which helps \r\nto identify the exact frame or video editing location. Several testing cases are \r\ndesigned in the experiments, coupled with the state-of-the-art anti-forensic \r\napproach, to verify the feasibility of the proposed method. \r\nKeywords – H.264\/AVC, double compression, coding mode analyzing, deep \r\nlearning, video tampering detection. ","abstract_ch":"數位視訊為資訊傳遞的重要媒介,在現今廣設監控攝影機的環境下,其\r\n影像內容常被做為偵查現場或法律程序上的證據。然而,數位視訊容易編\r\n修的問題引發了若干疑慮,若有人懷著惡意而對視訊進行編輯修改,可能\r\n造成影像內容的差異而影響事後的檢視判斷結果。本研究的主要目的是鑑\r\n定數位視訊片段是否曾被編輯,可能的攻擊包括畫面插入、刪減以及替換\r\n等。研究方法主要針對於被修改視訊的二次壓縮與GOP (Group of Pictures)\r\n的關係所產生的特徵,利用 H.264\/AVC 標準中的編碼模式資訊追蹤經由編\r\n輯操作所留下的不尋常痕跡。近期深度學習領域的蓬勃發展建立了更可靠\r\n的內容辨識技術,我們選擇使用卷積神經網路分析一連串畫面中的編碼不\r\n正常跡象,再經由偵測有規律性位置的異常畫面以判斷該視訊於首次壓縮\r\n時所使用的原始 GOP 大小。最後,我們以 GOP 與異常畫面位置等資訊推斷\r\n此視訊被編輯的實際位置。在實驗中我們設計了許多不同的測試情境,同\r\n時搭配新穎的反偵測機制進行驗證,以確認所提出方法的強健性。 \r\n \r\n關鍵 字 – H.264\/AVC 、二次壓縮、編碼模式分析、深度學習、視訊竄改偵\r\n測。 ","picture":"","personal_page":""},{"id":"49","phd":"0","class":"104","name_en":"Chih-Yun Fang","name_ch":"方志筠","research_en":"Quality Assessment of Image Retargeting based on Line Bending and Geometric Distortion","resercher_intro":"","research_ch":"基於線段扭曲與幾何變形之影像濃縮畫質衡量機制","abstract_en":"Image retargeting is a technique to output images with a different aspect\r\nratio from those of displaying devices. Various methods exist but they may not\r\nbe consistent with different kinds of images. It is essential to develop good\r\nquality assessment approaches in this field. In this research, we propose an\r\nobjective quality assessment for image retargeting with original images as the\r\nreference. The proposed scheme includes two parts: line bending and geometric\r\ndistortion, both of which are based on SIFT Flow to examine the degree of pixel\r\nshifting. The proposed scheme basically employs the variation of SIFT Flow to\r\ndetermine the geometric distortion. At the same time, some important lines are\r\ndrawn according to saliency map to determine line distortion, which is used to\r\ngrade the quality when the geometric distortion is not reliable. The setting of the\r\nparameters aims at making the score closer to the mean opinion scores (MOS),\r\nwhich represents that the quality assessment is consonant with human’s vision.\r\nExperimental results show the accuracy of the proposed scheme by comparing\r\nwith other objective image quality assessment methods and different retargeting\r\napproaches.\r\nIndex Terms - Image quality assessment, Distortion, Important line, SIFT Flow","abstract_ch":"影像濃縮(retargeting)的目標是讓具有固定尺寸的影像資料在各種解析\r\n度的畫面輸出中都能有良好的成像。然而,不同的影像濃縮方法往往對於\r\n調整後影像有相異的效果,因此接近使用者主觀感受的畫質衡量機制是必\r\n要的。本研究針對在具有原始影像做為參考下的濃縮影像提出客觀的畫質\r\n衡量機制,本機制分成兩個部分,包括重要線段扭曲(Line Bending)與幾何\r\n形狀變形(Geometric Distortion),兩者都會利用SIFT Flow 去判別像素點在\r\n兩張圖中的的位移情形。我們主要檢視SIFT Flow 的變化情況決定影像的幾\r\n何變形程度,另一方面根據視覺顯著圖(Saliency Map)在畫面上標示重要線\r\n段,再以SIFT Flow 找出線段扭曲位置,當幾何變形被判定失效時改以線段\r\n扭曲程度評量畫質。評量的參數設定儘量貼近平均意見分數(MOS),期望與\r\n人眼的主觀感受一致。實驗結果測試了不同影像與濃縮方法的評估結果,\r\n並與其他的客觀衡量方式比較,以顯示我們所提出的機制符合主觀感受的\r\n準確程度。\r\n關鍵字 - 影像畫質衡量、畫面扭曲、重要線段、SIFT Flow","picture":"","personal_page":""},{"id":"4","phd":"0","class":"103","name_en":"Yung-Chieh Chou","name_ch":"周永杰","research_en":"Toward More Efficient Multi-Operator Media Retargeting for Digital Images and Videos","resercher_intro":"","research_ch":"發展更具效率之多運算子影像視訊畫面尺寸調整機制","abstract_en":"This research presents a multi-operator image retargeting scheme, which can be further expanded to video retargeting. The objective is to effectively and efficiently adjust the image or video frame to the targeted resolution. Given an image or frame, the content-based cropping and scaling will be applied. The visual saliency map is calculated and the superpixels are formed via Simple Linear Iterative Clustering (SLIC) to serve as the reference to extract the visually significant foreground objects. Next, the degree of cropping and scaling will be determined by the saliency map. Seam carving can also be employed to make the resolution closer to the target if the efficiency is not an important issue. Seam caving checks the one-directional gradients and uses dynamic programming to remove the saliency with minimal significance. Local update helps to reduce the computational burden. Saliency points are identified and helps to decide when to stop the seam carving process. For certain images, inserting seams is also useful to decrease the the degree of scaling. Experimental results show that the proposed method does maintain the significant objects of the image and is also more feasible. \n For video retargeting, the data in compressed video stream, including the motion vectors and motion compensation, are used to classify the types of shots. If the shot belongs to a fixed scene, seam carving can be applied. Otherwise, only cropping and scaling are used. To avoid removing the foreground objects, the motion feature map is formed, combined with the visual saliency map, to achieve seam carving and cropping. The experimental results shows that the proposed scheme can deal a variety of shots and outperform existing algorithms.\n\n Index Terms-Multi-Operator, Content-based Cropping, Seam Carving, Visual Saliency Map, H.264 Motion Vector, Motion Feature Map.\n","abstract_ch":"本研究提出多運算子影像與視訊尺寸調整(retargeting)演算法,目的在於有效率地調整影像畫面至目標解析度,並將演算法延伸應用於視訊。對於數位影像,我們適當地施予基於內容之邊緣裁切(content-based cropping)和縮放(scaling),首先計算影像中的視覺顯著特徵(visual saliency feature),並將影像透過SLIC(Simple Linear Iterative Clustering)演算法切割成較大的超級像素(superpixel),擷取畫面中的前景物作為畫面切割的依據,接著逐一比較視覺特徵圖進行邊緣裁切與等比例縮放。若時間允許,圖縫裁減(seam carving)可被使用讓畫面更接近目標長寬比。圖縫裁減主要計算畫面梯度,採用動態規劃刪除最小能量圖縫並進行圖縫的局部更新,最後定義突出點以限制圖縫數量並決定裁減停止點。對於某些適合的影像,我們亦可增加圖縫來降低畫面直接縮放程度。由實驗結果顯示,我們確實有效率地維持影像主體,演算法也達到較高的實用性。另外,我們將影像處理延伸至視訊資料,考量視訊壓縮域動態資料計算,透過H.264\/AVC視訊壓縮編碼時所產生的運動向量(motion vector)和運動補償資訊(motion compensation)判斷鏡頭種類,若為非固定式場景,我們使用邊緣裁切以及縮放的方式處理畫面;若為固定場景,則可使用圖縫裁減機制。為了防止運動中的前景物在裁切過程中被移除而造成失真,我們將壓縮域中的位移向量製作運動特徵圖(motion feature map),結合視覺特徵圖協助圖縫裁減和邊緣裁切。實驗結果顯示我們的方法可以廣泛處理不同種類的鏡頭,在畫面前景物形狀的維持以及背景保留上,亦優於其他視訊畫面調整演算法。\n 關鍵字 -多種運算子、邊緣裁切、圖縫裁減、視覺顯著特徵、H.264\/AVC、運動向量、運動特徵圖。\n","picture":"","personal_page":""},{"id":"5","phd":"0","class":"103","name_en":"Jia-Hao Hu","name_ch":"胡家豪","research_en":"A Targeted Person Searching Scheme in Digital Videos based on Face Quality Assessment and Recognition","resercher_intro":"","research_ch":"基於人臉畫質衡量與識別之視訊目標人物搜尋機制","abstract_en":"This research presents a targeted person searching scheme in digital videos. It is assumed that a user is given an exemplar video containing a person to be searched and a video, from which the scenes related the targeted person will be extracted. First, the exemplar video will be processed to select multiple representative images of persons, which will be shown on a user interface for the user to select the images of a targeted person. After choosing the images which best characterize the targeted person, the scheme will apply the face assessment process to build the model of the targeted person. The model can be employed to search that person in other videos. We hope that, by the assistance of such a scheme, searching people in videos can be facilitated. Such applications as actor comparison in videos, retrieval of people, or digital evidence collection can be achieved.\r\nThe scheme mainly relies on the face tracking method to find consecutive pictures or frames that contain human faces. With the acquirement of multiple images, we can build a more stable model of the targeted person and further develop a reliable face assessment method to choose better images for recognition. The assessment process not only avoids the images with poor quality, but also reduces operating time and efforts. The face assessment method takes four factors into consideration, including out-of-plane rotation, sharpness, brightness, and resolution. By\r\niii\r\nanalyzing parameters and recognition outcomes, we can understand the effects of different settings and interface influence, and investigate the utility of all aspects for face matching in videos. Experimental results show the accuracy of the proposed scheme and the possible improvement in the future.\r\nIndex Terms- Face Assessment, Face Recognition, Support Vector Machine\r\n","abstract_ch":"本研究提出數位視訊目標人物搜尋機制,此機制假設使用者被提供一個包含目標人物的範例視訊,以及一個時間可能較長的待測視訊。範例視訊首先經過前處理,將代表多個視訊片段的人物畫面顯示於介面供使用者選取。使用者點選同個目標人物不盡相同之各式樣貌畫面後,本機制經選圖程序挑出若干目標人物影像建立目標樣板,並根據樣本於待測視訊中搜尋與標記可能包含目標人物的視訊片段。我們冀望透過這樣的機制協助建立以圖找圖之視訊中人物搜尋相關應用,例如視訊人物比對、檢索或錄影蒐證等。\r\n本機制主要藉由人臉追蹤方法尋找視訊中包含人臉的連續畫面,利用多張影像的取得建立較穩定之目標人物樣板,再發展可靠的人臉評分方式以選擇較佳的人臉影像方便之後的識別與偵測,避免採用品質不佳的圖片影響運作,也可以減少相關的時間耗費。人臉評分方式主要考量人臉角度、銳利度、光線明暗與離鏡頭遠近等因素,鑒於人臉角度對於識別上的影響甚鉅,我們藉由雙邊濾波判斷人臉是否過度偏斜,實驗結果顯示我們所提出機制的準確度與未來可能的應用暨改進方向。\r\n關鍵字-人臉評分、人臉識別、支持向量機\r\n","picture":"","personal_page":""},{"id":"6","phd":"0","class":"103","name_en":"Chiung-Fang Chang","name_ch":"張瓊方","research_en":"Detecting Texts and Graphs in Street View Images by Convolutional Neural Networks\n","resercher_intro":"","research_ch":"使用卷積神經網路偵測街景文字圖案","abstract_en":"Considering that traffic and shop signs appearing in street view images contain useful information, such as locations of scenes or effects of advertising billboard, a text and graph detection mechanism in street view images is proposed in this research. Many of these artificial objects in street view images are not easy to extract with a fixed template. Besides, cluttered backgrounds containing such items as buildings or trees may block some parts of the signs, increasing the challenges of detection. The weather or light conditions further complicate the detection process in this research, the proposed detection mechanism is divided into two parts; first, we use the Fully Convolutional Network (FCN) segmentation technique to train a detection model for effectively locating the positions of signs in street view images. In the second part, we extract the texts and graphs in the selected areas employing the characteristics of signs in such images. By observing that, regardless of their shapes, the texts\/graphs are usually superimposed on smooth areas, we construct smooth-region maps according to the gradient magnitudes and then confirm the actual areas of signs. The texts and graphs can then be extracted by Maximally Stable Extremal Regions (MSER), which is suitable for text detection. Experimental results show that this mechanism can effectively extract texts and graphs in various types of complex street scenes.\n","abstract_ch":"本論文提出於街景畫面中尋找文字與圖案的偵測機制,主要考量街景環境所拍攝的畫面常出現具識別性的人為標記,包括交通路牌與商家招牌,這些人造圖案提供了關於該影像的若干資訊,例如拍攝的所在位置與商家招牌的廣告效果等。然而,這類物件的多種圖案或形狀並不容易以固定的樣板予以分析,再加上街景影像常包含雜亂背景(建築、道路、林木等),路\/招牌在畫面中也可能重疊,或遭到街道中的其他物體遮蔽,而天候光線等因素也會影響偵測結果,這些因素都增加了偵測街景影像人為資訊的困難。我們所提出的偵測機制分成兩個部分,第一部分定位影像中之路牌及招牌所屬區域,我們採用基於全卷積網路(Fully Convolutional Network, FCN)分割技術,訓練街景路牌及招牌的偵測模型,以期迅速且有效地確認目標。第二部分則於該區域中擷取文字及商標,我們利用招牌及路牌的特性,即不論兩者形狀為何,通常都由一塊平滑區域組成背景,而文字及商標存在於其中。我們依據灰階梯度強度(Gradient Magnitude),建構平滑區域圖,再根據第一部分所偵測的區域,以比對平滑區域的方式確認畫面中招牌的實際區域,根據文字與圖案的特性定義人為資訊位置機率圖。最後以適用於文本檢測的最大穩定極值區域 (Maximally Stable Extremal Regions, MSER)方法,從資訊位置機率大的區域中擷取文字及商標。實驗結果顯示本機制在各類複雜街景畫面中能有效取得文字與圖案,並依此探討FCN在此應用中的使用方式。","picture":"","personal_page":""},{"id":"7","phd":"0","class":"103","name_en":"Ching-Chun Chiu","name_ch":"邱敬淳","research_en":"Detection of Video Shot Editing by Deblocking Filter of H.264\/AVC","resercher_intro":"","research_ch":"基於H.264\/AVC去區塊濾波器之視訊片段編輯偵測","abstract_en":"The purpose of this research is to develop a forensic scheme to determine whether an investigated video has been tampered by editing processes, including shot deletion, replacement or insertion, and so on. A detection mechanism based on H.264\/AVC de-blocking filter is proposed. Considering that the original video is encoded with H.264\/AVC and so is the investigated video, when certain shots in the original encoded video is edited, the re-encoding, may make some I frames in the original GOP (Group of Pictures) be converted into P frames. Such abnormal coding information generated by the tampering operations is employed to assess the authenticity of the investigated video.\nMost of the proposed tampering detection methods utilize the information of the coding residuals. This study makes use of the de-blocking filter related information in H.264\/AVC, which is more difficult to be attacked by anti-detection method than the existing methods. We extract the Boundary Strength (BS), which is the basis of the de-blocking filter for evaluating the filter strength of 4x4 block boundaries. Two graphs for analysis are formed, i.e., Prediction Residual Graph (PRG) and Inter-Prediction Graph (IPG) in a two-dimensional image. In order to deal with various kinds of anti-detection or tampering operations, the proposed method defines three kinds of discontinuities by analyzing PRG or IPG. Three evaluation methods are thus developed, including (1) VRF (Variation of Residual Footprint), which operates on PRG to improve the existing VPF (Variation of Prediction Footprint), (2) DOF (Degree of Fragments), which processes IPG and (3) VCF (Variation of Centroid Footprint), which calculates the offset of centroid in PRG. Finally, we estimate the distances between the detected peaks and find the distance that occurs most frequently to acquire the original GOP size, followed by the determination of the editing position. Besides, the latest anti-detection technology, an attacking method based on changing the quantized coefficients, is also used to verify the proposed detection mechanism based on the de-blocking filter. Experimental results show that, no matter using the fixed QP (Quantization Parameter) or CBR (Constant Bit Rate) encoding, the video after anti-detection attack still reveals abnormal phenomena, which demonstrate the robustness of the proposed method.\nIndex Terms - H.264\/AVC, video editing, de-blocking filter, frame insertion\/deletion","abstract_ch":"本論文提出基於H.264\/AVC去區塊濾波器之視訊編輯偵測機制,研究目的在於判斷待測編碼視訊是否曾發生改變視訊內容的操作,例如片段刪除、置換或插入等。考量原始視訊以H.264\/AVC進行編碼,而鑑識方取得的待測視訊多也以同樣的格式編碼,當原始編碼視訊經過若干編輯與再次編碼後,可能造成原先GOP(Group of Pictures)中的I畫面轉換成P畫面,我們利用這些畫面所產生的異常編碼情況評估編碼視訊的真實性。\n大多數已被提出的竄改視訊偵測方法多利用區塊預測殘差資訊,本研究則採用H.264\/AVC中的去區塊濾波器相關訊息,它比現有的方式更難被施予反偵測攻擊。我們將去區塊濾波器中用來判斷4x4區塊邊界執行濾波強度的依據,即邊界強度(Boundary Strength,BS)取出,把每張畫面的BS值以二維圖像建構兩種分佈圖:Prediction Residual Graph (PRG)以及Inter-Prediction Graph (IPG)。為因應各類反偵測或是竄改所造成畫面編碼內容不連續,本論文定義三種畫面分佈不連續狀況,分別為數量、破碎程度以及質心位置不連續。數量不連續通常發生在編碼資訊被更動但仍遵守H.264\/AVC正常編碼的情況,而破碎程度及質心位置不連續兩者則較常發生於編碼資訊被改動且可能受到若干進階反偵測攻擊時。若改動區塊數量較多則造成分析圖中發生更多破碎的情況,而改動數量少通常造成分析圖的所謂質心偏移。基於上述狀況,我們使用兩種圖像搭配成三種偵測方法:(1) VRF(Variation of Residual Footprint),即利用PRG改良現有VPF(Variation of Prediction Footprint),以計算數量不連續為主的偵測方法。(2) DOF (Degree Of Fragments),針對IPG中破碎程度不連續的偵測方法。(3) VCF(Variation of Centroid Footprint),計算PRG之質心位置偏移量作為偵測方法。最後,我們估測峰值間距,尋找出現次數最多的間距得到首次編碼的GOP大小,再取得視訊片段編修位置。此外,本研究也以最新的反偵測技術,一個基於改變量化後係數的攻擊方法,來對所提出的基於去區塊濾波器偵測機制進行驗證。實驗結果顯示,不論編碼時使用固定QP(Quantization Parameter)或採用CBR(Constant Bit Rate)編碼,經過反偵測攻擊後的視訊仍然暴露異常情況,證明本研究所提出方法的強健性。\n關鍵字 – H.264\/AVC、視訊編修偵測、去區塊濾波器、畫面增刪。","picture":"","personal_page":""},{"id":"8","phd":"0","class":"102","name_en":"Tsung-Fu Tsai","name_ch":"蔡宗甫","research_en":"A Secret Sharing Scheme in Halftone Images Based on Multi-Scale Error Diffusion","resercher_intro":"","research_ch":"基於多層次誤差擴散之數位半色調資料隱藏","abstract_en":"This research presents a secret sharing scheme in halftone images. Some gray-level images of the same resolution are selected and transferred to halftone ones, which are responsible for carrying a secret halftone image. Given the pixels of secret image as the constraint, the host images are generated using Multi-scale Error Diffusion (MED). The original pixels of host images are examined and the modified MED ensures that the resulting pixels of the host images should satisfy the required conditions. After grouping all the processed halftone images, the secret image can be successfully revealed. The research objective is maintaining the quality of all the halftone images in this secret sharing scenario. Another proposed method is termed “mutual embedding,” in which a halftone share can be decoded using all the other shares by modifying the initial setting in this secret sharing scheme. Besides, the approach of selecting host images is proposed so that suitable images can be chosen from an image database to ensure the quality of resulting halftone images. The experimental results and discussions demonstrate the interesting characteristics of the proposed scheme.\r\n","abstract_ch":"本論文提出一個利用數位半色調影像做為機密資訊傳遞的共享機制。多張數位\n影像被當做載體並轉換為數位半色調影像,影像在轉換的過程中,我們利用半色調影像的特性,將相同大小的機密半色調影像嵌入於載體中,達成秘密通訊及資料共享的目標。此機制的數位半色調轉換基於多層次誤差擴散演算法,每次選擇適當的影像與位置放置白點,並根據機密半色調影像的內容讓同位置的白點個數滿足隱藏條件,研究的主要目標在於嵌入秘密影像的同時亦保持載體半色調影像的畫質。此外,我們提出數位半色調影像的互嵌機制,在不指定欲嵌入機密影像的情況下,能夠在參與共享的M 張影像中,利用任何M-1 張影像擷取剩餘的一張半色調影像。\n為了從資料庫中選擇適當的影像作為資料載體,我們提供選圖機制做為實作參考。\n實驗結果顯示所有數位半色調影像能夠維持良好的畫質,能夠成功地將機密半色調影像被嵌入與擷取,達到秘密通訊與資訊共享之目的。","picture":"","personal_page":""},{"id":"9","phd":"0","class":"102","name_en":"Meng-Huan Li","name_ch":"李孟桓","research_en":"Robust and Accurate Iris Mask Estimation using Convolutional Neural Network\n","resercher_intro":"","research_ch":"應用卷積神經網路的虹膜遮罩預估","abstract_en":"Iris recognition has a lot of applications. A typical iris recognition system has several stages, including acquisition, segmentation, iris mask generation, feature extraction and matching. In order to increase the accuracy of iris recognition, many studies focus on iris segmentation, feature extraction and matching. However, iris masks can also have a great impact on the accuracy of recognition. \nIn this study, we propose an iris mask estimation algorithm based on deep learning. After pre-processing the iris images and the corresponding masks, we train these data in convolution neural networks (CNN), which help to achieve a higher accuracy in matching iris masks for different images than rule-based algorithms. The accuracy of matching by using patch-based CNN is 92.87%, with the 0.147% EER (Equal Error Rate) and the accuracy of applying multi-channel fully convolution networks is 95.56%, with an even lower EER equal to 0.0851%.\n","abstract_ch":"生物特徵辨識是指基於一個人的生理或者行為特徵作為身分辨識機處的一種技術,虹膜辨識是生物特徵辨識中一種精確度、普遍性、獨特性很高,且侵入性很低的辨識方式。在一個典型的虹膜辨識系統當中包含了以下幾個階段:1.影像擷取、2.虹膜切割、3.虹膜遮罩產生、4.特徵提取、5.特徵比對,為了提高虹膜辨識的準確率,許多的研究裡都關注在如何正確切割虹膜、提取特徵以及特徵比對,然而虹膜遮罩的正確與否也是虹膜辨識準確性的重要因素之一。\n在本篇論文中,我們嘗試了多種的神經網路架構來對虹膜遮罩進行預估,最後提出了兩種基於深度學習 (Deep Learning) 的演算法來學習輸入虹膜影像的遮罩,我會將虹膜影像和其對應正確的虹膜遮罩做些許前處理後,輸入進我們建置好的深度學習網路學習其特徵,學習完特徵後的網路在輸入新的虹膜影像時也能順利的預測其對應虹膜影像遮罩,使產生虹膜遮罩的正確率相對於 rule-based 或其他演算法產生的虹膜遮罩高,且能提升虹膜辨識最終的準確性,使用patch-based CNN的虹膜遮罩正確率可以達到92.87%、EER為0.147%,使用multi-channel FCN的虹膜遮罩正確率可以達到95.56%、EER為0.0851%。","picture":"","personal_page":""},{"id":"10","phd":"0","class":"102","name_en":"Kun-Zhang Chen","name_ch":"陳堃彰","research_en":"Design of Fingerprinting Watermark in Digital Compressed Videos for Streaming Services","resercher_intro":"","research_ch":"適用於壓縮視訊串流服務之溯源數位浮水印設計","abstract_en":"With the expeditious development of digital video streaming over the Internet, avoiding the infringement of copyright and illegal distribution of videos has become increasingly important. Digital watermarking techniques may provide an effective solution. In this study, we propose a practical design of fingerprinting watermarking scheme for video streaming services. In order to track the source of the illegal redistribution, we embed the watermark signal representing the identity of a legitimate video recipient in the compressed domain of the transmitted videos. Once a pirated copy is found somewhere, the embedded fingerprint can uniquely determine the malicious subscriber. For the sake of less computational complexity in the streaming server, compressed video would be decoded partially and embedded with the fingerprint in the quantization indices, and then re-encoded. Since the blind detection is performed in the proposed scheme, the feature point extraction such as SURF is necessarily employed to determine the embedding positions of the watermarks, so that embedding and detection can be synchronized against geometrically deformed video. The autocorrelation function is introduced to figure out whether a watermark exists or not, and the message is read by the cross-correlation function. The experimental results demonstrate the feasibility of the proposed scheme.","abstract_ch":"隨著網路多媒體視訊串流的蓬勃發展,數位多媒體內容易於複製和散佈,如何保障視訊著作權不受非法侵害日趨成為相當重要的議題,而數位浮水印技術被視為一種有效的防範措施。本論文提出實作於視訊串流服務中的數位視訊浮水印設計。透過在壓縮域嵌入浮水印訊號代表合法視訊接收者的相關身分資訊,有效協助追蹤非法視訊拷貝的散佈來源。運用部分解碼、嵌入處理和視訊編碼將浮水印訊號嵌入在視訊畫格中的特定區域,這些區域採用特徵或感興趣點來決定,有助於後續浮水印偵測。藉由嵌入訊號的自我相似特性,使浮水印偵測不需要原始視訊比對。利用移位浮水印設計,根據嵌入訊號的互相關函數表示不同位元,藉此提高浮水印攜載資訊容量。嵌入訊號理當能承受一定程度的轉碼處理或幾何形變轉換攻擊。為了實現視訊串流應用的需求,我們考量浮水印的視覺感知程度、容量、強健性和偵測方法,實驗結果也顯示我們提出架構的可行性。","picture":"","personal_page":""},{"id":"11","phd":"0","class":"101","name_en":"Yong-Quan Chen","name_ch":"陳勇全","research_en":"A Desktop Course Recording System based on Gesture Control","resercher_intro":"","research_ch":"基於手勢控制之桌面式課程錄影輔助系統\r\n","abstract_en":"A desktop course recording system is presented in this thesis. We combine the gesture control and operations of a PTZ camera to simplify the recording of lectures. To facilitate the recording process, the lecturer can easily write the lecture note on a paper or show the course material to enrich the content of course. In addition, the lecturer can also teach with the course slide. The proposed mechanism targets at controlling the system by gestures as simple as possible to reduce the complexity so that the lecturer can focus more on explaining the course material during the course recording.\nTo achieve the functions mentioned above, we employ the depth camera, i.e., Kinect, to reliably detect the lecturer’s gestures and the PTZ camera to record the course content. The lecturer can use several gestures to control the PTZ camera according to different situations. Considering that the lecturers may have various backgrounds, we not only simplify the system setting but also make the gesture commands stable and intuitive. This system can thus reduce the cost of labor and time for the post video production. The ultimate goal is to enable more learners willing to use the system to record videos for remote distance learning, so that more students can benefit from such a learning experience.\nIndex Terms – Lecture recording, gesture control, remote distance learning","abstract_ch":"本論文實作簡易桌面式教學輔助系統,藉由手勢判斷與PTZ攝影機的結合,簡化授課教師於錄製授課影片時,在不同情境下對授課教材的操作,讓課程錄影的過程更為順暢、直觀且具有彈性。本機制讓授課教師方便地在桌面上書寫教學內容,並自行操作PTZ攝影機拍攝課程筆記或講義,也可將書本與相關資料之文字與圖片等實體教材加入於課程中以豐富教學內容。授課教師可選擇使用投影片進行教學,同時以桌面手寫的方式對課程內容進行說明。此外,本機制盡量透過簡易的手勢達成需要的PTZ攝影機操控,降低教學錄影的複雜度,讓授課教師將心力專注於課程講解。 為了達成上述功能,我們使用Kinect深度攝影機以可靠地偵測手勢,並藉此操控PTZ攝影機對課程內容進行錄製。授課教師能夠依照不同的需求做出對應的手勢,自行控制PTZ攝影機,讓鏡頭的運作更有效地擷取授課內容,或以手勢調整授課教材。考量授課教師的各種不同背景,我們盡量簡化系統設定與手勢設計,除了讓教師更方便且直覺地操作攝影機及授課教材之外,也能夠增加系統操作的穩定度。對於錄製授課視訊或是遠距教學等相關應用,講者對於器材的自行操作可減少拍攝時的人力,亦可降低事後編輯課程內容與相關後製所需耗費的時間與成本,讓更多教師願意以錄影方式製作課程教材,嘉惠更多課程學員。\n關鍵詞:課程錄影、手勢控制、遠距教學","picture":"","personal_page":""},{"id":"12","phd":"0","class":"101","name_en":"Tzu-Hao Hsiang","name_ch":"向子豪","research_en":"SCAN: A Multi-Operator Image Retargeting Scheme","resercher_intro":"","research_ch":"SCAN: 一個多運算子的影像畫面調整機制\r\n","abstract_en":"This research presents a multi-operator retargeting mechanism termed “SCAN”, in which seam carving, cropping, adding seams and normalization (scaling) are applied on images in an automatic manner. The content-based cropping will first be used to remove insignificant portions on sides. Then a new seam carving algorithm based on both the global saliency and local saliency is proposed to rid of the pixels in the middle of the image. Efficiency is the major advantage of this seam carving algorithm. When the background is not complex, some seams may be inserted in a similar way as the proposed seam carving procedures to make the aspect ratio closer to the target one. Finally, the image is scaled or normalized directly. Experimental results will demonstrate the feasibility and advantages of the proposed method.\nKeywords-retargeting; seam carving; cropping; scaling; saliency\n","abstract_ch":"本研究提出一個多運算子的影像調整機制,其中包含了圖縫裁減(Seam carving)、邊緣裁切(Cropping)、圖縫增加(Adding seam)以及正規化(Normalization)或影像直接縮放,故此機制又稱為SCAN。首先我們根據畫面內容物紀錄兩側最多能夠裁切的位置,並進行第一次邊緣裁切,移除影像兩側不重要的部分,並且使得兩邊裁切的數目相同。接著,我們考慮局部能量以及全域能量實作一個新的圖縫裁減方法以去移除影像中間較不重要的圖縫,此演算法可有效率地移除大量圖縫。當影像背景不複雜時,類似圖縫裁減方法可以被應用於圖縫增加之上,使得畫面更接近目標長寬比例。若影像尺寸仍未達目標,可進行第二次的影像邊緣裁切。最後在施予畫面直接縮放。實驗結果顯示所提出的方法之可行性與優勢。\n\n關鍵字 - 影像重新定位、圖縫裁減、邊緣裁切、視覺顯著特徵、多運算子\n","picture":"","personal_page":""},{"id":"13","phd":"0","class":"101","name_en":"Yu-lun Lin","name_ch":"林瑜綸","research_en":"A Novel Data Transmission Scheme on Voice over Internet Protocol","resercher_intro":"","research_ch":"網路電話之額外訊息傳輸機制","abstract_en":"With the rapid growth of networking technologies and digital data, content transmission on the Internet has become a common operation. Traditionally, we encrypt the data to make them unreadable so that only the recipient who owns the decryption key can read the data successfully. Nevertheless, it is still possible that the eavesdropper may still intercept the data for further analysis. Besides, the application of cryptography in certain occasions may not be allowed. The encrypted data may even arouse the interest from the eavesdroppers. Therefore, how to transmit the data in a more secured way has been an important research issue. \nIn this research, an additional data transmission scheme in Voice over Internet Protocol (VoIP) is designed. The H.264\/AVC bit-stream is responsible for transmission the major data. The file size and the number of macroblocks for embedding a data unit are carried in G.729. By signaling the start and the end of macroblocks in H.264\/AVC for embedding a data unit, we can know whether the packet lost has occurred and the retransmission can be applied. The experimental results will show the feasibility of the proposed method.","abstract_ch":"在資訊數位化與網路化的趨勢下,以網際網路傳遞訊息已非常普及。傳統方法採用加解密技術保護訊息,將明文資料轉換為無法識別的亂碼,僅讓擁有解碼金鑰的接收者可正確解讀資訊,因而確保了資料的安全性。然而,在資料傳遞的過程中仍有可能被截獲並且破譯。另外,傳輸加密資料在某些場合中是不被允許的,或是對於訊息的加密反而招致竊聽者或攻擊者的興趣。因此,如何加強資料傳遞的隱蔽性以確保訊息安全是項重要的議題。\n本論文中提出在網路電話Voice over Internet Protocol (VoIP) 系統上的額外訊息傳輸機制。我們利用H.264\/AVC視訊來達成有效的資料傳輸,另外記錄並告知接收端每單位資料量所需要的編碼宏塊數,讓接收者可檢視封包宏塊開始位置與前個封包的結束位置來判斷是否發生封包遺失以利重新傳遞。這些關於訊息大小與單位資料量所需宏塊數資訊將以G.729音訊封包傳送,同時減低了視訊處理的工作量。實驗結果將顯示本機制的實用性。","picture":"","personal_page":""},{"id":"14","phd":"0","class":"100","name_en":"Lu-Jui Chueh","name_ch":"闕呂叡","research_en":"A Practical Design of Digital Video Watermarking for Tracking","resercher_intro":"","research_ch":"適用於使用者來源追蹤之數位視訊浮水印設計\r\n","abstract_en":"With the rapid growth of networking technologies and the advances of data compression, a large number of multimedia files are transmitted, shared and downloaded. Due to the high commercial values and entertainment, digital videos are popular and watching digital video streams has become a common activity in our daily life. Although the users do enjoy the convenience from the digital video streaming, the illegal spreading of copyrighted videos draws concerns from content owners or creators. Digital watermark is proposed as a tool for protecting the intellectual property right. One of its functions is fingerprinting. That is, the content owner will embed the fingerprint, which represents the identity of the recipient, into the digital videos. Once an illegal copy is found, we may detect the watermark and trace the origin of illegal distribution. In this research, we propose a practical fingerprinting scheme for digital video streaming. We embed the watermarks into widely used MPEG-4 videos. In order to avoid the heavy computational burden in the video servers, the watermark is embedded into partially decoded data. SIFT is employed to facilitate the “blind detection” even after the video transcoding. The visibility, capacity, robustness and false detection are examined to satisfy the requirements of such applications. The experimental results show the feasibility of the proposed scheme.\nKeyword- digital watermark;partially decoding ; SIFT;blind detection \n","abstract_ch":"由於網際網路的蓬勃發展與資料壓縮技術的進步,大量的多媒體影音資訊在網路上被傳遞與下載。數位視訊因具有較高的商業價值與娛樂性,加上影音分享平台的普及,享受數位視訊串流已成為現今普遍的娛樂活動之一。然而,雖然使用者獲得了影視訊數位化所帶來的便利,數位資料的任意散佈引發了影音版權所有人的疑慮,數位智權管理因此成為現今重要的議題。數位浮水印被提出作為協助智權保護的工具之一,其中一項功能是用來追蹤多媒體資料的非法散佈者,即視訊所有人或提供者將資料傳送給某位合法使用者之前,將代表該使用者的資料嵌入於數位視訊當中,在發現某個在網路上流傳的數位視訊後,我們可由該視訊中偵測數位浮水印,一來可依其內容追蹤惡意散佈來源,二來或可用於降低使用者任意無償分享的意願。在本論文中,我們提出以追蹤來源為主要應用的數位視訊浮水印機制,在網路上經常作為分享的MPEG-4視訊中嵌入代表使用者的數位浮水印。為了減少視訊伺服器端嵌入浮水印所需耗費的龐大時間,我們利用部分解碼的方式實作壓縮域浮水印。同時,本機制利用SIFT在選定的畫面中決定嵌入與偵測浮水印位置,以利浮水印訊號在轉檔或是畫面形變後仍能進行盲偵測,另透過成對浮水印的設計以及浮水印位移的方法,增加使用者內容嵌入量以及避免偵測端誤判。實作上考量了數位浮水印的強韌性、不可視性與容量等議題,實驗結果展示此方法的可行性。\n\n關鍵字- 數位浮水印;部分解碼;SIFT;盲偵測 ","picture":"","personal_page":""},{"id":"15","phd":"0","class":"100","name_en":"Jie Lain","name_ch":"連捷","research_en":"Detecting and Anti-Detecting Shot Insertion and Deletion in H.264\/AVC Videos","resercher_intro":"","research_ch":"H.264\/AVC視訊片段增刪之偵測與反偵測\r\n","abstract_en":"Digital multimedia data can be edited easily by the powerful software these days. Therefore, many digital forensic techniques have been developed to authenticate multimedia content. Anti-forensic techniques are also proposed to remove editing traces. These anti-forensic methods study the weaknesses of existing detection algorithms to make editing undetectable. This thesis presents an anti-forensic method employing two features, abnormal coding modes and distribution of quantized transform coefficients, which are generated by the frame\/scene adding or deletion. First, the coding modes are examined in the Rate Distortion Optimization (RDO) process to limit the use of intra coding blocks in certain frames. Then, the relationship between QP and rate are examined to predict the reasonable distribution of quantized coefficients. Next, we change the reconstruction content to erase the detection features by adjusting the quantized coefficients according to the predicted distribution. Following the above steps, we store the coding modes and the processed coefficients and then copy them back in the second encoding process. The experimental results show that our scheme can successfully eliminate the features. Finally, we discuss two possible detection methods, deblocking energy and examining QP values in the rate control. The former method detects the forgery by checking the deblocking intensity of the reconstruction frames, and the latter method uses a known rate control mechanism to determine whether a correct QP value is assigned. These methods are effective in certain appropriate conditions and deserve more discussions. \n\nKey word – H264\/AVC, video forensic, video anti-forensic, frame adding\/deletion, video transcoding.\n","abstract_ch":"影像與視訊處理軟體的普及讓數位資料內容的真實性遭到若干懷疑,近期許多研究試圖偵測多媒體資料是否曾被編輯或竄改,以及相對應的反偵測方式,藉由兩者的交互改進協助提昇多媒體資料的真實性。反偵測技術尋求現有偵測方式的弱點,將編輯後的多媒體資料內可能留下的某種特徵移除以讓偵測失敗。本論文提出利用畫面增刪攻擊而引入的兩種特徵,即異常編碼模式與H.264\/AVC整數轉換中的量化係數分佈,實作視訊編輯竄改的反偵測方法。首先,根據連續畫面編碼模式之間的關係,在RDO (Rate Distortion Optimization)中限制不合理數量的畫面內預測模式。接著,使用編碼時的QP (Quantization Parameter)與位元率的關係,預測異常畫面內應有的整數轉換量化係數分佈,再將過多的非零係數逐步調整至預測的目標分佈,即藉由改變重建畫面內容移除可偵測特徵。完成以上步驟之後,將編碼模式與處理後的係數儲存起來,於編碼第二次時複製回去得到反偵測影片。實驗結果顯示,我們的方法成功地掩飾了視訊中的畫面增刪攻擊。論文最後也分別討論了兩個偵測方法,即使用去方塊濾波能量進行偵測,以及使用位元率控制應給定QP之偵測方法。雖然在這些方法中存在了某些限制條件,但是在合適的情況下仍具有值得研究的偵測效果。\n\n關鍵字 – H264\/AVC、影片竄改偵測、影片竄改反偵測 、畫面增刪、影片重壓縮。\n","picture":"","personal_page":""},{"id":"16","phd":"0","class":"100","name_en":"Wei-Yu Chen","name_ch":"陳威宇","research_en":"Recoverable Partial Scrambling for H.264\/AVC","resercher_intro":"","research_ch":"H.264可回復局部擾亂機制\r\n","abstract_en":"Protecting personal privacy on digital images and videos is important these days. In this research, we present a privacy protection mechanism in H.264\/AVC videos. The private visual information on video frames is scrambled by processing the data in the compressed bitstream directly so that the private region is not visible to the regular users. Nevertheless, the scrambled region can be restored to the original content by authorized users. Basically, the scrambling is applied by extracting and removing some data from the H.264\/AVC bitstream. These data will be embedded into the bitstream so that the recovery can be applied successfully by placing these data back. In other words, the de-scrambling is achieved via the methodology of information hiding. Since the H.264\/AVC encoder makes use of the spatial and temporal dependency for reducing the data size, careless partial scrambling on H.264\/AVC compressed bit-stream will result in drift errors. To solve this problem, the restricted H.264\/AVC encoding is employed to prevent the modified data from affecting the subsequent video content. Experimental results show that our method can effectively scramble the privacy region, which can be recovered by using the hidden information. In addition, the size of partially scrambled video is kept under good control.\n\nIndex Terms—H.264\/AVC, Partial scrambling, Privacy protection, Information hiding.\n","abstract_ch":"數位影音被廣泛使用於各種傳播媒體,對於畫面中的個人隱私保護成為一項重要議題。本論文提出了一個作用於H.264\/AVC視訊壓縮架構上的隱私保護機制,對於欲保密區域的視訊畫面資料進行擾亂,使得隱私區域在一般使用者面前是模糊而無法被正常觀看的,只有權限使用者能將畫面完整還原成原始影像。\n我們擾亂的方式是修改H.264\/AVC視訊編碼串流中的資料,使得解碼端得到錯誤資訊產生擾亂效果,最後再將原始資料利用資料隱藏的方式嵌入於視訊串流中,讓具有權限的使用者藉由取出該資料而確實還原隱私畫面區域。由於H.264\/AVC編碼利用空間與時間的相依性以獲得良好的壓縮效果,當我們實作區域性擾亂時,會因漂移誤差(drift error)的產生,造成非保密區域的畫面也受到影響。為了解決此一問題,本研究採取限制編碼的方式以及slice方式,藉由限制H.264\/AVC編碼過程中的畫面預測,有效解決飄移誤差的發生。實驗結果顯示本方法能夠確實擾亂隱私區域,使畫面模糊讓人眼看不清楚,權限使用者能夠確實還原畫面,達到個人隱私權的保護,而且整體視訊的資料量能獲得控制,符合相關的視訊應用需求。\n\n關鍵字—H.264\/AVC、區域擾亂、隱私保護、資訊隱藏。\n","picture":"","personal_page":""},{"id":"17","phd":"0","class":"100","name_en":"Hao-Wei Wu","name_ch":"吳浩維","research_en":"Image Retargeting by Cropping, Seam Carving and Scaling","resercher_intro":"","research_ch":"結合邊緣裁切與圖縫裁減暨縮放之影像畫面調整技術\r\n","abstract_en":"A new multi-operator image retargeting approach is proposed in this research. Cropping, seam carving and scaling are applied sequentially on the image to acquire the image with the targeted resolution. The saliency map of the image is first computed to serve as the reference for the subsequent processing. The foreground objects that occupy larger areas will be extracted and the boundaries of objects will be used to determine the edges for cropping. Then, seam carving is applied to remove insignificant content by employing the dynamic programming. The local energy decides when the seam carving process should be stopped. For certain appropriate images, the seams are increased so that the resulting aspect ratio can be approaching the targeted one. Finally, the image is simply scaled to the resolution of the display. The experimental results demonstrate that the essential part the image can be maintained to avoid the serious distortion from the resolution changes. Compared with the images obtained by adopting more complicated methodologies, the image of our scheme is not inferior so the efficiency can be achieved. \n\n\nIndex Terms-Multi-Operator, Image Retargeting, Cropping, Seam Carving, Visual Saliency Feature.\n","abstract_ch":"本研究提出結合多種運算子(multi-operator)的影像重新定位(image retargeting)演算法,數位影像被循序且適當地施予邊緣裁切(cropping)、圖縫裁減(seam carving)與縮放(scaling)等三種方式,達到目標解析度。首先,我們計算影像中的視覺顯著特徵 (visual saliency feature)以作為畫面調整的依據。在邊緣裁切中,擁有較大連通數的前景物體將被擷取,並以其邊緣訂出裁切邊界。圖縫裁減則利用動態規劃方法(dynamic programming)刪除最小能量圖縫,並利用限制刪除圖縫後所產生的局部能量大小決定圖縫裁減停止點。對於某些適合的影像,我們以增加圖縫的方式讓影像長寬比例進一步接近目標長寬比。最後,畫面將直接被縮放至目標大小以在適當的顯示器上呈現影像內容。實驗結果顯示,經由上述簡易的判斷與操作,我們確實能夠維持影像主體,避免在不同長寬比例的影像大小改變下產生嚴重失真,處理後的影像與其他使用較複雜方式所獲得的影像相差無幾,本演算法因而具有較高的實用性。\n\n\n關鍵字 -多種運算子、影像重新定位、邊緣裁切、圖縫裁減、視覺顯著特徵。\n","picture":"","personal_page":""},{"id":"18","phd":"0","class":"99","name_en":"Sheng-Hao Chang","name_ch":"張勝豪","research_en":"A dual-camera tracking system with object information hiding","resercher_intro":"","research_ch":"結合目標物資訊隱藏之雙攝影機追蹤系統\r\n","abstract_en":"In this thesis, we propose a framework consisting of two cameras. One static camera is used to detect and track the objects, and one Pan-TiltZoom (PTZ) camera is used to control\/collect high-resolution images. In addition, the information hiding technique is exploited to simplify the querying process of examining high resolution images corresponding to objects appearing in videos. The proposed scheme is composed of three main components. The \frst part is a moving object detection model, in which the moving objects will be collected from the static camera and be tracked continuously. The second part utilizes the position and size information derived in the part one to control the PTZ camera. The face detection module is adopted here to recognize and preserve the region of interesting (ROI) images with suitable resolution from the PTZ camera. The third part is an information hiding scheme. Those ROI images acquired from the PTZ camera and its related labels are embedded into the corresponding video frames. The authorized user can query these ROI images from the video. Experimental results demonstrate that the proposed scheme can detect moving objects, track them simultaneously, and capture the high resolution ROI images at about 25 fps (Frame per Second). The information hiding technique o\u000bers not only a simpli\fed querying method but also reduces the number of \fles that need to be stored. Index Terms| PTZ camera, Static camera, Face Detection, Information Hiding, MPEG-4.","abstract_ch":"本研究提出結合固定式攝影機,與可左右轉動(Pan)、上下傾斜(Tilt)與縮放(Zoom) 的PTZ 攝影機之視訊監控機制。經由分析固定式攝影機所拍攝之畫面,偵測畫面中移動物體並加以追蹤,再控制PTZ 攝影機取得目標物較高解析度的影像,並將此細節影像經由資訊隱藏技術隱藏至固定式攝影機所記錄之影片中。本論文主要分為三個部分:第一部分為移動目標偵測,我們首先建立固定式攝影機畫面的背景資訊,以此找出可能的移動物體,並持續追蹤每個移動物體。第二部分為PTZ 攝影機控制模組,將固定式攝影機取得之目標資訊轉換成對應參數以自動控制PTZ 攝影機,並利用人臉偵測技術擷取目標的細部影像。第三部分為資訊隱藏,將記錄下來的細部影像以及與此目標相關的標籤嵌入至影片中,事後若要查詢資料時,即可利用此標籤取得特定目標相對應的細部影像。實驗結果顯示此系統能有效控制PTZ 攝影機去追蹤特定目標,並取得足夠解析度之目標物影像。利用資訊隱藏技術,除了擁有一種新穎的影像查詢機制外,還能夠減少所需儲存的檔案數目,甚至減少檔案儲存所需空間。關鍵字—PTZ 攝影機,固定式攝影機,臉部偵測,資訊隱藏。","picture":"","personal_page":""},{"id":"19","phd":"0","class":"99","name_en":"Long-Wang Huang","name_ch":"黃龍旺","research_en":"A Constant Quality Coding Framework for H.264\/AVC","resercher_intro":"","research_ch":"H.264\/AVC畫質平穩編碼架構\r\n","abstract_en":"Quality control is important in video coding, which tries to dynamically adjust the encoder parameters for achieving the target distortion. In this thesis, we propose a quality control framework for the constant quality coding in H.264\/AVC. The proposed scheme can assign a suitable Quantization Parameter (QP) to each frame based on the scene complexity. In intra-coded frames, we evaluate the scene complexity based on the quality measurements of the resized and singular value decomposition processed frames. With the proposed model, we can adjust the QP to achieve the target distortion. Our propose framework can use di\u000berent quality measurements such as Peak Signal to Noise Ratio and Structural Similarity. For inter-coded frames, we employ the additional temporal information by the simple motion estimation to improve the prediction accuracy. We also propose a dynamic encoding mechanism for the model adjustment. When the content has large variations, we may encode the frame twice. Otherwise, we encode it only once. In addition, the e\u000bect of scene changes on the model update is also considered to reduce the quality deviation from the target. Experimental results show that our scheme performs well in various test videos. Index Terms{D-Q model, constant quality coding.","abstract_ch":"畫質控制在視訊編碼中是一個很重要的議題,本研究提出一個根據畫面內容分析來動態調整編碼參數的方式,讓壓縮視訊達到品質恆定的需求。我們利用了畫面縮放與奇異值分解的方式來判定畫面複雜度,藉由訓練,我們將複雜度相對應至失真模型參數,以便選擇適當的量化參數。我們另也採用較簡單的畫面間預測來協助P畫面中的複雜度計算。我們所提出的架構可以被使用於不同的品質衡量標準,例如PSNR或SSIM。為了獲得更精確的恆定品質畫面,我們使用了兩種編碼,一種是當畫面變動過大時,採用編碼兩次的方式,第一次經由統計模型預估編碼參數,第二次以第一次編碼的結果來更新統計模型進行編碼。另一種則是畫面變動較小時,我們根據之前畫面的編碼結果更新預估編碼參數模型。實驗結果顯示我們提出的方式對於各種不同的影片皆可以達到恆定品質效果。關鍵字- 失真-量化模型、恆定品質","picture":"","personal_page":""},{"id":"20","phd":"0","class":"99","name_en":"Yu-Chuan Chang","name_ch":"張育銓","research_en":"A Geometrically Resilient Digital Image Watermarking Scheme Based on SIFT and Extended Template Embedding","resercher_intro":"","research_ch":"基於SIFT特徵點擷取與延伸樣板嵌入之強健型數位影像浮水印\r\n","abstract_en":"Synchronized watermark detection is an important issue. The embedded watermark may not be detected successfully if the image has undergone such geometrical transformations as rotation, cropping, scaling or even random bending. This research presents a feature-based still image watermarking approach. Scale-Invariant Feature Transform (SIFT) is first applied to locate the interest points, from which we form the invariant regions for watermark embedding. To resist geometrical transformations, the extended synchronization templates, which help to ensure that reasonably large invariant regions will be available for carrying the watermark payload and\/or for increasing the confidence of watermark detection, will also be embedded. In the detection phase, after SIFT, the template is first determined locally by adjusting the related affine parameters of the grid to match with the possible hidden template signal so that the watermark can be retrieved afterwards. Experimental results show the feasibility of the proposed method. Keywords—digital watermark; geometrical transformations; SIFT; StirMark .","abstract_ch":"當嵌入數位浮水印的靜態影像遭到例如旋轉、裁切、縮放,甚至是隨機變形等幾何攻擊時,經常造成數位浮水印偵測的失敗。本研究提出了基於特徵點擷取之強健型數位浮水印方法,來抵抗幾何變形攻擊所產生的同步問題。首先,我們利用尺度不變特徵轉換(Scale-space Invariant Feature Transform)演算法來擷取特徵點作為定位,依據此特徵點位置延伸出大量的局部不變區域,在每個局部不變區域中嵌入浮水印訊號,接著再嵌入解決同步問題的樣板訊號。較大的偵測區域使得浮水印嵌入量獲得提升,並且提高偵測的可信度。在偵測隱藏訊號時,由於影像可能遭受各種攻擊,導致特徵點資訊與原先不同。因此,我們在使用SIFT 擷取特徵點後,對於每個特徵點所建構的不變區域參數進行微調整,以尋找最佳的可能嵌入區域。利用延伸樣板解決同步問題後,我們即可從中擷取出數位浮水印資訊。實驗結果顯示我們所提出的浮水印方法對於各種不同的幾何攻擊與訊號處理攻擊,皆具有合理的強健性。關鍵字—數位浮水印;幾何變形攻擊;SIFT;Stirmark","picture":"","personal_page":""},{"id":"21","phd":"0","class":"99","name_en":"Shau-Yu Shiau","name_ch":"蕭少宇","research_en":"A Privacy Protection Scheme in H.264\/AVC by Information Hiding","resercher_intro":"","research_ch":"利用資料隱藏實現H.264壓縮視訊之隱私保護機制\r\n","abstract_en":"Protecting personal privacy on digital images and videos is important these days. In this research, we present a privacy protection mechanism in H.264\/AVC videos. The private visual information on video frames is scrambled by processing the data in the compressed bitstream directly so that the private region is not visible to the regular users. Nevertheless, the scrambled region can be restored to the original content by authorized users. Basically, the scrambling is applied by extracting and removing some data from the H.264\/AVC bitstream. These data will be embedded into the bitstream so that the recovery can be applied successfully by placing these data back. In other words, the de-scrambling is achieved via the methodology of information hiding. Since the H.264\/AVC encoder makes use of the spatial and temporal dependency for reducing the data size, careless partial scrambling on H.264\/AVC compressed bit-stream will result in the drift errors. To solve this problem, the restricted H.264\/AVC encoding is employed to prevent the modi\fed data from a\u000becting the subsequent video content. Experimental results show that our method can e\u000bectively scramble the privacy region, which can be recovered by using the hidden information. In addition, the size of partially scrambled video is kept under good control. Index Terms| H.264\/AVC; Partial scrambling; Privacy protection; Information hiding.","abstract_ch":"數位影音被廣泛使用於各種傳播媒體,對於畫面中的個人隱私保護成為一項重要議題。本論文提出了一個作用於H.264\/AVC視訊壓縮架構上的隱私權保護機制,我們對於欲保密區域的資料進行擾亂,使得隱私區域在一般使用者面前是模糊而無法被正常觀看的,只有合法授權者能將畫面還原成原始影像。我們擾亂的方式是去修改H.264\/AVC視訊編碼串流中的資料,使得解碼端得到錯誤資訊產生擾亂的效果,最後再將原始資料利用資料隱藏的方式嵌入於視訊串流中,以讓合法授權的接收者藉由取出資料而確實還原隱私畫面。由於H.264\/AVC編碼利用空間與時間的相依性以獲得良好的壓縮效果,當我們實作區域性擾亂時,會因漂移誤差(drift error)的產生,造成非保密區域的畫面也受到影響。為了解決此一問題,本研究採取限制編碼的方式,藉由限制H.264\/AVC編碼過程中的畫面預測,有效解決飄移誤差的發生。實驗結果顯示本方法能確實擾亂隱私區域,使畫面模糊讓人眼看不清楚,授權的接收者能確實還原畫面,達到個人隱私權的保護,而且整體視訊的資料量能獲得控制,以符合各種相關的視訊應用。關鍵字| H.264\/AVC, 區域擾亂, 隱私保護, 資訊隱藏。","picture":"","personal_page":""},{"id":"22","phd":"0","class":"98","name_en":"Ying-Chang Wu","name_ch":"吳盈樟","research_en":"Dynamic Video Stream Switching by Using SP\/SI Frames of H.264\/AVC","resercher_intro":"","research_ch":"利用H.264\/AVC之SP\/SI畫面實現\u000b動態視訊串流切換\r\n","abstract_en":"The SP-frame in H.264 facilitates the drift-free bit-stream switching. However, the switching points have to be periodically inserted into the bit-stream in advance and thus the low-delay bit-stream switching may not be achieved. If many switching points are assigned, the coding performance of SP\/SI frames will be affected. In this research, we consider the dynamic bit-stream switching in H.264\/AVC, which can be applied on both the multi-rate coding and multi-view coding for video adaption and free-viewpoint switching. A novel coding scheme is proposed to enable the drift-free switching at any frame. We separate a frame into two components, i.e. the motion compensated part and residuals, which are encoded independently by employing the concept of SP\/SI frames. For interview coding, the inter-correlation from the different views is further utilized. We apply SIFT to estimate the global displacement and the histogram matching algorithm to correct the color inconsistence in different scenes so that the negative effects of secondary SP frames can be reduced. Experimental results show that our scheme has good performance in the free bit-stream switching. Index Terms - H.264\/AVC, multi-bitrate, multi-view, video switching, SI, SP","abstract_ch":"H.264 視訊編碼標準提出SP\/SI 畫面來達到無損串流切換。然而,SP\/SI 畫面必須先在原本的串流中設定切換點,因此通常無法達成即時切換的效果。若欲設定多個切換點,整體編碼效率將因而受到影響。本論文提出基於H.264 編碼標準之動態串流切換,這個新的編碼機制能夠達成即時無損串流切換,且能夠運用於多位元率串流與多視角串流,達到視訊的調適性和視角的自由選擇性。我們將一張畫面分成移動補償畫面和冗餘畫面,利用SP\/SI 的概念各自獨立編碼。對於多視角視訊,利用不同視訊畫面間的關係,減少資料大小,利用SIFT 演算法來預估整個畫面的位移,並且利用直方圖比對的方法,降低不同視角間的顏色差異,提升編碼效益。實驗結果顯示,我們所提出的方法在動態串流切換有良好的編碼表現結果。關鍵字-- H.264\/AVC,多位元率,多視角視訊, 影像切換, SI, SP","picture":"","personal_page":""},{"id":"23","phd":"0","class":"98","name_en":"Hui-Chun Hsu","name_ch":"許惠君","research_en":"A Joint Adaptive Rate-Quantization Model and Region of Interest Intra Coding of H.264\/AVC","resercher_intro":"","research_ch":"結合內容相關位元率量化模型與興趣區域之H.264\/AVC畫面內預測編碼\r\n","abstract_en":"This thesis presents a joint content adaptive rate-quantization model and region of interest intra coding of H.264\/AVC. The rate control of video coding is an important issue and the intra coding plays a very crucial role. Inappropriate assignment of bitrates in intra coding will deteriorate the overall coding performance. We will first present a more accurate content adaptive Rate-Quantization (R-Q) model, by which we can obtain the relationship between the Quantization Parameter (QP) of a macroblock and the block complexity. Given a target bit-rate, we can thus assign a more suitable QP for a frame. In addition, since our model is built on blocks, or more specifically macroblocks, Region of Interest (ROI) coding can also be achieved. More bits can be assigned to the ROI by using a lower quantization parameter (QP) so that the perceptual quality can be maintained within the limited bit-rate. Our macorblock-level R-Q model, compared with the traditional frame-level RQ model, is more flexible and can achieve the target bit rate more accurately.\n\nKey word—H.264\/AVC, ROI, Rate Control, Rate-Quantization Model\n","abstract_ch":"本論文提出一個結合內容相關位元率量化模型與興趣區域之H.264\/AVC畫面內預測編碼。在視訊編碼的位元率控制中,畫面內預測編碼佔了很重要的地位,其過高或過低的位元率將影響整體的編碼效率。在本篇論文中,我們提出了一個較準確的畫面內預測位元率量化模型(Rate-Quantization model, R-Q model)。我們根據區塊內容的複雜度與目標之位元率,求得此區塊之編碼率與量化參數(Quantization Parameter, QP)值的相對應關係。此外,在有了此內容相關位元率量化模型後,我們可結合興趣區域編碼,使得本機制在有限的位元率下,於興趣區域給予較多位元數,也就是較低之QP值,而使其具有較佳的畫質,而在視覺較不注意之區域給予較少位元數,藉由內容相關位元率量化模組與興趣區域的結合能在有限之位元數下達到較佳的人眼視覺效果。我們將畫面分成三個區域以各自給予適當的QP值。我們的實驗顯示,整體畫面平均之峰值訊號雜訊比 (Peak Signal to Noise Ratio, PSNR) 雖下降0.27 dB,但人眼視覺最關注區域之PSNR值增加了1.2 dB,而人眼最關注的前兩個區域則增加了0.51 dB。與傳統畫面階層之R-Q model相較,此區塊階層之R-Q model更具彈性,更容易達到目標位元率。\n\n關鍵字-H.264\/AVC, 興趣區域, 位元率控制, 位元率-量化模型\n","picture":"","personal_page":""},{"id":"24","phd":"0","class":"98","name_en":"Kai-Yi Cheng","name_ch":"程凱驛","research_en":"An Adaptive Traffic Flow Analysis Scheme Based on Scene-Specific Sample Collection and Training","resercher_intro":"","research_ch":"基於視訊場景資料蒐集與訓練之自適應車流估計機制\r\n","abstract_en":"This research presents a framework of analyzing the traffic information in the surveillance videos from the static roadside cameras to assist solving the vehicle occlusion problem for more accurate traffic flow estimation and vehicle classification. The proposed scheme consists of two main parts. The first part is a model training mechanism, in which the traffic and vehicle information will be collected from the characteristics of masks. Their statistics are employed to automatically establish the models of scene, including the implicit shape model of vehicles and the support vector machine of feature points. It should be noted that the proposed self-training mechanism can reduce a great deal of human efforts. The second part adopts the established implicit shape model and support vector machine to recognize vehicles. Each feature point is classified into a vehicle type and processed by the corresponding ISM. Experimental results demonstrate that the proposed scheme can deal with the scenes with different characteristics in the traffic surveillance videos.\n\nIndex Terms - Vehicle, ISM, SVM, SURF, Self-Training,\n","abstract_ch":"本研究提出針對固定式道路監控攝影機所拍攝畫面之分析工具,用於獲取道路上的交通資訊,以對車流進行估算。本論文主要分為兩個部分:第一部分為模型訓練機制,我們首先對畫面內容進行去背景,並利用形態學方法得到可能的車輛遮罩,再對遮罩面積進行統計分析後,取得畫面中可能之不同種類車輛大小資訊,並依此收集不同種類車輛之樣本影像。在每個區域自動取得定量之訓練樣本後,我們以支援向量機 (Support Vector Machine)搭配隱式型態模式(Implicit Shape Model)的技術,對資料進行訓練及相關處理,此自適應演算方式可以大幅減少模型建置的人力需求。第二部分為辨識機制,我們使用訓練完成的SVM對特徵點進行分類過濾,再利用訓練完成的ISM對場景中的車輛影像進行辨識,協助解決車輛影像交疊問題,同時提升車輛分類準確度。實驗結果顯示這個機制確實能夠適應不同的交通場景,有效對車輛進行辨識,達成車輛計數或車流估算的目的。\n\n關鍵字 - 車輛, ISM, SVM, SURF, 自我訓練,\n","picture":"","personal_page":""},{"id":"25","phd":"0","class":"98","name_en":"Ya-Xin Cheng","name_ch":"鄭雅欣","research_en":"A Compressed Domain Digital Watermarking Scheme for Transaction Tracking in Video Distribution","resercher_intro":"","research_ch":"使用於視訊散佈來源追蹤之壓縮域數位浮水印設計\r\n","abstract_en":"Due to the prevalent uses of digital video technology and the internet, many authorized videos have been illegally copied, downloaded and distributed. The concerns over intellectual property right infringement of video content are thus raised. Digital fingerprinting is a means of tracking the users who distribute the copyrighted multimedia data to provide a convincing evidence for the litigation. In this thesis, a partially decoding fingerprinting scheme is proposed. The quantization indices of the compressed video are decoded and embedded with the fingerprint and then encoded into the fingerprinted video. The computationally expensive motion estimation can thus be avoided in this transcoding process. In order to avoid the drift errors, which result from the error propagations in the inter-coded frames, only the non-referenced blocks are used for fingerprinting. Since the blind detection is adopted in the proposed scheme, the synchronization templates are embedded and their embedding positions are determined according to SIFT feature points such that the synchronized detection of geometrically attacked video can be achieved. Experimental results show that, even though the fingerprints are only embedded in the non-referenced areas and the video may be geometrically attacked, the fingerprinting watermark can still be detected successfully.","abstract_ch":"近年來由於網路的蓬勃發展與壓縮技術的進步,許多的多媒體檔案被任意的散佈或下載,數位智權管理變得日益重要,而數位浮水印技術可被用來追蹤這些多媒體資料的非法散佈者。在本論文中,我們選擇對MPEG-4視訊壓縮標準嵌入數位浮水印。為了減少將浮水印嵌入視訊所需耗費的時間,我們利用部分解碼來實作壓縮域浮水印,即在解碼端解碼出量化指標,加入預先準備好的嵌入資訊,再經由編碼端的可變長度編碼組成壓縮視訊。為了避免嵌入的訊號因視訊壓縮時的預測機制造成飄移錯誤(drift error),我們提出將訊號只嵌入每張畫面中的不被參考區塊。此外,由於我們使用盲檢測(blind detection)以符合實際應用及偵測的公正性,我們使用特定的SIFT(Scale Invariant Feature Transform)特徵點嵌入樣板以達成偵測的同步。實驗結果顯示,即使因只嵌入不被參考的區塊且經過量化讓嵌入的資訊量大幅減少,我們依然能夠於可能被攻擊的視訊中正確偵測浮水印,且具有一般數位浮水印所必須具備的基本特性。","picture":"","personal_page":""},{"id":"26","phd":"0","class":"98","name_en":"Chia-Yang Chiou","name_ch":"邱家揚","research_en":"SSIM-Based Constant Frame Quality Control for H.264\/AVC","resercher_intro":"","research_ch":"基於結構相似度之H.264\/AVC視訊畫面品質恆定控制\r\n","abstract_en":"The digital videos require effective compression to facilitate their transmission and storage. However, it is not easy to control the quality of compressed videos. If the related applications need to preserve the content of every frame in the video, the traditional rate control mechanism may not be suitable. In this research, we propose a constant frame quality control technique, which can reduce the quality variations between successive frames to avoid serious perceptual distortion. The proposed scheme may thus be helpful in the applications of video archiving or surveillance videos. In addition, by constructing the relationship between the quality and quantization parameter, the proposed method may also benefit the traditional rate control coding. Objective distortion metrics such as mean squared error or peak signal to noise ratio are poorly correlated with the human perceptual quality. Recently, various image\/video quality metrics based on the HVS have been proposed and the structural similarity (SSIM) index has been shown to be effective. Therefore, we adopt SSIM as the quality metric for our constant frame quality control. Compared with the reference software of H.264\/AVC, JM, our approach can reduce SSIM variation significantly. Furthermore, at the same SSIM index, the proposed scheme achieves lower overall bit-rate.","abstract_ch":"數位視訊需經有效的資料壓縮程序以利其傳輸與儲存。然而,壓縮後的視訊畫質並不易受到控制,連續畫陎品質的劇烈變動,可能造成若干視覺上的失真。若相關應用需要將每張畫陎內容適當地保存,只考慮位元率控制的編碼機制也許並不是非常適合。本研究提出視訊畫質恆定控制技術,減少視訊畫陎間的品質差異,提升視訊觀賞的流暢度與舒適感,有利於重要資料的保存。此外,如果我們能夠在編碼前即合理預測視訊畫質與編碼參數間的關係,對於位元率控制編碼也將有所助益。以往的視訊畫質估測經常採用均方誤差的方式,但均方誤差與人眼視覺之間缺乏良好的相關性。因此研究人員提出了各種畫質估測方式,而結構相似性指標(SSIM)是其中之一。由於結構相似性指標有效地模擬人眼視覺系統中擷取影像結構訊息的功能,所以我們採用結構相似性指標做為畫質量測依據,並藉此提出失真-量化參數預測模型、模型參數的預測方法以及實際編碼中的動態處理程序等,以達成畫陎品質恆定控制之目的。與H.264\/AVC 參考軟體JM 相較,我們所提出的方法不僅大幅降低了畫陎品質的變化,在同樣的整體視訊品質下,也擁有較低的位元率。","picture":"","personal_page":""},{"id":"27","phd":"0","class":"97","name_en":"Ming-Tse Lu","name_ch":"呂明澤","research_en":"A Practical Design of High-Volume Steganography in Digital Videos","resercher_intro":"","research_ch":"以數位視訊檔案為載體之高容量資訊隱藏設計\r\n","abstract_en":"In this research, we consider to exploit the large volume of audio\/vidio data streams in compressed video clips\/files for effective steganography. By observing that most of the widely distributed video files employ H.264\/AVC and MPEG AAC for video\/audio compression, we examine the coding fea- tures in these data streams and determine good choices of data modifica- tions for reliable and acceptable information hiding, in which the percep- tual quality, compressed bit-stream length, payload of embedding, effec- tiveness of extraction and efficiency of execution are taken into account. Different settings can thus be used to cope with varying requirements of applications. Experimental results demonstrate that the payload of the selected features for achieving a good balance among several constraints can be over 10% of the compressed video file size.\nIndex Terms— Steganography; H.264\/AVC; MPEG AAC.","abstract_ch":"本論文提出一個以數位影音視訊檔案為載體之高容量機 密資訊隱藏機制。近年來由於網際網路的蓬勃發展以及許多 影音社群平台的流行,在網路上觀看傳播高畫質數位串流影 音媒體更加便利。高畫質檔案多以 FLV 格式且以 H.264\/AVC 編碼為大宗,其音訊則以 MPEG AAC 編碼為主流。本研究 充分地利用了高畫質編碼技術之細節特徵資訊,將機密資訊 藏入於高畫質數位視訊檔案之影像及音訊串流中,以達成機 密通訊之目的。本論文大致分成兩部分,第一部分為對 H.264\/AVC 之視訊資訊隱藏。本研究提供了多種嵌入方法以 因應不同的應用及影片。論文的第二部分為對 MPEG AAC 之音訊資訊隱藏機制。結合視訊及音訊的嵌入機制,達成高 容量資訊隱藏,且盡量保持音視訊品質之目的。此外,為了 提升資訊嵌入的效率,我們亦提出了一個模式複製的程序, 以增進資料嵌入之執行速度。\n實驗結果顯示本論文所提出的演算法能夠藏入大於 10% 視訊檔案大小之機密資訊容量,並且對於音視訊品質及影片 的壓縮率皆列入考量。\n關鍵字—H.264\/AVC, MPEG AAC, 資訊隱藏。","picture":"","personal_page":""},{"id":"28","phd":"0","class":"97","name_en":"Sheng-Yuan Tsao","name_ch":"曹盛淵","research_en":"A Digital Watermarking Scheme Resisting Geometrical Transformations","resercher_intro":"","research_ch":"抵抗幾何攻擊之數位浮水印設計\r\n","abstract_en":"Fingerprinting is one of the potential applications of digital watermark- ing, which is expected to be helpful in discouraging illegal copying and pro- tecting the intellectual property rights of content owners. By embedding the watermark representing the individual fingerprint of the intended receiver in the content, we may trace down the source of distribution according to the extracted hidden signal of an illegal copy. In this research, we propose two video fingerprinting schemes for digital videos. Considering that the video frames may be geometrically modified, both of the schemes make use of Scale Invariant Feature Transform (SIFT) to deal with such attacks. To be more specific, our feature-based fingerprinting schemes employ the invariant re- gions of each specific frame based on the orientation and the scale of the scale-space feature points. The watermark will be embedded into DCT co- efficients or quantization indices, which will appear in the coding structure of such video codec as MPEG2 or MPEG4. Our first scheme requires the original frames in the watermark detector for recovering the attacked frames into the original shape before the watermark detection as this application may not strictly require the blind watermark detection. The second scheme doesn’t require the original frames to simplify the watermark detection. The experimental results will demonstrate the feasibility of the proposed methods.\nIndex Terms Digital watermark, fingerprinting, SIFT, Geometrical Dis- tortion.","abstract_ch":"指紋追蹤(fingerprinting)是數位浮水印中具有潛力的一種 應用,主要被期望用來阻止非法的複製以及保護文件擁有者 的智慧財產權。藉著嵌入代表接收者個人的浮水印,我們能 夠根據那些從不合法的複製中擷取出來的隱藏訊號來追蹤 散佈的來源。在這篇論文中,我們提出了兩個視訊的指紋追 蹤的方法。考慮到視訊畫面可能會受到幾何上的修改,兩個 方法皆採用了 SIFT 來解決這類的問題。更精確的說,在以 特徵點為基礎的指紋追蹤方法中,我們根據尺度空間所找到 的特徵點尺度以及方向來產生具有不變性的區域。而浮水印 也將被嵌入至視訊編碼規格如 MPEG2 或 MPEG4 中的 DCT 係數或量化指標。由於在指紋追蹤的應用中並未強制地要求 需要盲檢測,因此在我們第一個方法中,利用原始的畫面來 將受到攻擊的畫面回復成原始的形狀才去偵測浮水印。而第 二個方法則是不需要原始畫面來簡化浮水印的偵測。實驗結 果將會證明所提出方法的可行性。\n關鍵字—數位浮水印, 指紋辨識, SIFT, 幾何扭曲攻擊。","picture":"","personal_page":""},{"id":"29","phd":"0","class":"97","name_en":"Kai-Kai Hsu","name_ch":"許凱凱","research_en":"Adaptive Traffic Scene Analysis by using Implicit Shape Model","resercher_intro":"","research_ch":"利用隱式型態模式之自適應車行監控畫面分析系統\r\n","abstract_en":"This research presents a framework of analyzing the traffic in- formation in the surveillance videos from the static roadside cam- eras to assist resolving the vehicle occlusion problem for more accu- rate traffic flow estimation and vehicle classification. The proposed scheme consists of two main parts. The first part is a model train- ing mechanism, in which the traffic and vehicle information will be collected and their statistics are employed to automatically estab- lish the model of the scene and the implicit shape model of vehicles. It should be noted that the proposed self-training mechanism can reduce a great deal of human efforts. The second part adopts the established implicit shape model, which is a highly flexible learned representation, for vehicle recognition when possible occlusions of vehicles are detected. Experimental results demonstrate that the proposed scheme can deal with the scenes with different character- istics and the occlusion problem in traffic surveillance videos can be reasonably resolved.\nIndex Terms— Vehicle; traffic surveillance; occlusion; SIFT.","abstract_ch":"本研究提出一個針對固定式道路監視畫面之分析工具,用以協助 解決車輛影像交疊問題,並提升車流評估及車輛分類準確度。本論 文主要分為兩個部份,第一部份為模型訓練機制,經由搜集之交通 場景及車輛相關資訊,分析其統計特性,取得目標道路車流方向及 出現之機車、汽車、公車等各類車輛大小資訊,接著以自動化的方 式建立交通場景模型及代表車輛之隱式型態模式 (ISM)。值得注意的 是,此自適應機制可以大幅減少模型建置的人力需求。第二部份結 合了訓練完成的 ISM ,對可能發生車輛影像交疊的部份進行辨識。 實驗結果顯示了這個機制確實能夠適應不同的交通場景,並且有效 地解決道路監視器畫面中車輛影像交疊的問題。\n關鍵字 — 車輛,交通監控,交疊,SIFT。","picture":"","personal_page":""},{"id":"30","phd":"0","class":"97","name_en":"Yao-Hsuen Huang","name_ch":"黃耀萱","research_en":"A Robust Image Watermarking Scheme based on Scale-Space Feature Point Detection","resercher_intro":"","research_ch":"利用特徵點偵測之強健型數位影像浮水印\r\n","abstract_en":"Geometrical transformations, such as cropping, rotation, scaling or even random bending, cause the synchronization problem of detecting the digital image watermark. This research presents a feature-based watermarking scheme to deal with geometrical attacks in still images. First, the scale-invariant feature extraction is applied to locate the interest points that can survive the signal processing\nprocedures and affine transforms. A local invariant region based on the scale-space features of an image is then acquired. At each invariant region, two signals will be embedded, {em i.e.} the watermark carrying the hidden information and the extended synchronization pattern or grid, which helps to ensure that a reasonably large invariant region be available for carrying the watermark payload and increasing the confidence of watermark\nextraction. The detection of the grid is based on the local search by adjusting the related parameters of the grid to match with the possible hidden pattern so that the watermark can be retrieved afterwards. Experimental results demonstrate that the proposed scheme is robust against common image processing and geometrical\nattacks.","abstract_ch":"幾何變形攻擊,包括裁切、旋轉、尺度縮放,甚至隨機變形等,所產生的同步問題,對於數位影像浮水印的偵測影響極大。本研究提出了一種基於特徵點擷取之數位浮水印方法,來抵抗對於靜態圖片的幾何變形攻擊。首先,尺度不變之特徵點將會被擷取出來做為定位點,此特徵點亦能存活於一般的訊號處理以及仿射轉換等攻擊。利用此類特徵點適當的強韌性,我們依據特徵點位置建構出多個局部幾何不變之格狀形區域,並於每個局部幾何不變區域中嵌入兩種訊號,第一種為隱藏資訊之數位浮水印,以及第二種做為同步機制的訊號,此訊號又稱為樣板訊號。樣版訊號可以確保較大範圍之局部幾何不變區域能夠被擴張建構出來以提供經過幾何攻擊後之浮水印偵測。較大的偵測區域將使得浮水印嵌入量獲得提升,且浮水印的可信度也隨之增加。在偵測浮水印時,我們對於每個特徵點所定位之格狀區域參數進行微調整,尋找局部之最佳可能區域,浮水印訊號將因此被更可靠地偵測。實驗結果顯示我們所提出的浮水印方法對於幾何攻擊具有強健性,並且能夠抵抗一般的訊號處理攻擊。","picture":"","personal_page":""},{"id":"31","phd":"0","class":"97","name_en":"Tzu-Hsin Tseng","name_ch":"曾子欣","research_en":"A Practical Highlight Extraction and Classification Scheme in Baseball Videos Based on Transition Effect Detection","resercher_intro":"","research_ch":"利用串場效果偵測之實用棒球比賽精華擷取暨分類系統\r\n","abstract_en":"This research presents a system of analyzing video content in the MPEG compressed videos for classifying the highlights in baseball videos. The system makes use of the transition effects inserted preceding and fol- lowing the slow motion replays by the broadcaster, which demonstrate highlights of the game. First, we examine the characteristics of the tran- sition effects via the camera changes and the video content analysis to construct the transition effect template, which can help to locate all the appearances of transition effects. Next, we search the pitching views which appear before the transition effects and construct the pitching view model for this game. Finally, after we locate the highlight candidates, we will apply HMM (Hidden Markov Model) to analyze and classify the content to ensure that the extracted highlights match our definitions of high-level highlight semantics. Because the system is based on MPEG compressed video data streams, it can save a large amount of computa- tional complexity. The experimental results show the feasibility of the potential solution.\nIndex Terms–Baseball, video, highlight, transition effect, slow motion, MPEG, SVD, HMM","abstract_ch":"棒球比賽轉播在目前商用電視台十分常見,一場比賽的完整視訊 內容通常為九局上下,然而隨著實際比賽的變化以及球團戰術的實行, 其比賽時間之長往往超乎預期,對於觀看棒球比賽者而言,並不是每 個人都有充裕的時間能將整場比賽瀏覽完畢而且並非所有內容觀看 者都非常感興趣,因此,若是能事先將大量的比賽資訊透過系統做分 析整理並擷取出不同的精華片段提供觀看者做選擇,則可讓觀看者節 省大量的時間並得到想要的資訊。\n本論文提出一個在 MPEG 壓縮域底下對輸入系統的視訊內容做分 析並擷取出精華片段再加以分類的系統。由於精采畫面的前後通常都 會有電視台加入的串場效果以通知觀看者,因此我們首先透過鏡頭變 化偵測加上視訊內容分析找出串場效果畫面的資訊,並利用數位訊號 處理的方式準確的建立出串場效果模板,以此模版我們可定位出所有 串場效果出現的位置。再來,我們以 SVD 對所有出現在串場效果前後 的投捕畫面做特徵擷取並定出一個特徵的閥值範圍,可作為投捕畫面 的定位判斷,投捕畫面至串場效果即為分析精華片段的開始與結束點。 最後,在找到串場效果畫面以及投捕畫面之後,我們可將此段畫面以 HMM 加以分析與分類,可得知此片段是否為我們所定義的某種精華片 段。\n由於系統是建立在MPEG視訊壓縮的資訊串流做計算,因此可節省 大量複雜的運算。實驗結果顯示本系統在視訊畫面的分析上以及精華 擷取的分類上具有相當高的準確度與運算效率。\n關鍵字—棒球,視訊,精華,串場效果,慢動作,MPEG,SVD,HMM","picture":"","personal_page":""},{"id":"32","phd":"0","class":"97","name_en":"Men-Tu Juan","name_ch":"阮門督","research_en":"A Highway Preceding Vehicle Detction Scheme by Using Implicit Shape Model ","resercher_intro":"","research_ch":"利用隱式型態模式之高速公路前車偵測機制\r\n","abstract_en":"Developing a practical driver assistance system for ensuring driving safety has become an increasingly important issue. The major risk of driving on the highway comes from possible collisions of the vehicle with the preceding one because a suitable distance is not well maintained. Therefore, knowing the relative position of the preceding vehicle and the surrounding cars should significantly reduce the risks. In this thesis, we would like to develop a highway preceding vehicle detection\/tracking scheme, in which a monocular vision-based system for detecting the preceding vehicle in close and mid-range view will be designed to help provide a better view for the drivers.\nOur approach is based on an appearance-based methodology, i.e. Implicit Shape Model. A codebook is built for vehicle detection and tracking by using the training images captured from the real scenes. The collection of training images are divided into three parts: fully rear view, partially rear view from left and from the right sides. By applying scale-invariant feature transform (SIFT) to extract the interest points, we have a set of good features presenting the preceding vehicles. Then, we group those features to build up the codebook by clustering. Three models will thus be constructed. For detection and tracking the objects, we apply SIFT detector again in the real scenes. In each scene, we compare the extracted features with the codebook to find its matched representative features. Once a model is found, we can identify the ROI based on the scale and position indicated in the models. We can continue searching for the left and right side of pre-identified ROI to detect more possible vehicles. The experimental results show that vehicles can be detected in each of the three areas, i.e. right in front of the diver and his left\/right-hand side areas.","abstract_ch":"發展實用的駕駛輔助系統來確保駕駛安全已日漸成為一項重要的課題。駕駛者在高速公路上所面臨的主要危險來自於與前車未保持適當距離而導致可能的車輛碰撞。因此,得知與前車以及周圍車輛的相對位置將可大幅降低此種危險。在本論文中,我們發展一個高速公路前車偵測\/追蹤機制,設計以單眼視覺為基礎之系統用來偵測前方車輛位置。我們的方法主要根據一種型體之方法論,即隱含式外型模型。我們先由實際場景得到欲用來訓練的圖像,在藉此建立一個碼簿。這些收集到的訓練圖像可被區分成三個部份:完整拍攝車後的景象,部分從左方拍攝車後的景象,及部分從右方拍攝車後的景象。透過使用尺度不變特徵轉換 (SIFT) 擷取興趣點,我們可以擁有代表前方車輛的良好特徵。接著,我們利用群聚的方式集群這些特徵來建立我們的碼簿。為了偵測以及追蹤物件,我們再次使用尺度不變特徵轉換在實際場景上。在每一個場景中,我們比較擷取出來的特徵與碼簿以找出匹配的代表特徵。一旦模型建立完成,我們可以根據尺度以及在模型內指示出的位置辨識出有興趣區間 (ROI) 。我們可以繼續搜尋先前定義的左邊車輛ROI以及右邊車輛ROI以偵測更多可能的車輛。實驗結果顯示車輛可以在三個區域中被偵測。","picture":"","personal_page":""},{"id":"33","phd":"0","class":"96","name_en":"Chung-Chi Tsai","name_ch":"蔡鐘琦","research_en":"A Video Copy Detection Scheme Based on Spatial and Temporal Feature Extraction\n","resercher_intro":"","research_ch":"基於時間域與空間域特徵擷取之影片複製偵測機制","abstract_en":"Digital videos are distributed widely these days on various kinds of media thanks to the proliferation of cheaper but increas- ingly powerful personal computers, the prevalence of high-speed networking facilities and the advanced video coding technologies. Many video web servers are available nowadays to provide con- venient platforms for users to upload and share digital videos. However, video content providers do not always support these video web servers since many videos are uploaded\/shared with- out their permission and infringe their intellectual property rights (IPR). The popular video servers may often be requested to re- move certain video clips or even be sued for the copyright viola- tion. Therefore, the issues of copyright protection become critical for the owners of popular video web servers to reduce such con- troversies or disputes.\nIn this research, we aim at providing a feasible content-based video copy detection scheme. The content of the uploaded video with be matched with those of original videos stored in the video web servers to determine whether it is a duplicate copy that may infringe the copyright. To be more specific, the content matching will be based on the comparison of the significant features, which are extracted from the uploaded and original videos and act as the signature or video hash, instead of the videos themselves to avoid the requirement of extremely large storage. First, the shot bound- ary detection is applied on the videos to determine the candidates of key frames. The key frames with unique visual characteristics will be selected as the anchor points for content matching. Then the spatial or pixel domain hash will be extracted via the tech- niques of vector quantization and singular value decomposition from anchor frame for efficient retrieving. Finally, the temporal features, i.e. the shot lengths, will be matched to further ensure the correctness of content matching. The research objective is to maintain a good balance between robustness, discrimination and efficiency. We believe that the contribution of this research will also be helpful to such fields as consumer multimedia collection, multimedia linking and content analysis.\nIndex Terms— Video copy detection, MPEG, VQ, Scene-change detection, keyframe.","abstract_ch":"隨著個人電腦與各式錄影設備的普及, 配合寬頻網路的建置, 以及先 進的視訊編碼技術, 大量的數位視訊得以廣泛地傳播與流通。 同時, 許 多視訊共享網站被建立, 提供多樣的途徑讓使用者上傳與分享數位視訊, 而目前的網路頻寬有相當大的部分即為傳遞此類網站視訊資料所使用, 由此可看出其受歡迎的程度。 然而, 對於擁有內容版權的電影\/電視公司 來說, 這樣的任意分享並不為他們所支持, 他們不願意讓內容被無償使 用, 而大量未經授權的影片被放置於分享平台也可能影響其獲利。 因此, 越來越多的著名視訊分享網站被要求移除某些違反版權的影片片段, 甚 至遭到以違反著作權條款為由所控告。 如何保護著作權並減少版權問題 所引發的爭議成為這些視訊分享網站所要面對的重要議題。\n本研究的目的在於提供一個藉由視訊內容比對來偵測視訊複製的機 制。 簡言之, 當視訊片段被上傳後, 該片段經處理後所產生的特徵資料 會與儲存於視訊網站上的原始特徵資料比對, 以判斷上傳資料是否來自 於原版影片的複製。 為有效達成此目的, 我們將由視訊資料中擷取基於 內容所產生的簽章或是雜湊函數以增進執行效率, 避免大量視訊資料的 儲存。 我們將先利用場景切換偵測技術將影片分成多個片段, 並由這些 切換場景畫面中找出關鍵畫面, 再由這些關鍵畫面上取得空間域或像素 域上的雜湊函數值。 我們利用向量量化以及奇異值分解等方式產生所需 比對的像素域特徵資料。 利用正確比對所得到的畫面做為定位點, 然後 我們再使用時間域特徵來確認視訊內容比對的準確性。 本研究的主要挑 戰在於如何於視訊雜湊函數的強健性、 視訓分辨性與比對效率三者間取 得平衡。 我們相信本研究的產出不僅能夠提供一個視訊複製偵測的方式, 並且將有助於多媒體內容分析研究及其相關應用。\n關鍵字— 複製影片偵測,MPEG, 向量量化, 場景偵測, 關鍵畫面。","picture":"","personal_page":""},{"id":"34","phd":"0","class":"96","name_en":"Ching-Yu Wu","name_ch":"吳靖宇","research_en":"A Joint Watermarking and ROI Coding Scheme for Traffic Surveillance Videos","resercher_intro":"","research_ch":"結合數位浮水印與興趣區域位元率控制之車行視訊編碼\r\n","abstract_en":"A new application of information hiding by employing the digital watermarking techniques to facilitate the data annotation in traffic surveillance videos is proposed in this research. As there are more and more roadside surveillance cameras are deployed, the applications related to traffic surveillance systems become important. In the pro- posed schemes, the data collected from intelligent transportation sys- tems (ITS) are embedded into the corresponding regions of traffic scenes to facilitate the data management. The scheme consists of two parts. The first part is the object-based watermarking, in which the infor- mation of each vehicle collected from other sensors\/sources in the ITS will be conveyed\/stored along with the visual data via information hid- ing. The traffic scene captured by a video camera will be analyzed and the individual vehicles are extracted as moving objects, which will be embedded with the associated information. The scheme is integrated with H.264\/AVC, which is assumed to be adopted by the surveillance system, to achieve an efficient implementation. The second part is a Region Of Interest (ROI) rate control mechanism for encoding traffic surveillan videos, which helps to improve the overall performance. The quality of vehicles in the video is thus well preserved and a good rate- distortion performance can be attained. Experimental results show that this potential scheme works well in traffic surveillance videos.\nIndex Terms— Digital watermark, H.264\/AVC, information hiding, ROI, rate control, ITS.\n","abstract_ch":"本論文提出一個結合數位浮水印與興趣區域 (ROI) 位元率控制的車行視訊 編碼機制。 近年來道路監視器被大量地設立以增加對人車安全的保障, 同時也 使得交通監控相關應用更為廣泛。 本研究將智慧型運輸系統 (ITS) 所蒐集的車 輛相關訊息, 利用數位浮水印技術嵌入於監控畫面中, 以減少用來解讀畫面的 資料量, 便利資料的整理, 並且增加資料的可讀性。 本論文大致分成兩部分, 第 一部分為以區分車行畫面前景與背景物為基礎的數位浮水印嵌入與偵測機制。 透過背景建立, 我們將畫面中包含車輛的區塊擷取出來, 並將其對應至 ITS 感 測器所得到個別車輛資訊以進行浮水印嵌入, 浮水印技術則與壓縮車行視訊所 使用的 H.264\/AVC 緊密結合以增加系統執行的效率, 確保視訊畫質不受影 響, 維持壓縮視訊的長度, 並確認隱藏訊息可被正確的偵測。 論文的第二部份 為建構於 ROI 之位元率控制機制, 此機制利用車行畫面的特性, 訓練相關模 型以適用於各種交通監控場景, 透過有效的預測, 我們能夠準確的將較多的位 元分配於 ROI。 因此, 我們不僅可藉此告知解碼端關於數位浮水印的嵌入位置 以利其偵測, 並可協助系統提高整體視訊編碼表現, 有效提升車輛部分的影像 畫質。 實驗結果顯示了此車行畫面編碼系統的優點與實用性。\n關鍵字— 數位浮水印,H.264\/AVC, 資料隱藏,ROI, 位元率控制, 智慧型 運輸系統。","picture":"","personal_page":""},{"id":"35","phd":"0","class":"96","name_en":"Ing-Fan Chen","name_ch":"陳穎凡","research_en":"The H.264\/AVC Video Content Authentication Scheme by Using Digital Watermarking","resercher_intro":"","research_ch":"植基於數位浮水印之H.264\/AVC視訊內容驗證機制\r\n","abstract_en":"Digitization of videos brings a lot of convenience to the transmission and archiving of visual data. However, the ease of manipulation of digital videos gives rise to some concerns about their authenticity, especially when digital videos are employed in the applications of surveillance. In this research, we try to tackle this problem by using the digital watermarking techniques. A practical digital video watermarking scheme for authenticating the H.264\/AVC compressed videos is proposed to ensure their correct content order. The watermark signals, which represent the serial numbers of video segments, are embedded into nonzero quantization indices of frames to achieve both the effectiveness of watermarking and the compact data size. The human visual characteristics are taken into account to guarantee the imperceptibility of watermark signals and to attain an efficient implementation in H.264\/AVC. The issues of synchronized watermark detections are settled by selecting the shot-change frames for calculating the distortion-resilient hash, which helps to determine the watermark sequence. The experimental results demonstrate the feasibility of the proposed scheme as the embedded watermarks can survive the allowed transcoding processes while the edited segments in the tampered video can be located.","abstract_ch":"數位科技的快速發展與先進的資料壓縮技術讓視訊資料的儲存與傳輸變得更為便捷,數位攝影機已經取代了傳統類比式錄影設備而廣泛運用於各種應用中。然而,數位資料易於編修的特性卻使其真實性受到若干質疑。在視訊資料中,片段的插入、刪除、更換或互換是最常見也最容易施行的攻擊方式,這樣的攻擊可在不被察覺的情形下改變資料的內容。本研究針對此類攻擊提出一個基於數位浮水印之視訊內容驗證機制,我們透過數位浮水印的嵌入與正確的偵測與驗證,來確認該視訊的內容順序。在本機制中,數位浮水印在視訊資料被壓縮的過程中被嵌入而在解壓縮的過程中被偵測以增加運算效率。H.264\/AVC優異的編碼效能使其被廣泛地運用於多種場合,我們因此選擇將數位浮水印機制建構於H.264\/AVC之上。\n為了達成浮水印有效嵌入與偵測,以及避免檔案大小的增加,我們將浮水印訊號嵌入於非零的量化係數上,而數位浮水印資訊則包含了影片片段的序號。為了避免浮水印的嵌入造成影片畫質下降,我們根據Watson''s視訊模型中的亮度遮罩(luminance masking)來調整浮水印的能量,在不影響人眼視覺的情況下增加浮水印的強健度。此外,我們也偵測場景轉換畫面以其產生能夠適度抵抗失真壓縮的雜湊數,再利用此雜湊數產生擾亂浮水印的序列,一方面可以達到浮水印嵌入和偵測的同步,另一方面也增加了浮水印的安全性。在論文中我們對浮水印嵌入與偵測的方式做深入討論,以期確實能實作於H.264\/AVC中,並達成預期目標。實驗結果顯示藉由數位浮水印的方式可協助驗證數位影片的完整性與真實性。","picture":"","personal_page":""},{"id":"36","phd":"0","class":"96","name_en":"Chih-Wei Hsu","name_ch":"徐治瑋","research_en":"A Joint Encryption and Digital Watermarking Scheme in H.264\/AVC Videos for Digital Rights Management","resercher_intro":"","research_ch":"應用於數位智權管理之H.264\/AVC視訊加解密暨數位浮水印機制\n","abstract_en":"H.264\/AVC is expected to be widely used in streaming appli- cations due to its decent coding performance. Many commercial and surveillance systems will employ H.264\/AVC to facilitate the transmission and archiving of video content. However, the con- venience of distributing digital videos may also raise certain con- cerns from content providers and owners so the issues of Digital Rights Management (DRM) of videos become critical these days. In this research, we propose a joint encryption and digital wa- termarking scheme in H.264\/AVC compressed videos. We first present a novel selective encryption scheme under the framework of this state-of-the-art video codec. The main idea is to not only make the encrypted video useless for anyone who does not own the correct key for decryption but also keep the compatibility of H.264\/AVC syntax so that the complexity of decoding can be reduced. In other words, the video frame can be extracted for viewing but is scrambled to conceal its intelligibility. The ideas of partial encryption lead to more efficient power management. In addition, we also present a low complexity fingerprinting al- gorithm by using the techniques of digital watermarking for en- crypted videos. The information of target user can be embedded for the purposes of trailer tracking. Experimental results demon- strate that the proposed scheme effectively scramble the video frame content and bring a negligible impact on the coding perfor- mance. Moreover, the coding complexity is almost not affected. The proposed fingerprinting algorithm can embed a reasonable amount of information into the compressed bit-stream without degrading the video quality.\nIndex Terms— Selective Encryption, Partial Encryption, Dig- ital Watermark, H.264\/AVC, Digital Rights Management.","abstract_ch":"H.264\/AVC優異的編碼壓縮效能使其被廣泛地使用於網路視訊串流 傳輸的各種應用。 然而, 網路與壓縮技術所帶來的方便性也讓未授權視 訊更容易地被散播, 數位影片的智權管理因此成為一項重要的議題。\n本論文針對 H.264\/AVC 提出一個結合選擇性加密與低運算複雜度 數位浮水印的智權管理機制。 本機制利用選擇性加密, 在儘量不影響影 片大小的前提下, 有效率地擾亂原來的視訊畫面, 加密後的影片依然保 有 H.264\/AVC 的相容性, 雖可被播放但有著凌亂的畫面。 選擇性加 密可以減少影片加解密時所需要的時間, 也間接減少了電力上損耗。 此 外, 本機制更在視訊壓縮與加密的過程中, 抽換視訊串流中的一部分資 料, 並在此段資料中嵌入相對應使用者的個人資訊, 在缺乏該段資料的 情況下, 視訊畫面將受到進一步的擾亂。 使用者為了觀賞正確畫面必須 另行下載此段資料, 而在該資料與原先視訊串流結合的過程中, 代表使 用者的數位浮水印將被嵌入, 以供日後追蹤之用, 也就是當非法散播的 H.264\/AVC 影片被發現後, 可經由此訊息追溯下載使用者。 此部份設 計的目標在於如何減少額外資料的傳輸量, 以及對於影片畫質的保障。\n實驗結果顯示本論文所提出的演算法能夠有效擾亂畫面, 並且對於影 片的壓縮率只有些微的影響, 而視訊串流壓縮的時間則幾乎沒有增加, 浮 水印嵌入演算法亦不影響視訊畫質。\n關鍵字— 選擇性加密, 部分加密,H.264\/AVC, 數位浮水印, 智權管 理。","picture":"","personal_page":""},{"id":"37","phd":"0","class":"96","name_en":"Chin-Yi Cho","name_ch":"卓晉億","research_en":"A TV News Analysis Scheme based on Text and Anchorperson Identification","resercher_intro":"","research_ch":"基於文字與主播偵測之新聞視訊分析系統\r\n","abstract_en":"With the Proliferation of multimedia data, requests for effective and efficient video retrieval are growing. Among the various kinds of digital videos, TV news videos play an important role in broadcasting nowadays and may also serve as a major source of daily information for people these days. In Taiwan, there are several TV news stations and duplicated news videos are repeated again and again. Watching them may be a waste of time. Considering that the digital recording facilities are widely available\nnow, we propose a classification scheme that can cluster the recorded TV news video segments so that the viewers may choose to watch the related archived news and even retrieve the useful information from them.\nIn the proposed scheme, we make use of the text in TV news for clustering videos. It should be noted that the text analysis in Taiwan’s TV news needs further processing since the text areas in Taiwan’s TV news may include various information including the caption, weather report, and stock market indices etc. It’s challenging to locate the area where we are really interested in. Furthermore, video OCR is not mature enough and does not work quite well in Taiwan’s TV news broadcasting because of the special and different text fonts used in each TV news channel. We apply the low-level feature extraction and SVM to locate the possible region of interest, which should help to differentiate new segments from commercials. Then the anchorperson scene will be located to divide a piece of news into two parts, one part with the anchorperson describing the news and the other part related to the news content itself. Next, we extract the caption in the second part, in which the text is more stable and representative. After refining the extracted text areas, a cross-correlation process is used to find the similar pattern in captions of video segments to relate them together. Experimental results will be\nshown to demonstrate the feasibility of this potential solution.","abstract_ch":"在數位科技漸趨成熟的今日,大量的影音資訊藉由數位化與日益進步的壓縮技術而得到廣泛的傳遞與永久的保存。現今的使用者能夠藉由不同的管道取得大量的多媒體資訊,但龐大的多媒體資料若需以人工方式搜尋或加註以分類則是相當耗時的。因此,如何協助使用者有效率地搜尋及萃取多媒體資訊的技術與工具成為一個相當重要的研究議題。\n本研究針對新聞視訊提出協助內容擷取與分類的工具。在新聞視訊內容中,文字是最重要的特徵之一,少許的幾個文字可為新聞內容給予精確的註解,若能對新聞中的文字進行有效的識別,將有助於對新聞內容的認識與了解。然而,在台灣的新聞頻道中,畫面文字包括了新聞標題、氣象預報、股市行情與跑馬燈,內容繁複,且文字字體與字型及其大小格式不一,而目前的文字識別軟體僅能針對少數已訓練過字型做識別,無法作用於台灣多數新聞頻道中的文字,如何從複雜的新聞畫面中擷取出利於分析的區域,便成為待解決的問題。此外,穿插於新聞播報中的廣告會使得內容分析受到影響,因此我們必須予以有效剔除以利分析。本研究將針對有代表性意義的文字區域進行偵測擷取及相關處理,並對上述問題提出解決的方法。","picture":"","personal_page":""},{"id":"38","phd":"0","class":"95","name_en":"Chi-Heng Lan","name_ch":"藍啟恆","research_en":"A Highlight Extraction and Classification Scheme for Baseball Videos","resercher_intro":"","research_ch":"棒球比賽精華片段擷取分類系統\r\n","abstract_en":"This paper presents a practical highlight extraction and classification schemes for baseball videos. The approach relies on precise detections of transition effects inserted at the beginning and the end of the replays in the game, which demonstrate the game highlights. It is worth noting that the complexity of the highlight extraction procedure should be limited since it is an auxiliary function of a digital video recorder. Therefore, in the proposed system, the features of MPEG compressed videos are used for subsequent processing to archive efficiency. The properties of transition effects are exploited so that the effects can be accurately retrieved for locating the video segments of replays. Next, the pitching view, which is the starting point of every play in baseball games, will be extracted via Support Vector Machine (SVM). The contents of the play can then be analyzed and classified to determine their types or exciting levels. We classify the extracted highlight segments by using Hidden Markov model (HMM). Experimental results show that the accuracy is good enough to achieve the practical highlight extraction for baseball videos.","abstract_ch":"本論文提出一個實用的棒球比賽精華片段擷取與分類的方法。由於 棒球比賽時間較久,若使用者需要在短時間內瀏覽多場比賽,一個自動 的精華片段擷取將帶來許多便利。值得注意的是,精華片段擷取可視為 一個數位錄影機的附屬功能,因此,它所耗費的計算資源應該遠少於數 位視訊的解壓縮過程以符合實際應用的需求。我們所提出的系統首先將 偵測所謂串場效果的位置。由於棒球比賽精彩片段重播出現的前後,通 常會由電視台加入串場效果以告知觀眾,我們利用串場效果獨特的視覺 特性,準確偵測其位置,即重播畫面出現處。接著,我們以 SVM (Support Vector Machine) 分類器找出投打對決畫面,作為此段重播實際畫面的起 點。再來,我們對於這些精彩畫面以 HMM (Hidden Markov Model)加以 分析與分類,將內外野精采畫面的情境擷取出來,讓使用者更容易得到 所需的內容。本系統主要是建構在 MPEG2 數位視訊壓縮格式上,即我們 有效利用壓縮後的 MPEG 串流資訊,做為後續分析與處理的主要參考依 據。如此設計不僅有效降低系統運算的複雜度,也讓我們的系統與其它 已提出的方法相較,更具有實用性。實驗結果顯示,我們所提出的系統 具有相當高的準確度。","picture":"","personal_page":""},{"id":"39","phd":"0","class":"95","name_en":"Kuei-Chih Chen","name_ch":"陳桂枝","research_en":"Real-Time Multi-Camera Tracking by Exploiting Features of H.264\/AVC","resercher_intro":"","research_ch":"利用H.264\/AVC特徵之多攝影機即時追蹤系統\r\n","abstract_en":"Visual surveillance systems have played a more and more important role these days to protect the safety and\/or properties of people. In this research, we consider building a real-time multi-camera tracking system by using PTZ(Pan-Tilt-Zoom) cameras. Several PTZ cameras are deployed in an indoor environment to track a single walking person. Each PTZ camera can dynamically change its focus to have a better view of the tracked object so that the subsequent intelligent processing or understanding of the scene can be applied. Cameras can communicate with each other by exchanging simple messages via the wireless networking. The captured video will be compressed with the state-of-the-art video codec, H.264\/AVC, to facilitate the video transmission and storage. In order to achieve the effective and real-time tracking, we make use of the residuals of H.264\/AVC to detect the position and size of the moving object, especially when the PTZ camera is not in the still mode. The camera that is close to the object will apply the principal tracking while the nearby camera(s) may apply zooming-in on the walking person so that the system can have multiple views of the target. The experimental results show the feasibility of our proposed multi-camera tracking system as the cases of failures in tracking the moving object can be significantly reduced.","abstract_ch":"視訊監控在現今的生活中扮演著極為重要的角色,越來越多的監控攝影機被佈建於各個場合以協助保障人員或財產的安全。然而,大量的監控資料管理不易,負責監看畫面的管理人員注意力有限,監控的效能因而打了折扣。為了彌補人力監視與操作上的不足,我們提出使用多台PTZ(Pan-Tilt-Zoom)攝影機實現室內環境多角度自動化人物即時追蹤。透過多台攝影機之間的溝通,對目標人物持續且有效的追蹤,並透過多角度與變焦拍攝(zoom),讓管理人員獲得更清晰的視訊監控資訊。\n 本研究所提出之監控系統採用H.264\/AVC壓縮技術來處理視訊資料以利畫面資訊的傳輸與保存。為減低系統負擔以達成即時追蹤,我們利用H.264編碼過程中所擷取之冗餘(residual)資訊,輔助PTZ攝影機在旋轉過程中,對出現在畫面中之移動人物作有效的偵測。對於單一移動人物,系統會以一台距離目標物較近之攝影機執行主要追蹤,同時以鄰近幾台攝影機配合施予多角度變焦輔助拍攝。實驗結果顯示,利用多台追蹤攝影機可減少目標物消失於監控畫面的情況,經由充分利用PTZ攝影機的功能,可得到更清晰且多角度的監控畫面。\n","picture":"","personal_page":""},{"id":"40","phd":"0","class":"94","name_en":"Chien-Chung Chen","name_ch":"陳建昌 ","research_en":"A Baseball Highlight Extraction Scheme based on Transition Effect \r\nDetection and Content Analysis.","resercher_intro":"","research_ch":"植基於串場效果偵測與內容分析之棒球比賽精華擷取系統\r\n","abstract_en":"Watching sports videos has always been an important and popular recreation. The audiences nowadays can enjoy watching the sports games at home with their high-quality audio-visual facilities and even record the videos by using digital video recorders. When the audiences choose to record the video for time-shift purposes, they may not be interested in watching the whole game but the video highlights only. In addition, the highlight parts in current sportscasts are always followed by slow-motion replays. A transition effect is usually inserted between the normal frame and the replaying frame to inform the audiences of the replay. Therefore, the appearance of a transition effect has a direct linkage to the video highlight. In this research, we propose to detect transition effects for baseball videos highlight extraction. In order to reduce the computational cost of hardware, the proposed method processes MPEG compressed bit-streams directly. We make use of the color information of MPEG streams and the motion information including motion vectors and the macro-block types in frames. Then we analyze to determine whether the transition effects occur by the characteristics of transition effects. Next, we use the detected transition effects to train a template, which will be used for matching in the remaining parts of video. Furthermore, we classify the replay segments so that the user can choose the video segment that he or she really likes to watch. Since the users will be more interested in watching the normal scenes of highlights, we trace back to find out the pitching view as the starting point of a highlight. Experimental results show the feasibility of the highlight extraction system.","abstract_ch":"由於數位化與儲存技術的進步,多媒體資訊在應用的需求上不斷成 長,對於數位視訊資料提供快速搜尋機制,甚至取出精華等成為當今研 究的重點。而運動比賽分析是視訊研究的主要項目之ㄧ,其在娛樂與商 業上的價值相當高,考慮一般比賽時間較長且精華片段較一般影片明 確,本論文提出一種快速且準確的運動比賽精華萃取系統,以提供使用 者更佳的服務,並以棒球比賽為主要測試項目。\n轉播單位較觀眾具有專業水準,因此當出現精采畫面時,會提供相 關重播鏡頭給觀眾欣賞。為了區隔重播片段與現場畫面,因此會利用串 場效果作為緩衝。由於各家轉播單位與轉播的內容往往大相逕庭,串場 效果也是呈現多樣化。本研究藉由通用性的串場效果偵測,找出串場效 果,並由此定位重播片段位置,進而分析重播片段內容,找出球賽精華。 最後再由重播片段往前搜尋,找到原始事件位置,作為精華片段的標記 位置。整個系統建構在現今使用的數位儲存格式 MPEG 視訊壓縮的資訊 做計算,節省大量複雜運算。實驗結果也顯示本系統具有相當高的準確 度與效率。","picture":"","personal_page":""},{"id":"41","phd":"0","class":"94","name_en":"Hsiang-Yi Ma","name_ch":"馬翔毅","research_en":"Object Detection and Tracking for a Moving Surveillance Camera by Using Dynamic Background Compensation","resercher_intro":"","research_ch":"使用動態背景補償以偵測與追蹤移動監控畫面之前景物\r\n","abstract_en":"There are increasing demands to detect usual\/unusual events in various environments nowadays. Deploying cameras in public\/private areas to form a ubiquitous surveillance system is thought to be very helpful in ensuring safety of people in many aspects. However, as more and more cameras are being installed, it may become impractical and cumbersome to find available human resources to achieve effective surveillance. Advanced surveillance systems that can actively monitor an area\/object of interest and automatically identify abnormal situations are considered to be a promising solution. The advanced surveillance systems rely on analyzing the visual data recorded by the cameras to determine if unusual events happen. The issue of object tracking in video frames is thus very important and needs to be investigated thoroughly.\nIn this research, we adopt Pan-Tilt-Zoom (PTZ) cameras in our surveillance environment and propose a novel detection and tracking algorithm for dynamic scene videos captured by a PTZ camera. In our system, we first use the static scene tracking algorithm to construct the background and then use the dynamic scene tracking algorithm when the camera starts moving. The optical flow approach is used to detect the background motion and then predict the current background image. The background subtraction is then applied to obtain the rough foreground regions. In order to better predict the next frame, we compensate the predicted background to prevent error propagation. Finally, the watershed algorithm is applied to obtain a more precise contour of the foreground object. The camera is controlled to move for tracking the object accordingly. Experimental results show the feasibility of the proposed system.","abstract_ch":"自動化監控是近年來熱門的研究方向,由於監控人員無法永遠專注 地監視攝影畫面,利用自動化監控來幫忙追蹤監視是必要的。但目前的 監控系統是以固定式定點拍攝的攝影機為主要設備,由於攝影範圍有 限,因而會有許多的死角,造成監視上的困難。因此,我們採用可旋轉 式攝影機(Pan-Tilt-Zoom camera),利用其可控制移轉的特性來增加監控視 野。\n本研究提出動態背景預測法來偵測並追蹤物體。當攝影機未移動 時,利用背景相減找出移動物體並取出其特徵點,在攝影機追蹤移動時, 採用光流運算得到移動物體的特徵點移動後的位置與背景移動向量,進 而估計出當前的背景影像,再用動態背景補償的方式,防止預測誤差傳 播擴散,最後採用分水嶺演算法得到物體更精確的輪廓,以控制攝影機 完成追蹤。\n當移動物體突然轉向或停止時,我們所提出的方法依然可以正確的 追蹤物體,實驗結果證實我們所提出方式的可行性。","picture":"","personal_page":""},{"id":"42","phd":"0","class":"94","name_en":"Hong-Min Chang","name_ch":"張鴻閔","research_en":"Content Authentication of Visual Data by Using\r\nSingular Value Decomposition and Vector Quantization.","resercher_intro":"","research_ch":"利用奇異值分解與向量量化以達成視覺資料之內容認證\r\n","abstract_en":"The creation and distribution of digital contents have become increasingly convenient these days thanks to the rapid growth of digital signal processing techniques and broadband networking infrastructure. Editing digital images\/videos, which was thought to be only achievable by professionals, can now be done with inexpensive and widely available software and hardware. A content authentication scheme for digital images and videos is proposed in this research. In order to avoid the content from being unnoticeably tampered by using digital editing facilities, reliable visual features are extracted as the authentication code and transmitted along with the images and videos. The integrity of content in a picture or a frame can thus be guaranteed by comparing the similarity between the transmitted authentication code and the extracted feature. The features are sensitive to malicious modification of data but are resilient to allowed lossy compression. The receiver can calculate the similarity between the transmitted digest and the extracted one to determine if the content has been tampered. The tampered region can even be located to help identify the motivation of the attacker. Experimental results show the feasibility of the proposed algorithm.","abstract_ch":"數位多媒體壓縮技術純熟與網路的快速發展使得數位化的影像、視訊等多媒體資料兼顧高品質與體積小,並且對於多媒體的發佈變也便的容易許多。由於數位多媒體資料易於編輯與複製的特性,使得數位化的資料易受竄改,使得多媒體資料的真實性(autnetity)與完整性(intergrity)受到質疑。多媒體的驗證主要目的為了確保數位多媒體資料的完整性與真實性避免受到竄改。在本篇論文中,提出一個以數位簽章為基礎的強健的影像\/視訊驗證系統。主要透過量化編碼的方式取得區塊內容特徵的擷取形成驗證碼,再經由壓縮加密隨著影像\/視訊傳送或儲存。對於惡意的竄改發出警告,其利用畫面資訊的區域性標示出竄改區塊;而對於影像\/視訊的有損壓縮,像是JPEG、H.264\/AVC等類型的合法操作,均能夠通過我們的驗證系統。更提出在H.264\/AVC的監視系統下的改良方法,使我們系統產生的驗證碼長度小與提升驗證的效率,且不會影響驗證的準確率。","picture":"","personal_page":""},{"id":"43","phd":"0","class":"94","name_en":"Ming-Lun Li","name_ch":"李明倫","research_en":"Content-Adaptive Digital Watermarking in H.264 for Video Authentication","resercher_intro":"","research_ch":"應用於H.264視訊內容認證之適應式數位浮水印\r\n","abstract_en":"Digital contents have become increasingly popular nowadays due to their convenience of transferring and storage. In addition, the rapid growth of broadband networks and advanced coding technologies make creation and distribution of digital contents much easier and faster than ever. However, digital contents can be easily modified and the malicious tampering of data may change the meaning of contents. In this research, we propose a digital watermarking scheme under the framework of H.264\/AVC. The watermark is embedded into video frames to ensure the correct frame order. Such attacks as frame dropping, swapping or insertion can be reflected from the unambiguous watermark detection.\nWe use DC values of blocks as the features of the image, which can help determine whether a shot change occurs. The image hash value is calculated by content analysis and used to generate the watermark sequence. The watermarking scheme uses a human perceptual model to adjust the watermark energy so that the watermark robustness can be enhanced without degrading the visual quality.\nBy integrating the watermarking approach with H.264\/AVC, we can make the watermark embedding\/detection be done in a very efficient manner. The content-analysis mechanism not only makes the watermark imperceptible but simplifies the watermark detection process. The experimental results show that the embedded watermark can survive transcoding processes such as changing quality parameters or the coding structure. Besides, the frame modification attacks will be revealed by the successful watermark extraction.\n","abstract_ch":"由於數位技術的進步,傳統的類比資料轉為數位格式儲存,不論是收藏或是攜帶都更為便利;網際網路的發達,也促進了數位資料的普及;此外,多媒體編碼技術的進步,數位資料所需的容量越來越小,對於上述的網路傳輸與儲存也變得更為方便。\n然而,數位資料所遇到的問題也接踵而來。由於數位資料易於複製的特性,使用者可以輕易的對其複製散播,使得創作者的智慧財產權受損。更甚者,利用偽造、變更、剪接等方式,對其內容更改,破壞原先內容的含義,進而做為己用。以監視系統為例,數位監視影像因為內容易於更改的特性,大幅降低了影像的公信力與可靠性。因此,我們提出了一個完善的數位浮水印機制,對於數位影像的內容加以驗證,由萃取出的浮水印,可判斷數位影像內容是否遭受竄改、加入畫面、移除畫面或是畫面調換等剪輯攻擊。\n我們以區塊DC值做為畫面之特徵,利用畫面內容分析的方式,判斷畫面場景切換以及影像雜湊值,再由場景內之畫面內容與單一畫面的雜湊值做為浮水印序列的生成參數,利用人類視覺模組依據畫面適應性地調整浮水印能量,在不破壞畫面品質下,盡可能地增強浮水印的強健性。\n利用我們所提出的浮水印方法,嵌入的浮水印將顯示影像畫面或片段的順序,因此任何企圖更改影像片段順序或是以刪除或插入畫面來改變視訊內容的攻擊都將被偵測出來。此外,本研究方法不但能在維持原始影像品質下,對影像內容做有效的驗證,還能夠抵抗重新壓縮等攻擊。我們相信本研究所提出之H.264\/AVC視訊壓縮與數位浮水印整合機制,將對建立更完美的多媒體資訊架構作出貢獻。\n","picture":"","personal_page":""},{"id":"44","phd":"0","class":"93","name_en":"Chun-Chieh Chen","name_ch":"陳俊傑","research_en":"Video\/Image Content Authentication by Vector Quantization","resercher_intro":"","research_ch":"植基於向量量化之視訊\/影像內容驗證技術\r\n","abstract_en":"The rapid growth of signal processing techniques and widespread networking facilities make creation and distribution of digital contents much easier and faster than ever. Users nowadays can produce images\/videos with their low-cost software and hardware and then transmit and share the digital data with others by using broadband networks. The convenience brought by the digital technology definitely benefits most of the content users but certain concerns may arise. One challenging issue is related to content authentication of digital images\/videos. Many surveillance cameras store data in digital formats to facilitate their storage and transmission. However, the ease of editing digital data may void their effectiveness as evidence or a proof on the court. Thus, the authenticity of digital images\/videos should be further ensured so that they can serve as reliable evidence without doubt.\nIn this thesis, a content authentication scheme for digital images and videos is proposed. The classified vector quantization of image\/frame blocks is employed to create a digital digest or an authentication code, which is less sensitive to the lossy compression process and will be transmitted with the images\/videos. The receiver can calculate the similarity between the transmitted digest and the extracted one to determine if the content has been tampered. The tampered region can even be located to help identify the motivation of the attacker. The proposed scheme is closely tied with H.264 video codec to achieve better efficiency. Issues about codebook design and security are also discussed thoroughly. Experimental results show the feasibility of the proposed scheme.","abstract_ch":"網路的快速發展以及先進的影像\/視訊壓縮技術,為高品質數位媒體的發佈帶來許多便利,然而數位化多媒體也具有容易複製、存取與修改的特性。因此,多媒體資料的驗證即判斷多媒體資料的真實(authentity)與完整性(integrity)將變得更加重要。在本篇論文中,我們將現有的驗證技術與挑戰作討論,並且提出符合現今使用環境的驗證系統。此多媒體驗證技術利用畫面資訊的區域性(localization)指出影像畫面中可能遭受竄改的部份,並且抵抗影像\/視訊壓縮所造成的影響。\n對於畫面竄改的認定,我們以畫面內容意義改變與否作為惡意竄改的定義。有別於以往的多媒體驗證系統,我們所提出的影像\/視訊驗證技術是以向量量化(Vector Quantization)為基礎,利用影像\/視訊壓縮格式中既有的DCT係數區塊取出DC係數值,作為日後驗證的依據,接著透過簡單的區塊分類,將分別是具有資訊的區塊或者平滑的均勻區塊經由不同向量量化方法來達到影像內容驗證的目的。實驗顯示這樣的方法可適用於靜態影像JPEG壓縮與H.264等之視訊壓縮格式。","picture":"","personal_page":""},{"id":"45","phd":"0","class":"93","name_en":"Yu-Wei Wang","name_ch":"王昱偉","research_en":"Detecting Transition Logo for Sports Video Highlight Extraction","resercher_intro":"","research_ch":"針對與運動比賽精彩畫面相關串場效果之偵測\r\n","abstract_en":"Watching sports videos has always been an important and popular recreation and the broadcasting of sports games takes a large portion of TV programs. With the rapid advancement of digital technologies, audiences nowadays can enjoy watching the sports games at home with their high-quality audio-visual facilities and even record the videos by using digital video recorders (DVR.) When the audiences choose to record the video for time-shift purposes, they may not be interested in watching the whole game but the video highlight parts only. The audiences may be benefited a lot if a novel DVR is developed to extract the highlights from sports videos automatically and accurately.\nIn the sports videos broadcasting nowadays, the highlights part are always followed by slow-motion replays. Besides, the editor usually inserts a transition effect between the normal frame and the replaying frame to inform the audiences. Therefore, the appearance of a transition effect has a direct linkage to the video highlight. In this thesis, we propose to detect transition effects for sports videos highlight extraction. In order to reduce the computational cost of hardware, the proposed method processes MPEG compressed bit-streams recorded from the DVR directly. We make use of the color information of MPEG streams, i.e., DC coefficients, and the motion information including motion vectors and the macro-block types in frames. Then we analyze to determine whether the transition effects occur by the characteristics of transition effects, which include a short period of appearance, fast color change and object moving etc. We tested several videos of baseball games with different transition effects. The experimental results demonstrate a 70% of accuracy.","abstract_ch":"觀賞運動賽事長久以來是一項重要的娛樂活動,而運動賽事的轉播也一直在電視節目中佔有相當大的比例。隨著數位錄影機的日益普及,許多無法即時欣賞比賽的觀眾會使用數位錄影機錄下賽事以隨後觀賞。考慮到運動比賽通常持續數小時,而比賽精華只佔其中的一部份,若能開發先進的數位錄影設備使其能夠自動地將觀眾感到有興趣的比賽精華部份擷取出來,將可為使用者帶來便利。\n在現今的運動賽事轉播中,比賽精華片段會伴隨著慢動作重播,而轉播單位通常會在重播前後加上所謂串場效果以告知觀眾。我們在本論文中提出偵測與運動比賽精華相關之串場效果以達到比賽精華片段的擷取。我們的研究方法將直接處理經由數位廣播所傳來或由數位錄影機所錄製的MPEG串流,從MPEG串流中抽取及計算特徵值以節省訊號處理所需的計算時間與硬體需求。我們使用MPEG串流中所包含的色彩資訊,即DC值,以及動作資訊,包括動作向量以及在畫面中不同型態巨區塊比例等特徵,然後利用串場效果時間短暫、顏色變化與效果移動快速等特性,分析影片以找出可能是串場效果的片段。我們測試了以棒球為主的影片,其中包含了不同型態的串場效果。整體而言可達70%的準確度。","picture":"","personal_page":""},{"id":"1","phd":"1","class":"98","name_en":"Ching-Yu Wu","name_ch":"吳靖宇","research_en":"Model-Based Fast Algorithm and Rate\nControl for H.264\/AVC","resercher_intro":"","research_ch":"基於內容適應式模型之H.264\/AVC快速演\n算法與位元率控制","abstract_en":"H.264\/AVC has become the most frequently used video codec nowadays. A lot of efforts have been made to pursue highly efficient video coding and to maintain good rate-distortion performances of video compression. In this dis- sertation, several content-adaptive models are developed to increase the speed of video encoding and to achieve better rate\/quality control in H.264\/AVC. The dis- sertation consists of three major parts. First, an efficient intra-prediction mode decision mechanism is presented. A projection-based approach, which utilizes the reconstructed surrounding pixels and block content to compute the predicted block residuals (PBR), can effectively eliminate the less probable modes from the computation of Rate Distortion Optimization. According to the PBR and coding information acquired during the encoding process, some prediction modes and macroblock types can be further skipped to accelerate the intra coding. Then, after considering the efficiency of intra coding, we research the issue of Rate- Quantization (R-Q) in the intra coding of H.264\/AVC. Assigning an appropriate Quantization Parameter (QP ) to the intra-coded frames is very important to the video coding. A content-adaptive R-Q model is thus presented to predict the bit usage of intra-coded frames. The relationship between the QP of a macroblock and the block complexity is derived so that a suitable QP can be determined un-der a target bit-rate. Since the proposed model is built on macroblocks, Region of Interest (ROI) coding can also be achieved. By adjusting the QP value at the macroblock level, more bits can be assigned to the ROI to better preserve its per- ceptual quality. Finally, we tackle the problem of rate\/quality control for regular video encoding by estimating the resultant quality or distortion associated with QP . A Distortion-Quantization (D-Q) model is proposed to predict the distortion level, which is defined as the difference between the original video frame and the decoded one in the sum of squared errors. As in the R-Q model, the proposed D-Q model also has only one adjustable parameter related to the macroblock content and provides a mapping between QP and the corresponding distortion before the exact encoding process. Given a targeted frame quality measured in peak signal to noise ratio (PSNR), this model helps to assign a suitable QP value to each frame. Two applications are then considered, i.e., the single-pass constant frame PSNR coding and the two-pass coding with the additional bitrate or storage constraint, both of which can facilitate such applications of video archiving and editing.","abstract_ch":"H.264\/AVC影片編碼已經成為目前最常被使用的編碼標準,許多研究也因此致力於追求更高效率的影片編碼,以及維持良好的位元率- 失真表現。在本篇論文中,我們研發了許多內容適應性的模型來加 快H.264\/AVC的編碼速度,以及達到更好的位元率和品質控制。\n本論文由三個部分組成。首先,我們提出了一個畫面內預測的快速演 算法。區塊邊緣的重建像素以及區塊內的內容,會先經由投影的方式產生 兩組向量,再進一步計算出預測區塊冗餘。此一特徵值能夠有效地先行刪 除一些較可能被位元-失真最佳化所刪除的預測模式,來提升編碼速度。 根據預測區塊冗餘以及其他在編碼過程中擷取的資訊,本研究提出了更進 一步的跳躍方法來跳過某些編碼模式和區塊種類,使得畫面內編碼能進一 步加速。\n再者,增加了畫面內編碼的效率之後,我們探討畫面內編碼的位元- 量化關係。如何適當地分配量化參數給I-畫面,對於影片編碼來說相當 重要。我們提出了一個內容適應性的位元-量化模型,來預測I-畫面的 的位元使用量。藉由分析量化參數以及區塊複雜度之間的關係,在目標位 元率決定之後,決定一個適合的量化參數來進行編碼。由於提出的模型是 建構在巨區塊層,感興趣區域可以藉此使用較多的位元、以及較低的量化 參數來編碼,進而達到提升人眼的視覺品質。\n最後,我們藉由估測失真與量化參數間的關係,進一步探討位元率控 制以及品質控管問題。我們提出了一個內容適應性的失真-量化模型,來 預測畫面或區塊的失真程度。和之前的位元-量化模型類似,該模型只有一個可使用巨區塊內容調整的參數,並且能在每張畫面被編碼之前,就先 行預測該畫面的失真程度。在由訊噪比所定義的畫面品質被設定之後,該 模型能夠幫助計算出適合的量化參數。藉由此模型,我們進一步探討兩個 恆定畫質的影片編碼應用,希望能幫助相關專業應用,例如影片的儲存以 及編輯,達到更好的品質與效果。","picture":"","personal_page":""},{"id":"2","phd":"1","class":"94","name_en":"Chin-Sung Wu","name_ch":"吳錦松","research_en":"Spatial and Temporal Feature Extraction for Digital Video Copy Detection","resercher_intro":"","research_ch":"應用於數位影片複製偵測之空間域與時間域特徵擷取機制\r\n","abstract_en":"In this research, the techniques of spatial and temporal feature extraction are proposed for digital video copy detection. An efficient content-based video copy detection scheme based on the spatial and temporal feature extraction and matching is presented. The key-frames are selected to generate the spatial features, which are used as the anchor points for the temporal feature matching. The design considers the video coding structure so the efficiency and the compact size of feature database are the main contributions of the proposed framework. The experimental results show that the extracted feature can facilitate fast content matching for identifying the possible copies. The method should thus be feasible in matching contents in very large video databases.","abstract_ch":"本研究提出一個應用於數位影片複製偵測之空間域與時間域特徵抽取機制,可應用於鑑別使用者上傳的影片之合法性。本研究方法是以影像內容的特徵為檢索基礎,並且提出新的空間域(spatial)特徵值為索引的快速近似方法來抵抗影片失真的狀況與提高辨識性,最後再使用新的時間域(temporal)特徵值的快速相似匹配方法,來鑑別使用者上傳的影片之合法性。本系統架構是基於H.264\/AVC壓縮域的影片來執行解碼,進而分析每一組GOP(Group of Picture)值,來進行切換鏡頭畫面(SCF)的偵測分析,再從這些SCF集合中獲取時間域的數位特徵值,緊接著繼續從這些SCF集合中篩選出一個或是數個具代表此部Video的key frame,再透過我們提出的一種新的空間域特徵值提取方法來得到空間域的數位特徵值,最後我們僅需將產出的空間域與時間域的數位特徵值,儲存進資料庫中,並不需要花費大量的儲存空間來儲存原版影片。日後若需要鑑別使用者上傳的影片之合法性時,僅需比較在影片資料庫的空間域與時間域的數位特徵值即可。\n我們提出的植基於內容的影片複製偵測方法,是適用於線上大量影片的複製偵測,例如鑑別使用者上傳到YouTube伺服器的影片之合法性。經過實際測試,在資料庫為252小時的影片中,使用者上傳影片的一張key frame的執行匹配計算時間約0.016秒。我們使用了MUSCLE-VCD-2007[34]與YouTube上大量影片來當作影片資料庫,並且使用一些失真的相似影片(例如在影片中加入noise、亮度改變、對比度改變、frame loss、frame insert、frame change、移位、旋轉、time shift)與不同影片來執行複製鑑別,實驗數據顯示了本機制是一個強健與高辨識性的系統,在對龐大的資料庫進行比較時,有高平均的查全率(Recall)與準確率(Precision),並能夠迅速地鑑別上傳的影片之合法性。","picture":"","personal_page":""}]