In this project we propose an efficient video searching and video retrieval of human actions using spatio-temporal localization algorithm. Content-based video retrieval(CBVR) which is an extension of content-based image retrieval(CBIR).Highly efficient localization model that first performs temporal localization based on histograms of evenly spaced time-slices, then spatial localization based on histograms of a 2D- spatial grid. In the existing method they used dollar detector. In this project we used Histogram of Gradient (HOG) descriptor and SVM classifier for feature detection. We also show the relevance feedback can be applied to our localization and ranking algorithm. It gives high performance and accuracy. As a result, the presented system is more applicable to real-world problems than any prior contentbased video retrieval. It can be used for surveillance actions and also in many restricted areas for recognizing human actions.


In existing method a video surveillance system in the environment of a stationary camera that can extract moving targets from a video stream in real time and classify them into predefined categories according to their spatiotemporal properties. Targets are detected by computing the pixel-wise difference between consecutive frames, and then classified with a temporally boosted classifier and “spatiotemporal-oriented energy” analysis. The proposed classifier can successfully recognize five types of objects: a person, a bicycle, a motorcycle, a vehicle, and a person with an umbrella. In addition, we process targets that do not match any of the AdaBoost-based classifier’s categories by using a secondary classification module that categorizes such targets as crowds of individuals or noncrowds. We show that the above classification task can be performed effectively by analyzing a target’s spatiotemporal-oriented energies, which provide a rich description of the target’s spatial and dynamic features.

Moving target recognition involves two major steps: feature extraction and classification. The feature extraction process derives a set of features from the video stream. The second step analyzes the extracted features in order to make an appropriate classification decision. A variety of machine learning classification techniques have been investigated for surveillance tasks, e.g., support vector machines naïve Bayes classification and AdaBoost .

People are usually the main objects of interest in surveillance tasks. AdaBoost-based classifier is very effective at identifying individuals, but it is difficult to design and train it to recognize groups or crowds of people due to their
different shapes. In this work, define a “crowd” as two or more people in a small 11 spatial region. Although crowd (group) recognition has received some attention in recent years, most research that has focused on determining the number of people in a small spatial region has been cast within the people counting and tracking paradigms. Since occlusion and projective effects are two of the major challenges associated with crowd detection, many existing systems use 3-D positions of humans and require camera calibration. Human trackers that do not require camera
models have also been proposed.

Information to detect crowds without employing explicit tracking techniques. One notable exception is the approach in which uses space-time slices to detect crowds in urban road environments. As with AdaBoost-based classifier, the use of temporal information can greatly improve the performance of crowd detection systems. However, to avoid the complex tracking process and the detection and segmentation tasks that are often involved, analyze spatiotemporaloriented energies because they encapsulate spatial and dynamic information and do not require specific motion computations.

The framework that can detect and classify moving targets in video streams based on the targets’ spatiotemporal properties. Targets are detected by computing the pixel-wise difference between consecutive frames, and then classified with a temporally boosted classifier and “spatiotemporal- oriented energy” analysis. The classifier improves weak classifiers by allowing them to make use of previous information when evaluating a frame. In addition, a method for processing targets that do not match any of the AdaBoost-based classifier’s categories. Such targets are categorized as crowds of individuals or non-crowds. It is shown that moving crowd recognition can be performed effectively by using spatiotemporal-oriented energies. The proposed framework was tested on an extensive dataset. The detection rates demonstrate that the proposed system is extremely effective at recognizing all the predefined object classes.


Histogram of oriented gradients is a feature descriptor used to detect objects in computer vision and image processing. The HOG descriptor technique counts occurrences of gradient orientation in localized portions of an image – detection window or region of interest.

Implementation of the HOG descriptor algorithm is as follows:

  1. Divide the image into small connected regions called cells and for each cell compute a histogram of gradient directions or edge orientations for the pixels within the cell.
  2. Discretize each cell into angular bins according to the gradient orientation.
  3. Each cell’s pixel contributes weighted gradient to its corresponding angular bin.
  4. Group of adjacent cells are considered as spatial regions called blocks. The grouping of cells into a block is the basis for grouping and normalization of histogram.
  5. Normalized group of histograms represents the block histogram. The set of these block histogram represent the descriptor.


We first prepared each dataset for the retrieval experiments. The datasets were scaled uniformly to 240 pixels in height (maintaining aspect ratio) and 15 frames per second, so the feature extraction procedure was identical for both. We extracted features from each dataset at an average rate of 180 features per second, detecting features with multiscale Dollar and describing them with HOG3-D. The resulting features were clustered into 1000 code words after PCA was performed to capture 95% of the features’ variance. Time-slice histograms were generated over the whole dataset in batch before the main retrieval experiments; as these preprocessing steps can be performed before a retrieval search is performed, they are not included in the performance statistics. Each time slice was 10 frames in length, and the 2-D spatial grid was divided into 10 by 10 pixel blocks. These parameters were chosen based on observations of the minimum length and size of the actions within the dataset.

Electronics Engineering mini Projects

Video Frame


• High performance and accuracy.
• Time consumption.


• Used in surveillance actions.
• It can be used in CC TV action in banks, home security systems and also in colleges schools.


Efficient content-based search systems, such as the model presented here, are becoming increasingly relevant in today’s . Effect on the accuracy of various spatial localization methods, as well as temporal localization alone. UT. UT Query Time Costs world, as sophisticated searches are increasingly necessary to navigate the huge amounts of data.
Through theoretical discussion and experimental results, we have demonstrated basic practical applicability of our system to this task of real-world video search. In designing our algorithm, we have taken an efficiency-first approach this has resulted in the creation of a fast permissive temporal-then-spatial localization technique, followed by a more orthodox histogram ranking step, both of which can be assisted by relevance feedback.



Share This Post

Related Articles

Leave a Reply

Powered by WordPress · Designed by Theme Junkie