INTRODUCTION
Improving production efficiency, ensuring animal welfare, and reducing environmental impact require technologies for growth estimation, individual identification, and behavior monitoring [1–3]. As computer technology advances, livestock monitoring systems have also evolved. These systems are broadly categorized into sensor-based contact methods and video-based non-contact methods [4–7]. Sensor-based contact methods involve collecting behavioral data using a sensor attached to the ear or gathering real-time information from a microchip implanted in the neck of an animal. However, these methods are prone to sensor failures, difficult to scale for large populations, and can be stressful for the animals [8]. Particularly, radio frequency identification (RFID) tags, which are widely used due to their low cost, are limited by their restricted range, inability to read multiple tags simultaneously, and time consumption, and the attachment process itself can be stressful for animals [9–12]. On the other hand, video-based non-contact monitoring technologies do not require physical contact. As a result, they eliminate stress for the animals, allow for remote monitoring of their condition, and enable monitoring even at night.
Recently, the rapid advancement of computer technologies, including deep learning algorithms, has enabled the analysis of accumulated data to continuously monitor and analyze animal conditions without human intervention, resulting in efficient and automated monitoring systems. This review examines and summarizes research related to video processing and convolutional neural network (CNN)-based deep learning for animal face recognition [13–36], identification [8,28,37–44] and re-identification [45], and assesses its applicability to precision livestock farming (PLF) for improving animal welfare and production efficiency (Table 1 and Fig. 1).
Research areas | Reference | Target animal | Dataset | Pre-trained/transfer learning status | Feature | Algorithm |
---|---|---|---|---|---|---|
Wildlife recognition | [26] | Wildlife | Wildlife Spotter | × | − | Lite AlexNet, VGG-16, ResNet50 |
[27] | Wildlife | Fishmarket, MS COCO 2017 | × | − | WildARe-YOLO | |
Wildlife face recognition | [29] | Chimpanzee | Self-created dataset | × | Annotation Automation Framework | SSD, CNN |
[25] | Giant panda | Self-created dataset, ImageNet | ○ | Pre-trained AlexNet, GoogLeNet, ResNet-50, VGG-16 | NIPALS | |
[21] | Panda | Self-created dataset, COCO | ○ | Pre-trained Faster R-CNN, fine-tuned ResNet-50 | DNN | |
[39] | Golden snub-nosed monkey | Self-created dataset | × | − | Faster-RCNN | |
Livestock face recognition | [24] | Pig | Self-created dataset | × | Automatic selection of training and testing data | Haar cascade, Deep CNN |
[31] | Sheep | × | − | YOLOv5s, RepVGG | ||
[28] | Aberdeen-Angus cow | Self-created dataset | ○ | Pre-trained VGGFACE, VGGFACE2 | − | |
[34] | Cattle | Self-created dataset | x | Embedded system,automatically processing datasets | CNN | |
[35] | Cattle | Self-created dataset | x | channel pruning | YOLOv5 | |
identification | [37] | Cattle | × | − | Inception-V3 CNN, LSTM | |
[42] | Cattle | ImageNet, COCO | x | Mobile devices | YOLOv5, ResNet18 Landmark | |
[44] | Horse, etc. | THDD dataset | ○ | Hybrid | YOLOv7, SIFT, FLANN | |
Re-identification | [40] | Amur tiger | ATRW, ImageNet | ○ | Pre-trained SSD-MobileNet-v1, SSD-MobileNet-v2, DarkNet | YOLOv3 |
PLF has grown alongside advancements in sensing technology, big data, and deep learning. PLF applies these technologies to individual recognition and behavior monitoring, feed intake and weight measurement, barn temperature control, body temperature and estrus detection, activity levels, gait, body condition, and carcass traits [1–3]. The goal of PLF is to enhance farm management efficiency, conserve resources, improve animal welfare, and maximize productivity by implementing real-time data monitoring and automated management systems.
The monitoring methods used in PLF are categorized into sensor-based contact methods and video-based non-contact methods. Contact methods involve attaching devices like collars, bands, ear tags, and RFID tags to animals to collect data. While these methods can gather accurate physiological data, they may also cause stress to the animals and are challenging to manage and maintain on a large scale [8–12]. Non-contact methods collect data remotely without direct contact with the animals by using tools like CCTV, special cameras, drone cameras, and sound detection systems, and rely primarily on analyzing video and image data [4–7]. Although this method may be less accurate compared to contact-based methods, it is advantageous for animal welfare as it does not cause stress to the animals. It also allows for monitoring over a relatively wide area and is more economical in terms of equipment management and maintenance.
Non-contact based monitoring is primarily performed through object detection [46–49], which is a technology that detects objects in images or videos and indicates the location of each object [50–52]. Even more detailed analysis is possible when object detection is combined with a CNN [13–21] because CNN enables powerful feature extraction while maintaining spatial structure in large volumes of images. Furthermore, various architectures and high-performance algorithms have been developed. Recently, research has been conducted not only on object recognition [22–36], but also on object identification [28,45].
Object recognition involves distinguishing a specific object from other objects by classifying and recognizing the type of object detected in an image or video. Object identification includes matching the recognized object in a database to identify the specific object. As a real-world application of object recognition, inter-species recognition research is being conducted to effectively recognize faces among different animal species so that this technology can recognize various animal species in one system [34].
The field of human face recognition is already widely used in biometric authentication. The deep learning-based algorithm ArcFace, which converts the features of each face into embedding vectors, shows an accuracy of 99.78% [24]. On the other hand, the field of animal recognition or identification has seen significant research in recent years, but has fewer results. Animal identification involves distinguishing and recognizing specific animals and can be applied to research that monitors individual animals’ health, behavior, and reproductive status, and can be used for the protection of endangered species [27]. Technologies for animal face identification, recognition, re-identification, and inter-species recognition can be utilized to monitor the health status, growth patterns, and behavior patterns of individual animals. In the case of wild animals, these technologies play a crucial role in biological conservation and research by helping to determine an animal’s population or monitor their migration paths [41].
Understanding the health and behavior patterns of animals in the livestock sector is crucial for early disease detection, diagnosis, and animal welfare. As a result, animal recognition technologies are essential in PLF [22]. Furthermore, research on animal re-identification is also being conducted. This research aims to recognize previously identified animals for long-term monitoring of behavior patterns, survival rates, and migration paths [8,41–45,48,53]. To accurately identify individual animals, it is necessary to precisely detect their location within an image through object detection and accurately classify them. The more accurate the detection results are, the more accurate the recognition results will be.
Traditional object detection algorithms use manual methods that involve feature extraction considering color, gradient, texture, and shape, and use K-nearest neighborhood (KNN), support vector machine (SVM), and Bayesian classifiers. These methods are suitable for detecting small, distinct objects, but they are less accurate and inefficient for detecting objects in real-world images that include noise such as backgrounds. Object detection has significantly improved in accuracy due to machine and deep learning algorithm improvements, and it is being utilized in various fields, including PLF for non-invasive identification [46,47].
Generally, deep learning-based object detection algorithms can be divided into one- and two-stage methods. One-stage algorithms process the image only once within the network to directly extract features, classify them, and determine their location. Examples include You Only Look Once (YOLO) and Single Shot MultiBox Detector (SSD). On the other hand, two-stage algorithms, such as R-CNN, Fast R-CNN, and Faster R-CNN, first select region proposals within the image, and then classify and refine the boundaries of the objects in each region. These algorithms require large training and validation datasets to show accurate learning results.
Recording and observing animal behavior through videos is common, but manually processing large amounts of data requires significant time and labor. Particularly for animals, the individual characteristics of various species differ and their living environments are diverse. Additionally, they do not cooperate in acquiring images so the data is insufficient for adequate training. In fields such as image recognition, video processing, and speech recognition, CNNs require a substantial amount of training data to train an effective recognition system [13–21].
Animal recognition and identification datasets are designed to distinguish and identify animals at the species or individual level. These datasets include images or videos of animals, as well as metadata describing the characteristics of each animal. Recently, there has been increasing interest in long-term tracking to observe how individual animals change and behave over time and in different environments. This has led to the use of animal re-identification datasets. These datasets are used to re-identify specific animals across various times, locations, or other conditions [41]. However, animal re-identification datasets are not widely available, and the few well-summarized datasets often have small data sizes, limited annotations, and images captured in non-wild settings.
Fortunately, with the advancement of facial recognition technology, more and more open-source datasets are being made available for research, and animal datasets are becoming increasingly diverse. Labeled Faces in the Wild (LFW) provides a total of 13,233 annotated face images from 5,749 people in natural and complex environments [54]. ImageNet offers over 14 million images, including animal images with backgrounds, categorized into 27 major categories and over 20,000 subcategories [55]. PASCAL visual object classes (VOC) includes approximately 11,530 images containing 27,450 objects, with bounding boxes and pixel-level masks encoded by class [56]. Datasets that include various animals are Animal Web [57], which contains over 21,000 species-specific face images, Animals with Attributes [58], which includes 37,322 images from 50 species in versions 1 and 2, Animal Faces-HQ [59], which contains a total of 15,000 high-resolution animal face images from three categories (dogs, cats, and wild animals), and ZooAnimal Faces (https://www.kaggle.com/datasets/jirkadaberger/zoo-animals), which includes face images of zoo animals.
Wild animal image datasets captured in various environments are mainly collected through automatic camera traps and include metadata such as species, location, date, and time. Notable datasets include Smithsonian Wild provided by the Smithsonian Conservation Biology Institute, AfriCam (https://emammal.si.edu/), Caltech Camera Traps (https://beerys.github.io/CaltechCameraTraps/) provided by the California Institute of Technology, and Wild Animal Face, which is extensively used in computer vision and machine learning research for training and evaluating animal face recognition models. Datasets collected for specific wild animal research include Amur Tiger Re-identification in the Wild, which contains images of wild Amur tigers [45], the Grévy’s zebra dataset (https://datasets.wri.org/dataset/grevy-s-zebra-population-in-kenya-1977-78) containing images of Grevy’s zebras in Kenya, Chimpanzee Faces in the Wild (ChimpFace), which stores images of wild chimpanzee faces, and the African elephant dataset, which includes images of various ear shapes and facial features of African elephants. The Animal Movement and Location dataset collects movement patterns and location information of wild animals and is used in re-identification research.
With the increasing importance of PLF, the collection of livestock image datasets is also actively being conducted. Notable datasets include CattleCV (https://www.kaggle.com/datasets/trainingdatapro/cows-detection-dataset), which contains thousands of cattle images and health data, Afimilk Cow, and Dairy Cattle Behavior. Pig image datasets include PigPeNet, which contains over 10,000 pig face images, and RiseNet, which includes 7,647 pig face images collected from 57 videos [34]. Other livestock image datasets include ThoDTEL; 2015, which contains 1,410 images from 50 horses, Sheep Face, which contains hundreds of sheep face images, Goat-21, which contains approximately 2,100 goat face images, and Poultry-10K, which contains about 10,000 chicken images (https://livestockdata.org/datasets).
There are various ways to improve the performance of machine or deep learning models. Images collected from different environments often contain noise from being obscured by obstacles or being darkened or blurred due to light. Data pre-processing is necessary to improve the quality of the data before deep learning model training and analyzing in order to enhance the model’s efficiency and accuracy. Image pre-processing includes resizing images for consistent input, improving image quality, or restoring images to make analysis easier. This involves techniques such as histogram equalization, grayscale conversion, image smoothing, noise removal, and image restoration. Additionally, to increase the generalization performance of the model or to prevent overfitting to the same data, data augmentation is performed to artificially increase the diversity of the dataset and extend or augment the limited data. Image augmentation techniques include mirror imaging, rotation, scale transformation, translation, left-right flipping, zooming in/out, color dithering, noise addition, distortion, and other pre-processing methods [60–62].
Training recognition models using deep learning requires a vast amount of training data. Even when utilizing open datasets or performing image augmentation, it is often challenging to secure a sufficient amount of labeled image data for specific animals. In such situations, pre-training and transfer learning are used to improve model performance and enable efficient training [11,46,63]. Pre-training involves using large-scale datasets like ImageNet to pre-train the model to learn general features and set stable initial weights. This accelerates training and enhances model performance. The process of adjusting the weights of a pre-trained model to fit a new task is called fine-tuning [13,34], and it is used to achieve optimal performance. Recent studies actively explore enhancing network performance through both pre-training and fine-tuning.
Transfer learning is a technique that utilizes a pre-trained model for a new task by using the lower layers of the pre-trained model as feature extractors. By retraining a model learned from a previous task, transfer learning allows rapid learning on new datasets and improves model performance even in data-scarce situations. Even when data is sufficient, using the weights of an existing model as initial values through transfer learning can reduce the training time and allow training to proceed efficiently, thereby improving performance.
To recognize animal faces and identify the species or individuals from given images or video frames, it is necessary to extract animal face features using deep learning models like CNNs and train classifiers based on these features. CNNs introduce convolutional layers within the network to learn feature maps that represent the spatial similarities of patterns found in images. This makes them effective deep learning models for processing and analyzing visual data like images or videos [16,17].
CNNs consist of convolutional layers, which extract local features from the input image, pooling layers, which reduce the spatial size to decrease computation and emphasize important features, and fully connected layers, which perform classification tasks at the end of the network. The training process uses a backpropagation algorithm to calculate the gradient of the loss function and to update the network weights, and employs optimization techniques such as gradient descent to minimize errors.
Standard CNN frameworks include AlexNet, VGG16, GoogLeNet/InceptionNet, ResNet, and CapsNet [19]. With the advancement of deep learning technologies such as CNNs, research on recognizing, identifying, or re-identifying animal faces using these technologies has been actively progressing. Animal face recognition is the process of determining whether a detected animal face belongs to a specific animal or species. Distinct from this, animal face detection involves locating the face of an animal in an image or video, identifying the position of the face, and marking the area with a box.
Animal face identification is the process of confirming whether a recognized animal face belongs to a specific individual within the same species. Re-identification refers to repeatedly identifying the same animal over time and across different locations. Re-identification techniques involve complex algorithms that compare existing databases to determine if it is the same individual, and measure the similarity between feature vectors. These techniques are necessary for tracking individuals and analyzing behaviors.
Experiments in 2018 were conducted to classify animal and non-animal images using the Wildlife Spotter dataset, and to recognize and identify birds, rats, bandicoots, rabbits, wallabies, and other mammals using three CNN architectures: Lite AlexNet, VGG-16, and ResNet50 [28]. The results showed that ResNet50 achieved the highest accuracy and performance. However, while fine-tuning slightly improved the performance of VGG-16, it decreased the performance of ResNet50 due to overfitting.
In a study published in 2024 [29], a proposed lightweight WildARe-YOLO technique for wildlife recognition was tested using the Wild Animal Facing Extinction, Fishmarket, and Microsoft Common Objects in Context (MS COCO) 2017 datasets. Compared to the latest deep learning models, the proposed technique increased the frames per second (FPS) by 17.65%, reduced the model parameters by 28.55%, and decreased the floating point operations per second (FLOPs) by 50.92%. In a paper published in 2019, a deep learning-based automated pipeline was developed to efficiently annotate datasets by providing a toolset and an automated framework. This pipeline identifies and tracks individuals, and provides gender and identity recognition from a video archive collected over 14 years from 23 chimpanzees [31].
Annotation was performed using a web-based VGG Image Annotator (VIA) annotation interface by drawing tight bounding boxes around each chimpanzee’s head. The proposed model achieved 84% accuracy in 60ms using a Titan X GPU and in 30 seconds using a standard CPU, surpassing expert annotators in both speed and accuracy. Using 50 hours of frontal, side, and extreme side videos, the SSD model was employed to detect faces, and a deep CNN model was trained to implement face recognition and gender recognition. The recognition model trained with the generated annotations achieved 92.47% identity recognition accuracy and 96.16% gender recognition accuracy. Using only frontal faces, it achieved 95.07% identity recognition accuracy and 97.36% gender recognition accuracy.
Matkowski et al. [27] obtained 163 images from 28 Chengdu giant pandas, and manually extracted images of their frontal faces. Then, a two-stage algorithm was proposed to recognize panda faces using a classifier based on the NIPALS algorithm. This classifier was also used to calculate comparison scores between the panda images. Compared to networks pre-trained on the ImageNet dataset, such as AlexNet, GoogLeNet, ResNet-50, and VGG-16, the proposed method achieved a 6.43% and 8.59% higher accuracy than the second-best ResNet-50.
There was also a study that built a dataset containing 6,441 images from 218 pandas, with manual annotations inserted for panda faces, ears, eyes, noses, and mouths [23]. A Faster R-CNN detection network pre-trained on the COCO dataset was applied for face detection, and normalized face images were input into a deep neural network (DNN) to propose a fully automated deep learning algorithm for panda face recognition. Then, a fine-tuned ResNet-50 was used to verify panda IDs, achieving 96.27% accuracy in panda recognition and 100% accuracy in detection.
In 2020, a deep network model called Tri-AI was developed. It was reported that the model could quickly detect, identify, and track individuals using Faster R-CNN from videos or still images in a dataset containing 102,399 images of 1,040 known individuals [40]. This model demonstrated a face detection accuracy of 98.70%, an individual identification accuracy of 92.01%, and a new individuals identification accuracy of 87.03% in frame-by-frame detection and identification of 22 individuals using a test dataset of 10 videos of golden snub-nosed monkeys.
Wildlife recognition technologies play a crucial role in achieving various ecological and conservation goals, such as protecting endangered species, tracking population numbers, and monitoring behavior. Deep learning models like ResNet, Faster R-CNN, and YOLO are widely utilized for wildlife detection and identification, with their performance heavily influenced by the quality and quantity of datasets. Additionally, significant efforts are being made to develop lightweight models and high-performance algorithms that reduce computational costs while maintaining high accuracy.
For pig face recognition, an adaptive approach was proposed to automatically select high-quality training and test data before applying a deep CNN, and an augmentation approach was proposed to improve the accuracy [26]. This approach measures the structural similarity index (SSIM) of pig face images to remove identical frames and uses a Haar cascade classifier in two stages to automatically detect pig faces and eyes. By selecting high-quality training and test images, it recognizes pig faces after applying the deep CNN technique.
Meanwhile, a technique was also proposed to improve the accuracy and robustness of the recognition model. This technique involves cutting out faces detected from images taken at various distances and angles by YOLOv5’s object detection algorithm, extracting important features with the Shuffle Attention (SA) [63] spatial channel attention mechanism and the Reparameterizable VGG (RepVGG) algorithm, and fusing features of the same scale [32]. The SA block enhances the network’s feature extraction ability, while the RepVGG block improves the recognition efficiency through lossless compression. The proposed model achieved 95.95% accuracy on a side-face dataset, 97.64% on a frontal face dataset, and 99.43% on a full-face dataset. A study was reported for cow face recognition using transfer learning and additional data augmentation and fine-tuning on an RGB dataset containing 315 face images of 91 Aberdeen-Angus cows. Pre-trained neural networks VGGFACE and VGGFACE2 were used, with VGGFACE2 achieving better accuracy at 97.1% [30].
In a 2022 study, Li [35] constructed a dataset of 10,239 cow face images collected under various angles and lighting conditions from 103 cows on a farm. The study proposed a lightweight neural network consisting of six convolutional layers for cow face recognition. The proposed network used global average pooling instead of fully connected layers on top of the convolutional layers, reducing the number of parameters to 0.17 M, the model size to 2.01 MB, and the computation to 9.17 mega floating point operations per second (MFLOPs). The model achieved a recognition accuracy of 98.7%, and Gradient-weighted Class Activation Mapping (Grad-CAM) was used to visualize and confirm which valid features were extracted. Additionally, the small size of the model allows it to be implemented on embedded systems or portable devices, enabling real-time cow identification [35].
In a 2024 study, Weng proposed a method for automatically detecting cow faces using a YOLOv5 network-based approach. The dataset consisted of images taken at various angles of 80 cows (Simmental beef cattle and Holstein dairy cows) at a farm in Hohhot, Inner Mongolia, using five smartphones. The study applied channel pruning and model quantization to reduce the model size, the number of parameters, and FLOPs by 86.10%, 88.19%, and 63.25%, respectively, compared to the original YOLOv5 model. This enabled real-time cow face detection on mobile devices [36].
An identification method was proposed using the Inception-V3 CNN network to extract image features from each frame, and train a long short-term memory (LSTM) network to capture temporal information and identify individual animals [39]. Combining the strengths of the Inception V3 and LSTM networks, the cattle recognition method achieved 88% accuracy on 15-frame video lengths and 91% on 20-frame video lengths. These results were superior to frameworks using only CNNs, and demonstrated the ability of the method to extract and learn additional information related to individual identification from video data.
Li et al. [41] conducted a re-identification study on the Amur Tiger Re-identification in the Wild (ATRW) dataset. This dataset was built from 92 Amur tigers, a critically endangered species with fewer than 600 individuals remaining. It includes 8,076 high-resolution video clips capturing tigers in various poses and lighting conditions, annotated with bounding boxes, pose keypoints, and tiger identities. The study used deep models to perform re-identification of Amur tigers. Additionally, by using the ImageNet pre-trained backbone to benchmark the performance of the SSD-MobileNet-v1 [52] and SSD-MobileNet-v2 [26] models, and by benchmarking object detectors using TinyDSOD [17], which was trained from scratch on the training set, and YOLOv3 [24], which used the pre-trained backbone DarkNet from ImageNet, it was demonstrated that these models can be utilized for the protection and management of individual animals.
Dac et al. [43] proposed a face recognition pipeline for Holstein-Friesian dairy cows, recorded in RGB videos within a fixed frame at a robotic dairy farm located at Dookie College, University of Melbourne, Victoria, Australia. The pipeline uses images trained and fine-tuned on widely known public datasets such as ImageNet and COCO with the MobileNetV2 model, which are then registered in a database. For input cow images, the YOLOv5 model detects the face and extracts the facial region. Landmark features such as eyes and nose are extracted using a ResNet18-based landmark prediction model. Finally, face encoding is performed using embedding features from a ResNet101-based model, and face matching is conducted by comparing the similarity scores between the encoded results and the embedding features of other cow faces in the database. This study tested the method on the NVIDIA Jetson Nano device for real-time operation, achieving 84% accuracy for 89 cows captured more than twice [43].
Qiao et al. [51] proposed a deep learning framework for cow identification by collecting 363 video datasets from 50 cows. Spatial features were extracted using CNNs, while spatiotemporal information across sequential frames was learned using Bidirectional Long Short-Term Memory (BiLSTM). The proposed model achieved 93.3% accuracy and 91.0% recall, outperforming existing methods such as Inception-V3, MLP, SimpleRNN, LSTM, and BiLSTM.
Ahmad et al.[8] introduced a method for automatically identifying animals by detecting their faces and muzzles using the YOLOv7 model, followed by extracting muzzle pattern features with the Scale-Invariant Feature Transform (SIFT) algorithm. The extracted features were then matched against a database using the Fast Library for Approximate Nearest Neighbors (FLANN) algorithm. The method achieved over 99.5% accuracy in cow identification and demonstrated a lightweight structure and real-time performance, making it suitable for embedded systems or mobile devices. Deep learning often relies on high-performance computing devices, limiting its application in mobile devices. However, as the use of small mobile devices has become more widespread, recent studies [8,35,36] have focused on improving detection accuracy and speed while reducing computational costs or quickly and accurately detecting obstacles in outdoor environments [36,64–67]. Similar research has also begun in the field of livestock face recognition.
CONCLUSION
This review examines contactless techniques for animal face recognition, identification, and re-identification. In the data collection phase, animal face images are captured under various angles and lighting conditions, and data preprocessing normalizes the images to enhance the efficiency and accuracy of model training. Data augmentation and transfer learning (e.g., using pre-trained models like VGG and ResNet) are employed to address data scarcity, followed by fine-tuning to adapt the models to specific animal datasets. The integration of video processing and CNN-based deep learning presents a highly promising approach for PLF. These technologies enhance production efficiency, improve animal welfare, and reduce environmental impact. They provide accurate and efficient tools for growth estimation, individual identification, and behavior monitoring, driving innovation in livestock management.